LISTSERV mailing list manager LISTSERV 16.5

Help for XML-L Archives


XML-L Archives

XML-L Archives


XML-L@LISTSERV.HEANET.IE


View:

Message:

[

First

|

Previous

|

Next

|

Last

]

By Topic:

[

First

|

Previous

|

Next

|

Last

]

By Author:

[

First

|

Previous

|

Next

|

Last

]

Font:

Monospaced Font

LISTSERV Archives

LISTSERV Archives

XML-L Home

XML-L Home

XML-L  December 1998

XML-L December 1998

Subject:

Re: Search Engines

From:

Jerome McDonough <[log in to unmask]>

Reply-To:

General discussion of Extensible Markup Language <[log in to unmask]>

Date:

Wed, 9 Dec 1998 10:05:51 -0800

Content-Type:

text/plain

Parts/Attachments:

Parts/Attachments

text/plain (72 lines)

At 12:26 PM 12/9/1998 -0000, Mark wrote:
>This implies to me that the next generation of index servers will have
>to take more than simply a document's DTD into account - since this
>relationship cannot be expressed there. (And if a document has no DTD
>then that's even worse. If nothing else you don't know what fields could
>be there but are empty.)

Right, and actually not that difficult a problem. Not to proselytize
for Cheshire II *too* much, but I think the approach Ray Larson developed
for that system addresses this problem nicely. A single configuration
file for the search/indexing engine specifies what indexes to generate
(author index, title index, subject index, etc.), specifies what elements
(or subelements of particular elements) for a particular document class
should be used to generate those indexes, and maps those indexes to
a well-defined set of search attributes (author, title, subject, etc.)
in the Z39.50 search protocol. As an example, the configuration file
that we developed for a library catalog system that used an SGML version
of MARC bibliographic records specified that an author index should be
generated using subelements <a>, <q>, <b>, and <c> of elements
<FLD100>, <FLD700>, and <FLD800> (if you're a MARC junkie, that made
sense, trust me), and that Z39.50 searches using BIB1 use attributes
of 1 and 1004 (personal name and author-personal name) should be mapped
to that index. In this setup, a Z39.50 client which requested a search
for an author using a personal name would have that search conducted
against the author index, which was generated by extracting the keywords
contained only within the subelements of the elements specified
in the config file.

This setup allows you to 1. generate indexes using only the precise
elements within a document that you want; 2. employs an *already
defined* common set of search attributes; and 3. makes mapping from
the search attributes to the elements of your choice relatively trivial.
I think Cheshire II also demonstrated (at least to my satisfaction)
that you *cannot* have a distributed search system run by multiple
providers that provides this level of sophistication in searching
*without* a decent search protocol for communication. That may or may
not mean Z39.50, but it definitely does not mean HTTP; any efforts to
try to implement more precise, distributed searching on top of HTTP
are going to be hamstrung from the start.

>Anyway, if we want more mind-boggling problems to ponder, what about the
>actual syntax used to search? Everyone's talking about the trickiness of
>searching for the <author> field, but what about it's context? What if I
>only want books written by a certain author and not magazine articles?

Specifying a query for only books by an author and not magazine articles is a
straightforward boolean query, as long as whoever generated the indexes
did their job properly. But you're right that contextual searches can
get tricky (C. M. Sperberg-McQueen has some great example searches that
combine contextual searching with partial document retrieval that
he's been using to make search engine designers, including myself, acutely
uncomfortable). I think the most important things for the XML community
to keep in mind at this point are 1. document designers and authors should
be able to do more as less as they please in creating documents, and not be
constrained in their work by the needs of the information retrieval
community, and 2. efforts to better support retrieval of XML/SGML documents
should build on existing standards for query language and search engine
design, and not reinvent the wheel. There are a lot of bright people
working on the problem of intelligent retrieval of structured documents;
I suspect it won't be very long at all before decent search engine
support for structured document querying is commonplace.





Jerome McDonough -- [log in to unmask] | (......)
Library Systems Office, 386 Doe, U.C. Berkeley | \ * * /
Berkeley, CA 94720-6000 (510) 642-5168 | \ <> /
"Well, it looks easy enough...." | \ -- / SGNORMPF!!!
         -- From the Famous Last Words file | ||||

Top of Message | Previous Page | Permalink

Advanced Options


Options

Log In

Log In

Get Password

Get Password


Search Archives

Search Archives


Subscribe or Unsubscribe

Subscribe or Unsubscribe


Archives

February 2018
February 2017
August 2016
June 2016
March 2016
January 2016
July 2014
April 2014
January 2014
July 2013
February 2013
September 2012
August 2012
October 2011
August 2011
June 2011
January 2011
November 2010
October 2010
July 2010
June 2010
March 2010
February 2010
January 2010
November 2009
September 2009
August 2009
July 2009
May 2009
March 2009
December 2008
October 2008
August 2008
May 2008
March 2008
February 2008
January 2008
December 2007
October 2007
August 2007
June 2007
March 2007
January 2007
December 2006
September 2006
July 2006
June 2006
April 2006
February 2006
January 2006
November 2005
September 2005
August 2005
July 2005
June 2005
May 2005
March 2005
January 2005
October 2004
August 2004
July 2004
June 2004
May 2004
March 2004
February 2004
January 2004
December 2003
November 2003
October 2003
September 2003
August 2003
July 2003
June 2003
May 2003
April 2003
March 2003
February 2003
January 2003
December 2002
November 2002
October 2002
September 2002
August 2002
July 2002
June 2002
May 2002
April 2002
March 2002
February 2002
January 2002
December 2001
November 2001
October 2001
September 2001
August 2001
July 2001
June 2001
May 2001
April 2001
March 2001
February 2001
January 2001
December 2000
November 2000
October 2000
September 2000
August 2000
July 2000
June 2000
May 2000
April 2000
March 2000
February 2000
January 2000
December 1999
November 1999
October 1999
September 1999
August 1999
July 1999
June 1999
May 1999
April 1999
March 1999
February 1999
January 1999
December 1998
November 1998
October 1998
September 1998
August 1998
July 1998
June 1998
May 1998
April 1998
March 1998
February 1998
December 1997
November 1997
October 1997

ATOM RSS1 RSS2



LISTSERV.HEANET.IE

Secured by F-Secure Anti-Virus CataList Email List Search Powered by the LISTSERV Email List Manager