Print

Print


At 12:26 PM 12/9/1998 -0000, Mark wrote:
>This implies to me that the next generation of index servers will have
>to take more than simply a document's DTD into account - since this
>relationship cannot be expressed there. (And if a document has no DTD
>then that's even worse. If nothing else you don't know what fields could
>be there but are empty.)

Right, and actually not that difficult a problem.  Not to proselytize
for Cheshire II *too* much, but I think the approach Ray Larson developed
for that system addresses this problem nicely.  A single configuration
file for the search/indexing engine specifies what indexes to generate
(author index, title index, subject index, etc.), specifies what elements
(or subelements of particular elements) for a particular document class
should be used to generate those indexes, and maps those indexes to
a well-defined set of search attributes (author, title, subject, etc.)
in the Z39.50 search protocol.  As an example, the configuration file
that we developed for a library catalog system that used an SGML version
of MARC bibliographic records specified that an author index should be
generated using subelements <a>, <q>, <b>, and <c> of elements
<FLD100>, <FLD700>, and <FLD800> (if you're a MARC junkie, that made
sense, trust me), and that Z39.50 searches using BIB1 use attributes
of 1 and 1004 (personal name and author-personal name) should be mapped
to that index.  In this setup, a Z39.50 client which requested a search
for an author using a personal name would have that search conducted
against the author index, which was generated by extracting the keywords
contained only within the subelements of the elements specified
in the config file.

This setup allows you to 1. generate indexes using only the precise
elements within a document that you want; 2. employs an *already
defined* common set of search attributes; and 3. makes mapping from
the search attributes to the elements of your choice relatively trivial.
I think Cheshire II also demonstrated (at least to my satisfaction)
that you *cannot* have a distributed search system run by multiple
providers that provides this level of sophistication in searching
*without* a decent search protocol for communication.  That may or may
not mean Z39.50, but it definitely does not mean HTTP; any efforts to
try to implement more precise, distributed searching on top of HTTP
are going to be hamstrung from the start.

>Anyway, if we want more mind-boggling problems to ponder, what about the
>actual syntax used to search? Everyone's talking about the trickiness of
>searching for the <author> field, but what about it's context? What if I
>only want books written by a certain author and not magazine articles?

Specifying a query for only books by an author and not magazine articles is a
straightforward boolean query, as long as whoever generated the indexes
did their job properly.  But you're right that contextual searches can
get tricky (C. M. Sperberg-McQueen has some great example searches that
combine contextual searching with partial document retrieval that
he's been using to make search engine designers, including myself, acutely
uncomfortable).  I think the most important things for the XML community
to keep in mind at this point are 1. document designers and authors should
be able to do more as less as they please in creating documents, and not be
constrained in their work by the needs of the information retrieval
community, and 2. efforts to better support retrieval of XML/SGML documents
should build on existing standards for query language and search engine
design, and not reinvent the wheel.  There are a lot of bright people
working on the problem of intelligent retrieval of structured documents;
I suspect it won't be very long at all before decent search engine
support for structured document querying is commonplace.





Jerome McDonough -- [log in to unmask]  |  (......)
Library Systems Office, 386 Doe, U.C. Berkeley     |  \ *  * /
Berkeley, CA 94720-6000    (510) 642-5168          |  \  <>  /
"Well, it looks easy enough...."                   |   \ -- /  SGNORMPF!!!
         -- From the Famous Last Words file        |    ||||