At 11:01 AM 9/23/98 +0100, Senior, Chris wrote:
>Ok, this is fine but surely we still have a problem. If I did a search for
>"International Standard Book Number" - My "thicko" browser/search engine
>does not "understand" that "ISBN" and "International Standard Book Number"
>are the same thing. So it won't return the information I want, even though
>it is there.
This is indeed a problem, and ISBN's are by no means the worst example.
Imagine trying to do a search for author/auteur/autor/autore/etc.
However, there are a few points to keep in mind that, while they
don't exactly solve this problem, do highlight the benefits of XML
1. XML *is* beneficial to search engines, although perhaps not in
the context of global Internet searching that keeps getting hyped
(or at least not in the *way* it's currently being hyped. See below.).
Having helped build a search engine that searches SGML data (see
http://cheshire.lib.berkeley.edu), I can tell you that there are
real advantages to search engine designers in being able to build
a retrieval system that works with SGML/XML. By designing a search
engine which works with tagged text, you can build a system which
can be easily adapted to a large variety of search tasks. We've
used the Cheshire II system for both a production library catalog
and as a full-text retrieval system for the annual TREC competitions,
with no modifications to the search engine. We've even be able to
use it as part of an image retrieval system (for those interested,
see http://elib.cs.berkeley.edu/cgi-bin/blobs/blobworld?108014 for
an example search of image data using Cheshire). That's a big win; you
can see the paper we reference on the Cheshire site for more details.
Granted that what we're talking here is searching within a very
local, restricted domain, but search engines get used for much more
than just overall Internet searching. Within that context, XML
can be a big help, giving a search engine something *real* to
index, instead of the hopeless glop that is HTML.
2. I think XML actually *can* be of benefit to more distributed searching,
but we'll have to give up on the current model of "one search engine
continually trolls the Internet to generate one humongous index".
Let's face it; that model isn't the most scalable anyway. It's amazing
it still works at all, and given the continuing explosion of materials
available over the Internet, it's not going to work much longer.
I'd say what we need instead is a much *larger* number of search engines,
all of them cooperating with each other. In point of fact, I'd say
that you'd probably want to see every webserver in the world come with
a search engine built in, and preferably one that speaks a sensible
IR protocol (ok, I'm a Z39.50 bigot, so sue me). If you *do* have
a wider distribution of search engines speaking a common IR protocol,
than XML is still a benefit. It allows you to define your own local
markup to assist with management/presentation of text, and generate
sensible indexes off that tagging structure, something that's not
possible with HTML. And you can map from those indexes to the common
Search engines that want to provide global search services can then
contact these smaller, local search engines, and say 'what kind of
searches can you perform? What indexes do you support?' and keep
a record of that information. The global search engine might even
do the sort of trolling it does now, but in a more directed fashion,
asking each smaller engine if it supports author indexes, and if so,
asking for a full listing of all authors indexed at that site, and
adding that to it's own global author index.
XML by itself won't solve all the world's problems. But XML can play
a vital part in a much more intelligent, distributed global search
system. That's my take, anyway.
Jerome McDonough -- [log in to unmask] | (......)
Library Systems Office, 386 Doe, U.C. Berkeley | \ * * /
Berkeley, CA 94720-6000 (510) 643-2058 | \ <> /
"Well, it looks easy enough...." | \ -- / SGNORMPF!!!
-- From the Famous Last Words file | ||||