LISTSERV mailing list manager LISTSERV 15.5

Help for XML-L Archives

XML-L Archives

XML-L Archives


Next Message | Previous Message
Next in Topic | Previous in Topic
Next by Same Author | Previous by Same Author
Chronologically | Most Recent First
Proportional Font | Monospaced Font


Join or Leave XML-L
Reply | Post New Message
Search Archives

Subject: Re: Sorting Out Searching (Was "RE: The key Benefit of XML?")
From: Peter Flynn <[log in to unmask]>
Reply-To:General discussion of Extensible Markup Language <[log in to unmask]>
Date:Tue, 29 Sep 1998 01:46:49 +0100

text/plain (54 lines)

> Does anybody else know what the search engines are up to?

Nottalot. As far as I know they are all waiting for XML to take off.
Those who are using a search engine with a parser beside it (like the
ill-fated PAT component of OpenText's engine) will be sitting pretty
because they'll be able to offer context-sensitive searching straight
away (although we've already seen the problem of how to detect which
element type names J Random Corp has used for "paragraph", "bulleted
item", "address", or "table cell"). Those who have been relying on
plain indexed text-search are going to be up the creek.

There are several keys to providing decent searching of XML:

a. robust indexing

b. really fast search code

c. sensitivity to markup, so you can guarantee you are only
   searching (i) genuine user PCDATA, (ii) known markup, eg
   looking for a specific attribute value, or (iii) both

d. full parse, so you can offer the markup as a guide to the
   searcher (eg search for 3.142 in table cells only provided
   all cell data in the table is numeric, or search for "quality
   of mercy" as a cited title within section headings only);

e. context-based return, so you get a sensible (define that :-)
   amount of the surrounding context, like a para's worth: not
   the whole damn 500Kb file (which you can request if the hit
   is the one you were looking for).

f. proximity search, Boolean search, search including stop-words

[and others]. Tim Bray can explain far better than I can why no-one
will ever buy this, and hence why there's no money in it.

> "XML, So what!" is already echoing inside my head.  The adoption
> process has been a lot slower than I wanted.  While I'm not nearly
> ready to say it's failing, I can certainly understand the
> impatience.

I can sympathize, but the world is not quite ready for this level of
sophistication. Most users are grossly uninterested in what goes on
inside or how it works. Like my good colleagues in the Adult Education
Dept of another institution who searched for "adut education", they
don't understand why searches return irrelevant garbage 90% of the
time. (Nor should they need to, but for them to start asking for
better, they need to have demonstrated to them that "better" is
actually possible.) Don't forget that the search engines make money
from the adverts, regardless of whether the search results were what
you wanted. How many of you complain to the advertiser when the search
site fails to deliver, and how many advertisers could care?


Back to: Top of Message | Previous Page | Main XML-L Page



CataList Email List Search Powered by the LISTSERV Email List Manager