> Does anybody else know what the search engines are up to?
Nottalot. As far as I know they are all waiting for XML to take off.
Those who are using a search engine with a parser beside it (like the
ill-fated PAT component of OpenText's engine) will be sitting pretty
because they'll be able to offer context-sensitive searching straight
away (although we've already seen the problem of how to detect which
element type names J Random Corp has used for "paragraph", "bulleted
item", "address", or "table cell"). Those who have been relying on
plain indexed text-search are going to be up the creek.
There are several keys to providing decent searching of XML:
a. robust indexing
b. really fast search code
c. sensitivity to markup, so you can guarantee you are only
searching (i) genuine user PCDATA, (ii) known markup, eg
looking for a specific attribute value, or (iii) both
d. full parse, so you can offer the markup as a guide to the
searcher (eg search for 3.142 in table cells only provided
all cell data in the table is numeric, or search for "quality
of mercy" as a cited title within section headings only);
e. context-based return, so you get a sensible (define that :-)
amount of the surrounding context, like a para's worth: not
the whole damn 500Kb file (which you can request if the hit
is the one you were looking for).
f. proximity search, Boolean search, search including stop-words
[and others]. Tim Bray can explain far better than I can why no-one
will ever buy this, and hence why there's no money in it.
> "XML, So what!" is already echoing inside my head. The adoption
> process has been a lot slower than I wanted. While I'm not nearly
> ready to say it's failing, I can certainly understand the
I can sympathize, but the world is not quite ready for this level of
sophistication. Most users are grossly uninterested in what goes on
inside or how it works. Like my good colleagues in the Adult Education
Dept of another institution who searched for "adut education", they
don't understand why searches return irrelevant garbage 90% of the
time. (Nor should they need to, but for them to start asking for
better, they need to have demonstrated to them that "better" is
actually possible.) Don't forget that the search engines make money
from the adverts, regardless of whether the search results were what
you wanted. How many of you complain to the advertiser when the search
site fails to deliver, and how many advertisers could care?