searching for search

Thu May 27 20:10:32 BST 1999

Disclaimer: I'm a Staff Engineer at Infoseek Corp. I work
on Ultraseek Server, our search product (which can search XML).

At 01:00 PM 5/23/99 -0700, Mark D. Anderson wrote:
>Regarding the recent "Indexing XML Document Collections" thread...
>
>I'm interested in these questions:
>
>- in general, why would I pick one of these over another
>(i.e. boolean query vs. structured query; scalability in size
>or requests; pluggable format drivers for source data;
>stemming and concept support; etc.)

What is your search problem? If professional writers are
searching in a repository they understand, you might give
them pretty complex search. If Joe AOL is searching your
public site, you have to give good results for one-word
queries and give good results on the first page.

>- in general, what are the features that push a technology
>into another level of complexity and why (i.e. what is so
>hard here?)

Making something "excellent" instead of "pretty good" usually
means that you have to actually deal with all the picky cases
instead of pretending they don't exist. For example, our
spider has special code for Lotus Domino, special code to
recognize directory listings generated by various webserver,
special code to handle spaces in filenames on MS FTP servers,
and so on. Our spider has a *lot* more code than our search engine.

>- specifically, what are the characteristics of each of
>these in performance/reliability/features (personal experience
>from non-vendors and public benchmarks are of course preferred,
>but vendor claims might be of interest too)

I'll mostly defer to customer evals and our product web site, 
but for scalability you can try out www.infoseek.de, which has 
about 10 million documents and does about 1 million queries/day. 
The search back end is stock Ultraseek Server, and the front end 
is a custom pagebuilder.

>- can i safely ignore the non open source ones without giving
>up capabilities

Not really, at least according to our customers. In some areas,
open source tools are competitive, in others, they aren't. Search
is the latter. We routinely beat free tools in customer evals.

Personally, I use a free editor (Emacs), but a commercial bug-tracking
system (Globetrack). You've got to make your own evaluations, of
course.

>- if all i wanted to do was boolean search on field values with
>no stemming/concept support, then regardless of how i did the
>indexing, what is wrong with using standard b-trees and/or just
>putting the index data in a sql db?

Relevancy ranking would be nice. Going through thirty pages of
hits really bites.

And stemming does help. Phrase search helps a lot. Counting
inter-site links helps with very short queries. Anti-spam
algorithms help. Field weights help. Find Similar (query by
example) is useful. Indexing Microsoft Word, PDF, PostScript,
and XML is handy. And so on.

Finally, please add this commercial product to your list:

Ultraseek Server: http://software.infoseek.com/

wunder

--
Walter R. Underwood
wunder at infoseek.com
wunder at best.com (home)
http://software.infoseek.com/cce/ (my product)
http://www.best.com/~wunder/
1-408-543-6946

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)