searching for search

Mark D. Anderson mda at
Sun May 23 22:02:14 BST 1999

Regarding the recent "Indexing XML Document Collections" thread...

I've been doing some breadth-first search for indexing/query
technology, and here is a summary of what i've learned.
I'm posting this because I'm interested in the area but don't
have the time to investigate all these, and it seems like
there are some real experts on this list.

I'm interested in these questions:

- in general, why would I pick one of these over another
(i.e. boolean query vs. structured query; scalability in size
or requests; pluggable format drivers for source data;
stemming and concept support; etc.)

- in general, what are the features that push a technology
into another level of complexity and why (i.e. what is so
hard here?)

- specifically, what are the characteristics of each of
these in performance/reliability/features (personal experience
from non-vendors and public benchmarks are of course preferred,
but vendor claims might be of interest too)

- can i safely ignore the non open source ones without giving
up capabilities

- if all i wanted to do was boolean search on field values with
no stemming/concept support, then regardless of how i did the
indexing, what is wrong with using standard b-trees and/or just
putting the index data in a sql db?

indexing/query technologies
what: sgrep
license: GPL
comment: does structured document grep, with an indexing phase.

what: Xtract
license: GPL
comment: another xml grep; more XQL-like. no indexing.

what: swish (Simple Web Indexing System for Humans)
license: sort of free
comment: see swish-e

what: swish-e (swish-enhanced)
license:  GPL
comment: focused specifically on web site indexing.

what: MG (managing gigabytes)
license: GPL
comment: based on book: commercial version is SIM:

what: wais and freeWAIS and freewais-sf/SFgate
comment: now supplanted by Isearch/Isite.

what: Isearch
license: non-copyleft free.
comment: Isearch is behind dmoz/newhoo (,4,28964,00.html?

what: dig or "ht://dig"
license: GPL

what: glimpse
license: non-commercial use, open source.

 Excalibur RetrievalWare
 oracle intermedia
 fulcrum (now pcdocs)
 OpenText (soon to be pcdocs?)

no cost, but object code only:
 excite for web servers
 PLS acquired by AOL.
 thunderstone webinator is no cost, object code only.

"XML Servers" (which can mean anything)
 odi excelon
 softwareag tamino
 poet cms
 oracle ifs, dbweb, etc.

query/search languages and standards

 aka ISO 23950 ; formerly ISO 10162 and ISO 10163.
 basically the U.S. started branching the original ISO standard, and now they lead the ISO standard.
 WAIS was based on the first version Z39.50-1988.
 see also
 for history see and

GILS (government information locator service)
 for technology, just aggregates other projects (uses Isearch, htdig, etc.).
 at a standards level, it subsets Z39.50 and articulates some 150 specific attributes/elements for semantics,
  in the "GILS Profile"
 [there, i've now saved you from reading a horrific amount of verbiage.]

 a standardization effort like GILS. subsets Z39.50.
 complementary (sort of) to publication/metadata/robots.txt standards like dublin/rdf.

SDQL (structured document query language)
 DSSSL thing.,

SOIF (Summary Object Interchange Format)
 first made up by Harvest in 1994.

CIP (Common Indexing Protocol)
 output of the moribund ietf FIND working group

XQL and XML-QL and a gazillion more


Search UI
comment: web interface to WAIS and SWISH search engines

what: webglimpse
comment: web interface

what: HURL (Hypertext Usenet Reader & Linker)
license: will be free software.
comment: uses glimpse underneath

what: harvest
comment: just does the spidering; the index is with glimpse
verity etc. could be used instead of glimpse.
does provide a "Broker" cgi around the indexer.
maps SGML to "SOIF".

Papers/Reading on IR

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at
Archived as: and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at

More information about the Xml-dev mailing list