searching for search
Edward C. Zimmermann
edz at bsn.com
Sun May 23 23:27:07 BST 1999
>
> Regarding the recent "Indexing XML Document Collections" thread...
>
> I've been doing some breadth-first search for indexing/query
> technology, and here is a summary of what i've learned.
> I'm posting this because I'm interested in the area but don't
> have the time to investigate all these, and it seems like
> there are some real experts on this list.
>
> I'm interested in these questions:
>
> - in general, why would I pick one of these over another
> (i.e. boolean query vs. structured query; scalability in size
> or requests; pluggable format drivers for source data;
> stemming and concept support; etc.)
Why pick one over the other?
Well, I'd forget about stemming and concept support since both
just sound good on paper but create more noise (except in some
special conditions), IMHO, than anything else....
>
> - in general, what are the features that push a technology
> into another level of complexity and why (i.e. what is so
> hard here?)
Designing fulltext engines is not difficult :-)
>
> - specifically, what are the characteristics of each of
> these in performance/reliability/features (personal experience
> from non-vendors and public benchmarks are of course preferred,
> but vendor claims might be of interest too)
>
> - can i safely ignore the non open source ones without giving
> up capabilities
What do you mean (I seem to be on a roll at not understanding
questions these days)?
>
> - if all i wanted to do was boolean search on field values with
> no stemming/concept support, then regardless of how i did the
> indexing, what is wrong with using standard b-trees and/or just
> putting the index data in a sql db?
To make the answer short: depends upon what you want to do.
>
>
> indexing/query technologies
> ---------------------------
>
> what: Isearch
> url: http://www.etymon.com/Isearch
> license: non-copyleft free.
> comment: Isearch is behind dmoz/newhoo (http://www.news.com/News/Item/0,4,28964,00.html?st.cn.News.today.ne)
Well... Its more like DMOZ/NewHoo uses public Isearch (as well as many other
sites).
>
> what: dig or "ht://dig"
> url: http://www.htdig.org/
> license: GPL
>
> what: glimpse
> url: http://glimpse.cs.arizona.edu/
> license: non-commercial use, open source.
>
> commercial:
> Readware http://www.readware.com/products.htm
> Excalibur RetrievalWare http://www.excalib.com/
> verity http://www.verity.com
> oracle intermedia http://www.oracle.com
> fulcrum http://www.fulcrum.com (now pcdocs)
> OpenText http://www.opentext.com/ (soon to be pcdocs?)
Actually OpenText was trying to acquire PCDOCS not the other way around....
And PCDOCS/Fulcrum goes to Hummingbird...
> SIM: http://www.mds.rmit.edu.au
We too support XML
http://www.bsn.com/Z39.50
There are, of course, *many* more products out (and many that support
XML and Z39.50 as well).
>
>
> query/search languages and standards
> -------------------------------------
>
> Z39.50-1995 http://lcweb.loc.gov/z3950/agency
> aka ISO 23950 ; formerly ISO 10162 and ISO 10163.
> basically the U.S. started branching the original ISO standard, and now they lead the ISO standard.
> WAIS was based on the first version Z39.50-1988.
> see also http://www.faqs.org/rfcs/rfc1729.html
> for history see http://mirrored.ukoln.ac.uk/lis-journals/dlib/dlib/dlib/april97/04lynch.html and
> http://slis6000.slis.uwo.ca/~jxerri/index.html
>
> GILS (government information locator service) http://www.gils.net/locator.html
> for technology, just aggregates other projects (uses Isearch, htdig, etc.).
> at a standards level, it subsets Z39.50 and articulates some 150 specific attributes/elements for semantics,
> in the "GILS Profile" http://www.gils.net/prof_v2.html
> [there, i've now saved you from reading a horrific amount of verbiage.]
Actually not. You are not talking here about GILS but about the ASF freeware...
ASF is a framework for GILS but its more than just about GILS but distributed S/R.
Its something like Z39.50 + Whois++ + Gathering +.
The freeware was a mish-mash but won't talk about that :-)
If ones also talking about distributed search one might want to refer to the work
in the TF-CHIC at Terena (http://www.terena.nl/task-forces/tf-chic/).
In this context might also want to talk about the work on XER (XML Encoding) of
Z39.50 and IETF's WebDAV.....
>
> STARTS http://www-db.stanford.edu/~gravano/starts.html
> a standardization effort like GILS. subsets Z39.50.
> complementary (sort of) to publication/metadata/robots.txt standards like dublin/rdf.
No... But.... Acutally STARTS was proposed to compete with Z39.50 under the notion that
"Z39.50 is too complicated". The counter from the ZIG was ZSTARTS...
>
> SDQL (structured document query language)
> DSSSL thing. http://www.jclark.com/dsssl/sgml95/sdql.html, http://www.jclark.com/dsssl/IS/dsssl85.htm
>
> SOIF (Summary Object Interchange Format)
> first made up by Harvest in 1994.
SOIF is not a query/search language or protocol but as the name says an "interchange format".
There are many of these around including the IAFA recs which has taken on a new lease on
life within ROADS (http://www.ilrt.bris.ac.uk/roads/)..
This all belongs under the chapter "Resource Discovery and Metadata"...
>
> CIP (Common Indexing Protocol)
> output of the moribund ietf FIND working group
>
> XQL and XML-QL and a gazillion more http://www.w3.org/TandS/QL/QL98/pp.html
>
> OQL http://www.odmg.org/standard/odmgbookextract.htm#Chapter 4
>
>
--
______________________
<A HREF="whois://rs.internic.net/ecz">Edward C. Zimmermann</A>
<A HREF="http://www.bsn.com/">Basis Systeme netzwerk/Munich</A>
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)
More information about the Xml-dev
mailing list