searching for search

Mon May 24 04:07:34 BST 1999

On Sun, May 23, 1999 at 11:31:43PM +0200, Edward C. Zimmermann wrote:
> > - in general, what are the features that push a technology
> > into another level of complexity and why (i.e. what is so
> > hard here?)
> Designing fulltext engines is not difficult :-)

Making them updateable, small and fast is.  Full structure indexing,
in particular, is easy in principle but quite awkward to make
efficient and small in a dynamic update environment.  phrase querying
is another that has to be done carefully to simultaneously offer speed
while avoiding space blowouts.

It is quite common for engines to generate indexes that are two to ten
times the size of the data.  Furthermore, this is generally considered
acceptable (SIM rarely goes over half the size of the data -- we
haven't implemented full structure or phrase yet, but have done some
research to establish the cost; structure indexes should be minimal
and efficient phrase indexes should roughly double our current index
size).

Here is a list of features that have the potential to push complexity (the list
is neither comprehensive, nor in any particular order):

  * Size minimisation
  * Large collections (e.g. exceeding 2 or 4 GB can pose
    unique problems that are non-trivial to solve) 
  * Performance
  * Interactive updates
  * Full-structure querying
  * Phrase querying
  * Transactions
  * Incremental backups (important for large collections)
  * Multi-database queries
  * Multi-database ranked/sorted queries
  * Multi-server queries
  * Multi-server ranked/sorted queries
  * Multi-server multi-vendor queries
  * Multi-server multi-vendor ranked queries

(I list the various multi-database options separately because each
of them introduces new and quite different issues, though some of the
issues may only arise in the context of Z39.50, with which we deal.)

> > - specifically, what are the characteristics of each of
> > these in performance/reliability/features (personal experience
> > from non-vendors and public benchmarks are of course preferred,
> > but vendor claims might be of interest too)
> > 
> > - can i safely ignore the non open source ones without giving
> > up capabilities
> What do you mean (I seem to be on a roll at not understanding
> questions these days)?

Possibly the triple negative (_ignore_, _non_, _without_) contributed
in this case. :-)

> > - if all i wanted to do was boolean search on field values with
> > no stemming/concept support, then regardless of how i did the
> > indexing, what is wrong with using standard b-trees and/or just
> > putting the index data in a sql db?
> To make the answer short: depends upon what you want to do.

A slightly longer answer is, if you have 100GB of data that you want
to index in an SQL database then you'd better grab a terabyte of hard
disk and be prepared to wait a LONG time for your queries to come back
to you.

Cheers,
Marcelo

-- 
http://www.simdb.com/~marcelo/

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)