What is a good database for very large collections?

Marcelo Cantos marcelo at mds.rmit.edu.au
Thu Feb 4 03:51:26 GMT 1999

On Mon, Feb 01, 1999 at 12:33:53PM -0500, Borden, Jonathan wrote:
> >
> > Can I try to shift it back to a vital question asked earlier, but not
> > answered?
> >
> > What is a good database for XML?

SIM (http://www.simdb.com/sim_2.1/

> > The criteria are:
> >     * over 20, 000, 000 document fragments, each less than 256
> > characters, each with some flat metadata, able to be incrementally
> > reloaded onto the live system
> >     * about simultaneous 30 users accessing about 10 fragments a minute
> > each, grouped together (along with other dynamic data) and transformed,
> > with a high need for immediate response

We can load about 200 MB per hour while live (actually I think we can
load 400-500 MB/hr but we claim 200 MB to add a safety factor).  We
handle small documents quite well through DTD caching techniques (we
also plan to include expat in the near future for unvalidated XML. We
do currently support unvalidated XML, but through SP, which is not as
fast as we'd like).  Queries are fast (we queried "to be or not to be"
across 55 GB in 74 seconds on a 2x336 MHz UltraSPARC with 1 GB
RAM--note that this was a word position query using several stop

> How are the fragments selected? By query? If you can easily
> represent the 20M fragments in tabular form, and if you can easily
> represent the queries in SQL then a relational db is the way to go.
> this is not a particularly large, nor high-volume application for

And if you can't represent them in tabular form, try SIM.

> Ought you store the 20m fragments each in its own file ... probably
> not (a big waste). Ought you employ an ODBMS? not unless SQL
> wouldn't work well (you could always load it into say Oracle/SQL
> Server/DB2 etc vs. ODI/Poet etc and test it out). My expectation
> would be that if you need to run queries, the RDB will win.

For content queries (e.g. summary CONTAINS "stock option*") SIM will
easily outperform an RDBMS.  Customers have chosen our product above
RDBMS's for this very reason.

> >     * constant data-mining tools using various adhoc AI and linguitic
> > retrieval software augmenting the metadata in the background.

We support stored queries and scheduled queries with filters to exclude
previously returned records.  I'm not sure if this meets the above

To say there are no scalable solutions (as someone did recently on
xml-dev) is simply false.  There may be no scalable solutions that do
everything you want--and I'm certainly not touting SIM as the be-all
and end-all (we have yet to support XQL, full path indexing,
transactions, etc. all are pending with varying levels of
priority)--but there are products available right now that scale and
solve people's problems.

SIM has been used in law (http://www.thelaw.tas.gov.au is the world's
first legislation to officially go online),  taxation
(http://www.ato.gov.au/general/advanced/adv.htm), other government
(libraries, NSA--no URL, sorry :-), aviation (Boeing), etc.  Moreover,
our customers don't go away dissatisfied.  We are quite proud of the
fact that every SIM site is a reference site.  We are also pleased
that in some instances, project managers have been promoted as a
result of using SIM!

Marcelo Cantos
SIM developer


xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)

More information about the Xml-dev mailing list