Storing Lots of Fiddly Bits

Sun Jan 31 18:22:37 GMT 1999

W. Eliot Kimber wrote:
>
>
> At 12:55 PM 1/30/99 -0500, Borden, Jonathan wrote:
>
> >	In general, object databases have been designed to
> efficiently store lots
> >of c++ (or java) objects which contain embedded pointers (or
> references) and
> >they provide a mechanism to navigate the database using the
> semantics of a
> >pointer dereference. They are not designed to *efficiently*
> perform complex
> >queries, especially those that SQL databases excell at.
>
> If this is the definition of object database, then I don't think it
> qualifies as a "database" at all--it's just persistent object storage,
> which is useful, but not very interesting.  At least my layman's idea of a
> "database" is that it is both general and supports queries.

	Many object databases do provide query capability, e.g. OQL. The point is
not that they aren't capable of handling queries, rather that relational
databases excel at queries. The question is not one of ability, rather
efficiency (and scalability).

>
> Of course, this has always been one of my problems with object-oriented
> programing in general: it tends to cause people to conflate the data with
> the processing to the degree that the objects end up becoming primary,
> rather than things that serve the data.  Persistent objects are useful as
> an optimization technique but they should never be a substitute for
> standards-based data repositories.

	exactly! (from an oo guy:-)
>
...
>
> As an example of the cost of deserialization, we have a client with about
> 80 Meg of SGML data organized into about 15000 small documents (most
> documents are less than 2K in length).  On a 400mhz Pentium II with 128Meg
> of memory (running Windows NT) and gigs of free disk space, it takes 21
> hours to load this data into the repository (one of the leading SGML
> element manager databases, implemented on top of a leading object
> database)
> and 8-10 hours to export it.  And, unless we're doing something wrong, the
> import process does not include indexing of the data, only objectizing it.
> This seems a little extreme to me. It may be that this product is
> particularly poorly implemented or that we have failed to perform some
> essential tuning action, but still, 21 hours?  I hope that this annecdotal
> evidence is not indicative of other, similar systems, but it's not very
> encouraging.
>

	scary isn't it. the problem is this: suppost we wish to use XQL as a query
"language", now suppose the documents are stored as XML in files. What would
an index look like and how would the processor efficiently represent
containment etc. This is where SQL has trouble. For example, select href
from documents where tagname = 'p' is a piece of cake, but
selectSingleNode("/repository/*//*/chapter/*//*p"); (or something like that
:-)) makes things a bit more difficult in SQL.

	the approach I believe might make most sense would be to store the
documents in an in-memory DOM or grove format (which isn't far off from what
an object database essentially is). The cost of doing so ought not be too
far off from building the DOM parse tree in the first place.

	I can hear many people choking at this point about efficiency, memory,
swapping etc. but this approach is essentially exactly how object databases
work. This issue is disk and swap space. 32 bit architectures are limited to
about 4 gb (2^32) but 64 bit architectures are limited to 2^64 bytes which
does require a large RAID farm.

	so 21 hours to process 80 mb of information is grossly out-of-whack and
points to a very poorly designed system (a 128 mb machine should be able to
process 80 mb of data in memory give or take a few 10s of mb).

	For example, with as small a system as you are talking about, I'd slap a
swapfile on Jade and be done with it. You ought to be able to run an XSL
query directly.

	The system I work with scales to terabytes, employs SQL indexes, and files
which aggregate individual objects into 20-80 mb chunks. This hooks into a
HSM for essentially unlimited storage capability and accepts information at
~1 mb/sec on a 10base-t network with a 180 MHz pentium pro and 64 Mb memory.

Jonathan Borden
http://jabr.ne.mediaone.net

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)