Storing Lots of Fiddly Bits (was Re: What is XML for?)

W. Eliot Kimber eliot at
Sat Jan 30 23:37:58 GMT 1999

At 12:55 PM 1/30/99 -0500, Borden, Jonathan wrote:

>	In general, object databases have been designed to efficiently store lots
>of c++ (or java) objects which contain embedded pointers (or references) and
>they provide a mechanism to navigate the database using the semantics of a
>pointer dereference. They are not designed to *efficiently* perform complex
>queries, especially those that SQL databases excell at.

If this is the definition of object database, then I don't think it
qualifies as a "database" at all--it's just persistent object storage,
which is useful, but not very interesting.  At least my layman's idea of a
"database" is that it is both general and supports queries.

Of course, this has always been one of my problems with object-oriented
programing in general: it tends to cause people to conflate the data with
the processing to the degree that the objects end up becoming primary,
rather than things that serve the data.  Persistent objects are useful as
an optimization technique but they should never be a substitute for
standards-based data repositories.

As a Certified SGML Paranoid Nutcase (CSPN) I distrust all software
implicitly and therefore always prefer solutions in which the data,
represented using SGML or XML, is the primary data store, with any other
representations being merely transient reflections of that data for
purposes of optimization and that sometimes you are forced to trust your
software not to screw up your data too badly. Of course I realize that this
extreme view can't work for a some use scenarios, but it turns out to work
really well for a lot of them, especially high-volume *publishing*
scenarios, where the input to the publishing system is the SGML or XML--the
cost of reserializing documents stored as objects at production time is
orders of magnitude higher than the cost of objectizing them at indexing or
editing time, largely because the throughput requirements are different for
these different processes.  In other words, if the SGML data wasn't the
primary format, it would be impossible to meet the production throughput
requirements.  For one particular customer, even the cost of not having the
files directly on the file system is too high, so they have to go around
behind the back of their storage manager (which provides access control and
file-level versioning).

Or said another way: optimizing for one part of the process usually, if not
always, deoptimizes for another part. Not news, but it bears repeating once
in a while.

As an example of the cost of deserialization, we have a client with about
80 Meg of SGML data organized into about 15000 small documents (most
documents are less than 2K in length).  On a 400mhz Pentium II with 128Meg
of memory (running Windows NT) and gigs of free disk space, it takes 21
hours to load this data into the repository (one of the leading SGML
element manager databases, implemented on top of a leading object database)
and 8-10 hours to export it.  And, unless we're doing something wrong, the
import process does not include indexing of the data, only objectizing it.
This seems a little extreme to me. It may be that this product is
particularly poorly implemented or that we have failed to perform some
essential tuning action, but still, 21 hours?  I hope that this annecdotal
evidence is not indicative of other, similar systems, but it's not very


<Address HyTime=bibloc>
W. Eliot Kimber, Senior Consulting SGML Engineer
ISOGEN International Corp.
2200 N. Lamar St., Suite 230, Dallas, TX 75202.  214.953.0004

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at
Archived as:
To (un)subscribe, mailto:majordomo at the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at

More information about the Xml-dev mailing list