Opinions requested

Sat Mar 6 08:46:24 GMT 1999

On Fri, Mar 05, 1999 at 09:37:29AM -0800, Jerome McDonough wrote:
> At 02:17 PM 3/5/1999 +1100, Marcelo Cantos wrote:
> >>"Jeffrey E. Sussna" wrote:
> >>
> >> There is not (AFAIK) yet any such thing as an XDBMS (though you
> >> could consider a file system of XML documements plus a web server
> >> to resolve URL's to those documents as such a thing).
> >
> >I am continually surprised to hear remarks such as this.  SIM _is_
> >an XDBMS (it is also an SGML, MARC, RTF, etc. database with
> >structure and full content query capabilities).
> 
> I think one of the reasons you hear these kinds of remarks is that
> the terminology surrounding these systems is used differently by
> different folks.  For instance, from what I know of SIM, I wouldn't
> call it a DBMS system of any kind, as I don't believe (I could be
> wrong) it supports referential integrity constraints, concurrency
> control, recoverable transactions, and other features I would expect
> out of a reasonable DBMS.  Granted it has hooks that allow you to
> get it to work with a DBMS that can provide all that, but that
> doesn't make SIM itself a DBMS.  I would instead class SIM as an
> information retrieval system, and a pretty damned good one at that.
> However, SIM performs as well as it does in great part because it's
> not doing the extra work that a DBMS should do, and which add
> greatly to retrieval time from database systems (as well as limiting
> their ability to handle complex data formats gracefully).

Thank you, Jerome, for the candid and quite fair assessment of SIM.

On the point of referential integrity, you are quite right, there is
no built in support.  Though with our new event hook mechanism
(similar to the triggers found in most relational systems) one will be
able to attach event handlers to various update operations, and
prevent them from completing in the event of a referential integrity
violation.  This probably wouldn't work together with concurrency
controls (thought this will be moot when transaction support comes
in).

However, in one particular project, we have put in referential
integrity control using a single query per reference as part of the
check-in mechanism.  Another project only generates references
dynamically at query time effectively with a single reverse-reference
index lookup at query time.  The problem with referential integrity
checking is sometimes you need to be able to manage broken data and
this is more often the case with documents than with the more typical
applications of RDBMS technology (financial transactions etc).  Of
course when you store whole documents instead of unnaturally breaking
them up into millions of tiny pieces, you don't have nearly the same
referential integrity problems in the first place.

With respect to concurrency control you are mistaken.  We support
short term locks, which prevent individual records, at least, from
ever entering an undefined state under concurrent loads.  These locks
can be held as long as desired, but cannot persist beyond the lifetime
of a session.   Long term locks (which outlive the session) are in the
offing, and stand a good chance of getting into release 3.0 (scheduled
for mid-year, I think -- it could be earlier).

Transactions we most definitely do not support.  We do, however,
provide recovery through log files, which record server activity and
can be played back in a batch load operation.  It's a little crude
(you make the server read-only, back it up, and start a new log file.
When you crash, restore the last backup and replay the log) but it is
safe and effective.

More important than any specifics, however, is the issue of what you
call a DBMS.  To me, a DBMS is a database management system (seems
painfully obvious, but I think it bears repeating).  You may argue
that a product is not a DBMS if it does not support feature X, and I
don't entirely disagree.  When one talks of a DBMS one is conjuring up
a certain image in the mind of the listener, and that image may well
include feature X.  To be fair to SIM, however, the essence of a DBMS
is that it manages a collection of data.  If it doesn't support
transactions, this does not entail that it does not manage data.
Rather it simply has limits on the way the data is managed (i.e. it
doesn't manage data as well as one would like).

You clearly believe that transaction support is part of the essence of
what makes a DBMS.  I disagree, indeed, I profoundly disagree.  There
is nothing in the concept of a database that mandates any such
requirement.  Rather I would say that transaction support is an
important issue for any _good_ DBMS.  Likewise for referential
integrity and concurrency (and, for that matter, support for
declarative queries, use of indexes, a rich set of fundamental data
types, etc.).  If I recall correctly, dBase III was generally
acknowledged to be a DBMS though it lacked most of these requirements,
and could barely even call itself relational!

Now, don't get me wrong here.  I am not trying to defend SIM by
deprecating the features you demand.  They are very important and
highly desirable features in a DBMS (the fact that they are amazingly
difficult to do well is of no concern to the user).  Their absence in
SIM is of ongoing concern to us.  Furthermore it is far from
satisfying to be able to insist that, SIM fits into a strict,
minimalist definition of a DBMS if it lacks features that are
typically associated with DBMS's.  One of the primary reasons they are
not in at this stage is that, as you pointed out so well, the primary
focus of SIM has always been performance and scalability; and all of
the aforementioned features can have a significant impact on
performance if implemented naively (transaction support, in
particular, is an onerous requirement, though by no means untenable).

SIM is not a full featured DBMS.  But it is not a mere informaton
retrieval system either.  It does support recovery (though not full
transaction support), it does support concurrency, and it can be
coerced to support referential integrity.  It also bears mentioning
that you don't have to talk out to an RDBMS to do any of these things.
In fact the only use I have heard of for our ODBC capability is one
client who wanted to access a personnel database for authentication
purposes (it had nothing to with the database server per se).

I guess this all boils down to what's in a name.  At the end of the
day, it is far more important to know what a product does and does not
do than what you call it.

> This isn't to knock SIM; anyone who needs a flexible information
> retrieval system should be taking a very serious look at it.  The
> Z39.50 support alone puts it way ahead of the market as far as I'm
> concerned.  But I don't think SIM is evidence that there are DBMS
> systems that handle SGML/XML well; I don't think they do.  Oracle
> may very well be getting there with its latest release, but I
> suspect there's still a lot of work to be done there.

I am sceptical that any RDBMS vendor can come to the party in terms of
performance.  Past attempts to try to force text into a relational,
table or object based paradigm have not reaped great success (Oracle's
ConText comes to mind as an example of how forcing a square peg into a
round hole requires sacrificing the edges of performance).  I would be
surprised if any of the major database vendors would be prepared to
venture away from their core competency (the relational model) to
address the performance issues.

But why parse XML to split it up into tables when you can store the
XML directly?  Why build thousands of index entries to system
generated element ID's so that you can do join's to build up an XML
fragment, when you can build a single index and pull the fragment in
its entirety out of the document from which it comes?  Why use
inferior content indexing technology taking up to 10 to 20 times the
size of the data being indexed when you can use compressed inverted
files which take between 15% (document level index) and 50%
(multi-level word position index) the size of the data?  And all this
with faster update speed than many standard text retrieval systems.

There is an additional overhead in the relational paradigm which has
nothing to do with transactions, concurrency control, or referential
integrity checking.  That cost is that relational tables do not map
cleanly onto hierarchical documents (or data collections to pick up on
another thread).  Every fragment you insert, update, or remove has to
be taken apart to map it onto some underlying representation, modified
piece by piece, and then reassembled to be delivered.

I strongly disagree that SIM doesn't handle SGML/XML well.  In the
five years of successfully selling SIM, no customer has ever replaced
SIM with another product. In fact none of them have even mentioned to
us that they ever considered replacing SIM.  This in itself is
remarkable given that, because our customers use SIM to store their
SGML/XML natively, they can get the data out of SIM much more easily
than if it were mapped onto some proprietary internal database format.
People buy SIM because it is flexible enough to do whatever they need
to do with their XML/SGML.  It doesn't force them to adopt a
non-XML/SGML approach.  It doesn't force them to translate their data
into some proprietary format in order to interact with the data.  It
deals directly with the XML.  Precisely what the original post was
asking for, in fact.

Cheers,
Marcelo

P.S.: Some thanks go to my colleague, Tim Arnold-Moore, for providing
some of the content (including the closing) for this article.

-- 
http://www.simdb.com/~marcelo/

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)