Opinions requested
Marcelo Cantos
marcelo at mds.rmit.edu.au
Sat Mar 6 06:45:35 GMT 1999
Thank you, Walter for the erudite response. I am left in a bit of
quandary as to how or even whether to respond. This is in large part
due to the fact that, while your post was in response to mine, it is
not immediately clear to me whether you are addressing my comments
specifically or rather the general theme of this thread.
Having the vague impression (though no firm conviction) that it is in
response to my claims that you waxed eloquent on the theme of what
defines an XML database, I will proceed to provide commentary, and
occasionally direct response/rebuttal, to a smattering of your points.
My humble apologies, Walter, if I have in any way misconstrued your
post.
On Fri, Mar 05, 1999 at 02:22:51AM -0500, W. E. Perry wrote:
> Marcelo Cantos wrote:
>
> > "Jeffrey E. Sussna" wrote:
> >
> > > There is not (AFAIK) yet any such thing as an XDBMS
> >
> > I am continually surprised to hear remarks such as this. SIM _is_
> > an XDBMS (it is also an SGML, MARC, RTF, etc. database with
> > structure and full content query capabilities). As an XDBMS it
> > has weaknesses (it only supports predefined indexes and limited
> > structure querying), but in some ways provides a model that is
> > even richer than XML (it provides structure below element level,
> > and has the concept of fields
>
> In addition to this vision of an XML database, there has been much
> discussion of XML as a front end or a query-and-response framework
> for data stores, but I would argue that such applications of XML
> markup are not an XML database. A true XML database is shaped by the
> essential characteristics of XML itself: it should be freely
> eXtensible; it should be defined and manipulated by Markup; and it
> should be cast in a Document Structure within which Elements
> identify Data Constructs, and Attributes provide Data
> Characterization.
It seems here that I may have provided an incorrect characterisation
of what we do, and hence given Walter cause to provide some qualifiers
on anyone wishing to define themselves as an XML database.
On this point, I must make it quite clear that SIM is _not_ an XML
front end to a data store. It is an XML (etc.) document repository.
One additional, crucial point is that SIM _is_ extensible (though I
will qualify this presently). It can be defined to accept markup to
any degree of strictness or laxity (within the bounds of
well-formedness or validity, of course). It can be setup to accept
any and all markup and do _something_ intelligent with it. It can
also be configured to make stringent demands (well in excess of the
DTD, both with respect to strictness and complexity of constraints) of
its inputs.
This quality of SIM renders the product amenable to both of the
major application streams of XML: data and documents. It can provide
strict data validation as well as extensibility.
Now, by way of qualification, SIM does not provide free-form runtime
extensibility (runtime from the administrator's perspective, not
ours). Rather it provides the application developer with the
requisite tools to define, at design time, what structures will be
supported. For instance, you cannot, with SIM, perform queries such
as, "find me all sections containing subsections with an attribute of
security="public" and at least one paragraph with fewer than four
words in it" The semantic complexity of such a query is beyond the
scope of our product. However, if one were to know in advance that
queries about the minimum paragraph length in public subsections will
be commonplace in the particular application one is developing, then
SIM could, at design time, be told to create an appropriate index and
then the above query could, indeed, be performed.
In short, SIM _is_ extensible, but the extensibility is bound somewhat
earlier than runtime. In practice, clients never complain about this
quality. In fact, it is usually a benefit rather than a hindrance,
for the same reason that compile time type checking is a good thing to
have in a programming language.
I also take issue with Walter's remark that an XML database should be
manipulated by and defined through the medium of XML. This sounds
analogous to suggesting that relational databases should be defined
and manipulated by markup. Now, it is true that relational schema
are, themselves, typically stored as relations (one will, for example,
find a ".TABLES" table, a ".FIELDS" table, a ".INDEXES" table, etc.
inside a database). However, it seems to me patently absurd to
suggest that SQL (whether DML or DDL) be expressed in terms of tuples
and relations. Now, while it does not seem likewise absurd to suggest
that XML queries and data definition constructs be defined as XML, the
truth of such a suggestion is anything but self-evident. Why should
one not use an SQL-like language to define and query XML databases?
There may or may not be merit in such an approach, but it seems no
more or less appropriate than a query/data definition language cast in
XML. Indeed, many of the query language position papers at W3C do not
use XML syntax. Data definition and query languages are
meta-constructs. They are not part of the data, but rather operate on
the data and structures. This suggests that while it may be possible
to fold the system in on itself by expressing meta-structure as data,
it would be unwise to proceed down this path in _a priori_ fashion
(Now, have I completely missed Walter's point here? I'm not sure.)
> Like XML itself, the XML database is fundamentally mismatched to the
> familiar storage and transmission frameworks of filesystem,
> relational table, object serialization or data stream. In the first
> case, any item--document, data table, or executable--whether 'text'
> or binary--which is committed to storage in a filesystem is treated
> as a file: that is, as unitary and indivisible within the
> perspective and capabilities of the filesystem. A word processing
> program may, by opening a document, be able to identify and to
> manipulate as individual elements the sentences, paragraphs and
> chapters of that document. By contrast, the filesystem in which
> that document is stored reads, writes, renames, searches for or
> deletes the document as a whole. In XML terms, the filesystem sees
> the document as a single element--a root. Regardless of how many
> subelements we might mark up within that <root>, the
> filesystem--designed for a generic 'file-like' document, is capable
> of manipulating only one.
One must be careful, here, to discriminate between interfaces and
implementations. I basically agree with all of Walter's points in the
above paragraph, but would add that many systems store conceptual XML
documents as files. Our system uses a highly tuned variable length
record manager (unsurprisingly named the VLRM) to store documents and
fragments of any size in a highly efficient manner (both in terms of
size and speed). Consequently, we store entire documents for the most
part. If parsing time starts to weigh heavily due to retrieval of
excessively large documents (the entire Australian Tax Legislation,
say, or a complete Boeing Aircraft Maintanence Manual), then we
fragment the documents to a level where parsing is no longer a
bottleneck.
In all of this, however, SIM can always treat the XML as XML. The
developer always sees trees, not files, or BLOB's. It doesn't matter
how it is stored in the background, that is an implementation issue.
The one caveat with our product is that fragmented documents cannot be
treated as a conceptual whole without physically rejoining the parts.
This is one thing which OODBMS's do better than us present, though we
are looking at ways to provide that additional level of abstraction
(we are also considering the usefulness of doing so, since fragments
are more commonly the unit of interest, rather than the entire
document).
> In the terms of both filesystem and relational table, an XML
> document is effectively a BLOB, in that its specifically XML
> structure is outside the ability of either to discern or to make any
> use of. Just as, for example, with audio or video content more
> commonly recognized as BLOBs, the filesystem or relational database
> engine is obliged to invoke a particular, content-specific processor
> in order to understand, and then to implement, the structure
> conveyed by markup in every XML document. Yet this need for
> pre-defined, content-specific handlers obviates the benefits of XML
> as a general solution. Indeed, it is not really XML at all if the
> markup possibilities are circumscribed by the need to conform to
> what a pre-defined handler can implement.
I disagree with the last sentence above. Not from the pedagogical
perspective (which seems quite evident in Walter's prose, and with
which I largely sympathise), but from the pragmatic perspective. Yes,
the purist will rightly decry the notion of predefinition of structure
in an ostensibly XML-friendly environment, but the end-user comes
along and not only accepts, but vociferously demands that his
environment be constrained. The user doesn't want flexibility to
store anything, she wants the flexibility only to store what she wants
to store.
The serious user of XML does not have a heterogeneous collection of
vaguely defined documents with a motley crew of DTD's and well-formed
markup. Most users have a well defined data set for which they want
to define efficient structures for storage and retrieval (if they
aren't interested in efficiency then their problem isn't particularly
interesting -- any tool will do). In the few cases where they do have
arbitrary structure to deal with, more often than not they are only
interested in the content and are likely to throw the structure away.
After all, what is the use of structure if you don't know, say,
whether the prolog element contains an abstract element, or whether
"date" attributes refer to creation time, last modification time, or
effectivity (or, worse still, whether they are in U.S., Australian or
international format)? In the real world, I suspect that cases where
structure is arbitrary but important will be few and far between.
This is borne out by the almost complete absense of demand for
arbitrary structure querying capability from our clients or potential
clients. It just never seems to be an issue.
A qualifier is also in order for the above remarks, lest there be a
misunderstanding. XML tools, in general, must be extensible and
accept any and all valid and/or well-formed inputs. My comments
specifically address the issue of repositories (DBMS's). XML may be
extensible, but it, too, expresses the notion of constraint through
the concept of DTD's. Databases, likewise, not only can, but should
constraint the inputs, both for simplicity and efficiency. Perhaps
this is, after all, what Walter meant when repudiating the idea of
predefined handlers.
Cheers,
Marcelo
--
http://www.simdb.com/~marcelo/
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)
More information about the Xml-dev
mailing list