ANN: XML and Databases article

Thu Sep 9 01:40:12 BST 1999

[Ron Bourret:]

> I've read Paul's tutorial and the GroveMinder summary on the Web, so
> let's see if I've got this straight.  A grove is basically a
> property set, broken down into classes, each of which has
> properties. There are probably relationships between those
> classes. For example, a grove for XML could have classes for
> elements, attributes, entities, and so on, where the element class
> points to the attribute class. A grove for a relational database
> would have classes for tables, columns, etc., where the table class
> points to the column class.

Pretty close, but not quite on the money.  First of all, a
terminological problem: A grove is the set of objects that results
from understanding (parsing and processing) some particular logical
resource.  No grove is made from more than one logical resource (I say
"logical" resource because some single resources are distributed in
multiple physical containers).  However, more than one grove can be
made from a single resource.  This is because resources have multiple
layers.  For example, in the case of XML documents, there is always
the XML syntax layer of "understanding".  The property set (schema)
for this layer is probably strongly reminiscent of the DOM.  However,
there are one or more vocabularies used in every XML document (there's
always at least one because the element types have names, even if
there's no DTD).  The semantics of these vocabularies may imply
"emergent properties" of the information contained in the resource,
and there can be a property set for each vocabulary's emergent
properties.  So preparing a single resource for application-internal
exploitation may involve creating groves for each vocabulary.  By
giving names to the emergent properties of vocabularies, such property
sets can be, in effect, APIs to the semantics of each vocabulary, thus
opening the way for vocabulary-specific software engines, and for far
more reliable cross-application information interchange than the Web
has ever seen.

So, instead of saying, 

> A grove for XML could have classes for elements, attributes,
> entities, and so on, where the element class points to the attribute
> class.

 ... you might better have said any one of the following (this has to
 be said with extreme precision, so look closely):

| A property set for the XML language could have classes for elements,
| attributes, entity references, and so on, where the element class
| has, as one of its nodal properties an "attribute specification
| list" property, whose value is a list of "attribute value
| specification" nodes.

 or:

| The primary grove form of an XML resource could have nodes
| conforming to the "element", "attribute specification", and "entity"
| classes, and so on, where the "element" class has, as one of its
| properties, an "attribute specification list" property, whose value
| consists of a list of nodes that all must be of the class "attribute
| value specification".

 or, in view of the fact that the DTD of an XML resource is part of
 its grove (when it appears or is referenced by the DOCTYPE
 declaration in an XML resource):

| The primary grove from of an XML resource could have element type
| definitions, attribute list definitions, entity declarations, and so
| on, where the element type definition class has, as one of its nodal
| properties, an "attribute definition list" property, whose value
| consists of a list of nodes that all must be of the class "attribute
| definition".

The second problem with your summary statement is that "points to" is
actually an implementation detail.  The standard only says that nodes
(objects) in groves have properties, and the some properties can be
"nodal" -- that is, the values of such properties can be other nodes
(in the same grove and/or in other groves).  The manner in which a
node is represented to be a property value in any given implementation
is almost certainly going to be via pointing (at least in a von
Neumann architecture machine), but it's important to realize that that
is an implementation decision, and it's inaccurate to say that
"pointing" has anything to do with the grove paradigm.  A property set
can only say that the value of a property is nodal, and
implementations of the grove paradigm must make it appear that the
value of such a property is indeed one or more nodes, but how that is
made to happen is not part of the standard (nor should it be).

So, instead of saying:

> "where the table class points to the column class"

 ... it would be much more accurate to say:

| where the "table" class has a property named "columns" whose value
| is a list of "column"-class nodes.

> In this sense, the XML information set has much in common with
> groves, as it is a property set.

Yes, except that it's not yet clear that the XML info set will be
expressed using the ISO Property Set DTD -- but this is merely a
syntax issue.  I agree with David Megginson: I expect it to be readily
convertible.

> Similarly, the DOM could be viewed as an API for a grove.

Yes, to a single kind of grove, specifically an XML syntactic grove.
(A grove governed by the properties of XML's syntax.)

(Aside: I hope we're not facing a future in which the semantics of
certain chosen vocabularies will be directly supported by future
versions of the DOM.  Such support should "plug into" (and be
unpluggable from) the DOM.  No vocabulary-specific support should
become a required feature of all DOM implementations.  For example,
making XLink a vocabulary is fine; making the DOM able to support
XLink but no other linking vocabularies would be the start of a long
nightmare with a bad ending.  To do that would significantly reduce
the freedom of industries to design their own information
architectures, and to evolve them according to their own perceived
needs.  It would also destroy the DOM, which must stay simple in order
to survive.  No API can do everything for everybody, and once you
start putting support for DTD-specific (or namespace-specific)
semantics into the DOM, where do you stop?  I've watched a couple of
systems bloat uncontrollably and meet their demise in similar ways,
and the stage is perfectly set for the same thing to happen to the
DOM.)

> The XML information set is not a grove because ... it is not
> ... expressed in grove notation.

If you replace the word "grove" with "property set" (twice) in the
above sentence, you are exactly correct.  (There is no such thing as
"grove notation".  "Grove" is an abstract concept that, when sensibly
implemented, makes a grove exactly as human readable as a hex dump of
RAM in which there are C structs in no particular order.)

> The DOM is not an API for a grove because it's a bit wishy-washy in
> places -- for example, four characters of PCDATA could be one node
> or four, so it's not built on a rigid enough data model.)

Close enough. I would put the same thought differently: The DOM
doesn't have a formalized underlying data model, so the DOM doesn't
answer the need for a solid basis on which to express the addresses of
the components of XML resources.  I'm hoping and believing that after
the XML infoset is done we'll have a basis for implementing a powerful
version of XPath (or XPointers or whatever the idea of generalized
addressing of components of XML resources is being called at that
time).

> The nice thing about groves is that all groves, regardless of what
> they are built on, have certain commonalities, such as
> addressability, so you can perform certain common functions with
> them.

Right.  All nodes in groves have the same "object model" (I'm using
this term in a more formal, scientific sense than the term is used in
the phrase "Document Object Model (DOM)".)  The grove object model is:
Groves have nodes, nodes conform to classes, and classes have named
properties with value constraints.  Nodes have named properties, and
values for those properties.  That's about it; the rest is detail.
(It's pretty interesting detail.)

> GroveMinder is generic grove middleware. It has plug-ins, called
> Minders (I think of them as drivers),

Hooray, thank you!  I have sometimes called them "notation drivers"
only to get the blankest stares imaginable.  (I then have asked
something lame, like, "Do you know what a device driver is, and why we
have them?")  But you obviously get the point of Minders: Minders
represent plug and play support for individual notations, in a system
that makes all content look alike (i.e., conform to the grove object
model).

> that can build groves over different property sets. For example,
> there is one Minder for SGML/XML documents and a different Minder
> for relational databases.

Well, actually, there's probably a one-to-one correspondence between
property sets and database schemas.  In order to address information
in terms of its structure, you have to know the structure.  In
grove-land, the structure is defined by a property set.  Different
databases have different structures, normally expressed as database
schemas.  Making a database look like a grove is very straightforward.
The bulk of the work is translating the schema into a property set
(which is, after all, a kind of schema).  There's a bit of coding
involved, too, but the GroveMinder developer kit has tools that make
this amazingly easy.  (At least the Lockheed-Martin people were
amazed, and they said so publicly at XML '98.)

The grove paradigm breaks down the distinction between documents
(resources) and databases.  Everything, in its addressable form, is a
grove, and a grove is a database.  But a grove is convertible into an
interchangeable resource (that is, if the property set is a
comprehensive expression of the syntactic features of the notation of
an interchangeable resource).  Obviously, a resource is also
convertible into a grove, given a property set for its notation.
Property sets are the bridge between the world of information
interchange, and the world in which interchanged information is
immediately useful (i.e., the world in which information exists after
parsing and common semantic processing of interchangeable resources
has been done).  If the resource is *already* a database, there's
probably no parsing or processing involved.  All that needs to be done
is to put a translating layer over it that makes the database look
like a grove.  Then, the database and all its contents are fully able
to participate in the wider world of interchangeable information
resources: they can be linked, re-used by reference, have any kind of
metadata associated with them, etc. etc.

> (There can actually be different property sets for a "type" of
> data. For example, one property set for XML might include entities
> and another might not, specifying that each entity is replaced by
> its value. A different Minder is needed for each property set.)

Strictly speaking, you're correct: people can disagree about the
properties of, say, PostScript as a notation, or they might agree
about the properties but not about what the names of the properties
should be.  Nothing prevents people from writing their own property
sets.  In fact, however, the situation is not as chaotic as your
example might lead one to believe, because of "grove plans".  A "grove
plan" is a way of selectively deleting properties from classes, and of
deleting classes altogether, as a way of avoiding the overhead of
storing and/or processing those properties and classes.  For example,
the property set for SGML is comprehensive, but an application may not
need, for example, to store nonsignificant white spaces found in the
start tags of SGML elements.  The application may therefore use a
"grove plan" to delete the properties whose values would be those
white space characters.

The addresses of nodes in groves are always expressed with respect to
a property set and a grove plan.  If it were not so, you wouldn't know
whether to count a certain node type or not, when counting nodes to
get to a particular node.  And it's true that, for example, some
people want to count the text that was inserted via an entity
reference as a distinct node, while other people don't; this kind of
flexibility is needed in order to keep peace in the family, and allow
people to do addressing in the way they want to do it.

Property sets are modularizable, so that it's relatively easy to
express commonplace grove plans, to establish conformance levels for
processing systems, and to understand the rules for interpreting
address expressions.

A Minder that implements a property set comprehensively can optionally
view groves less comprehensively, so as to be able to resolve
addresses that were expressed according to lesser grove plans.  There
doesn't have to be a different Minder for each different grove plan.
(And that's where your example might be misleading.)

> One thing GroveMinder can do is store a grove in its own
> database. (Note that this is separate from the database addressed by
> the relational database Minder -- it has a structure designed to
> store groves.) Thus, GroveMinder can store an XML document in a
> database as a grove and is what I, in my article, called a content
> management systems. That is, it can store and retrieve an XML
> document as a document.

Sounds right to me.  ("...its own database" sounds a bit odd because
GroveMinder can use any ODBMS for grove storage.)

> Some questions:

> 1) Is it possible to combine groves of different types? For example,
> can I take a grove representing a table in a relational database and
> stuff it into a grove for an XML document, for example as the
> content of an element?

I'm afraid I don't grasp the intent of this question.  When such an
XML document is exported from its grove as an XML document, what
should the document look like?

There's no need (and no way) to stuff something into something else.
It is only necessary that the "content" property of the element have,
as its value, the node in the database grove that represents the
table.  The ISO standard SGML Property Set does not allow this; only
certain classes of nodes within the same grove are allowed as the
value of the "content" property of "element" nodes.  However, if you
want to change your operative SGML Property Set so that this will be
permitted, nothing (other than good sense) prevents you from doing it;
the grove paradigm will readily support you in your madness.

I don't know why it would be sensible to regard an RDBMS table as the
content of an SGML or XML element.  The normal meaning of "content" is
elements, character data, and/or other SGML constructs, right there,
inside the element.  There is no way to write a general purpose
grove-to-SGML converter unless the classes of the nodes that can
appear in element content are limited and known.  (We certainly don't
want to dump arbitrary data into the content of an element; this would
invite a situation in which the document that is ultimately exported
is unparsable.)

>  If so, does the table grove retain its table-ness, or is it
> converted to one or more XML elements?  Both cases seem reasonable,
> although the latter would presumably require a special converter. If
> the latter case is true, then GroveMinder might also fit what I call
> data transfer middleware, depending on how the conversion is done.

I would suggest that an efficient way to handle this would be to
convert the table into node classes that *are* permitted to appear in
element content, and then make *those* nodes the value of the content
property.  If you do it this way, you're necessarily making the
decisions that must be made about how the XML document, when exported,
will reflect the table data.

You're right that one application of GroveMinder is data transfer
middleware.  The conversion program is comparatively easy to write,
since everything already conforms to the same object model.

> 2) Are groves themselves relevant at a high level in a discussion of
> XML and databases? It strikes me that, like SAX and the DOM, they
> are a useful tool in implementing software that stores/retrieves XML
> documents (or data from those documents) in a database but are not
> directly relevant to the discussion itself. Instead, they are most
> relevant to the user in that they are likely to weigh heavily in the
> feature set exposed by a content management system or (possibly)
> data transfer system.

Good question.  I guess that's for the person who's doing the
discussing to decide.  Since groves can be persistent (e.g., stored in
databases), and since XML resources can become groves, it seems to me
that groves are relevant.  You're right, the real reason they're
interesting is their impact on feature sets.  But aren't feature sets
(and especially tradeoffs between feature sets) what technical
discussions are all about?

> 3) This isn't directly related to XML/databases, but what other
> common functionality do all groves have? For example, can I write an
> application that navigates groves, regardless of their source (I
> think the answer is yes)?

Yes.  We have a demonstration of that.

> Can I combine groves of different types or convert painlessly --
> that is, without writing any additional code -- from one type to
> another (I think the answer is no -- additional code is needed)?

Probably no, but it really depends on what you mean by "code."  You
have to decide how instances of nodes of particular classes and in
particular contexts will be mapped onto instances of nodes of
particular classes in the new context, and you have to express your
decisions in a formal, machine processable fashion.  Right now, using
GroveMinder, you can do that with a Python script, which seems about
as quick, intuitive, and flexible a way to do it as any.  I don't know
of any transformation specification language with which a similar feat
(transforming one kind of grove into another kind of grove) can be
done, except possibly DSSSL (which relies on (and was written in terms
of) the grove paradigm, by the way).  We haven't implemented DSSSL,
but it shouldn't be too hard to do that on top of GroveMinder.  Would
you call a DSSSL transformation specification "code"?  (I guess I
would.)

> Can I hyperlink from one grove to another (I think the answer is
> yes)?

Yes.  The interesting thing here is that traversal can be initiated
from any node in any grove, on account of a link in any grove, and
traversal can be made to any node in any grove.  Neither the traversal
initiation point, nor the traversal target, has to be a linking
construct.  Neither has to "know" anything about the fact that they
are actually anchors.

> And so on.

I'll provide you with a copy of the GroveMinder demo, if you like.
There are lots of playful possibilities.  Some people have even
written their own HyTime documents to use with the demo software.
It's a challenge for puzzle lovers, because the demo does not report
errors in documents.

-Steve

--
Steven R. Newcomb, President, TechnoTeacher, Inc.
srn at techno.com  http://www.techno.com  ftp.techno.com

voice: +1 972 231 4098
fax    +1 972 994 0087
pager (150 characters max): srn-page at techno.com

3615 Tanner Lane
Richardson, Texas 75082-2618 USA

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)