Groves, the next big thing (Re: ANN: XML and Databases article)

Fri Sep 10 17:03:38 BST 1999

Michael Champion wrote:
> 
> Third, there was a widespread
> perception that the groves model implies, in DOM terms, that "every
> character is a node",  and people concerned about implementing the DOM API
> felt strongly that this would lead to unacceptable footprint and run-time
> overhead.

Two years later we have incompatible models of string handling in the
DOM, the infoset and XSLT. This is *exactly* the problem that groves
were invented to solve. I don't mean that in a loose way. I mean that
very specifically: it was precisely to head off these sorts of
incompatibilities between HyTime and DSSSL that groves were invented.
Character handling was only one issue but a major one.

The grove world decided that character-level addressing is:

 a) necessary
 b) highly optimizable using first CS 201 lazy evaluation tricks.

I claim that there is no DOM implementation that can match the
performance of James Clark's Jade grove builder. Of course I haven't
tied them all but I know that the Jade grove builder is an astoundingly
fast part of an astoundingly fast style processor.

> Most importantly, someone is going to have to write a *clear* statement of
> the paradigm, its power, why it's "the next big thing, etc.  

You're asking the impossible. Pretend I am a skeptical mid-1980s dbase
user. Now write the one-page description of the relational model that
will convince me that the model is better than DBase and other
proprietary, ad hoc models.

Pretend that I am a mid-1980s C programmer. Now write a one-page
description of the object oriented model that will convince me that it
is better than the procedural model?

Insofar as there is always a way to hack around the limitations of the
CURRENT model it is essentially impossible to sell the benefits of the
new model in simple terms. Rather, the listener needs to have wrestled
with the problem and needs to have developed a dislike for ad hoc
half-answers. And there are still smart people who I respect that reject
both OO and the relational model so I wouldn't expect groves to ever be
uncontroversial. I am still discovering the Zen of OO and the Zen of
relational myself so how could I brain-dump the Zen of groves (which I
am also still discovering)? 

I can give you some hints though:

 1. addressing is the basis of everything. For the most part, the DOM
could be replaced with an API of one method: "EvaluateQuery()". Of
course that presumes a more powerful "query" language than we have -- we
need a full data manipulation language. So just as almost all of
Microsoft's "ADO", "DAO", "ODBC" and "RecordSet" APIs boil down to
optimizations of and layers on top of "EvaluateSQLQuery", the DOM could
be (!should be!) considred an optimizaition of and layer on top of
"EvaluateXMLDMLQuery()". 

 2. addressing is always done in terms of a data model. This is a
universal truth which I do not expect anyone will dispute. Even the USPS
has a logical model of abstractions such as cities and states.

 3. we need to address many different data types. The DOM already
supports CSS, HTML and XML. XPath/Pointer supports XML. URLs into PDF
are also meaningful. Most other web media is pretty much hyperlink
opaque (but should not be!). As soon as you can build a hyperlink that
has anchors in a PDF and a JPEG (e.g. NOW) you need terminology to
describe the linked objects. "Anchor" is not good enough -- what do the
anchors contain? I claim "nodes"? What are the universal properties of
nodes? When you define them you will have reinvented a major part of the
grove model.

 4. inventing, from scratch, a new data model for every media type would
be incredibly tedious and hard to implement. Better to set up a framwork
for describing the data models of media types: a meta data-model (very
different from a meta-data model).

 5. inventing, from scratch, a new query language for each media type
would also be tedious and hard to implement. This is the path we are on
now when we invent new fragment identifiers. If we had a standard data
model then many media types could share a query language with shared
concepts. Of course optimized query languages will never completely go
away but often we don't need them. Addressing an anchor in a PDF is very
much like addressing an anchor in HTML. Other media are similarly
related. We could start with the goal that there be a universal syntax
for fragment identifiers. A universal underlying model is just a few
steps beyond that.

 6. inventing, from scratch, a new API for each media type would ALSO be
tedious and hard to implement. We already have three such APIs under the
title "DOM." If an API is just a layer on a query/data manipulation
language, then couldn't we algorithmically develop programming-language
specific APIs from the same data model definition that we use to develop
our query language? This would work in the same way that I can
algorithmically generate Java interfaces from IDL

 7. new layers on existing data types have all of the same problems as
new data types. We need a data model, query language and API for XML,
namespaced XML, namespaced XML with XLink, namespaced XML with RDF,
XHTML, BizTalk and so forth. Developing each level by hand gets tired
(and expensive) pretty quickly.

 Paul Prescod

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)