Data models: Groves and tutti quanti.
Didier PH Martin
martind at netfolder.com
Tue Feb 1 15:08:41 GMT 2000
Hi,
I followed the thread on groves with a lot of interest since we are actually
right in the middle of preparing the new scope of work for DSSSL-2. As you
know, I am also doing a lot of experiments in Didier's labs. So, this thread
made me check in my notebooks and see what I already have found on Groves,
data models and tutti quanti.
So, for the curious ones, I'll talk in this Lab's note about the Grove data
model, its relationship with the DOM and also about other possible data
structures used to model an XML document.
Groves:
--------
Groves are based on a certain data model that can be easily used by anything
that can process lists. The atom of a Grove is the node. And as atoms, nodes
are divisible :-) and are composed of properties. A node is thus, in fact, a
set of properties. Some properties are singletons like, for instance, a
"Boolean" or a "string". Others are collections like, for instance, the
"Attributes" or "Content" properties. If we just take the "element node" as
an example we have, according to the SGML Grove plan (that can easily be
mapped on XML):
General Identifier (Gi) - singleton (string)
Identifier (ID) - singleton (string)
Attributes - collection (a list of attribute nodes)
Content - collection ( a list of element nodes that also contains a single
data node member)
For graphically inclined people let's translate this basic model into a
figure:
Element node
|____ Gi
|____ ID
|____ Attributes
| |___ attribute node
| |___ attribute node
|____ Content
|___ data node
|___ element node
|___ element node
|___ element node
As you noticed either form the text or the figure, the whole hierarchy is
built on the "Content" property. This is this latter that contains the other
element nodes and therefore this is this property which is used to map the
SGML or XML document tree (at least, the elements). Off course, other nodes
can be added to the model like a root node (generally mapped to the
document), processing instruction nodes, comment nodes, etc...
As you also probably noticed, the Grove is already behind the DOM. Why?
DOM:
----
The DOM is an API that allows to manipulate a hidden and unknown data
structure. However, as it is the case for any API, the data structure is not
so silent and, sooner or later, the data structure percolate through the
API. The DOM is no exception to this observation.
As you all know, the DOM has also its atom: the node. The DOM node is a
collection object. Or to be more precise an API that wraps a collection.
With the API you can manipulate a collection of node, whatever their type.
The DOM, even if the properties are different from the SGML Grove plan, it
is following about the same pattern. And as you can notice by yourself, the
DOM includes a plethora of node types:
- ELEMENT_NODE The node is a Element.
- ATTRIBUTE_NODE The node is an Attr.
- TEXT_NODE The node is a Text node.
- CDATA_SECTION_NODE The node is a CDATASection.
- ENTITY_REFERENCE_NODE The node is an EntityReference.
- ENTITY_NODE The node is an Entity.
- PROCESSING_INSTRUCTION_NODE The node is a ProcessingInstruction.
- COMMENT_NODE The node is a Comment.
- DOCUMENT_NODE The node is a Document.
- DOCUMENT_TYPE_NODE The node is a DocumentType.
- DOCUMENT_FRAGMENT_NODE The node is a DocumentFragment.
- NOTATION_NODE The node is a Notation.
So to make the story short, the DOM implicitly is an interface to something
so similar to the Grove that we can even say that what's behind is a Grove
with a different Grove plan than the SGML grove plan. So, let's say that the
DOM is a Grove with an XML grove plan and that the info set are this Grove
plan, even if the Info set is a bit fuzzy concerning the structure.
Observation:
------------
What you can observe is that the Grove is more or less what you would find
in a parse tree, with some properties added to the nodes. An other thing
that you can observe is that the document is decomposed on collections and
singletons. So far so good, this is a data model that seems to be the
reassigning base for both the ISO and W3C worlds. The Grove has the effect
to create a multitude of node types. Because the original document is parsed
and each of its entities are trnasformed into nodes, therefore we have a
multiptude of nodes. Imagine as we add more complexity to the XML world, we
may have the problem to multiply the number of entities (i.e. nodes) and
then make Mr Occam feel bad and us like children that never learn from
wisdom (who said that the humanity learns only by crisis?).
Possible alternative model:
---------------------------
Can we envision a possible different data model for an XML document? You bet
we can. An even simpler one closer to the object world and not necessarily
resembling a parse tree. Let's call the atom an object and let's say that an
object is composed of a set of properties and that a property is simply a
name/value pair. An object is composed of two basic collections:
a) a collection of properties
b) a collection of objects
The collection of objects is used to build a hierarchy and each object can
be mapped to an XML element. Let's take a concrete example. IN this data
model, an element is transformed into an object and a set of properties
attached to this objects. If we say that all attributes are properties and
that the data content is also a property. Then if we have an object like
<Book author="Didier PH Martin" publisher="Wrox" subject="XML">Professional
XML</Book>
Then this element is transformed into
object=Book
|___ property={author,Didier PH Martin}
|___ property={publisher,Wrox}
|____property={subject,XML}
|____property={content,Professional XML}
Then if we want to build a hierarchy of elements we simply have:
object ----- set of properties
|____ object ----- set of properties
|____ object ----- set of properties
|____ object ----- set of properties
I guess that now you caught the data model. It is easy to grasp. Why use
such model and what are the advantages of using it?
>From Macrocosm to microcosm
---------------------------
question: Do you know, what kind of data model a directory service has?
simple
object ----- set of properties
|____ object ----- set of properties
|____ object ----- set of properties
|____ object ----- set of properties
What are the advantages to use the alternative data model to the that we
actually use and that is based on the concept of Grove? Simple. If you have
the same model for a directory service that you have for a document
internal. And, that you re-unite document used to encode knowledge and
document used to encode data into a single format. Then, if you have the
same data model for the Macro (the directory service) and the micro (the
document), you have a powerful paradigm and have re-united the macrocosm and
the microcosm.
So, maybe the right question to ask is: Is the Grove the right model? Is
there any other alternative that can offer more? What are the other doing?
what is their model? Can we synthesize two world? What are the effects of
having a single data model for:
- directory services
- data messages
- documents used to encode knowledge
- etc...
Benefits:
--------
If we use instead a model based on the objects (as defined above), we have
potentially less entities since if we map any kind of elements to an object
we can instead provide a simpler API:
a) for object collection manipulation - to add, remove, etc... - and map any
element to an object (even processing instruction)
b) for properties manipulation - to add, remove, etc... and map attributes
as properties, data content as property and even add new properties not
comming from the document to a particular object type.
Only two basic interfaces are needed:
a) an interface to manipulate the object collection and also ease the memory
load because we know that an element is mapped to an object.
b) an interface to manipulate the properties.
Some languages can even map object/properties access like
object.property = value
value = object.property
So for object manipulation (i.e. elements manipulation)
add()
remove()
update()
move()
copy()
find()
etc...
Notice that we added copy() and move() that are useful methods, Also a
find() that may take as a parameter a query expression and return an object.
For properties
add()
rename()
remove()
get()
put()
etc...
And you know what? this simple API could be used for either a directory
service or for document internals.
OK, time to return to the lab, I have a presentation to prepare. By the way,
I'll be at New York on Wednesday and Thursday(at WebNewYork), so, if you
want to discuss these issues in person, I be more than glad to exchange and
understand the world through new point of views. Otherwise, for the curious
of this world, my email is always open to discussion and learning.
"A different point of view is worth a thousand points in IQ"
Cheers
Didier PH Martin
----------------------------------------------
Email: martind at netfolder.com
Conferences: Web New York (http://www.mfweb.com)
Book: XML Pro published by Wrox Press
Products: http://www.netfolder.com
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ or CD-ROM/ISBN 981-02-3594-1
Please note: New list subscriptions and unsubscriptions
are now ***CLOSED*** in preparation for list transfer to OASIS.
More information about the Xml-dev
mailing list