Call for unifying and clarifying XML 1.0, DOM, XPATH, and XML Infoset

Mon Jan 24 15:49:28 GMT 2000

XML should be about a universal and simple model of trees based on the
linear syntax of XML 1.0, right?  Well, it's not.  I hope to generate
a discussion of how the current multitude of models can be unified.
This message is long, reflecting the enormity of the confusion that's
being sown.  And, I want to convince everybody who's interested that
it'll be a conspicuous failure not to unify terminology and models;
conversely, I believe that at a little price, involving a small amount
of back-pedaling, XML could get an attractive and universal model.
Time is running out, however.

Consider the following five W3C contributions to the question of what
tree an XML document represents:

- LOST CHILDREN (DOM2):

  "Attr nodes are not actually child nodes of the element they
  describe, the DOM does not consider them part of the document tree"

  So, attribute nodes are attached to their element node, but the
  element node is not a parent.  The document tree doesn't represent
  the document (but presumably, the "document hierarchy" does).

- LOSING YOUR CHILDHOOD (XPATH): 

  "The element is the parent of each of these attribute nodes, but an
  attribute node is not a child of its parent element"

  So, I'm not a child of my parent (if I am an attribute
  node). Please, don't say you meant that!

- THE TONGUE TWISTER (INFOSET):

  Tree = XML Information Set
  Node = XML Information Item

  This abstract model is the promising one.  Modulo the unhelpful
  terminology, it's exactly the simple tree model that's so needed.
  Its nodes are not called nodes and its children are not called
  children, because apparently the authors believe that would make the
  confusion even more explicit.

- THE NARROW VIEW (XML 1.0):

  "for each non-root element C in the document, there is one other
  element P in the document such that C is in the content of P, but is
  not in the content of any other element that is in the content of
  P. P is referred to as the parent of C, and C as a child of P."

  So, according to XML 1.0, only children that are elements are in
  fact children.

- THE PC VIEW (XML Schema):

  "The shipDate element daughter of PurchaseOrderType is..."

  The term daughter is not defined in the Schema draft.

This is a big mess.  I'll outline a modest simplification that affects
several of the (draft) recommendations with the result that there is
one model and one terminology.  But before that, I'll give more
examples to show some adverse effects of the lack of consistency among
the tree models.  At the end, I'll show how some of the examples
appear after the simplification.  My simplification is certainly not
the only way to go about these fundamental problems, but I hope that
they'll show that they are solvable.

1. TEXT AND MARKUP CONFUSION

In XPATH and DOM, text denotes a maximum continuous sequence of
characters (with no tags), but in XML 1.0 a very different explanation
is provided:

  "Text consists of intermingled character data and markup", 

where markup is defined as

   "Markup takes the form of start-tags, end-tags, empty-element tags,
    entity references, character references, comments, CDATA section
    delimiters, document type declarations, and processing
    instructions."

INFOSET does not take a position, but introduces a finer-grained model,
where individual characters are nodes in the tree representation.

2. XPATH/XSLT NODE CONFUSION

- Apparently, node means node, but not quite in XLST:

  "node() matches any node other than an attribute node and the root node"  (2)

- and the contrary opinon is offered in XPATH:

  "A node test node() is true for any node of any type whatsoever" 

In fact, there is not a technical inconsistency.  There is an
intricate explanation: when node() is used as a pattern, it is assumed
that the pattern applies to children (the ones that are not
attributes), since "child" is the default axis.  So to include
attribute nodes one has to write "@* | node ()".  The "@" overrides
the default axis, but node() doesn't.  This is pretty wild.  I wrote a
long XSLT program in October and in January I don't understand even
the patterns I used before spending 20 minutes re-reading XPATH and
XSLT.

3. MORE MARKUP AND TEXT CONFUSION

In DOM, we read

   "If there is no markup inside an element's content, the text is   (3)
    contained in a single object implementing the Text interface that
    is the only child of the element."

But, the sentence that follows says:

   "If there is markup, it is parsed into the information items    (3')
    (elements, comments, etc.) and Text nodes that form the list of
    children of the element."

So, in this sentence, markup now means markup in the XML 1.0 sense +
character data.  Also, since information items, in fact, include the
character data, the sentence says that both the fine-grained character
information items and some corresponding Text nodes somehow are
included in the list of children.

4. ROOT OR DOCUMENT CONFUSION

Lets look at XML 1.0 again: "There is exactly one element, called the
root, or document element, no part of which appears in the content of
any other element."  Now, the root element node is a child of the root
node according to (DOM, XPATH)!  This consequence would be formulated
in INFOSET speak as the prose:

  A reference to the document element information item is contained in
  the children list of the document information item.  (4)

(John, if you read this, please correct me if I'm wrong.)

5. THE TRANSLATION OF INFOSET INTO CONVENTIONAL TERMINOLOGY

In XPATH, a whole section is dedicated to describing a natural data
model, even though it substantially replicates INFOSET.  Since the
authors of XPATH wisely use familiar concepts, they've been obliged to
include tautoligisms:

  "An element node comes from an element information item. The
   children of an element node come from the children and children -
   comments properties. The attributes of an element node come from
   the attributes property."

And who does that help?

THE PROPOSAL

There are three main kinds of nodes: root nodes, property nodes, and
content nodes.  They form a hierarchy of node concepts as follows:

root node

property nodes: 
  attribute node
  notation node
  namespace declaration node

content nodes: 
  cdata nodes: 
    CDSect node  (for CDATA sections)
    text node 
  markup nodes:
    element node 
    comment node 
    entity node 
    PI node

This terminology seems to be rather consistent with XML 1.0 except
that we use "text" in the sense found in DOM and XPATH and that
"child" is not just applied to elements, but all nodes that are
immediate descendants.  By an official resolution, this difference
should be made clear.

A root node is the document information item of INFOSET or the
Document interface of DOM or the root node of XPATH.  The root node
has exactly one element child, which is called the document node,
since it corresponds to the document element of XML 1.0.  By 
resolution, the term "root element" in XML 1.0 is banished.

Now, define the *text view* of the XML tree as the tree gotten by
grouping together maximum consecutive sequences of text and CDSect
nodes into one text node.

That's all.

(I am omitting document declarations from this discussion; they are
less important, although they need a model, too.)

WHAT ARE THE REPERCUSSIONS?

INFOSET will become XML-TREE, and it will be the enjoyable gold
standard that defines the XPATH data model and for which DOM is the
API---all without notational and conceptual confusion. 

For example, (4) becomes

  "The document node is a child of the root node."

The XPATH data model *is* the text view of the XML tree.
But now XPATH and XSLT can make use of additional predicates:

  content() is the pattern that matches any content node

That solves (2).  In particular, an erratum could be issued that would
get rid of the node() pattern puzzle.  (Even without, future good
practice would dictate that content() be used in most situations where
node() is now used).  The erratum would further specify that the
"child" axis will now be called the "content" axis.

For DOM, there will be some changes that I hope people would find
entirely innocent: for example, the introduction of the DOM structure
model in section 1.1.1:

  "The DOM presents documents as a hierarchy of Node objects that also
  implement other, more specialized interfaces. Some types of nodes
  may have child nodes of various types, and others are leaf nodes
  that cannot have anything below them in the document structure. The
  node types, and which node types they may have as children, are as
  follows: "

could be recast:

  "The DOM presents documents as Node objects organized according to
   the XML Tree model.  Some nodes also implement other, more
   specialized interfaces.  An element node may have child nodes of
   various types that represent content, attributes, and namespace
   declarations.  The node types, and which kinds of node types their
   content children may have, are as follows:"

So this is not a revolution!  They're would be very minor changes to
the IDL specification as well:

interface Node {
  // NodeType
  ...
  readonly attribute Node             parentNode;
  readonly attribute NodeList         childNodes;
  readonly attribute Node             firstChild;
  readonly attribute Node             lastChild;
  ...}

becomes

interface Node {
  // NodeType
  ...
  readonly attribute Node             parentNode;
  readonly attribute NodeList         contentNodes;
  readonly attribute Node             firstContentChild;
  readonly attribute Node             lastContentChild;
  ...},

and the ownerElement can now be removed from

interface Attr : Node {
  readonly attribute DOMString        name;
  readonly attribute boolean          specified;
           attribute DOMString        value;
                                        // raises(DOMException) on setting

  // Introduced in DOM Level 2:
  readonly attribute Element          ownerElement;
};

since an attribute node has a parent, which the ownerElement was
supposed to denote. 

And (3) and (3') would simply become

  "The contentNodes list contains the content nodes of the element."

For XML Schema, there would be significant simplifications in
terminology.

Simpletons, are you still there?

/Nils

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ or CD-ROM/ISBN 981-02-3594-1
Unsubscribe by posting to majordom at ic.ac.uk the message
unsubscribe xml-dev  (or)
unsubscribe xml-dev your-subscribed-email at your-subscribed-address

Please note: New list subscriptions now closed in preparation for transfer to OASIS.