Call for unifying and clarifying XML 1.0, DOM, XPATH, and XML Infoset
Nils Klarlund
klarlund at research.att.com
Mon Jan 24 15:49:28 GMT 2000
XML should be about a universal and simple model of trees based on the
linear syntax of XML 1.0, right? Well, it's not. I hope to generate
a discussion of how the current multitude of models can be unified.
This message is long, reflecting the enormity of the confusion that's
being sown. And, I want to convince everybody who's interested that
it'll be a conspicuous failure not to unify terminology and models;
conversely, I believe that at a little price, involving a small amount
of back-pedaling, XML could get an attractive and universal model.
Time is running out, however.
Consider the following five W3C contributions to the question of what
tree an XML document represents:
- LOST CHILDREN (DOM2):
"Attr nodes are not actually child nodes of the element they
describe, the DOM does not consider them part of the document tree"
So, attribute nodes are attached to their element node, but the
element node is not a parent. The document tree doesn't represent
the document (but presumably, the "document hierarchy" does).
- LOSING YOUR CHILDHOOD (XPATH):
"The element is the parent of each of these attribute nodes, but an
attribute node is not a child of its parent element"
So, I'm not a child of my parent (if I am an attribute
node). Please, don't say you meant that!
- THE TONGUE TWISTER (INFOSET):
Tree = XML Information Set
Node = XML Information Item
This abstract model is the promising one. Modulo the unhelpful
terminology, it's exactly the simple tree model that's so needed.
Its nodes are not called nodes and its children are not called
children, because apparently the authors believe that would make the
confusion even more explicit.
- THE NARROW VIEW (XML 1.0):
"for each non-root element C in the document, there is one other
element P in the document such that C is in the content of P, but is
not in the content of any other element that is in the content of
P. P is referred to as the parent of C, and C as a child of P."
So, according to XML 1.0, only children that are elements are in
fact children.
- THE PC VIEW (XML Schema):
"The shipDate element daughter of PurchaseOrderType is..."
The term daughter is not defined in the Schema draft.
This is a big mess. I'll outline a modest simplification that affects
several of the (draft) recommendations with the result that there is
one model and one terminology. But before that, I'll give more
examples to show some adverse effects of the lack of consistency among
the tree models. At the end, I'll show how some of the examples
appear after the simplification. My simplification is certainly not
the only way to go about these fundamental problems, but I hope that
they'll show that they are solvable.
1. TEXT AND MARKUP CONFUSION
In XPATH and DOM, text denotes a maximum continuous sequence of
characters (with no tags), but in XML 1.0 a very different explanation
is provided:
"Text consists of intermingled character data and markup",
where markup is defined as
"Markup takes the form of start-tags, end-tags, empty-element tags,
entity references, character references, comments, CDATA section
delimiters, document type declarations, and processing
instructions."
INFOSET does not take a position, but introduces a finer-grained model,
where individual characters are nodes in the tree representation.
2. XPATH/XSLT NODE CONFUSION
- Apparently, node means node, but not quite in XLST:
"node() matches any node other than an attribute node and the root node" (2)
- and the contrary opinon is offered in XPATH:
"A node test node() is true for any node of any type whatsoever"
In fact, there is not a technical inconsistency. There is an
intricate explanation: when node() is used as a pattern, it is assumed
that the pattern applies to children (the ones that are not
attributes), since "child" is the default axis. So to include
attribute nodes one has to write "@* | node ()". The "@" overrides
the default axis, but node() doesn't. This is pretty wild. I wrote a
long XSLT program in October and in January I don't understand even
the patterns I used before spending 20 minutes re-reading XPATH and
XSLT.
3. MORE MARKUP AND TEXT CONFUSION
In DOM, we read
"If there is no markup inside an element's content, the text is (3)
contained in a single object implementing the Text interface that
is the only child of the element."
But, the sentence that follows says:
"If there is markup, it is parsed into the information items (3')
(elements, comments, etc.) and Text nodes that form the list of
children of the element."
So, in this sentence, markup now means markup in the XML 1.0 sense +
character data. Also, since information items, in fact, include the
character data, the sentence says that both the fine-grained character
information items and some corresponding Text nodes somehow are
included in the list of children.
4. ROOT OR DOCUMENT CONFUSION
Lets look at XML 1.0 again: "There is exactly one element, called the
root, or document element, no part of which appears in the content of
any other element." Now, the root element node is a child of the root
node according to (DOM, XPATH)! This consequence would be formulated
in INFOSET speak as the prose:
A reference to the document element information item is contained in
the children list of the document information item. (4)
(John, if you read this, please correct me if I'm wrong.)
5. THE TRANSLATION OF INFOSET INTO CONVENTIONAL TERMINOLOGY
In XPATH, a whole section is dedicated to describing a natural data
model, even though it substantially replicates INFOSET. Since the
authors of XPATH wisely use familiar concepts, they've been obliged to
include tautoligisms:
"An element node comes from an element information item. The
children of an element node come from the children and children -
comments properties. The attributes of an element node come from
the attributes property."
And who does that help?
THE PROPOSAL
There are three main kinds of nodes: root nodes, property nodes, and
content nodes. They form a hierarchy of node concepts as follows:
root node
property nodes:
attribute node
notation node
namespace declaration node
content nodes:
cdata nodes:
CDSect node (for CDATA sections)
text node
markup nodes:
element node
comment node
entity node
PI node
This terminology seems to be rather consistent with XML 1.0 except
that we use "text" in the sense found in DOM and XPATH and that
"child" is not just applied to elements, but all nodes that are
immediate descendants. By an official resolution, this difference
should be made clear.
A root node is the document information item of INFOSET or the
Document interface of DOM or the root node of XPATH. The root node
has exactly one element child, which is called the document node,
since it corresponds to the document element of XML 1.0. By
resolution, the term "root element" in XML 1.0 is banished.
Now, define the *text view* of the XML tree as the tree gotten by
grouping together maximum consecutive sequences of text and CDSect
nodes into one text node.
That's all.
(I am omitting document declarations from this discussion; they are
less important, although they need a model, too.)
WHAT ARE THE REPERCUSSIONS?
INFOSET will become XML-TREE, and it will be the enjoyable gold
standard that defines the XPATH data model and for which DOM is the
API---all without notational and conceptual confusion.
For example, (4) becomes
"The document node is a child of the root node."
The XPATH data model *is* the text view of the XML tree.
But now XPATH and XSLT can make use of additional predicates:
content() is the pattern that matches any content node
That solves (2). In particular, an erratum could be issued that would
get rid of the node() pattern puzzle. (Even without, future good
practice would dictate that content() be used in most situations where
node() is now used). The erratum would further specify that the
"child" axis will now be called the "content" axis.
For DOM, there will be some changes that I hope people would find
entirely innocent: for example, the introduction of the DOM structure
model in section 1.1.1:
"The DOM presents documents as a hierarchy of Node objects that also
implement other, more specialized interfaces. Some types of nodes
may have child nodes of various types, and others are leaf nodes
that cannot have anything below them in the document structure. The
node types, and which node types they may have as children, are as
follows: "
could be recast:
"The DOM presents documents as Node objects organized according to
the XML Tree model. Some nodes also implement other, more
specialized interfaces. An element node may have child nodes of
various types that represent content, attributes, and namespace
declarations. The node types, and which kinds of node types their
content children may have, are as follows:"
So this is not a revolution! They're would be very minor changes to
the IDL specification as well:
interface Node {
// NodeType
...
readonly attribute Node parentNode;
readonly attribute NodeList childNodes;
readonly attribute Node firstChild;
readonly attribute Node lastChild;
...}
becomes
interface Node {
// NodeType
...
readonly attribute Node parentNode;
readonly attribute NodeList contentNodes;
readonly attribute Node firstContentChild;
readonly attribute Node lastContentChild;
...},
and the ownerElement can now be removed from
interface Attr : Node {
readonly attribute DOMString name;
readonly attribute boolean specified;
attribute DOMString value;
// raises(DOMException) on setting
// Introduced in DOM Level 2:
readonly attribute Element ownerElement;
};
since an attribute node has a parent, which the ownerElement was
supposed to denote.
And (3) and (3') would simply become
"The contentNodes list contains the content nodes of the element."
For XML Schema, there would be significant simplifications in
terminology.
Simpletons, are you still there?
/Nils
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ or CD-ROM/ISBN 981-02-3594-1
Unsubscribe by posting to majordom at ic.ac.uk the message
unsubscribe xml-dev (or)
unsubscribe xml-dev your-subscribed-email at your-subscribed-address
Please note: New list subscriptions now closed in preparation for transfer to OASIS.
More information about the Xml-dev
mailing list