Call for unifying and clarifying XML 1.0, DOM, XPATH, and XML Infoset

Nils Klarlund klarlund at
Thu Jan 27 17:55:54 GMT 2000

> So I see several reasons for the general failure to understand what
> groves are all about:
>  - lack of suitable presentational material
>  - the fact that it comes from a different community with a different
>    philosophy and terminology
>  - the extreme suspicion with which many W3C (and other) people view
>    anything coming out of the ISO processes (this is not directed at
>    you, Michael; I have no idea what your stance is)
>  - the fact that groves and the DOM have different purposes

Or maybe that groves are just too abstract? (I'm saying this solely
based on what I read in the excellent tutorial by Paul Prescod that
you referred to.) In fact, even the XML quintessence, trees, is not a
clear sell: recursion and trees are a standard part of a computer
science curriculum, but these concepts are not easily swallowed by

The best hope we have is to base what we develop directly on concepts
that we can assume have been somewhat understood through training or
mathematical intuition.  Groves appear well thought-out for their
purpose, but the mathematical abstractions they embody are not
necessarily any easier to grasp than if those abstractions were
applied directly to the problem domain at hand.  In fact, it is my
experience that formal methods frameworks are often a hindrance to
exposing simple ideas.  Talking mathematically is not bad, but the
talking must be in mathematics that's known: sets, maps, trees, etc,
not in a little-known lingo that requires extra training.  (By the
way, I realize belatedly that Infosets are more or less couched in
grove langauge, mostly unknown to the world, whether justified or not.
XML Schema is in turn couched in Infoset speak, and so on,....)

I am not qualified to comment on SGML itself, but even XML 1.0 does
appear to be suffering from over-conceptualization (too many concepts
that don't fit together too precisely).  As a simple example, look at
content models:

- a content model is not a model for content in general, but only two
  kinds of content, namely elements and character data, not processing
  instructions and not comments (incidentally, it could have been
  termed "markup model" as well I think, since markup is a more
  general concept than content)

- the content model concept is further split into two concepts:

  (1) element content, which allows only elements in content
  (2) mixed content, which allows character data interspersed
      with elements

Thus, there are now two similar sets of syntax and regular expressions
for describing not content, but the projection of content onto
elements and character data.

An alternative approach would have declared "content" to simply
consist of just element nodes and text nodes ("text nodes" as in
XPATH) representing character data.  Then there would be no need for
(2), since a content model now describes a regular language over the
alphabet consisting of what you would expect: element names and the
token text() (or #PCDATA).  And, you'd be able to describe, say, HTML
with Appendix elements that must appear at the end:

  ((#PCDATA | H1 | H2 |...)*, Appendix*)

So, the distinction between element content and mixed content is a
needless one that both restricts what can be expressed and that
muddles the conceptual framework.  (The way of treating content just
outlined is what we chose for the DSD schema notation, incidentally.)

XPATH with its tree model goes a far way in clearing up things; so
does DOM2, but the crown unifying these not-quite-compatible models is


xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at
Archived as: or CD-ROM/ISBN 981-02-3594-1
Unsubscribe by posting to majordom at the message
unsubscribe xml-dev  (or)
unsubscribe xml-dev your-subscribed-email at your-subscribed-address

Please note: New list subscriptions now closed in preparation for transfer to OASIS.

More information about the Xml-dev mailing list