Marcus Carr mrc at allette.com.au
Mon Aug 25 11:39:37 BST 1997

Apologies in advance to all those who have thought and fought over this
issue for a long time, but as a self-confessed critic of the claim that
"XML is SGML", I feel compelled to throw my hat into the ring.

As far as I can see, there are only two circumstances when whitespace is
an issue - receiving an XML document or authoring one. Receiving, it
doesn't matter if you have a DTD or not - the application can determine
from a well formed document whether it should regard an element's
content as MIXED or ELEMENT. It does involve parsing it, but only until
it sees mixed content. If elements are assumed to be ELEMENT until
proven otherwise, surely this wouldn't be a massive overhead. Authoring
applications would be similar - the first time a tag contained mixed
content, the application would reset the status of the element. The onus
would from then on be on the application to assist the user in creating
semantically correct documents, by such mechanisms as not allowing hard
returns at element boundaries, in short, making significant whitespace
look like significant whitespace.

MURATA Makoto wrote:

> Suppose that we have different kinds of tags for mixed-content
> elements (e.g, <name:mixed> and </name:mixed>) and element-content
> elements (e.g, <name:element> and </name:element>).  Then, even
> non-validating parsers can tell element contents and mixed contents.
> Does this help?

It seems that the choices are either the current proposal that nobody
seems to feel is entirely satisfactory, or suggestions such as the
above, which would certainly work, but ultimately may involve as great
an overhead as sending the DTD. It seems to me that we're throwing the
baby out with the bathwater by ignoring a solution such as declaring at
the start of the document how whitespace in elements should be handled.

I would also like to see DTDs sent to non-validating parsers, just so
they could determine how to apply whitespace rules without necessarily
having to do any structural parsing. If need be, two new types of
declared content could be added, ELEMENT and MIXED. They might behave
the same way as ANY, or the DTD could be constructed even more loosely,
where only MIXED elements were declared and everything else was
defaulted to ELEMENT. This would result in a small DTD sent only for the
sake of making the application aware of how to deal with whitespace. If
desirable, no DTD need be sent, but the application's performance may
suffer marginally for it. This is in keeping with the idea that an
application need not know how to deal with a document as it comes in. As
far as I can see, much of the functionality in XML (such as linking)
relies on a DTD, so it's not going to be foreign to most XML
applications anyway.

The whitespace rules in SGML can be simplified - most people accept that
they should. Because inclusions and exclusions aren't valid in XML
anyway, the rules are already somewhat simpler. I would really like to
see XML and SGML stay in synch - I think anything else would be to
everyones disadvantage. There really isn't a lot of point in flaming me
for this; the question is well intentioned and the current solution
seems to have satisfied few. The concept of declaring things at the
start is a tried and true methodology, yet we seem to be fleeing it in
favor of something nobody's quite sure about.


Marcus Carr                  email:  mrc at allette.com.au
Allette Systems (Australia)  email:  info at allette.com.au
Level 10, 91 York Street     www:    http://www.allette.com.au
Sydney 2000 NSW Australia    phone:  +61 2 9262 4777
                             fax:    +61 2 9262 4774

xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)

More information about the Xml-dev mailing list