Mixed content considered harmful...

Paul Prescod paul at prescod.net
Tue May 11 01:33:18 BST 1999


XML Schema Part 1 seems to import a mistake from SGML and XML. This is the
idea that content models must either be text-containing, "mixed" or
element containing and that the former sort of model must not constrain
the ordering of elements and text nodes.

"A content model for mixed content provides for mixing elements with
character data in document instances. The allowed element types are named,
but neither their order or their number of occurrences are constrained."

SGML had a separation between mixed and text-containing nodes but it did
not have this constraint that it not be possible to constrain the order
and occurence of text nodes and element nodes. #PCDATA was just a token
and you could use it where you wanted. 

What it did have was a massive bug in its parsing algorithm that made
these "constrained" mixed content models impossible to use. The bug had
nothing to do with validation -- it was a parser problem.

There sprung up a superstition that these mixed content models were evil
when the truth is that the particular bug in SGML was the real problem.
Before it was clear that we could change SGML, XML adopted a ridiculously
confusing rule about the use of mixed content. It didn't occur to me (or
probably anyone else) that it would have been better to just fix the bug.
We probably didn't know at that point that we had that option.

Now this rule has been imported into XML Schema. The rule is even more out
of place in XML schema than it was in XML itself. Then we had the
opportunity to fix the bug. Today the bug is not even relevant -- XML
schema works on the result of the parse....it does not influence the
parse.

#PCDATA is just a data type that is unconstrained. You should be able to
mix data type refs, #PCDATA and element type refs in content models with
impunity (barring real parsing ambiguity). Using old syntax:

<!ELEMENT SECTION (#PCDATA, P+)>
<!ELEMENT FIG (#PCDATA|IMG)>
<!ELEMENT HTML (TITLE,(#PCDATA|P)+)>

You can handle any of these with wrappers but I claim that the instinct to
wrap these things arises more from exposure to the superstition than from
fundamental design considerations. We can make XSchema more uniform by
removing the concept of "mixed content" and by introducing a PCDATA
content token type.
-- 
 Paul Prescod  - ISOGEN Consulting Engineer speaking for only himself
 http://itrc.uwaterloo.ca/~papresco

And so, in one of history's little ironies, the global triumph of bad
software in the age of the PC was reversed by a surprising combination
of forces: the social transformation initiated by the network, a
long-discarded European theory of political economy, and a small band
of programmers throughout the world mobilized by a single simple idea. 
 - http://old.law.columbia.edu/my_pubs/anarchism.html

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)




More information about the Xml-dev mailing list