CDATA by any other name... (was The raw and the cooked)

John Cowan cowan at
Wed Nov 4 17:00:00 GMT 1998

Paul Prescod wrote:

> XML DTDs are in the business of constraining people to the data models and
> data that the software is expecting/can deal with. I don't see any big
> difference between saying: "This content must be restricted to this set of
> characters" and "this content must be a NMTOKEN or base-64 encoded."

Put that way, I suppose you are right.  As I said before, this could and
should be handled as a special case of "The character data of this
element must conform to the following regular expression."

> Nevertheless, this is clearly a schema problem and CDATA sections seem to
> me to be a really bad tool for enforcing this distinction.

Particularly because it would mean that the charset of an XML document
would become part of its schema: a document in US-ASCII can have
only ASCII in its CDATA sections, but if it were transcoded to
ShiftJIS, then it could have any JIS X 208 character in the
CDATA section.

So this means that transcoding arbitrary XML documents *requires*
parsing them, because if you are reducing the repertoire, you may need
to break up CDATA sections, and you cannot (?) recognize a
CDATA section reliably without parsing.  (In particular, what
looks like a CDATA section start/end could appear as an attribute
value, PI data, or comment.)  An interesting side effect!

John Cowan		cowan at
	You tollerday donsk?  N.  You tolkatiff scowegian?  Nn.
	You spigotty anglease?  Nnn.  You phonio saxo?  Nnnn.
		Clear all so!  'Tis a Jute.... (Finnegans Wake 16.5)

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at
Archived as:
To (un)subscribe, mailto:majordomo at the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at

More information about the Xml-dev mailing list