CDATA by any other name... (was The raw and the cooked)

Rick Jelliffe ricko at allette.com.au
Tue Nov 3 11:52:09 GMT 1998


A CDATA marked section is not only a way to prevent delimiter recognition.
It is also a way to declare that the characters in that section are limited
to ones available in the direct document encoding of the originating system.
(SGML has a CDATA keyword you can use instead of content models: XML was
felt not to need it because you could use <![CDATA[, however that perhaps
shows the mind of the XML WG at that time, in that they were down-playing
the need for schemas.) It declares "this section does not use character
references or entities or subelements". So, conceptually, it could sometimes
be markup, not merely delimiter recognition.

For example, I cannot see why a smart editor could not use the CDATA section
to cofine editing to whatever the repertoire of the character set of the
encoding attribute of the XML header says. In the case of editing the XML
specification, for example, when there is a CDATA marked section being
edited, and the editor types "<", a smart section should know not to replace
it with "&lt;" or expect it to be a STAGO. And if someone is editing a CDATA
marked section and uses a character that is not in the incoming encoding's
repertoire, I think it might be useful (to have an option to) warn the
data-enterer. This way, for example, it might be possible to constrain
document intended to interact with existing non-ISO10646 systems.

Of course, that might be useful, but whether it is so useful that XML should
keep to that way of thinking I dont know. I certainly dont think it is so
evil that it must be rejected outright.

(Apologies if this next part is getting too far off-topic)

> From:   Michael Kay
> Sent: Tuesday, 3 November 1998 21:43
> To: xml-dev at ic.ac.uk
> Subject: Re: CDATA by any other name... (was The raw and the cooked)
>
>
> >> marked sections actually mark up
> >> notations: at ISO there has been discussion of whether to
> allow something
> >> like (for example)
> >>         <![JAVA[ java code here ]]>
> >
> >While I applaud the ongoing proliferation of real Java(tm), I admit I
> >don't like that either ... <Java><![CDATA[ java code ]]></Java> has
> >worked just as well, and does no damage to XML.  (Not as pretty though!)
>
>
> Neither really works well, because "]]>" can legitimately occur in a Java
> program. For example, it is quite likely to occur in a Java program that
> generates XML.

The <![JAVA[ ]]> idea was not that JAVA would be a "CDATA marked section",
but an "RCDATA marked section", which means that special character
references and entity references would be allowed. XML does not have RCDATA
marked sections, in the interests of simplicity. So "]]>" might have been a
possibility for SGML, but it is not for XML.

Why have anything like this? The primary reason (apart from orthogonality)
to me is the contention that if you make element structure do too much, you
make the structure difficult to model with simple schema notations.

For example, think of a "wrapper" element type. (This is a pattern, by the
way.) For example, the RDF elements. Using a foreign wrapper element in a
document means that

* you will have to rewrite the content models in order to validate the
document. Or,
* you have to create a more complicated schema convention (e.g.,
  ** call the existing DTD an architecture and make it external, then use
the RDF DTD as the DTD of the current document and make dummy declarations
with ANY content models for all the old document or
  ** make up schema definition languages that rely on more than one level of
context)

But if, instead of a wrapper element, you used PIs for the wrappers, then
the content model is undisturbed, and the element structure keeps its
previous simplicity and the goals of its original authors. It would be nice
if W3C allowed this, but the less that a PI can be treated (by XLL or DOM or
SAX or whatever) as a kind of element, the less that this kind of simplicity
is possible. I have little sympathy for some of the people who say content
models are inexpressive, when they deliberately choose to ignore other the
markup options available.

Rick Jelliffe


xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)




More information about the Xml-dev mailing list