SAX needs from our point of view

Fri Apr 24 03:06:38 BST 1998

Michael Amster writes:

 > In our case, having embedded XML languages with our own language
 > controlling flow of execution, we have a real need for an accurate
 > reproduction of the XML elements parsed so they can be rewritten
 > correctly.

SAX reports all elements, together with character data, ignorable
whitespace, and processing instructions, so you won't lose anything
there.

 >  Specifically, the issue is important in distinguishing between text and
 > CDATA.  Let me illustrate with a simple example:
 > 
 > <WEIF COND="true">
 > 	<WETHEN>
 > 		<ARBITRARYXML/>
 > 		<![CDATA[
 > 			This is data with &references; which should not be parsed!
 > 		]]>
 > 		<MOREXML>
 > 			This is just text
 > 		</MOREXML>
 > 	</WETHEN>
 > </WEIF>
 > 
 > When this is reported up from a SAX parser, we do not differentiate between
 > text and the CDATA, but let's say that we want to output the subset of
 > arbitrary XML back out from our DOM or other object structure:
 > 
 > 		<ARBITRARYXML/>
 > 			This is data with &references; which should not be parsed!
 > 		<MOREXML>
 > 			This is just text
 > 		</MOREXML>

Your output routine is wrong: it should automatically escape all
instances of '&', '<', and '>':

  <ARBITRARYXML/>
   This is data with &amp;references; which should not be parsed!
  <MOREXML>
   This is just text
  </MOREXML>

or even

  <ARBITRARYXML/>
   This is data with &#x26;references; which should not be parsed!
  <MOREXML>
   This is just text
  </MOREXML>

 > Now you see that the CDATA will have all references made when it is
 > reparsed.  We really do want to preserve CDATA as different from
 > text in SAX.

If there's a semantic attached to your use of CDATA, you should
represent it with an element (which is guaranteed to make it through
processing):

  <listing><![CDATA[
    Here is a listing: 1 < 2
  ]]></listing>

  <listing>
    Here is a listing: 1 &lt; 2
  </listing>

There is no need for general XML processing tools _ever_ to know about
CDATA sections; authoring and repository tools (including tools for
authoring transforms) might want preserve them, but those fall out of
the target audience for SAX level 1.  

Think of the analogy of C: the preprocessor takes care of surface
things like macros and hides them from the compiler, which produces
exactly the same object code for

  #define FOO 1

  printf("%d", FOO + FOO);

and

  printf("%d", 1 + 1);

All the best, and thanks for the comments,

David

-- 
David Megginson                 ak117 at freenet.carleton.ca
Microstar Software Ltd.         dmeggins at microstar.com
      http://home.sprynet.com/sprynet/dmeggins/

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)