SAX needs from our point of view
Tyler Baker
tyler at infinet.com
Fri Apr 24 03:22:41 BST 1998
Michael Amster wrote:
> Quoting Ray Cromwell:
>
> >Ok, now that I've started a flame war and gotten that off my chest :),
> >I'd like to nominate the three biggest features I'd like in SAX Level 2
> >(or SAX2.0), in order of importance.
> >1) access to DTD information
> >2) comments, CDATA, and location information for Attributes
> >3) sax.util classes that take an ElementFactory (which return DOM
> >interfaces), and build a tree. (maybe Don Park would like to contribute
> >this). IBM's XML for Java is a starting point, but it has the fatal flaw
> >that the return values of the ElementFactory are not the DOM interfaces
> >(such as Element or PI) but IBM base classes, like TXElement or PI,
> >which means you are forced to inherit from TXElement instead of just
> >implementing Element.
>
> In our case, having embedded XML languages with our own language
> controlling flow of execution, we have a real need for an accurate
> reproduction of the XML elements parsed so they can be rewritten correctly.
> Specifically, the issue is important in distinguishing between text and
> CDATA. Let me illustrate with a simple example:
>
> <WEIF COND="true">
> <WETHEN>
> <ARBITRARYXML/>
> <![CDATA[
> This is data with &references; which should not be parsed!
> ]]>
> <MOREXML>
> This is just text
> </MOREXML>
> </WETHEN>
> </WEIF>
>
> When this is reported up from a SAX parser, we do not differentiate between
> text and the CDATA, but let's say that we want to output the subset of
> arbitrary XML back out from our DOM or other object structure:
>
> <ARBITRARYXML/>
> This is data with &references; which should not be parsed!
> <MOREXML>
> This is just text
> </MOREXML>
>
> Now you see that the CDATA will have all references made when it is
> reparsed. We really do want to preserve CDATA as different from text in
> SAX. I can live without comments and to some degree, I can even reduce the
> amount of DTD info available to me, but I hope that CDATA and text are
> reported differently through the interface. It should not substantially
> complicate things for parser writers or application developers if it is
> just a Document handler event.
>
> -MA
The solution I have found for the XMLReader (formatter) I have been working on is to
scan each string of character content for any characters that need to be escaped with
a CDATA section and embed that content in a CDATA section. This operation
algorithmically is sort of expensive, but for the content I have had to format, the
formatting process is still 5-10 times faster than the parsing process.
Tyler
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)
More information about the Xml-dev
mailing list