Canonical Encoding for XML Elements

Fri Jan 9 02:45:22 GMT 1998

Here, as mentioned, is our process for creating a canonical form of
XML elements. Comments are welcome.

In particular, do parsers keep CDATA sections distinct from character
data?

-------------------------------------------

Canonical Encoding Format for XML

The canonical format of an XML element is created by firstly
deriving the logical content and structure of the underlying
XML document by parsing it, and then generating the canonical
physical form of the element based on the logical structure
using the process defined below.

For the XML element being generated or any of its child
elements:

*  convert all characters in the element to [UTF16] format1.

*  apply all external entities and all character and entity
   references in the element so that they are completely resolved

*  exclude comments and processing instructions (PIs),

*  reduce all attributes to their canonical form using the
   attribute type in the DTD. Replace all single and double
   quotes present in attributes with &#39; and &#34; respectively
   so that attributes can be enclosed in double quotes

*  create attributes, using their default value, which are not
   present in the original but have default values in the DTD

*  sort the original and generated attributes in ascending
   attribute name order according to the UTF-16 encoding of the
   attribute name (i.e. not the native character ordering)

*  for whitespace inside markup but not inside attribute
   values, generate it as minimally as possible. Specifically:
   -  remove non essential whitespace, and
   -  represent required whitespace by a single space character

*  generate the content of all start tags using only the
   element name and the attributes as described above. If the
   element is an "empty" element then generate it using the
   single empty tag format, with a trailing slash. Generate end
   tags using only the element name, with no added whitespace.

*  remove all whitespace in the element content

*  keep CDATA sections as CDATA sections. Also:
   -  do not convert CDATA sections to character data with
      character references
   -  convert all occurrences of the right angle bracket ">" to
      &#62;

*  character data that is not in CDATA sections must have all
   occurrences of "<", ">", and "&" converted to &#60; &#62; and
   &#38 respectively.

*  start tags, end tags, empty tags, CDATA sections, and text
   sections are assembled in the same order as the original
   document.

---------------------------------------------------------------------------
 Chris Smith                                          <smith at interlog.com>

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)