Canonical Encoding for XML Elements
Chris Smith
smith at interlog.com
Fri Jan 9 02:45:22 GMT 1998
Here, as mentioned, is our process for creating a canonical form of
XML elements. Comments are welcome.
In particular, do parsers keep CDATA sections distinct from character
data?
-------------------------------------------
Canonical Encoding Format for XML
The canonical format of an XML element is created by firstly
deriving the logical content and structure of the underlying
XML document by parsing it, and then generating the canonical
physical form of the element based on the logical structure
using the process defined below.
For the XML element being generated or any of its child
elements:
* convert all characters in the element to [UTF16] format1.
* apply all external entities and all character and entity
references in the element so that they are completely resolved
* exclude comments and processing instructions (PIs),
* reduce all attributes to their canonical form using the
attribute type in the DTD. Replace all single and double
quotes present in attributes with ' and " respectively
so that attributes can be enclosed in double quotes
* create attributes, using their default value, which are not
present in the original but have default values in the DTD
* sort the original and generated attributes in ascending
attribute name order according to the UTF-16 encoding of the
attribute name (i.e. not the native character ordering)
* for whitespace inside markup but not inside attribute
values, generate it as minimally as possible. Specifically:
- remove non essential whitespace, and
- represent required whitespace by a single space character
* generate the content of all start tags using only the
element name and the attributes as described above. If the
element is an "empty" element then generate it using the
single empty tag format, with a trailing slash. Generate end
tags using only the element name, with no added whitespace.
* remove all whitespace in the element content
* keep CDATA sections as CDATA sections. Also:
- do not convert CDATA sections to character data with
character references
- convert all occurrences of the right angle bracket ">" to
>
* character data that is not in CDATA sections must have all
occurrences of "<", ">", and "&" converted to < > and
& respectively.
* start tags, end tags, empty tags, CDATA sections, and text
sections are assembled in the same order as the original
document.
---------------------------------------------------------------------------
Chris Smith <smith at interlog.com>
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)
More information about the Xml-dev
mailing list