Attribute value normalization

Wed May 27 13:35:39 BST 1998

> While translating the XML specification, I find that I do not understand 
> the attribute normalization mechanism of XML.

The result produced by RXP and LT-XML is given at the end (except that
carriage return characters have been replaced by the sequence ^M for
ease of reading).  Here is my explanation for each case.  The relevant
section of the standard is of course 3.3.3.

> <test a="
> 
> test
> 
> test
> 
> "/>

In this case, the linefeeds (or whatever record boundaries are in your
system) are replaced by spaces. Then, the trailing spaces are removed and
the other spaces compressed.  So the result is

  <test a="test test"/>

This is of course the intended way for NMTOKENS to work.

> <test a="&D;&A;&D;&A;test&D;&A;&D;&A;test&D;&A;&D;&A;"/>
> <test a="&DA;&DA;test&DA;&DA;test&DA;&DA;"/>

In this cases the character entities were expanded (into carriage
returns and linefeeds) when then general entities were defined.  So
when the replacement text of the entities is "recursively processed",
they get turned into spaces.  They then get stripped or replaced,
producing the same result as the first case.

[However, if the attribute were of type CDATA, the result would be
different from the first case: these would have 4 spaces instead of 2,
because the cr/lf pairs in the first case were reduced to linefeeds
(probably on input, see section 2.11), whereas in the second case they
are not part of the *literal* entity value of the internal entity.]

> <test a="&#xD;&#xA;&#xD;&#xA;test&#xD;&#xA;&#xD;&#xA;test&#xD;&#xA;&#xD;&#xA;"/>
> <test a="&#xD;&#xD;test&#xD;&#xD;test&#xD;&#xD;"/>
> <test a="&#xA;&#xA;test&#xA;&#xA;test&#xA;&#xA;"/>

In these cases, the character references are appended, but unlike the
case general entity references the result is not recursively
processed.  So there are no space characters to normalise, and the
result is the same as if the attribute had had type CDATA - that is,
the carriage returns and linefeeds appear in the normalised value.

Here is the RXP/LT-XML output:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE test [
<!ELEMENT test (#PCDATA|test)*>
<!ATTLIST test 
        a NMTOKENS #IMPLIED>
<!ENTITY D "&#xD;"> 
<!ENTITY A "&#xA;">
<!ENTITY DA "&#xD;&#xA;">  ]>
<test>
<test a="test test"/>
<test a="test test"/>
<test a="test test"/>
<test a="^M
^M
test^M
^M
test^M
^M
"/>
<test a="^M^Mtest^M^Mtest^M^M"/>
<test a="

test

test

"/>
</test>

-- Richard

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)