REC-xml-19980210: whitespace

Thu Oct 15 23:27:24 BST 1998

John Cowan wrote:
> David Brownell wrote:
> 
> > Moreover, 2.10 says (albeit oddly) that CRLF gets normalized
> > to LF everywhere, and the LF would get normalized to a single
> > space inside of an attribute (or Public Identifier) value.
> 
> Hmmm.  Real CRLFs get normalized to LFs, but does that apply
> to the appearance of "&#xD;&#xA;"?  I think not.
> In attribute values, however, 

At a minimum, Section 3.3.3 Attribute-Value Normalization 
and other places where white space is discussed seems a little
short of exact.

In Section 3.3.3, bullet #1:

  "a character reference is processed by appending the referenced
  character to the attribute value"

None of the other bullets deal with character references. The
value of the character reference is appended to the attribute value.
Its part of the attribute value. 

  Note:
    When do characters in the attribute value get put into the 
    normalized value? Bullet #4 states that "other characters 
    are processed by appending them to the normalized value."
    It seems that character references are never explicilty 
    transfered from the attribute value to the normalized value.

What happens when the character reference is: #x20?
  If the attribute is not CDATA, then nothing.
  If the attribute is CDATA, then 
    if the #x20 is leading or trailing, then its stripped, else
    if the #x20 is part of a sequence of #x20's, then only
      one #x20 takes the place of the sequence.

What happens when the character reference is: #x09?
    Nothing happens because the bullet #3 does not apply to
      character references.

What happens when the character reference is: #x0A?
    Nothing happens because the bullet #3 does not apply to
      character references.

What happens when the character reference is: #x0D?
    Nothing happens because the bullet #3 does not apply to
      character references.

This implies that the sequence "&#xD;&#xA;" is not converted into a #x20
(this was noted by John Cowan above). Section 2.11 End-of-Line Handling,
does not apply since a character reference can not contain both #x0D and
#x0A in a single reference.

So, it is possible for normalized attribute values (of type CDATA and
not CDATA) to contain sequences of #x20s and the character sequence
#x0D#x0A. If this is not the intent of the spec. then Section 3.3.3
needs a little work.

I do not know if this is what was really desired by the authors of the
spec but (at least to me) thats what the spec says.

If even character reference whitespace are to be processed/normalized,
I would recommend that first a value is created by appending 
character reference, recursive appending of entity references, and
simple appending other characters, and then CDATA/non-CDATA normalization
takes place, i.e, a two step description. That way the sequence "&#xD;&#xA;"
will be normalized, again, assuming that thats what the spec is
trying to say.

I am not out to criticized the spec or its authors; I'm just trying to
build a validating parser and not being an SGML-techie or not having 
partaken in the year+ w3c xml spec development process, I only have
the spec to go on. I have the feeling that since there is so much
semantics, not just syntax, in the spec someone ought to have, 
for example, someone like Guy Steele take a pass at it.

Richard Emberson

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)