REC-xml-19980210: whitespace
Richard Emberson
emberson at faslab.com
Thu Oct 15 23:27:24 BST 1998
John Cowan wrote:
> David Brownell wrote:
>
> > Moreover, 2.10 says (albeit oddly) that CRLF gets normalized
> > to LF everywhere, and the LF would get normalized to a single
> > space inside of an attribute (or Public Identifier) value.
>
> Hmmm. Real CRLFs get normalized to LFs, but does that apply
> to the appearance of "
"? I think not.
> In attribute values, however,
At a minimum, Section 3.3.3 Attribute-Value Normalization
and other places where white space is discussed seems a little
short of exact.
In Section 3.3.3, bullet #1:
"a character reference is processed by appending the referenced
character to the attribute value"
None of the other bullets deal with character references. The
value of the character reference is appended to the attribute value.
Its part of the attribute value.
Note:
When do characters in the attribute value get put into the
normalized value? Bullet #4 states that "other characters
are processed by appending them to the normalized value."
It seems that character references are never explicilty
transfered from the attribute value to the normalized value.
What happens when the character reference is: #x20?
If the attribute is not CDATA, then nothing.
If the attribute is CDATA, then
if the #x20 is leading or trailing, then its stripped, else
if the #x20 is part of a sequence of #x20's, then only
one #x20 takes the place of the sequence.
What happens when the character reference is: #x09?
Nothing happens because the bullet #3 does not apply to
character references.
What happens when the character reference is: #x0A?
Nothing happens because the bullet #3 does not apply to
character references.
What happens when the character reference is: #x0D?
Nothing happens because the bullet #3 does not apply to
character references.
This implies that the sequence "
" is not converted into a #x20
(this was noted by John Cowan above). Section 2.11 End-of-Line Handling,
does not apply since a character reference can not contain both #x0D and
#x0A in a single reference.
So, it is possible for normalized attribute values (of type CDATA and
not CDATA) to contain sequences of #x20s and the character sequence
#x0D#x0A. If this is not the intent of the spec. then Section 3.3.3
needs a little work.
I do not know if this is what was really desired by the authors of the
spec but (at least to me) thats what the spec says.
If even character reference whitespace are to be processed/normalized,
I would recommend that first a value is created by appending
character reference, recursive appending of entity references, and
simple appending other characters, and then CDATA/non-CDATA normalization
takes place, i.e, a two step description. That way the sequence "
"
will be normalized, again, assuming that thats what the spec is
trying to say.
I am not out to criticized the spec or its authors; I'm just trying to
build a validating parser and not being an SGML-techie or not having
partaken in the year+ w3c xml spec development process, I only have
the spec to go on. I have the feeling that since there is so much
semantics, not just syntax, in the spec someone ought to have,
for example, someone like Guy Steele take a pass at it.
Richard Emberson
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)
More information about the Xml-dev
mailing list