XML and Using It With Whitespace

Peter Murray-Rust peter at ursus.demon.co.uk
Wed Jan 7 00:36:03 GMT 1998


Since no one has responded publicly Chris, here's my take on your concerns...

At 02:06 05/01/98 -0500, Chris Smith wrote:
>
>Sorry if the subject is confusing, but it's a really concise proposal
>for getting whitespace through - guaranteed.
>
>I earlier posted a query about the behaviour of various parsers
>surrounding whitespace. I guess I'm not as hopeful as I was earlier,
>at least based on the answers I received. Thanks to those who took the
>time to reply.
>
>Essentially, I had hoped that using   to replace a space would
>allow for the creation of a 'magic' difference, the same way that
>the &lt; and < are treated differently. Ideally all spaces could
>become &#32; and we could use the (invalid!) xml:space="none", leaving
>only the &#32; behind.

I am not sure this buys you anything. The &#32; is presumably occurring in
content. If it occurs in mixed content or ANY it will be emitted by the
parser as " " and would look the same as if you had put ordinary spaces in
the document. If it occurs in element content (no character data allowed as
children) then if the parser accepts it as whitespace it will be treated as
if it was a " ". [I still have my concerns as to *where* the spec
explicitly allows  " " in element content...]

[...]
>
>I think parsers can still correctly read such files. But it points to
>a more general problem. If I read such a file with a parser, how can I
>write it out again exactly (and I mean *exactly*) the way it was read? 
>If the parser doesn't indicate clearly where substitutions with
>entities were done, then I can't put them back in the file. The same
>problem occurs with empty elements. Although the XML spec wants to
>imply that <tag></tag> and <tag/> are the same, some might see them as
>the difference between a zero-length content and null content. Either
>way, if the original XML contains <tag><tag/>, then that is what
>should go back out. If it later contains <tag/> then the both
>references should remain different from each other and unchanged.

There has been discussion on this and my understanding that the unequivocal
policy is that <TAG></TAG> and <TAG/> result in exactly the same events or
grove and there is NO way of distinguishing which the original document
contained. Some people regret this, but the decision is clear.

>
>To wrap up the options, I'll run through the same paragraph using
>three different techniques.
>
>2....Using character entities - still my favourite, since they work in
>attributes as well. Out of all of them, this, to my eyes, looks like
>it could easily have been placed in the XML 1.0 spec without breaking
>anything else that is in the spec, simply by adding the
>xml:space="none". &spc; could be &#32; and &lf; is &#10; so no new
>entities would have to be added.

xml:space="none" is NOT allowed in the XML spec.

>
><p xml:space="none">Finally,&spc;the&spc;other&spc;idea&spc;is&spc;the
>&spc;one&spc;at&spc;the&spc;bottom&spc;-&spc;use&spc;elements&spc;for&lf;
>spaces,&spc;tabs,&spc;and&spc;lineends.&spc;&spc;There&spc;is&spc;a&spc;
>single&spc;attribute&spc;n&spc;to&spc;indicate&lf;repeat&spc;counts.</p>

Assuming that you have something like:

<!ENTITY spc " ">

Then the paragraph above will be result in the same parser output as if
they had been spaces (except that it might report the internal entity events).

>
>3.....With only elements.
>
><p xml:space="none">Finally,<s/>the<s/>other<s/>idea<s/>is<s/>the<s/>
>one<s/>at<s/>the<s/>bottom<s/>-<s/>use<s/>elements<s/>for<l/>
>spaces,<s/>tabs,<s/>and<s/>lineends.<s n="2"/>There<s/>is<s/>a<s/>single
><s/>attribute<s/>n<s/>to<s/>indicate<l/>repeat<s/>counts.</p>

If you really care about every character this is a reasonable way of doing
it, but it will generate a large number of events or (in a tree) require a
lot of nodes to be created. Both will impact performance.

Part of the problem arises from the requirement (which I strongly support)
that "XML documents should be human legible and reasonably clear". In some
cases something has to be sacrificed and it looks like you are happy to let
this one go...

>
>Clearly, you must have the DTD to make sense of the last one! However,
>I see a rather interesting side-effect, namely that this one could
>likely be added using a namespace. (Tangent: any parsers experimenting
>with namespaces?)

Parsers are NOT allowed to experiment with namespaces :-).  Parsers must
recognise ":" as a valid name character. That's all.

Humans can experiment with namespaces.  So can applications.  PaulG has
pointed out that the latest namespace proposal is confidential, so
discussion of that is inappropriate. However, going on the information in
the public domain (e.g. the RDF draft) JUMBO has implemented a namespace
experiment.  For what you are doing, I suspect stylesheets would be more
valuable.
>
>In summary, the distinction is, as a reply noted, between "wanted" 
>whitespace and "unwanted" whitespace. The XML specification wants to
>leave it to the application because there are far more 'whitespace
>convention sets' than it is desirable to put in the spec. However,
>there are far more applications than there are 'whitespace convention
>sets', and the application designer wants to pick one, not reinvent
>the wheel. 

I fully agree with this, and if no one else makes proposals... But we need
to concentrate on SAX at the moment.
>

Peter Murray-Rust, Director Virtual School of Molecular Sciences, domestic
net connection
VSMS http://www.nottingham.ac.uk/vsms, Virtual Hyperglossary
http://www.venus.co.uk/vhg

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)




More information about the Xml-dev mailing list