XML and Using It With Whitespace

Chris Smith smith at interlog.com
Mon Jan 5 07:07:57 GMT 1998

Sorry if the subject is confusing, but it's a really concise proposal
for getting whitespace through - guaranteed.

I earlier posted a query about the behaviour of various parsers
surrounding whitespace. I guess I'm not as hopeful as I was earlier,
at least based on the answers I received. Thanks to those who took the
time to reply.

Essentially, I had hoped that using   to replace a space would
allow for the creation of a 'magic' difference, the same way that
the &lt; and < are treated differently. Ideally all spaces could
become &#32; and we could use the (invalid!) xml:space="none", leaving
only the &#32; behind.

It appears that most of the parsers will have a tough enough time
consistently declaring ignorable whitespace in element content -
track where in PCDATA a &#32; became a ' ' is just not on the radar.

That doesn't mean I'm abandoning the idea - the message authentication
we're doing is important enough to the application that I'm prepared
to sacrifice the use of all the parsers to get the above behaviour. It
doesn't hurt that we are likely going to have standalone applications
processing the XML stream - it's not really a file-based system.

I think parsers can still correctly read such files. But it points to
a more general problem. If I read such a file with a parser, how can I
write it out again exactly (and I mean *exactly*) the way it was read? 
If the parser doesn't indicate clearly where substitutions with
entities were done, then I can't put them back in the file. The same
problem occurs with empty elements. Although the XML spec wants to
imply that <tag></tag> and <tag/> are the same, some might see them as
the difference between a zero-length content and null content. Either
way, if the original XML contains <tag><tag/>, then that is what
should go back out. If it later contains <tag/> then the both
references should remain different from each other and unchanged.

To wrap up the options, I'll run through the same paragraph using
three different techniques.


Finally, the other idea is the one at the bottom - use elements for
spaces, tabs, and lineends.  There is a single attribute n to indicate
repeat counts.

2....Using character entities - still my favourite, since they work in
attributes as well. Out of all of them, this, to my eyes, looks like
it could easily have been placed in the XML 1.0 spec without breaking
anything else that is in the spec, simply by adding the
xml:space="none". &spc; could be &#32; and &lf; is &#10; so no new
entities would have to be added.

<p xml:space="none">Finally,&spc;the&spc;other&spc;idea&spc;is&spc;the

3.....With only elements.

<p xml:space="none">Finally,<s/>the<s/>other<s/>idea<s/>is<s/>the<s/>
spaces,<s/>tabs,<s/>and<s/>lineends.<s n="2"/>There<s/>is<s/>a<s/>single

Clearly, you must have the DTD to make sense of the last one! However,
I see a rather interesting side-effect, namely that this one could
likely be added using a namespace. (Tangent: any parsers experimenting
with namespaces?)

In summary, the distinction is, as a reply noted, between "wanted" 
whitespace and "unwanted" whitespace. The XML specification wants to
leave it to the application because there are far more 'whitespace
convention sets' than it is desirable to put in the spec. However,
there are far more applications than there are 'whitespace convention
sets', and the application designer wants to pick one, not reinvent
the wheel. 

This seems to be the missing middle ground. How can we reusably
specify the relatively few whitespace options we need. which are more
than the XML spec provides, but far fewer in number than the number of
applications that we hope to see using XML? 

 Chris Smith                                          <smith at interlog.com>

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)

More information about the Xml-dev mailing list