XML vs the Dreaded Whitespace

Peter Murray-Rust peter at ursus.demon.co.uk
Thu Dec 11 09:51:44 GMT 1997


Thanks very much Chris,
	I'm probably not going to be much practical help, but I hope your posting
catalyses a practical response from the SGML experts. I'd be surprised if
conventional XML-enhanced SGML tools couldn't handle this problem, but I
have no idea what they would cost. [The last flier I got was 2 orders of
magnitude greater than an impecunious academic could afford.]

At 03:00 11/12/97 -0500, Chris Smith wrote:
>
[... first problem punted ...]

>The second question is much less firm right now. We would like make
>whitespace handling robust - if someone along the way uses a tool
>which breaks a line, we should be able to fix it rather than die.
>
>If we add the following character entities to our DTD,
>
><!ENTITY spc    "&#32;">
><!ENTITY tab    "&#9;">
><!ENTITY cr     "&#13;">
><!ENTITY lf     "&#10;">
>
>then it should be possible to use these to represent 'wanted'
>whitespace, and thus allow for a simple rule prior to checking message
>authentication - that is, remove all 'native' space, tab, LF, and CR
>from the #PCDATA and check what remains (whitespace inside tags is
>handled in a more draconian fashion). (According to the previous
>section, "Hi&spc;there!" will be checked exactly that way you see it
>here - not as "Hi there!" The question? - is this distinction (between
>eg the native 0x0009 and &tab; (which converts to 0x0009) going to be
>difficult to keep track of? 

As one of the few authors of a generic native XML application I have to
face this problem and have repeatedly failed to get practical solutions.
the main response is:
	Yes, its' a problem and
	Yes, it's your problem
As I understand it, your XML document may contain two sorts of white space:
	whitespace that matters
	whitespace that doesn't matter
The latter may be inserted randomly by authors whose lines don't wrap. From
my very limited experience of SGML I would say your approach looks a
sensible one. 

However the major problem is 'where is your application software going to
come from?' I have argued very strongly (and shall continue to do so), that
there need to be generic conventions honoured by common application
programs. Otherwise you have to write your own application for your
problem. At present you have only two options:
	- write it yourself (and maintain it)
	- pay an SGML house to solve your problem for you

I hope shortly to propose some generic whitespace problems (implemented in
JUMBO) for certain types of document. I don't know whether they would solve
your problems, but thanks for giving me the chance to think about a real
problem. :-)

As a corollary: Is anyone testing the ESIS output of the current crop of
XML parsers (4 Java + nsgmls, I think)? Regardless of the whitespace model
or the value of xml:space they should all produce identical ESIS (right?)
If not, then one or more is wrong. And all applications should (IMO) be
prepared to work with ESIS which I think is isomorphous with a WF XML
document.

	P.

Peter Murray-Rust, Director Virtual School of Molecular Sciences, domestic
net connection
VSMS http://www.nottingham.ac.uk/vsms, Virtual Hyperglossary
http://www.venus.co.uk/vhg

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)




More information about the Xml-dev mailing list