Whitespace rules (v2)

Peter Murray-Rust Peter at ursus.demon.co.uk
Sun Aug 17 00:08:29 BST 1997

In message < at pop.intergate.bc.ca> Tim Bray writes:

Thanks very much for your support, Tim.  We believe that XML-DEV has a role in
coming up with workable pragmatic solutions to 'parts' of the XML process. 
Getting those all right at once (i.e. for the spec) may be impossible; getting
a few of them mainly right may be a useful step.

> I gotta say that it's noble of you guys to take aim at this particular
> problem, but you should bear in mind that it's really really really 
> hard.  The original goal as stated in SGML was to ignore white
> space "caused by markup" by which they meant "used to prettyprint
> markup".  A worthy goal, but in fact most people would agree that
> the rules you have to write to achieve this are horrendously complicated
> and some would argue that SGML never actually did get it right.  

I'd agree with this. And XML does not work in precisely the same way as SGML 
here.  It's most useful IMO to proceed on the basis that most XML-DEV'ers
will not understand the niceties of SML-whitespace but *will be prepared to
work to a (fairly) simple set of rules*.

If we go for an 80/20 solution (i.e. 80% of users/applications find it useful
80% of the time, that solves 64% - a reasonable starting point...)
> We spent a huge amount of time on this in the XML committee and 

Yes. And it's essential we don't go round this loop again. It will always be
possible to pick holes in a propsed set of rules - so we have to accept there
will be holes from the start. Juts minimise their size and point them out.

> eventually decided that if simple rules could be written, we weren't
> smart enough to figure them out.

I don't think there *is* a solution in terms that a cast-iron spec could 
contemplate (any more than there is one universal DTD). We have to seek a 
compromise solution.  

> So good luck, don't expect it to be easy, but if you get it right
> the world will be grateful. -Tim

Obviously there will be applications which come 'out-of-the-box' - the 
authoring and processing tools are already written and validated, and most
people won't need to see the intermediate XML text.  Maybe CDF is in this
category.  I think we are aiming at those documents which might be processed
by generic XML processors, or composed of cut-n-paste from a variety of
sources (or both). For example, in a combined MathML and CML document, it
is reasonable to expect the whitespace processing to be openly declared, easily
implementable and (hopefully) easy to understand.

I think we can aim for one (or possibly two) protocols that service 'most'
applications.  With those there would be simple guidelines for authors (of
documents and of processing software).

Firstly there are some 'gotchas'. I don't think anyone *wants* CR/LF problems
to be platform-dependent. So we have to address this independently of other

IMO most XML documents will fall into the categories:
	(a) precise whitespace matters (PRESERVE or <(HTML)PRE>). The main
problem with using this is the CR/LF one.
	(b) text-like, where markup is for formatting (mixed content, 
event-stream processing).
	(c) structured, often with pretty-printing (i.e. redundant whitespace)
(element content).
	(d) mixtures of (b) and (c). This would be common in technical documents
with a mixture of 'text' and 'non-textual' structured information.

I believe we can come up with simple rules for b/c/d which are reasonably 
intuitive to the webhacker and also cover a wide enough range of applications.


Peter Murray-Rust, domestic net connection
Virtual School of Molecular Sciences

xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)

More information about the Xml-dev mailing list