Whitespace rules (v2)

Mon Aug 25 13:33:45 BST 1997

I have been away for a few days so maybe it's a useful time to try to summarise
the Whitespace debate and to ask a few questions. You don't need to read the 
rest of this unless you believe there is a problem to be addressed :-)

In message <v03007800b01fa935a1f1@[205.181.197.116]> dgd at cs.bu.edu (David G. Durand) writes:
> I observed with dismay that the issue of whitespace has surfaced on this
> list, after we finally gave it the wooden-stake-in-the-heart treatment on
> the WG discussion lists. As a chief proponent of the current method, I'll

:-) I am not sure what has been killed :-)

> take a shot at explaining the rationale, as that is something that doesn't
> really fit in a standard, but actually helps a great deal in understanding
> one.

I will take David's points first, because I *do* believe that many of those
who were involved in the development of the spec feel that there is no scope
for further discussion of this *IN THE SPEC*.  I agree with this.

Essentially the spec says:
	- This is a difficult problem.  [Actually it doesn't say this, but 
it might help if it did in a footnote.]
	- We have taken a minimalist approach where we do not give any support
to any whitespace philosophy [other than PRESERVE which passes everything and
can be platform-dependent], but leave this to the community. DEFAULT is simply
the absence of PRESERVE.

I believe this solves one species of problem, where the authoring tool/system
is closely coupled to the application. CDF might be such a system (e.g. I have
never seen a native CDF file).

*IF* this is the major use of XML - where there is a one-to-one communication
of this sort - then there is no real problem.  I do not believe this is the
case, and I think there are at least two areas where XML will run into this
general problem on numerous occasions:

(A) There is a defined DTD (e.g. TEI, HTML) but a variety of authoring tools
and a variety of applications from different providers. Traditionally these
will come from the SGML community. I believe that there will certainly be
initial problems where m'facturer X emits whitespace in a particular way
which is incompatible with Y's tools for rendering/transforming it. It may
also be platform dependent.  We've seen this in the development of HTML systems
although they are improving. 

Remember that most SGML systems are current implemented within a single site
(the tools are chosen to be compatible throughout the process). Very little
SGML is delivered over the WWW to be consistent between different m'facturers.
XML is specifically designed to be delivered over the WWW in (I assume)
a platform and m'facturer-independent way.  Do we expect to see 'this XML
file best viewed with FOO software'??? If so, we might as well give up now.

IMO any developer needs to be able to say:
	(i) I support a wide range of XML DTDs.
	(ii) I can easily customise my software to support a range of commonly
used DTDs
	(iii) Documents authored by my software should be readable by software
from another m'facturer with whom I have had no formal discussions
	(iv) My system can support a range of applications which read documents
produced by other m'facturers systems and with whom I have had no formal
discussions.

If all the manufacturers tell me this is a non-problem, I'll shut up (on this
issue!) If each DTD defines its own use of whitespace (or worse, doesn't 
define it) they may have a lot of work.

(B) There are generic XML applications. The XML community continues to discuss
documents which 'contain information from more than one DTD' or 'are WF but
not necessarily valid(atable)'. Examples of these are:
	(i) an XML document to which meta-data has been prepended.
	(ii) an XML document which includes chunks conforming to well-defined
DTDs such as MathML.

The possible combinations are indefinitely large.
It is impossible to write bespoke software to process these documents, and we
need generic mechanisms. Perhaps many will be dealt with by stylesheets, and
maybe the WS issue is a question of developing appropriate conventions in
stylesheets.  In documents of this sort there have to be conventions and flags
that indicate how to interpret the documents. The spec has indicated that it
shouldn't be in the XML markup - no problem.  Somehow conventions have to
evolve, either conveyed implicitly or explicitly (e.g. through PIs). 
[Remember that there are - as yet - no agreed conventions as to what a PI can
look like - you can put anything in after the target.]

> 
[...]
> >Axiomatic? Call me stubborn (you won't be the first), but I, for one,
> >retain some hope. :-)
> 
> We all did at first. The problem is really the last point -- _universal_
> and while I am tempted to agree with Peter, I do not, in fact, because I
> think the current method actually does satisfy all four points -- but not
> necessarily in the way that you would expect.

Note; I am NOT trying to find a universal solution here.  I am suggesting that
we develop some common, useful approaches which will solve a reasonable 
number of problems.

> 
> >>[Peter states in detail different policies on whitespace he might need in
> >>different contexts.]
> >>
> >>What I am after here is a convention that I can state which instructs the
> >>processor how to treat this whitespace.  ***I do not wish to have to devise
> >>a specific convention for CML***.  I want to be able to indicate that that
> >>the W/S after <MOL> is irrelevant, and that the whitespace in the ATOMS
> >content
> >>is normalisable and used only as a delimiter of tokens.
> 
> The problem with this is that there are a large number of ways that
> whitespace can be used: the "tokens" form mentioned at the end, for
> example, has never been proposed for XML.

I agree there are a large number of ways.  Some classification would be 
valuable and IMO the sort of thing that XML-DEV could usefully provide.
[The WS-separated tokens are no different from 'words' in HTML and I would
expect that a large number of people would welcome a convention on 
normalising whetspace between 'words'.]

> 
> >>I expect that many other applications will use a similar approach, so I want
> >>to share the effort with them.  Examples of metadata in XML have often been
> >>portrayed as prettyprinted and I expect that CML could use the same
> >conventions.
> 
> This charing makes sense, only when the sharing of effort is not imposing
> an unreasonable burden on others. The problem with whitespace is that the
> different possible policies are all unneeded by many applications.

Then the application needn't implement them :-)  Applications have to do
*something* about whitespace.  This can be:
	- ignore the problem (or use PRESERVE)
	- their own thing
	- a set of choices which is understood by the community
	- refuse to process the document.

> 
> The typical browser/formatter may never need "token" style whitespace, and
> may implement such things by passing data to applets or other external
> processes that will handle them.
> 
> In fact, the need to write xml->xml transducers (SGML has tought us that
> this need never goes away), argues that it must be _possible_ to see all
> whitespace at least _some_ of the time, regardless of document. That's one
> reason that the current "pass all whitespace" model works.

It 'works' in that it shifts the problem to the application developer. I like
the idea of an XML->XML transducer - perhaps in front of the application, or
callable within it.  If David thinks that such tools could be built 
independently of applications that is exactly what I am suggesting :-)

> 
> The other reason that it works, is that you an always ignore data that
> you're not interested in (whitespace) but you can never get access to data
> that is hidden from you -- therefore the convenience of "automatic
> whitespace removal" is an inability to see that space without using
> non-standard tools.

it's clear that an application *must* have access to all whitespace if it 
wants it (this is made clear by, say, the requirement of XMl_LINK to search
on pseudoelements).  However it should also be able to access a normalised
form of the document.

> 
> >>I think that we can aim for a set of options that could be used by a
> >post-parser
> >>processor. Different applications (**or document authors**) could choose
> >between
> >>them. Examples might be:
> >>	- normaliseCRLF (Neil's Rule 1)
> >>	- discardAllWS
> >>	- normaliseToSingleSpace
> 
> I agree that this is the right place for such processing to happen (between
> a parser and an application). I'm not yet sure whether these things are as
> reusable as people think. I do know that without the use of #FIXED
> attributes (so I could avoid markup in the instance) I would _not_ use
> these, but rather make sure that my application (or stylesheet language)
> had the ability to apply these policies on request, as needed.

But we do have #FIXED, right? In which case I generally agree.

> 
[...]
> This is the option that XML universally adopts. That means  that any other
> method can be implemented _by any processor that cares_. If one can imagine
> destroying meaning of a document's content by the flattening of all
> whitespace strings to a single space, then you may need more elements in
> your content model, if you are not able to control the software that will
> process the document.

This is a good point.

> 
> In other words the parser guarantees all WS will be visible to applications
> -- this makes designing and implementing WS dependent processing easy --
> but since applications are _not_ constrained as folding or other WS
> processing behaviour, document authors will have to be cautious in using
> significant whitespace. If you can't assume that applications to process
> your markup will do the right thing, then you should not play games with WS.

Yes. But where is the rigour in authoring going to come from? This is where
I believe that XML-DEV has a role.

> 
> This actually is not much of an issue for CML, since it's a reasonable
> assumption that any implementation of CML markup-display will have to do
> lots of special things, of which whitespace is the least.

No, the point was that CML wishes to re-use HTML and MathML as additonal
components in the document. And then meta-data, and ... So that the 
application will become bloated unless it can re-use the approaches from 
the rest of the community.

> 
[...]
> 
> 
> I think XML's agnostic position is the correct one for tha language.
> Authors should probably assume (unless they anticipate absolutely no
> re-use) that HTML-style draconian normalization might occur anywhere and
> use markup rather than whitespace, or at least CDATA sections. This
> position _may_ be moderated (a little) where a well-known DTD with
> well-defined WS rules can be used (like the TEI or HTML).

I agree on this.  The point I have been trying to promote is that it should
be possible to collate the requirements of such systems and offer them
on a re-usable basis.

I know from experience that it's extremely easy to go round in circles here.
If this discussion is going to echieve something - and I think that a number
of peopel would welcome this - then perhaps a revised set of the rules 
recently suggested, and adddressed to HTML-like usage (with perhaps other 
common current DTDs as well) would be beneficial.

An author could then say:
	- the content of FOO, BAR, FLIP can be expected to be treated by 
XML-DEV-HTML-like WS normalisation.
	- the content of BAZ, BLORT suffers WS stripping as described in
XML-DEV-HTML-like-stripping.  

and that's about it. If we can get something along those lines, then 
I think a reasonable number of people would take note. It doesn't just have 
to apply to HTML DTDs.

	P.

-- 
Peter Murray-Rust, domestic net connection
Virtual School of Molecular Sciences
http://www.vsms.nottingham.ac.uk/

xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)