Whitespace rules (v2)

David G. Durand dgd at cs.bu.edu
Tue Aug 19 23:57:37 BST 1997


I observed with dismay that the issue of whitespace has surfaced on this
list, after we finally gave it the wooden-stake-in-the-heart treatment on
the WG discussion lists. As a chief proponent of the current method, I'll
take a shot at explaining the rationale, as that is something that doesn't
really fit in a standard, but actually helps a great deal in understanding
one.

I'm taking some recent notes on this list as a starting point.

At 5:17 PM -0500 8/18/97, Russell Chamberlain wrote:

>> Peter Murray-Rust wrote:
>>
>>I think - along with TimB - that it is unrealistic to come up with s single
>>set of rules that will server every application.  There was an enormous
>amount
>>of discussion on the XML group last year and I take it as axiomatic that we
>>cannot produce a set of rules which everyone agrees are:
>>	- simple to state
>>	- unambiguous
>>	- intuitive and easy to learn
>>	- universal (i.e. cover every situation)
>
>Axiomatic? Call me stubborn (you won't be the first), but I, for one,
>retain some hope. :-)

We all did at first. The problem is really the last point -- _universal_
and while I am tempted to agree with Peter, I do not, in fact, because I
think the current method actually does satisfy all four points -- but not
necessarily in the way that you would expect.

>>[Peter states in detail different policies on whitespace he might need in
>>different contexts.]
>>
>>What I am after here is a convention that I can state which instructs the
>>processor how to treat this whitespace.  ***I do not wish to have to devise
>>a specific convention for CML***.  I want to be able to indicate that that
>>the W/S after <MOL> is irrelevant, and that the whitespace in the ATOMS
>content
>>is normalisable and used only as a delimiter of tokens.

The problem with this is that there are a large number of ways that
whitespace can be used: the "tokens" form mentioned at the end, for
example, has never been proposed for XML.

>>I expect that many other applications will use a similar approach, so I want
>>to share the effort with them.  Examples of metadata in XML have often been
>>portrayed as prettyprinted and I expect that CML could use the same
>conventions.

This charing makes sense, only when the sharing of effort is not imposing
an unreasonable burden on others. The problem with whitespace is that the
different possible policies are all unneeded by many applications.

The typical browser/formatter may never need "token" style whitespace, and
may implement such things by passing data to applets or other external
processes that will handle them.

In fact, the need to write xml->xml transducers (SGML has tought us that
this need never goes away), argues that it must be _possible_ to see all
whitespace at least _some_ of the time, regardless of document. That's one
reason that the current "pass all whitespace" model works.

The other reason that it works, is that you an always ignore data that
you're not interested in (whitespace) but you can never get access to data
that is hidden from you -- therefore the convenience of "automatic
whitespace removal" is an inability to see that space without using
non-standard tools.

>>I think that we can aim for a set of options that could be used by a
>post-parser
>>processor. Different applications (**or document authors**) could choose
>between
>>them. Examples might be:
>>	- normaliseCRLF (Neil's Rule 1)
>>	- discardAllWS
>>	- normaliseToSingleSpace

I agree that this is the right place for such processing to happen (between
a parser and an application). I'm not yet sure whether these things are as
reusable as people think. I do know that without the use of #FIXED
attributes (so I could avoid markup in the instance) I would _not_ use
these, but rather make sure that my application (or stylesheet language)
had the ability to apply these policies on request, as needed.

><GeneralDiscussionOfWhitespace>
>
>A notation for describing whitespace handling must communicate the notion
>that whitespace processing is modal, and provide words for each mode and
>phrases for the transitions.
>
>Let's consider Peter's tentative rules:
>
>>	- normaliseCRLF (Neil's Rule 1)
>
>Please correct me if I am wrong, but this looks like a document-wide
>setting whose behaviour/interpretation isn't affected by the application
>type. A simple on/off PI setting could be used to set this.

One might want to do this only in specific elements. Say I'm piping some
sub-elements to a stupid processor, and that requires a fixed linend
convention, but none of my other processing cares.

>
>The rest of the rules, though, could be applied on a per-element basis:
>
>>	- discardAllWS
>>	- normaliseToSingleSpace
>
>I would add:
>
>    - keepAllWS
>
>(I haven't read every word of every post in this thread. Has this third one
>been discarded as a reasonable option? Even if it has, the rest of my
>discussion here isn't affected)

This is the option that XML universally adopts. That means  that any other
method can be implemented _by any processor that cares_. If one can imagine
destroying meaning of a document's content by the flattening of all
whitespace strings to a single space, then you may need more elements in
your content model, if you are not able to control the software that will
process the document.

In other words the parser guarantees all WS will be visible to applications
-- this makes designing and implementing WS dependent processing easy --
but since applications are _not_ constrained as folding or other WS
processing behaviour, document authors will have to be cautious in using
significant whitespace. If you can't assume that applications to process
your markup will do the right thing, then you should not play games with WS.

This actually is not much of an issue for CML, since it's a reasonable
assumption that any implementation of CML markup-display will have to do
lots of special things, of which whitespace is the least.

[[[Geek note: I think that authors might be a little safer if significant
WS is in a CDATA marked section. Since CDATA is essentially a quoting
mechanism, Applications should be more careful about such content.]]]
>Would being able to specify one of the three modes on a per-element basis
>be powerful enough? If we used PIs to do this then some HTML tags, for
>example, might be listed as follows (just a hypothetical notation example,
>_not_ a final suggestion for notation):
>
>    <?XML-SPACE-DISCARD  HTML, HEAD, BODY, ... ?>
>    <?XML-SPACE-COLLAPSE TITLE, P, H1, H2, ... ?>
>    <?XML-SPACE-KEEP     PRE, XMP, LISTING, ... ?>
>
>Notes:
>
>- HTML applications could just imply these rules.
>
>- Any elements that aren't listed would just use the current mode, which
>depends on the context.
>
>- If the desired whitespace mode depends on something other than the
>current element (an attribute, say) then this mechanism won't be powerful
>enough.
>
>- Specifying the whitespace mode on a per-element basis should make this
>technique well-suited to architectural forms, though.

One way to see that this is inadequate is to think about typesetting, where
you may need to consider the whitespace and adjacent typefaces independent
of their placement with respect to markup, in order to correctly handle
italic corrections and the like. This is something that authors frequently
fail to get right, and that is probably best solved, 90% of the time, by
smart software. (Let's not even consider the problem of punctuation in the
same environments!)

I think XML's agnostic position is the correct one for tha language.
Authors should probably assume (unless they anticipate absolutely no
re-use) that HTML-style draconian normalization might occur anywhere and
use markup rather than whitespace, or at least CDATA sections. This
position _may_ be moderated (a little) where a well-known DTD with
well-defined WS rules can be used (like the TEI or HTML).

  -- David

_________________________________________
David Durand              dgd at cs.bu.edu  \  david at dynamicDiagrams.com
Boston University Computer Science        \  Sr. Analyst
http://www.cs.bu.edu/students/grads/dgd/   \  Dynamic Diagrams
--------------------------------------------\  http://dynamicDiagrams.com/
MAPA: mapping for the WWW                    \__________________________



xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)




More information about the Xml-dev mailing list