Whitespace rules (v2)

Sat Aug 16 19:56:46 BST 1997

Firstly many thanks to neil for posting these proposed rules and those who
have answered.  On balance (I am an optimist!) I think there is something
desirable and achievable here.  I think a lot of us feel there has to be
some guidance on whitespace and I think Neil has covered much of the ground.

I think what is achievable is a set of rules at the 80/20 level (80%
of XML-DEV'ers think they are 80% useful). There are certainly areas where there
will be disagreement - this was a voluminous topic on XML-WG last autumn.

XML-DEV has the advantage and disadvantage that it has no formal standing, so
those who don't like anything that comes out of it can ignore it :-).  So if
we can come up with a set of rules and a label for them, application developers
can use them (or not) as they wish.  An advantage is that because all discussion
is publicly archived, we can always point back and say 'that is why we 
suggested X'.  

If a set of rules *does* emerge, then how can we generally inform an application
that it should take them as DEFAULT?  I assume this is through a PI:

<?XML-SPACE-DEFAULT 
   HREF="http://www.lists.ic.ac.uk/hypermail/xml-dev/12345.html"?>
...
<FOO XML-SPACE="DEFAULT">
    The <!-- munge this accodring to XML-DEV whitespace -->whitespace[CR][LF]is
normalised</FOO>

So I think we need a mechanism from XML-WG to show the application where
it should get its DEFAULT processing mechanism from.

Specific points:

[Rule 1 - normalisation]
I think it's essential to have something like Neil's proposal for [CR][LF]

In message <Pine.BSI.3.95.970816011323.12788A-100000 at shell1.interlog.com> 
Liam Quin writes:
> On Sun, 10 Aug 1997, Neil Bradley wrote:
> 
> > [...]
> > RULE 2. All whitespace preceding the start-tag and following the end-tag 
> > of a 'block enclosing' element is discarded.
> > ---
> > Note: a non-validating applications must refer to a style sheet or
> > configuration file to identify 'block enclosing' elements (perhaps by 
> > applying this rule to elements not specified as in-line elements).
> 
> No -- "blockness" is not at all the same as element content.
> For example, you have to allow for a run-in heading, which starts out
> looking like an HTML H3 (say) except that the rest of the paragraph
> follow on on the same line.  So it isn't a block in the paragraph sense.
> 
> > As a validating application cannot easily determine this rule from the
> > content model (the first mixed content element in the hierarchy is 
> > block enclosing, as well as all outer layers), it may choose the same 
> > approach. 
> 
> I think this is too complicated, as well as being not 100% right.
> I don't think there's a single "right" solution.  This is why it's
> best to allow the parser to pass _all_ whitespace back to the application,
> although it is certainly useful if a DTD-aware parser, even if it isn't
> validating, distinguishes element content whitespace from PCDATA whitespace
> in some way.

I agree with Liam - I didn't understand 'blockness'.  I also think that whatever
is done here has to be independent of stylesheets and DTDs.  The average hacker
like me simply won't undertsand the subtleties.
> 
> More than this is a bad idea, I think.
> 
> 
> > Note: If PI's, comments or empty elements remain in the data stream,
> > they are deemed transparent to this process, so:
> >  [SP]<!--comment--><p>Some text...
> > 
> > becomes:
> > 
> >  <!--comment--><p>Some text...
> 
> Note that if you have a very large comment, you might need a lot of
> lookahead here.

I would assume that this processing takes place in the application, not the
parser.  How/whether comments are passed to the application is part of the
parser API.  I assume that at this stage the comment is recognised as a single
chunk which can be deleted with/out surrounding whitespace as required.

> 
> > RULE 3. A sequence of one or more line-end codes immediately
> > following a start-tag, or immediately preceding an end-tag, are
> > discarded (except in preserved content).
> 
> This means that
> <Paragraph>This is<Emphasis>
> very
> </Emphasis>strange.</Paragraph>
> 
> becomes
> <Paragraph>This is<Emphasis>very</Emphasis>strange.</Paragraph>
> 
> or, if you format withut distinguishing emphasis,
> <Paragraph>This isverystrange.</Paragraph>
> 
> which I don't think is what you want.
> 
> But SGML itself is broken in this regard.

This one is tough.  Please criticise my current view :-).  SGML documents seem
to use markup as structure in some places (e.g. OL/LI in HTML) or
event streams (e.g. EM, B in HTML). Authors/readers expect different processing
modes from these types. The example above is best treated as structuring
markup (P) containg an event stream (#PCDATA|EM)* [sorry for abbreviations].
So we have to indicate to the processor that P is structuring and that 
whitespace after <P> or before </P> is irrelevant, and that its content is an 
event stream where all whitespace is normalised to a single space (cf HTML.)
Therefore can we have something like this:
<?XML-SPACE STRUCTURE="YES"?>
<Paragraph>
<?XML-SPACE EVENT="YES"?>
This is<Emphasis>very</Emphasis>strange.
<?XML-SPACE STRUCTURE="YES"?>
</Paragraph>

(I am sure there are cleaner ways of doing this, especially declaring this
for all <Paragraphs>s).  The question is whether a model like this meets the
80/20 rule.

> 
> > RULE 4.  A remaining line-end code is converted into a space, except when it is 
> > preceded by a normal (hard) hyphen, or by a soft hyphen ('&#176;'), 
> > in which case it is removed (a soft hyphen is also then removed). 
> > ---

I have to argue against this :-(.  A hyphen is indistinguishable from a minus
to lots of people. There are also many cases where people may wish to end
a line with a minus:
<MOL>
<ATOMS>
CL-
H+
</ATOMS>
</MOL>

Since we are normalising whitespace, then lines can always be arranged so that
hyphens are unnecessary.

Let's see if there is a solution which is simple, covers most of the common
problems and which is intuitively obvious to the webhackers who graduate from
HTML.  We clearly need something more than <PRE> and </PRE>, but it shouldn't
be more than, say, twice as complex.  I think we are a long way towards that.

	P.

-- 
Peter Murray-Rust, domestic net connection
Virtual School of Molecular Sciences
http://www.vsms.nottingham.ac.uk/

xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)