Whitespace rules (v2)

Tue Aug 19 00:16:25 BST 1997

<HI/>

In message <199708170743.IAA28970 at andromeda.ndirect.co.uk> "Neil Bradley"
writes:
> 
> Peter Murray-Rust wrote:
> 
>I think - along with TimB - that it is unrealistic to come up with s single
>set of rules that will server every application.  There was an enormous
amount 
>of discussion on the XML group last year and I take it as axiomatic that we
>cannot produce a set of rules which everyone agrees are:
>	- simple to state
>	- unambiguous
>	- intuitive and easy to learn
>	- universal (i.e. cover every situation)

Axiomatic? Call me stubborn (you won't be the first), but I, for one,
retain some hope. :-)

>
>I think that XML will include applications beyond 'browsers and typesetting 
>systems' although these will be the commonest. MathML and CML will have 
>chunks of material which contains whitespace not used primarily as part of
>text.  Here's a simple example:
><MOL>
>  <ATOMS>
>[HT]C H N    Cl[CR][LF]
>[HT]O P Br[CR][LF]
>  </ATOMS>
></MOL>
>where the whitespace is used (a) for visual effect and potential ease in 
>editing (b) as a delimiter (within ATOMS) [HT]=tab, for example. 
>
>What I am after here is a convention that I can state which instructs the 
>processor how to treat this whitespace.  ***I do not wish to have to devise
>a specific convention for CML***.  I want to be able to indicate that that 
>the W/S after <MOL> is irrelevant, and that the whitespace in the ATOMS
content 
>is normalisable and used only as a delimiter of tokens.
>
>I expect that many other applications will use a similar approach, so I want
>to share the effort with them.  Examples of metadata in XML have often been 
>portrayed as prettyprinted and I expect that CML could use the same
conventions.
>[BTW I think that there will be more human editing of XML files than is often
>assumed - and metadata is a good example. Prettyprinting is a useful tool
>in those cases.]
>
>I think that we can aim for a set of options that could be used by a
post-parser
>processor. Different applications (**or document authors**) could choose
between
>them. Examples might be:
>	- normaliseCRLF (Neil's Rule 1)
>	- discardAllWS
>	- normaliseToSingleSpace
>
>An author or application could then state which of these it was using. 
>
>It might be that in the first instance we can only agree on (say) Rule 1, but
>this would be a useful start.
>
>>  
>> > I agree with Liam - I didn't understand 'blockness'.  I also think
that whatever
>> > is done here has to be independent of stylesheets and DTDs.  The
average hacker
>> > like me simply won't undertsand the subtleties.
>> 
>> I am merely trying to distinguish in-line elements from other 
>> elements. An in-line element implies no line-breaks above or below 
>> it. A 'Block' element therefore DOES imply such a break. I do not use 
>> the terms element and mixed content here, because it is not quite the 
>> same thing. As I have said before, a Para element is a 'block' 
>> element, and has mixed content, but an Emph element is an 'in-line' 
>> element, yet also has mixed content. All style sheets, including 
>> CSS, understand the concept of in-line and block elements. Any 
>> whitespace surrounding a block element MUST be irrelevant.
>
>It looks like the context, rather than the content is the significant
>feature.
>
>> 
>> Liam raised the issue of a half-way element type, such as a header 
>> which implies a line-break before it, but not after, so that 
>> following text will appear on the same line. This one is tricky. 
>> Suggestions anybody?
>

<FormattingSpecificDiscussionOfWhitespace>

The idea of a "half-way" element type just highlights the fact that element
nesting does not necessarily map nicely to block/paragraph structure in
formatting applications. I like to say that block formatting _trancends_
element nesting -- there is no direct mapping.

In my experience, a pair of lower-level concepts (eg. "block start" and
"block end") has proven quite useful. In the current discussion, the
"blockness" of the elements might be described as follows:

           "block start"   "block end"
    -----------------------------------------
    Para       Yes            Yes
    Emph       No             No
    Hn         Yes            No

where:

  "block start" - means start a block at the start of the element
  "block end"   - means end a block at the end of the element

</FormattingSpecificDiscussionOfWhitespace>

<GeneralDiscussionOfWhitespace>

A notation for describing whitespace handling must communicate the notion
that whitespace processing is modal, and provide words for each mode and
phrases for the transitions. 

Let's consider Peter's tentative rules:

>	- normaliseCRLF (Neil's Rule 1)

Please correct me if I am wrong, but this looks like a document-wide
setting whose behaviour/interpretation isn't affected by the application
type. A simple on/off PI setting could be used to set this.

The rest of the rules, though, could be applied on a per-element basis:

>	- discardAllWS
>	- normaliseToSingleSpace

I would add:

    - keepAllWS

(I haven't read every word of every post in this thread. Has this third one
been discarded as a reasonable option? Even if it has, the rest of my
discussion here isn't affected)

Assuming that the three, mutually-exclusive rules (or _modes_) can be
applied to any element, how can we specify this?

Would being able to specify one of the three modes on a per-element basis
be powerful enough? If we used PIs to do this then some HTML tags, for
example, might be listed as follows (just a hypothetical notation example,
_not_ a final suggestion for notation):

    <?XML-SPACE-DISCARD  HTML, HEAD, BODY, ... ?>
    <?XML-SPACE-COLLAPSE TITLE, P, H1, H2, ... ?>
    <?XML-SPACE-KEEP     PRE, XMP, LISTING, ... ?>

Notes:

- HTML applications could just imply these rules.

- Any elements that aren't listed would just use the current mode, which
depends on the context.

- If the desired whitespace mode depends on something other than the
current element (an attribute, say) then this mechanism won't be powerful
enough.

- Specifying the whitespace mode on a per-element basis should make this
technique well-suited to architectural forms, though.

</GeneralDiscussionOfWhitespace>

 - Russ

PS - Should whitespace be blacklisted? ;-)

xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)