Whitespace rules (v2)
Russell Chamberlain
russc at watfac.org
Tue Aug 19 00:16:25 BST 1997
<HI/>
In message <199708170743.IAA28970 at andromeda.ndirect.co.uk> "Neil Bradley"
writes:
>
> Peter Murray-Rust wrote:
>
>I think - along with TimB - that it is unrealistic to come up with s single
>set of rules that will server every application. There was an enormous
amount
>of discussion on the XML group last year and I take it as axiomatic that we
>cannot produce a set of rules which everyone agrees are:
> - simple to state
> - unambiguous
> - intuitive and easy to learn
> - universal (i.e. cover every situation)
Axiomatic? Call me stubborn (you won't be the first), but I, for one,
retain some hope. :-)
>
>I think that XML will include applications beyond 'browsers and typesetting
>systems' although these will be the commonest. MathML and CML will have
>chunks of material which contains whitespace not used primarily as part of
>text. Here's a simple example:
><MOL>
> <ATOMS>
>[HT]C H N Cl[CR][LF]
>[HT]O P Br[CR][LF]
> </ATOMS>
></MOL>
>where the whitespace is used (a) for visual effect and potential ease in
>editing (b) as a delimiter (within ATOMS) [HT]=tab, for example.
>
>What I am after here is a convention that I can state which instructs the
>processor how to treat this whitespace. ***I do not wish to have to devise
>a specific convention for CML***. I want to be able to indicate that that
>the W/S after <MOL> is irrelevant, and that the whitespace in the ATOMS
content
>is normalisable and used only as a delimiter of tokens.
>
>I expect that many other applications will use a similar approach, so I want
>to share the effort with them. Examples of metadata in XML have often been
>portrayed as prettyprinted and I expect that CML could use the same
conventions.
>[BTW I think that there will be more human editing of XML files than is often
>assumed - and metadata is a good example. Prettyprinting is a useful tool
>in those cases.]
>
>I think that we can aim for a set of options that could be used by a
post-parser
>processor. Different applications (**or document authors**) could choose
between
>them. Examples might be:
> - normaliseCRLF (Neil's Rule 1)
> - discardAllWS
> - normaliseToSingleSpace
>
>An author or application could then state which of these it was using.
>
>It might be that in the first instance we can only agree on (say) Rule 1, but
>this would be a useful start.
>
>>
>> > I agree with Liam - I didn't understand 'blockness'. I also think
that whatever
>> > is done here has to be independent of stylesheets and DTDs. The
average hacker
>> > like me simply won't undertsand the subtleties.
>>
>> I am merely trying to distinguish in-line elements from other
>> elements. An in-line element implies no line-breaks above or below
>> it. A 'Block' element therefore DOES imply such a break. I do not use
>> the terms element and mixed content here, because it is not quite the
>> same thing. As I have said before, a Para element is a 'block'
>> element, and has mixed content, but an Emph element is an 'in-line'
>> element, yet also has mixed content. All style sheets, including
>> CSS, understand the concept of in-line and block elements. Any
>> whitespace surrounding a block element MUST be irrelevant.
>
>It looks like the context, rather than the content is the significant
>feature.
>
>>
>> Liam raised the issue of a half-way element type, such as a header
>> which implies a line-break before it, but not after, so that
>> following text will appear on the same line. This one is tricky.
>> Suggestions anybody?
>
<FormattingSpecificDiscussionOfWhitespace>
The idea of a "half-way" element type just highlights the fact that element
nesting does not necessarily map nicely to block/paragraph structure in
formatting applications. I like to say that block formatting _trancends_
element nesting -- there is no direct mapping.
In my experience, a pair of lower-level concepts (eg. "block start" and
"block end") has proven quite useful. In the current discussion, the
"blockness" of the elements might be described as follows:
"block start" "block end"
-----------------------------------------
Para Yes Yes
Emph No No
Hn Yes No
where:
"block start" - means start a block at the start of the element
"block end" - means end a block at the end of the element
</FormattingSpecificDiscussionOfWhitespace>
<GeneralDiscussionOfWhitespace>
A notation for describing whitespace handling must communicate the notion
that whitespace processing is modal, and provide words for each mode and
phrases for the transitions.
Let's consider Peter's tentative rules:
> - normaliseCRLF (Neil's Rule 1)
Please correct me if I am wrong, but this looks like a document-wide
setting whose behaviour/interpretation isn't affected by the application
type. A simple on/off PI setting could be used to set this.
The rest of the rules, though, could be applied on a per-element basis:
> - discardAllWS
> - normaliseToSingleSpace
I would add:
- keepAllWS
(I haven't read every word of every post in this thread. Has this third one
been discarded as a reasonable option? Even if it has, the rest of my
discussion here isn't affected)
Assuming that the three, mutually-exclusive rules (or _modes_) can be
applied to any element, how can we specify this?
Would being able to specify one of the three modes on a per-element basis
be powerful enough? If we used PIs to do this then some HTML tags, for
example, might be listed as follows (just a hypothetical notation example,
_not_ a final suggestion for notation):
<?XML-SPACE-DISCARD HTML, HEAD, BODY, ... ?>
<?XML-SPACE-COLLAPSE TITLE, P, H1, H2, ... ?>
<?XML-SPACE-KEEP PRE, XMP, LISTING, ... ?>
Notes:
- HTML applications could just imply these rules.
- Any elements that aren't listed would just use the current mode, which
depends on the context.
- If the desired whitespace mode depends on something other than the
current element (an attribute, say) then this mechanism won't be powerful
enough.
- Specifying the whitespace mode on a per-element basis should make this
technique well-suited to architectural forms, though.
</GeneralDiscussionOfWhitespace>
- Russ
PS - Should whitespace be blacklisted? ;-)
xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)
More information about the Xml-dev
mailing list