Whitespace rules (v2)

Peter Murray-Rust Peter at ursus.demon.co.uk
Sun Aug 17 13:50:28 BST 1997

In message <199708170743.IAA28970 at andromeda.ndirect.co.uk> "Neil Bradley" writes:
> Peter Murray-Rust wrote:
> > If a set of rules *does* emerge, then how can we generally inform an application
> > that it should take them as DEFAULT?  I assume this is through a PI:
> I was hoping that relevant applications (mainly browsers and 
> typesetting systems) will ALWAYS assume the rules that are finally 
> determined, except where preserved content (or some other set of 
> rules) is explicitly actioned.

I think - along with TimB - that it is unrealistic to come up with s single
set of rules that will server every application.  There was an enormous amount 
of discussion on the XML group last year and I take it as axiomatic that we
cannot produce a set of rules which everyone agrees are:
	- simple to state
	- unambiguous
	- intuitive and easy to learn
	- universal (i.e. cover every situation)

I think that XML will include applications beyond 'browsers and typesetting 
systems' although these will be the commonest. MathML and CML will have 
chunks of material which contains whitespace not used primarily as part of
text.  Here's a simple example:
[HT]C H N    Cl[CR][LF]
[HT]O P Br[CR][LF]
where the whitespace is used (a) for visual effect and potential ease in 
editing (b) as a delimiter (within ATOMS) [HT]=tab, for example. 

What I am after here is a convention that I can state which instructs the 
processor how to treat this whitespace.  ***I do not wish to have to devise
a specific convention for CML***.  I want to be able to indicate that that 
the W/S after <MOL> is irrelevant, and that the whitespace in the ATOMS content 
is normalisable and used only as a delimiter of tokens.

I expect that many other applications will use a similar approach, so I want
to share the effort with them.  Examples of metadata in XML have often been 
portrayed as prettyprinted and I expect that CML could use the same conventions.
[BTW I think that there will be more human editing of XML files than is often
assumed - and metadata is a good example. Prettyprinting is a useful tool
in those cases.]

I think that we can aim for a set of options that could be used by a post-parser
processor. Different applications (**or document authors**) could choose between
them. Examples might be:
	- normaliseCRLF (Neil's Rule 1)
	- discardAllWS
	- normaliseToSingleSpace

An author or application could then state which of these it was using. 

It might be that in the first instance we can only agree on (say) Rule 1, but
this would be a useful start.

> > I agree with Liam - I didn't understand 'blockness'.  I also think that whatever
> > is done here has to be independent of stylesheets and DTDs.  The average hacker
> > like me simply won't undertsand the subtleties.
> I am merely trying to distinguish in-line elements from other 
> elements. An in-line element implies no line-breaks above or below 
> it. A 'Block' element therefore DOES imply such a break. I do not use 
> the terms element and mixed content here, because it is not quite the 
> same thing. As I have said before, a Para element is a 'block' 
> element, and has mixed content, but an Emph element is an 'in-line' 
> element, yet also has mixed content. All style sheets, including 
> CSS, understand the concept of in-line and block elements. Any 
> whitespace surrounding a block element MUST be irrelevant.

It looks like the context, rather than the content is the significant

> Liam raised the issue of a half-way element type, such as a header 
> which implies a line-break before it, but not after, so that 
> following text will appear on the same line. This one is tricky. 
> Suggestions anybody?

> > I would assume that this processing takes place in the application, not the
> > parser.  How/whether comments are passed to the application is part of the
> > parser API.  I assume that at this stage the comment is recognised as a single
> > chunk which can be deleted with/out surrounding whitespace as required.
> As I say at the top of the rules, ALL these rules are applied by the 
> application, not the XML processor.

Agreed.  This discussion is about how the application behaves.  The question
is whether we can give it some generic instructions.  I'd delete the word
'ALL' if it suggest that you either take all the rules or none.

> > This one is tough.  Please criticise my current view :-).  SGML documents seem
> > to use markup as structure in some places (e.g. OL/LI in HTML) or
> > event streams (e.g. EM, B in HTML). Authors/readers expect different processing
> > modes from these types. The example above is best treated as structuring
> > markup (P) containg an event stream (#PCDATA|EM)* [sorry for abbreviations].
> > So we have to indicate to the processor that P is structuring and that 
> > whitespace after <P> or before </P> is irrelevant, and that its content is an 
> > event stream where all whitespace is normalised to a single space (cf HTML.)
> > Therefore can we have something like this:
> > <Paragraph>
> > This is<Emphasis>very</Emphasis>strange.
> > </Paragraph>
> I think that, ultimately, some combinations of markup will always 
> break whatever rules we come up with. We must ensure that only 
> obscure, non-intuitive combinations do this, then just shout from 
> the rooftops that these combinations are not to be used.

It is clear that a set of guidelines and examples must accompany these rules.
If necessary we may have to educate people to write XML like:
(although I think if we have to go to this stage we have lost 95% of potential
XML webhackers).

> My concern was to address existing text files, where hyphens are 
> often used in this way. Maybe I am over-estimating this problem.

I don't think we need to adress the conversion of existing non-XML files to
XML in this discussion. The question is what the application does to the
output of the XML parser.


WS is probably among the commonest problem that most newcomers to XML will 
face, so it's well worth trying to develop guidelines.


Peter Murray-Rust, domestic net connection
Virtual School of Molecular Sciences

xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)

More information about the Xml-dev mailing list