Specification Questions

Neil Bradley neil at bradley.co.uk
Tue Aug 5 11:49:56 BST 1997



Reply-to:      Peter at ursus.demon.co.uk (Peter Murray-Rust)

> Some additional - hopefully constructive - thoughts on whitespace.
> 
> The XML-lang spec does not ( and I suspect will not) give detailed guidance
> on how whitespace will be managed.  My impression is that it is up to 
> implementers and/or groups like this to come up with particular solutions.
> My worry is that these will be inconsistent and not inter-operable.

I agree totally. This was my original concern.

> ***
> Therefore I propose that those on XML-DEV who care about this problem come
> up with some guidelines for implementers. 
> ***

I very much hope this happens.

> XML does NOT treat whitespace like SGML and does NOT behave like HTML 
> (although it can be configured to do so).  As far as I see them, the rules
> are:
> 
> 'All characters that are not markup are passed to the application'.  (This
> is independent of any value of XML-SPACE (see below), processing instructions,
> stylesheets, etc.)  These characters include HT, CR, LF, SP, and probably
> a number of other Unicode 'whitespace' characters.  What the application
> does with them is *undefined* in XML-lang.
> 
> Note that this means that CR and LF are passed as separate characters. No
> normalisation takes place.  Therefore
> 
> Line one\n\rline two
> 
> is different from
> 
> Line one\nline two
> 
> even if they are visually similar on various text editors/displays, etc.
> (My impression was that SGML normalised these two strings to the same 
> ESIS output - is that right?).
> 
> This means that the author/processor 'contract' has to be aware of this.

I think all applications should be expected to either or both 
characters in sequence as a line end signal, so that platform 
dependancies can be eliminated. If there is no good reason to omit 
this taks from the XML-processor itself, I think it should be done 
there.


> *** In some cases the document author and the application author are both
> aware of this problem and so the whitespace characters inserted by the
> author will be processed in the way that they expect.  However, in most cases
> I suspect this will NOT be true and that authors will inadvertently create
> documents that are processed differently ***
> 
> XML provides an attribute XML-SPACE (local to an element BUT inherited by
> its children) which can have three values:
> 	- #IMPLIED (no signals about whitespace handling)
> 	- PRESERVE (applications preserve all the whitespace)
> 	- DEFAULT (the *application's* default white-space processing modes
> 		are acceptable fro this element).
> 
> PRESERVE seems clear.  All whitespace is passed to the application.  The 
> others seem to be dangerous unless there are some general conventions. 

> If possible, we should propose a *general* default mechanism for whitespace
> handling for XML-SPACE="DEFAULT".  If everyone adopts this, it will greatly
> reduce this problem.  Is this a reasonable strategy?

I believe so. In addition, can we not put 'XML-SPACE 
(PRESERVE|IMPLIED) "PRESERVE" in an attribute declaration for an 
element which will always have reserved content. It is common 
practice for a DTD to have some kind of pre-formatted element, such 
as HTML's '<pre>'.


> If so, we can propose that the DEFAULT mode for any whitespace processing is
> something along the lines (similar to HTML?).  Within an element with
> XML-SPACE="DEFAULT"
> 

> All whitespace sequences are mapped into a single space character.
Agreed.

> All whitespace pseudo-elements are ignored (i.e. whitespace between markup)

Ummm. what about 'the <b>bold</b>  <i>italic</i> styles...'?

> All leading and trailing whitespace in #PCDATA is ignored.

I think all applications should remove leading and trailing CR and LF
characters in a mixed content element. But not SP or HT, as this would
be undesirable in the following fragment:

A<emph>  bold  </emph>word.

Although an unusual layout, some people may use it, and it would be
unfortunate if it resulted in 'Aboldword'.


> Example:
> <FOO XML-SPACE="DEFAULT">
> <BAR> this
> <!-- comment -->
> is<!-- comment -->a 
DID YOU INTEND A SPACE SOMEWHERE BETWEEN 'is' AND 'a'?
> bar
> </BAR></FOO>
> 
> folds to:
> <FOO XML-SPACE="DEFAULT"><BAR>this is a bar</BAR></FOO>
> 
> I think it's important to address this, since otherwise I predict we shall
> have considerable confusion, especially when implementors of authoring or
> processing software have not thought this through completely.

Again, I agree, and I think it will be possible to achieve this with 
a bit more discussion in this forum.

> Peter Murray-Rust, domestic net connection

Neil.

-----------------------------------------------
Neil Bradley - Author of The Concise SGML Companion.
neil at bradley.co.uk
www.bradley.co.uk

xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)




More information about the Xml-dev mailing list