Specification Questions
Neil Bradley
neil at bradley.co.uk
Tue Aug 5 11:49:56 BST 1997
Reply-to: Peter at ursus.demon.co.uk (Peter Murray-Rust)
> Some additional - hopefully constructive - thoughts on whitespace.
>
> The XML-lang spec does not ( and I suspect will not) give detailed guidance
> on how whitespace will be managed. My impression is that it is up to
> implementers and/or groups like this to come up with particular solutions.
> My worry is that these will be inconsistent and not inter-operable.
I agree totally. This was my original concern.
> ***
> Therefore I propose that those on XML-DEV who care about this problem come
> up with some guidelines for implementers.
> ***
I very much hope this happens.
> XML does NOT treat whitespace like SGML and does NOT behave like HTML
> (although it can be configured to do so). As far as I see them, the rules
> are:
>
> 'All characters that are not markup are passed to the application'. (This
> is independent of any value of XML-SPACE (see below), processing instructions,
> stylesheets, etc.) These characters include HT, CR, LF, SP, and probably
> a number of other Unicode 'whitespace' characters. What the application
> does with them is *undefined* in XML-lang.
>
> Note that this means that CR and LF are passed as separate characters. No
> normalisation takes place. Therefore
>
> Line one\n\rline two
>
> is different from
>
> Line one\nline two
>
> even if they are visually similar on various text editors/displays, etc.
> (My impression was that SGML normalised these two strings to the same
> ESIS output - is that right?).
>
> This means that the author/processor 'contract' has to be aware of this.
I think all applications should be expected to either or both
characters in sequence as a line end signal, so that platform
dependancies can be eliminated. If there is no good reason to omit
this taks from the XML-processor itself, I think it should be done
there.
> *** In some cases the document author and the application author are both
> aware of this problem and so the whitespace characters inserted by the
> author will be processed in the way that they expect. However, in most cases
> I suspect this will NOT be true and that authors will inadvertently create
> documents that are processed differently ***
>
> XML provides an attribute XML-SPACE (local to an element BUT inherited by
> its children) which can have three values:
> - #IMPLIED (no signals about whitespace handling)
> - PRESERVE (applications preserve all the whitespace)
> - DEFAULT (the *application's* default white-space processing modes
> are acceptable fro this element).
>
> PRESERVE seems clear. All whitespace is passed to the application. The
> others seem to be dangerous unless there are some general conventions.
> If possible, we should propose a *general* default mechanism for whitespace
> handling for XML-SPACE="DEFAULT". If everyone adopts this, it will greatly
> reduce this problem. Is this a reasonable strategy?
I believe so. In addition, can we not put 'XML-SPACE
(PRESERVE|IMPLIED) "PRESERVE" in an attribute declaration for an
element which will always have reserved content. It is common
practice for a DTD to have some kind of pre-formatted element, such
as HTML's '<pre>'.
> If so, we can propose that the DEFAULT mode for any whitespace processing is
> something along the lines (similar to HTML?). Within an element with
> XML-SPACE="DEFAULT"
>
> All whitespace sequences are mapped into a single space character.
Agreed.
> All whitespace pseudo-elements are ignored (i.e. whitespace between markup)
Ummm. what about 'the <b>bold</b> <i>italic</i> styles...'?
> All leading and trailing whitespace in #PCDATA is ignored.
I think all applications should remove leading and trailing CR and LF
characters in a mixed content element. But not SP or HT, as this would
be undesirable in the following fragment:
A<emph> bold </emph>word.
Although an unusual layout, some people may use it, and it would be
unfortunate if it resulted in 'Aboldword'.
> Example:
> <FOO XML-SPACE="DEFAULT">
> <BAR> this
> <!-- comment -->
> is<!-- comment -->a
DID YOU INTEND A SPACE SOMEWHERE BETWEEN 'is' AND 'a'?
> bar
> </BAR></FOO>
>
> folds to:
> <FOO XML-SPACE="DEFAULT"><BAR>this is a bar</BAR></FOO>
>
> I think it's important to address this, since otherwise I predict we shall
> have considerable confusion, especially when implementors of authoring or
> processing software have not thought this through completely.
Again, I agree, and I think it will be possible to achieve this with
a bit more discussion in this forum.
> Peter Murray-Rust, domestic net connection
Neil.
-----------------------------------------------
Neil Bradley - Author of The Concise SGML Companion.
neil at bradley.co.uk
www.bradley.co.uk
xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)
More information about the Xml-dev
mailing list