Specification Questions

Sun Aug 3 15:58:14 BST 1997

In message <199708020838.JAA11135 at andromeda.ndirect.co.uk> "Neil Bradley" writes:
[...]

> <p>This is a long paragraph that is broken over two
> <!-- comment -->
> lines, with an implied space between 'two' and 'lines'.</p>
> 
> Is this interpreted as "two <!-- comment --> lines...", which reduces
> to "two   lines"?

Some additional - hopefully constructive - thoughts on whitespace.

The XML-lang spec does not ( and I suspect will not) give detailed guidance
on how whitespace will be managed.  My impression is that it is up to 
implementers and/or groups like this to come up with particular solutions.
My worry is that these will be inconsistent and not inter-operable.

***
Therefore I propose that those on XML-DEV who care about this problem come
up with some guidelines for implementers. 
***

XML does NOT treat whitespace like SGML and does NOT behave like HTML 
(although it can be configured to do so).  As far as I see them, the rules
are:

'All characters that are not markup are passed to the application'.  (This
is independent of any value of XML-SPACE (see below), processing instructions,
stylesheets, etc.)  These characters include HT, CR, LF, SP, and probably
a number of other Unicode 'whitespace' characters.  What the application
does with them is *undefined* in XML-lang.

Note that this means that CR and LF are passed as separate characters. No
normalisation takes place.  Therefore

Line one\n\rline two

is different from

Line one\nline two

even if they are visually similar on various text editors/displays, etc.
(My impression was that SGML normalised these two strings to the same 
ESIS output - is that right?).

This means that the author/processor 'contract' has to be aware of this.

Note also that *all* line-ends are passed (even immediately before/after
markup) unlike SGML.  Therefore:
<FOO>
line one
</FOO>

and
<FOO>line one</FOO>
are different.

Note also that:
<FOO><BAR>baz</BAR></FOO>
is different from
<FOO>
<BAR>baz</BAR>
</FOO>

The latter contains two pseudo-elements which contain only whitespace
(line-end characters) and FOO therefore has three children.

[Note that to make documents readable, the following trick can be used:
<FOO
><BAR
>baz</BAR
></FOO
>
since whitespace within the tag is ignored.  I do not think newcomers will
adopt this easily, and I suspect it can lead to errors in document editing.]

*** In some cases the document author and the application author are both
aware of this problem and so the whitespace characters inserted by the
author will be processed in the way that they expect.  However, in most cases
I suspect this will NOT be true and that authors will inadvertently create
documents that are processed differently ***

XML provides an attribute XML-SPACE (local to an element BUT inherited by
its children) which can have three values:
	- #IMPLIED (no signals about whitespace handling)
	- PRESERVE (applications preserve all the whitespace)
	- DEFAULT (the *application's* default white-space processing modes
		are acceptable fro this element).

PRESERVE seems clear.  All whitespace is passed to the application.  The 
others seem to be dangerous unless there are some general conventions. 

[Note also that XML parsers or processors have to ensure that children
inherit the XML-SPACE attributes of their parents.  Where does this get
done? In the parser? (It's part of XML-lang), in the processor - in which
case there is ample scope for inconsistent treatment...

Inheritance is already required in two places - XML-SPACE and XML-ATTRIBUTES
(XML-link). This is a generic mechanism and presumably should be implemented
in some package independenetly of the application.  Comments?]

If possible, we should propose a *general* default mechanism for whitespace
handling for XML-SPACE="DEFAULT".  If everyone adopts this, it will greatly
reduce this problem.  Is this a reasonable strategy?

If so, we can propose that the DEFAULT mode for any whitespace processing is
something along the lines (similar to HTML?).  Within an element with
XML-SPACE="DEFAULT"

All whitespace sequences are mapped into a single space character.
All whitespace pseudo-elements are ignored (i.e. whitespace between markup)
All leading and trailing whitespace in #PCDATA is ignored.

Does this cover everything? Is it workable?

Example:
<FOO XML-SPACE="DEFAULT">
<BAR> this
<!-- comment -->
is<!-- comment -->a 
bar
</BAR></FOO>

folds to:
<FOO XML-SPACE="DEFAULT"><BAR>this is a bar</BAR></FOO>

[Note that the Xpointer STRING syntax and the use of pseudo-elements
works on the *raw* data  (i.e. all non-markup characters).  Therefore the
application has to have access to this - it has to maintain a PRESERVEd
version of the document as well as (say) displaying or transforming a
DEFAULTed document.]

I think it's important to address this, since otherwise I predict we shall
have considerable confusion, especially when implementors of authoring or
processing software have not thought this through completely.

	P.

-- 
Peter Murray-Rust, domestic net connection
Virtual School of Molecular Sciences
http://www.vsms.nottingham.ac.uk/

xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)