White Space

Peter Murray-Rust Peter at ursus.demon.co.uk
Sat Apr 12 18:07:32 BST 1997

In message < at jclark.com> James Clark writes:
> One might think so, but since C has mixed content and no white-space in
> mixed content is automatically ignored, the white-space following the D and
> E elements will be data and hence constitute pseudo-elements.  Thus
> ID(F)PREVIOUS(-2) will actually designate E.

Having read this (helpful) reply pointed out a problem I had overlooked, I've 
gone back to the XML-LANG spec (2.8) to clarify my thoughts and failed to do
so :-(.  Regardless of how desirable the present policy is (and I'm sympathetic
to those trying to formulate a policy) I can't put a precise meaning on 2.8.
Please forgive my normal blundering through this.

para 2:  'An XML processor which does not read the DTD must always pass all
characters  that are not markup through to the application'.  The implication
is that the processor (== 'parser' at this stage?) must recognise mixed
content, so that [without a DTD]:
is mixed content and contains 3 elements (the first and third being 
pseudoelements consisting of a newline).  [My naive understanding of SGML is
that there would only be one element, since start and end newlines are ignored
in mixed content.  Since all SGML applications require a DTD, SGML and XML
give 'different' results here.]

'An XML processor which *does* read the DTD must always pass all characters
in mixed content that are not markup through to the application.'  [Presumably
the newlines are not markup?]  'It may also **choose** to pass white space
occurring in element content to the application.  If it does so, it must
signal to the application that...' [and the rest of the sentence appears to have
been truncated in the public drafts; please can we have it back :-)]

Presumably this latter occurs if something like:
has been included, making it clear that C does not contain mixed content.
My reading is that the *parser* can decide (choose) what to do
with this whitespace, so that different *parsers* can give different results
here.    The *application* (e.g. browser) has to be prepared for differing 
inputs from the same document according to the parser used...

The treatment of DEFAULT|PRESERVE is that the parser simply passes 
this flag to the *application* but takes no special action itself so that 
all parsers should behave identically.

Presumably a parser without a DTD has to create pseudoelements when it 
encounters characters that are not part of markup.  (Is the term pseudoelement
used in the spec?)  So according to whether the parser finds a DTD or not
it will create different numbers of elements/pseudoelements for the 
application.  It is under no obligation to tell the application how it arrived
at what it is passing to it :-)  So that the occurrence of pseudoelements
consisting of newlines do not imple mixed content since they may have occurred
in element content and the parser chose to pass them through.

My current hope would be that this is a problem which we could separate into
parser and application and that parsers could hide some of the intracies 
from the application developer (including those writing generic browsers like
me.)  I'm not clear whether (a) this distinction is clear in the spec. (b) 
whether current parser writers all agree on what should be done.


Peter Murray-Rust, domestic net connection
Virtual School of Molecular Sciences

xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)

More information about the Xml-dev mailing list