SAX and whitespace (was Re: Problems with whitespace and msxml)

Peter Murray-Rust peter at ursus.demon.co.uk
Thu Jan 1 17:08:43 GMT 1998


[I think this discussion is another good reason why SAX is urgently needed]

At 09:57 01/01/98 -0500, David Megginson wrote:
> > >   An XML processor must always pass all characters in a document
> > >   that are not markup through to the application. A validating
> > >   XML processor must distinguish white space in element content
> > >   from other non-markup
>
>What the PR means to say here is that a DTD-driven XML parser has to
>treat whitespace in element content differently than whitespace in
>mixed content -- this, of course, has nothing to do with xml:space.
>If there is no DTD, then all element types are assumed to allow mixed
>content, so a DTD-driven XML parser ("validating XML processor") would
>report all whitespace as significant.

I would agree with this interpretation and prefer the phrase "DTD-driven
XML parser (?processor?)". I interpret this to mean: 
	"a processor which uses any DTD information given in the document, and
which uses it to do as much validation as it and the document are capable of."

However, having read the spec more carefully, I am having great difficulty
in deciding *where* it allows whitespace in element content. Take the
document:
<!ELEMENT FOO (BAR)>
<!ELEMENT BAR EMPTY>
...
<FOO>
  <BAR>
  </BAR>
</FOO>

My reading of the spec suggests that this is an *invalid* document. Please
show me where I have gone wrong...

FOO has declared element content [3.2.1]. "... elements of that type must
contain only child elements ***(no character data)*** [my asterisks]..."

for BAR:
[3.2] An element is valid if there is a declaration matching elementdecl
where the Name matches the element type and ...
	1. the declaration matches EMPTY and the element has ***no content***

the context of content is [39]
	STag content ETag   <!-- no S? --->
and its definition is: [43]
	(element | CharData | Reference | CDSect | PI | Comment)*

Again there is no place for whitespace.

Therefore I cannot see where (apart from [2.10] which raises the whitespace
question) whitespace is can be defined as 'non-significant'. IOW whitespace
***in the content of an element*** is only formally allowed as CharData in
mixed content, and in mixed content it must be significant.

I am *sure* I've missed something here as the WG has debated this for ages,
but I can't see where.
>
>What should SAX do with ignorable whitespace?

Assuming that ignorable WS is found only in element content...

>
>1) Report it as a distinct event, like Ælfred does?
>2) Treat it as regular character data?
>3) Ignore it (as in regular SGML)?
>
>(1) seems to be what the PR requires.  Either (2) or (3) could cause
>strange results.

(3) is forbidden - it has to be passed through. I think it has to be (2)
and (1) simultaneously. IOW in an event mode you must report whitespace
(space, 3 tabs, one newline, 10 spaces) occurs "now"; in tree mode you
report "I have made you an element/node consisting of PCDATA, all
whitespace - it's up to you to keep/destroy it..."

	P.

Peter Murray-Rust, Director Virtual School of Molecular Sciences, domestic
net connection
VSMS http://www.nottingham.ac.uk/vsms, Virtual Hyperglossary
http://www.venus.co.uk/vhg

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)




More information about the Xml-dev mailing list