XML vs the Dreaded Whitespace

Thu Dec 11 13:39:31 GMT 1997

At 06:41 11/12/97 -0500, David Megginson wrote:
>Peter Murray-Rust writes:
>
> > As a corollary: Is anyone testing the ESIS output of the current crop of
> > XML parsers (4 Java + nsgmls, I think)? Regardless of the whitespace model
> > or the value of xml:space they should all produce identical ESIS (right?)
> > If not, then one or more is wrong. And all applications should (IMO) be
> > prepared to work with ESIS which I think is isomorphous with a WF XML
> > document.
>
>There are quite a few more XML parsers out there, including at least
>one in TCL -- see 
>
>  http://www.sil.org/sgml/XML.html#xmlSoftware

Apologies to anyone I missed. I am a great fan of tcl and wrote costwish in
it to sit on top of Joe English's CoST...

>
>As for ESIS, there are some problems that we'd have to overcome first:

Are there? How does a WF document differ from the corresponding ESIS
stream? IOW if I do the transformation:
WF -> ESIS -> WF shouldn't I be able to recover the original?

>
>1) How should empty elements be represented?  Right now, Ælfred generates a
>   startElement event immediately followed by an endElement event.

Yes - and JUMBO is happy with that. As far as JUMBO os concerned
<FOO></FOO> and <FOO/> are processed in the same way and I will need a very
clear argument to convince me that it should do different.

>
>2) How should the XML declaration be represented?  Should it appear as
>   a processing instruction, or should it be ignored?

JUMBO regards it as a PI. I hang all PIs off the preceding ELEMENT (not
PCDATA). In that way the tree can be processed with these intact. JUMBO
understands namespace PIs, <?JUMBO ...?> PIs and will also store the
others. It's useful to store them in case one wants to compare trees. BTW -
although it is nowhere stated most people seem to create PIs as name-value
pairs and JUMBO expects this.

>
>3) How should space in element content be handled?  According to the
>   spec, a DTD-aware parser should handle whitespace in element
>   content differently from whitespace in mixed content (Ælfred just
>   ignores whitespace in element content right now).

This is a critical area for the parser writers to agree on. I assume that
for the DTD-aware stuff there has to be a validating parser (i.e. one that
matches contentspec against element content). I am not sure what algorithms
are being used - JUMBO wants a java one for its birthday, please - but I
can imagine that with certain contentspecs they might get different answers.

>
>4) DTD-aware and non-DTD-aware parsers will handle whitespace in
>   attribute values differently.  Non-DTD-aware parsers will treat all
>   attributes as CDATA, but DTD-aware parsers will treat tokenised
>   attributes specially, by stripping all leading an trailing
>   whitespace, and normalising internal whitespace to single spaces.

In this case presumably only the TYPE in the ATTLIST is needed.

	P.

Peter Murray-Rust, Director Virtual School of Molecular Sciences, domestic
net connection
VSMS http://www.nottingham.ac.uk/vsms, Virtual Hyperglossary
http://www.venus.co.uk/vhg

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)