Whitespace

Mon Aug 25 14:02:00 BST 1997

Thanks Marcus,

In message <340152AF.F224A51C at allette.com.au> Marcus Carr writes:
> Apologies in advance to all those who have thought and fought over this
> issue for a long time, but as a self-confessed critic of the claim that
> "XML is SGML", I feel compelled to throw my hat into the ring.
> 
> As far as I can see, there are only two circumstances when whitespace is
> an issue - receiving an XML document or authoring one. Receiving, it
> doesn't matter if you have a DTD or not - the application can determine
> from a well formed document whether it should regard an element's
> content as MIXED or ELEMENT. It does involve parsing it, but only until
> it sees mixed content. If elements are assumed to be ELEMENT until

I may have misunderstood this, but the problem seems to be that we cannot
reliably determine this if authors use whitespace for pretty-printing. If
what you mean is 'non-whitespace MIXED content' (i.e. content which has at
least one non-WS character in) then I'm sympathetic. IOW it is possible
to say 'treat anything with only WS content or element content as having element
content'.  This is exectly the sort of convention that I have been suggesting
people might propose. Whether it's workable depends on the reaction you get :-)

> proven otherwise, surely this wouldn't be a massive overhead. Authoring
> applications would be similar - the first time a tag contained mixed
> content, the application would reset the status of the element. The onus
> would from then on be on the application to assist the user in creating
> semantically correct documents, by such mechanisms as not allowing hard
> returns at element boundaries, in short, making significant whitespace
> look like significant whitespace.
> 
> MURATA Makoto wrote:
> 
> > Suppose that we have different kinds of tags for mixed-content
> > elements (e.g, <name:mixed> and </name:mixed>) and element-content
> > elements (e.g, <name:element> and </name:element>).  Then, even
> > non-validating parsers can tell element contents and mixed contents.
> > Does this help?

I think this approach does help, but might be implementable through PIs
(see below)

> 
> It seems that the choices are either the current proposal that nobody
                                           ^^^^^^^^^^^^^^^^
I assume you mean the current XML spec.  

> seems to feel is entirely satisfactory, or suggestions such as the
> above, which would certainly work, but ultimately may involve as great
> an overhead as sending the DTD. It seems to me that we're throwing the
> baby out with the bathwater by ignoring a solution such as declaring at
> the start of the document how whitespace in elements should be handled.

I think that this is exactly what some members of this list are striving
for.  The spec requires them to use one or more of:
	- a specific markup element (e.g. <NELWLINE/>)
	- a stylesheet 
	- a PI

> 
> I would also like to see DTDs sent to non-validating parsers, just so
> they could determine how to apply whitespace rules without necessarily
> having to do any structural parsing. If need be, two new types of

It seems axiomatic that there are already documents that do no conform
to any given DTD, so this isn't an option. It has been suggested that 
content could be defined on a per-element basis, but at present parsers are
expected to use this to validate the whole document.

> declared content could be added, ELEMENT and MIXED. They might behave
> the same way as ANY, or the DTD could be constructed even more loosely,
> where only MIXED elements were declared and everything else was
> defaulted to ELEMENT. This would result in a small DTD sent only for the
> sake of making the application aware of how to deal with whitespace. If
> desirable, no DTD need be sent, but the application's performance may
> suffer marginally for it. This is in keeping with the idea that an
> application need not know how to deal with a document as it comes in. As
> far as I can see, much of the functionality in XML (such as linking)
> relies on a DTD, so it's not going to be foreign to most XML
> applications anyway.

This seems possible, but it requires a change to the XML-spec.  XML WG
members read this list and if any of them think it's a good idea they might
take it up.  But my impression is that most take the view that David Durand
has posted - the spec is not capable of further refinement at this stage.

It may be possible to implement this through a PI. This could define which
elements had which type of content, e.g.
<?XML-WHITESPACE CONTENT="ELEMENT" ELEMENTS="UL OL"?>
<?XML-WHITESPACE CONTENT="MIXED" ELEMENTS="P EM B H1"?>

> 
> The whitespace rules in SGML can be simplified - most people accept that
> they should. Because inclusions and exclusions aren't valid in XML
> anyway, the rules are already somewhat simpler. I would really like to
> see XML and SGML stay in synch - I think anything else would be to
> everyones disadvantage. There really isn't a lot of point in flaming me
> for this; the question is well intentioned and the current solution

There are no flames on xml-dev :-) We are all trying to solve a difficult
technical, perceptual and cultural problem. [The general standard of debate
and courtesy within the SGML community is impressive.]

	P.

-- 
Peter Murray-Rust, domestic net connection
Virtual School of Molecular Sciences
http://www.vsms.nottingham.ac.uk/

xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)