Whitespace

Mon Aug 25 17:01:47 BST 1997

At 4:38 AM -0500 8/25/97, Marcus Carr wrote:
>Apologies in advance to all those who have thought and fought over this
>issue for a long time, but as a self-confessed critic of the claim that
>"XML is SGML", I feel compelled to throw my hat into the ring.

I looked with interest for the criticism of the claim, since that would be
useful information -- we've gone so far as to hold off critical feeatures
of XML in a few places to wait for the ISO to catch up in the current SGML
revision. One of the things they kindly agreed to update is the whitespace
rules, so that the XML rules can be turned on in the SGML declaration.

>As far as I can see, there are only two circumstances when whitespace is
>an issue - receiving an XML document or authoring one. Receiving, it
>doesn't matter if you have a DTD or not - the application can determine
>from a well formed document whether it should regard an element's
>content as MIXED or ELEMENT.

Since XML must deal with well formed documents (no DTD) the traditional
SGML whitespace rules _cannot_ be used, as element content and mixed
content are not distinguished in instances by _any_ dependable cues. The
limited DTD proposal pleased neither the DTD-haters, nor the DTD-lovers,
though it was in a draft for a long time.

> It does involve parsing it, but only until
>it sees mixed content. If elements are assumed to be ELEMENT until
>proven otherwise, surely this wouldn't be a massive overhead.

It might involve buffering large amounts for whitespace across an arbitrary
parser lookahead, since there is no limit on the size of an element, or
where the non-space PCDATA might show up.
One would have to buffer the entire document in the parser before one could
decide whether to emit any whitespace in the root element. This might be a
bit of a memory performance hit...

> Authoring
>applications would be similar - the first time a tag contained mixed
>content, the application would reset the status of the element. The onus
>would from then on be on the application to assist the user in creating
>semantically correct documents, by such mechanisms as not allowing hard
>returns at element boundaries, in short, making significant whitespace
>look like significant whitespace.
Manye people have claimed that they use editors incapable of funtioning
without inserting linends (of their local flavor) every 200 characters or
so. I (personally) wasn't very sympathetic to this argument, but it stood
in for the empirical observation that people are very loose with
whitespace/linends, and that forcing tools not to emit whatever line-ending
codes it wants could be a problem.

>MURATA Makoto wrote:
>
>> Suppose that we have different kinds of tags for mixed-content
>> elements (e.g, <name:mixed> and </name:mixed>) and element-content
>> elements (e.g, <name:element> and </name:element>).  Then, even
>> non-validating parsers can tell element contents and mixed contents.
>> Does this help?
>
>It seems that the choices are either the current proposal that nobody
>seems to feel is entirely satisfactory, or suggestions such as the
>above, which would certainly work, but ultimately may involve as great
>an overhead as sending the DTD. It seems to me that we're throwing the
>baby out with the bathwater by ignoring a solution such as declaring at
>the start of the document how whitespace in elements should be handled.

The real problem is that there's an assumption that a generic processor can
solve the "whitespace problem" -- and that is not really true. In a very
real sense the meaning of whitespace is a product of the document _and_ and
he application. For instance, line breaks (as indicated by whitespace)
might be critical in a typesetting application for poetry (but _only in
<poem> elements). The same document, however, would be best processed with
some form of whitespace-collapsing everywhere, when indexed by a full-text
search engine. The same data may have different signficance when processed
differently.

The fact is that whitespace should be controlled by the application. For
typesetting and display, this means that practically, it's going to be part
of the "stylesheet" or other processing mechanism. The advantage of "parser
handled whitespace" would be the ability to create meaningful, error-free
applications that can work on arbitrary markup _whithout a stylesheet or
other processing specification_. The only small problem with that
convenience is that such processing is basically impossible, for many more
reasons that telling where words end, or if CR; is a linend or just part of
a CRLF sequence.
>
> .....
> As
>far as I can see, much of the functionality in XML (such as linking)
>relies on a DTD, so it's not going to be foreign to most XML
>applications anyway.

This is not necessarily the case. It's also harder to detect mixed content
from DTD declarations, than simply to recognized #FIXED attributes.

>
>The whitespace rules in SGML can be simplified - most people accept that
>they should.

>I would really like to
>see XML and SGML stay in synch - I think anything else would be to
>everyones disadvantage.

Yes, this is very true -- and this battle has been won by the compatibility
camp -- they are in synch. SGML has a new "pass all whitespace" option for
the declaration. This is not going to be a big problem for existing
implementations, since it's incredibly easy for parsers to implement --
most have had to anyway, if they attempt to support SGML->SGML
transformation tools. I think SP already can do the right thing.

> There really isn't a lot of point in flaming me
>for this; the question is well intentioned and the current solution
>seems to have satisfied few. The concept of declaring things at the
>start is a tried and true methodology, yet we seem to be fleeing it in
>favor of something nobody's quite sure about.

   No flameage required. I agree with the intent -- just not your proposed
solutions. We went through all these permutations -- any form of
normalization _before_ the application causes some kind of problem. And
since there is, iun any case, no universal way to handle markup without a
external processing spec (that can include whitespace among its many other
factors) there's no reason to make the parser cause applications more
problems than they will have to solve already.

_________________________________________
David Durand              dgd at cs.bu.edu  \  david at dynamicDiagrams.com
Boston University Computer Science        \  Sr. Analyst
http://www.cs.bu.edu/students/grads/dgd/   \  Dynamic Diagrams
--------------------------------------------\  http://dynamicDiagrams.com/
MAPA: mapping for the WWW                    \__________________________

xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)