Whitespace

Tue Aug 26 16:27:24 BST 1997

At 6:17 PM -0500 8/25/97, Marcus Carr wrote:
>David G. Durand wrote:
>
>> > It does involve parsing it, but only until
>> >it sees mixed content. If elements are assumed to be ELEMENT until
>> >proven otherwise, surely this wouldn't be a massive overhead.
>>
>> It might involve buffering large amounts for whitespace across an
>> arbitrary parser lookahead, since there is no limit on the size of an
>> element, or where the non-space PCDATA might show up. One would have
>> to buffer the entire document in the parser before one could decide
>> whether to emit any whitespace in the root element. This might be a
>> bit of a memory performance hit...
>
>Why would you need to buffer anything? Every element starts with a
>default value of 'element'. As they're shown to be otherwise, their
>status is revised. This involves tracking open elements, not picking up
>chunks and reviewing them. One linear pass of the document tells you all
>you need to know.

Well, you can't send any of the data within an element while you are
"tracking", since you don't know if whitespace is data or noise. So you
have to buffer element opens and closes, and any PCDATA, until the element
is over (or you find non-WS PCDATA). The easy to see worst case has
megabytes of doc with one lone  string: "The end" in the content of the
top-level element, that otherwise contains only elements and no data.

>> Manye people have claimed that they use editors incapable of
>> funtioning without inserting linends (of their local flavor) every 200
>> characters or so. I (personally) wasn't very sympathetic to this
>> argument, but it stood in for the empirical observation that people
>> are very loose with whitespace/linends, and that forcing tools not to
>> emit whatever line-ending codes it wants could be a problem.
>
>This would still respect the limits set by the user in the same way an
>application would behave when you turn off hyphenation - the line might
>be shorter, but it's broken in a sensible place.

Exactly. So as I said, this ptoential justification for whitespace mangling
is a non-starter. Thanks for the support.

>
>> The real problem is that there's an assumption that a generic
>> processor can solve the "whitespace problem" -- and that is not really
>> true. In a very real sense the meaning of whitespace is a product of
>> the document _and_ and he application. For instance, line breaks (as
>> indicated by whitespace) might be critical in a typesetting
>> application for poetry (but _only in <poem> elements). The same
>> document, however, would be best processed with some form of
>> whitespace-collapsing everywhere, when indexed by a full-text search
>> engine. The same data may have different signficance when processed
>> differently.
>
>If line breaks are critical, they should be marked explicitly. If you
>gave a hand written poem to a data entry person with no knowledge of
>poetry, you may have to specify that you want the current line
>boundaries respected. Why should an application not be given the same
>info?

It should. In a stylesheet. I've still not seen an example of a case where
an application that doesn't know the DTD, and doesn't have a processing
spec needs to _collapse_ whitespace. XML always passes all the whitespace,
so it is never lost except by _explicit application action_.

>
>> The fact is that whitespace should be controlled by the application.
>> For typesetting and display, this means that practically, it's going
>> to be part of the "stylesheet" or other processing mechanism.
>
>Whitespace is also a mechanism used to make data readable. In that
>sense, a space is a character in it's own right, not just something that
>appears around words. Imagine the response if it wasn't whitespace that
>was being discussed, it was the letter 'x', and we were telling people
>'x' may or may not appear in their data.

You are the one arguing that it must be possible to "turn off x's" when
convenient. XML _Passes all whitespace_. The only kind of convention we can
create is one that turns off some whitespace. I see that as dangerous, for
the reasons you give. We may be in raging agreement!

>
>> > As
>> >far as I can see, much of the functionality in XML (such as linking)
>> >relies on a DTD, so it's not going to be foreign to most XML
>> >applications anyway.
>>
>> This is not necessarily the case. It's also harder to detect mixed
>> content from DTD declarations, than simply to recognized #FIXED
>> attributes.
>
>It can't be that hard. If parameter entities (I assume they're allowed?)
>have to be unravelled anyway, surely it's just a case of looking at the
>content model? If it starts with #PCDATA and contains anything else,
>it's mixed content.

If you don't have a DTD, you don't have content models. Even if you do have
the DTD, a minimal parse would involve "entity unravelling" -- a serious
increment in complexity just to be able to ignore a few spaces. In any
case, XML has decided to elkiminate SGML's arcane whitespace rules, since
the ISO has agreed to create an SGML declaration option that will have the
same effect.

>"Pass all whitespace" will go some distance toward fixing the problem,
>but what else does it impact? Does it mean that inclusions and
>exclusions suddenly appear differently than they did in the 'old SGML'?

There are no inclusions or exlcusions in XML. If you are using the new
declaration in the new SGML you'll have to read the spec and find out, but
it's irrelevant to XML.

The XML authors worked through the consequences, but it wasn't very hard,
since most of the problematic features of SGML (inclusion exceptions,
shortrefs, minimization) were already gone, so the interactions were simple.

The one wierd thing is that the distinction between whitespace behavior for
element and mixed content no longer exists. You see all whitespace
regardless. This was essentially required for DTDless and DTDfull parsing
to produce equivalent results.

So in "pure XML" whitespace is never "source-code formatting", but is
_always_ data.

>My understanding is that one of the basic requirements of XML was that
>the applications had to be easy to write, so things could be allowed to
>happen quickly. As much as I do agree that this would have to be a good
>thing, (and as you pointed out, applications are coming out already) I
>would argue that maybe they should be more difficult to write, but
>should address this issue correctly.

They do -- even when correctly means compatible with ISO SGML -- but we did
get ISO to simplify some of the hard bits of SGML.

  -- David

_________________________________________
David Durand              dgd at cs.bu.edu  \  david at dynamicDiagrams.com
Boston University Computer Science        \  Sr. Analyst
http://www.cs.bu.edu/students/grads/dgd/   \  Dynamic Diagrams
--------------------------------------------\  http://dynamicDiagrams.com/
MAPA: mapping for the WWW                    \__________________________

xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)