Whitespace
Arnaud Le Taillanter
arnaud21 at club-internet.fr
Sun Sep 14 01:52:28 BST 1997
Hello,
Still about white space, sorry :-)
First part : comments on the XML draft approach to
WS handling.
Second part : comments on Neil Bradley's five rules for
WS handling (version 1).
**First part**
In the current draft, I see 3 rules concerning WS :
*Rule 1* : all WS is preserved and fed to the application.
A very simple rule indeed, in accordance with XML
design goals. But Neil Bradley five rules are
simple to implement too (though incorrect). On the contrary, consider
parameter entities: the committee members aknowledged
they had some difficulty designing a grammar for DTD
declarations, because of PEs. So implementing such a grammar
won't be trivial (BTW, someone said he had designed
a W grammar. It could be interesting
to see what it looks like. Please post!),
far less trivial than replacing CR, LF, CRLF by
a single character! (NB: the WG agreed a few days ago
on that rule :-)
So the simplicity argument doesn't hold.
The real issue is that the aplication must be fed with
a credible tree structure. Take a document without
a DTD:
<DOC>CR
<PART>CR
<P> foo</P>CR
</PART>CR
</DOC>
What kind of tree structure will the processor offer us?
A root node "DOC". So far, so good. But everybody
expects now a single child node (the "PART" element).
The processor gives us *three* for the same price:
the very useful "CR" element. The "PART" element.
And another "CR" node. What kind of ridiculous
tree is that ? A Tchernobyl tree I guess.
*Rule 2*: a validating parser must distinguish WS in
element content and signal to the application that such
WS is not significant.
I observe that it is not said how the parser will tell
the application about such insignificant WS. A minor point,
I concede. Wether the parser is validating or not, a
solution should be found where WS in element content
is *discarded* : this is the important point. No node
with only WS in it : it is completely against the
philosophy of SGML/XML: (well)*structured* content.
If the parser is able to distinguish what is element
content and what is not (the hard part without a DTD),
it should discard those completely useless WSs (the
easy part).
*Rule 3*: A special attribute may be inserted in documents
to signal an intention that the element to which this
attribute applies requires all white space to be treated as
significant by applications.
The value DEFAULT signals that applications' default white-space
processing modes are acceptable for this element; the value PRESERVE
indicates the intent that applications preserve all
the white space.
As someone observed, this is contradictory with the
position "the application should manage WS issues, the
parser doesn't intervene".
BTW, the attribute is hardly useful: suppose I put on the web a
document, with a "FOO" element with the attribute
"XML-SPACE" set to "DEFAULT". Application A
normalizes WS by default. Application B does nothing
with WS by default. As a result, an attribute set to "DEFAULT"
conveys absolutely no information. It will be the same as
"PRESERVE" with some applications. Basically, it
will be a mess :-) But we are used to that :-))
What is strange too, is that there is no default value
for this attribute by default. Those SGML guys are really
subtle :-)) A default value of "DEFAULT" would seem to be
natural, but in that case the application does anything
it wants to, so who cares :-)
**Second part**
Neil Bradley proposed some simple rules (this is "version 1", a second
version, a little more complex, but simple enough, was proposed). I
really like
the approach, even if it doesn't work for the moment.
*Rule 1*: standardization of input from different OSs.
CR, LF, CRLF are translated to a line end code.
OBVIOUS!!!!!
*Rule 2*: line end codes after a start tag or before an end tag are
discarded. A simple rule. For usual elements, it is exactly what you
expect :
<P>
blabla
<P>
becomes <P>blabla</P>
for PRE-like elements:
<PRE>
SPSPblabla
</PRE>
becomes <PRE>SPSPblabla</PRE>, so two line ends are discarded.
It seems nevertheless natural that these line ends are dropped.
BTW, this rule was in the first (11/14/96) XML draft.
There is a first problem with this approach: in
default content (preserved content will be examined later):
<P><EM>Two
</EM>words</P>
becomes
<P><EM>Two</EM>words</P>
The space between "Two" and "words" evaporated.
Same thing with:
<P><EM>
Two
</EM>words</P>
I don't think this particular problem is important: the encoding
is not natural. It should be an error!
I think everybody would write:
<P><EM>Two</EM> words</P>, or
<P>
<EM>Two</EM> words
</P>, etc...
Inside a preserved element, line end codes are wrongly discarded
after element start tags and before element end tags:
<PRE XML-SPACE="PRESERVE">
blabla <EM>
bloblo</EM>
blublu
</PRE>
The coding in this case is natural: bla, blo and blu are very
aesthetically aligned!
But: a line end code is discarded after "<EM>", it shouldn't be.
So: preserved elements need a special rule. It seems quite natural
they need a special rule concerning line end codes (and
space codes).
A possibility: the parser closes a "default" (not preserved) element,
and opens a "preserved" element: the line end codes after the start tag
and before the end tag are discarded. But for a preserved element
directly embedded in a preserved element, line end codes
are left intact.
*Rule3*: WS in element content is discarded.
WS space in element content *must* be discarded. The problem
is: without a DTD, one doesn't know if an element contains only
other elements.
Suppose we have :
<P><EM>blabla</EM>SP<EM>bloblo</EM></P>
We could choose a rule like: an element in which the parser
finds only other elements and WS (no characters) is an element
content element. But as the above example shows, it doesn't work.
If we follow this rule, we have a tree with a root node "P" and
two child nodes "EM". And what we want is a root note with three
child nodes: two "EM" elements and between the two a "PCDATA"
element (the space between "blabla" and "bloblo")
So a different method must be found.
A radical constraint put on the user would be: don't input a single
space character in element content. With this rule the parser
will be able to recognize easily element content. But you
can forget about indentation in that case. The rule for the
user would be: "when you type a space, you mean a space".
BTW, this is always the case, except for indentation.
If the semantic overloading for the space character is removed
(a space is either a "real" space or an indentation space),
things are so much easier.
*Rule 4*: Except in preserved elements (elements
with a space attribute set to "PRESERVE") line end codes are
discarded when preceded by a hard or
soft hyphen (in the process, a soft hyphen is also discarded) and
remaining line end codes are treated as space.
The rule concerning hyphens is not necessary. If it's a hard hyphen,
don't put it at line end (who would do that?)
Moreover, there is no use in an XML source file to put a soft
hyphen at line end. Who would do that? In my poor life, I have no occa-
sion to see some text with hyphens at line end.
There is a possible problem with the replacement of line end codes
in default (that is, not preserved) elements by a space character.
Suppose we have a text coded with Unicode (that could
happen :-)), with chinese ideographs. In chinese,
there is no concept of a word (sequence of letters): each ideograph is a
"word".
I don't know how in fact the chinese encode their texts, but there
is obviously no utility in putting a space after each ideograph.
The chinese must use nevertheless the end of line
character. And one shouldn't replace such a character by a space, which
would be an error, but simply discard it.
Depending on the class
of characters, there could be a different treatment of line end codes.
But this becomes complex :-(
Another approach: simply ignore line end codes. But you
have to put a space at the end of a line. The idea is quite
natural: line end codes are there for our eyes, they don't add
anything to the meaning of a text. The XML tree should
reflect the substance of a text, not the particular way it
was input:
<P>
We should
get rid of
line end
codes
</P>
and
<P>We should get rid of line end codes</P>
should give the same node in the document tree.
If line end codes must be preserved: use a preserved element, or
an empty element (<BR/>).
*Rule 5*: except in preserved elements, consecutive WS characters
are reduced to a single space.
I don't like this rule. If I put two spaces after a point, I mean two
spaces.
It's a typographic decision.
Rule 5 is meant to allow some indentation:
<P>
He said:
<QUOTE>
I need some
indentation.SPSPIndentation is needed.
</QUOTE>
</P>
In the above example, it is necessary to get rid of spaces caused
by indentation. But the two spaces marked "SP" should be retained.
So the new rule would be: SPs at the beginning of a line should be
discarded.
This rule must happen before line end codes ere discarded, ie before
rule 2. What a headache :-)
Perhaps a simple rule could be: don't use indentation in XML files, or
you'll
get burned.
More generally, if we want the parser to produce a clean data structure
out
of an XML file, some burden will have to be put on the user's shoulders.
The contract could be: the user accepts some limitations on the way to
input the source code. He could have to write instead of the above
something like:
He said:
<QUOTE>
I need some
indentation.SPSPIndentation is needed.
</QUOTE>
</P>
The reward (unvaluable) will be: a clean data
structure available for applications.
Thanks for your attention!
Regards,
Arnaud
xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)
More information about the Xml-dev
mailing list