Whitespace
Rick Jelliffe
ricko at allette.com.au
Wed Aug 27 02:38:05 BST 1997
> From: Sean Mc Grath <digitome at iol.ie>
> Could someone out there who reckons this is easy kindly put
> me out of my misery by showing how it can be best handled?
Without addressing your dolorous (if not rubescent) herring,
Knuth's comment in "The Errors of TeX" are useful:
"The stickiest issue in TeX has always been the treatment of
blank spaces. Users tend to insert spaces in their computer
files so that files look nice, but document processors muct also
treat spaces as abojects that appear in the final output...
I kept searching for rules that would be simple enough to
easily learned, yet natural enough that they could be applied
almost unconsiously. I finally concluded that no such rules
existed, and I opted for the best compromise I could find."
Charles Goldfarb commented at the Barcelona WG8 meeting
that whitespace handling was one of the design areas that
he felt SGML got it wrong (by which I think he did not mean
that the SGML86 rules are not a workable, justifiable and
rational compromise -- given the constraint of having to work
with fixed-line-length text editors, which is the nub of the
design decision for SGML86 -- merely that perhaps the XML
'solution' of making it someone else's problem would
have deflected some consternation away from ISO 8879, and
partitioned functionality more neatly).
The solution that I think XML *now* has is this:
1) There are ISO 10646 characters available for lots of different
kinds of spaces. These can be specified directly by numeric
character references, or indirectly using the ISO public entities.
Some of these entities are already familiar to HTML people: in
particular is generated almost pathologically by some
versions of Netscape's HTML editor. So if you want to force
a break or space, these should be used.
2) If you want to force that normal spaces should not be collapsed,
then the attribute XML-SPACE="preserve" should be specified on
the containing element.
3) Otherwise, you should use spaces and newlines only when you
need them, and expect whitespace sequences to be collapsed.
XML generators that have access to the DTD should strip out
confusing whitespaces from element and mixed content.
4) SGML86 and XML have different whitespace rules. So you should
expect to have to process the files to add or remove space when
you convert between the two, unless you write your SGML DTD
without mixed content and/or impose some stricter discipline on
document creation.
5) If you need to prettyprint your document text, then you are best
advised to use whitespace within tags, rather than between tags.
For example:
<p x=1
>An element</p
>
Rather than
<p>
An element
</p>
If this looks strange to XML people, then remember that Bert Bos
found it natural to do (something like) this in a paper he wrote:
<x >blah< /x>
<x >blurt< /x>
So I do not think that we should assume too much about how HTML
people naturally view tag integrity. (In SGML and XML, Bert's
experimental markup would be invalid and not well-formed, despite
its nice pretty-printing: ETAGO '</' cannot be divided by whitespace.)
6) The XML stylesheet language must be strong enough to handle forcing
spaces between elements. It must be possible to define that, for example,
a keyword element must be seperated by whitespace or punctuation (or
superscripted note references) from adjacent words, in languages that
use spaces as word separators.
I think these are good enough. If developers implement their systems to
allow them, then users will learn to tailor their documents
appropriately. Users will always be able to markup documents incorrectly,
no matter how hard we try, I tend to think.
Rick Jelliffe
xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)
More information about the Xml-dev
mailing list