5 Whitespace Rules
neil at bradley.co.uk
Sat Aug 9 12:15:01 BST 1997
I think it's time to pin down some rules or guidelines regarding
the use of whitespace. I am not suggesting that the following is
exhaustive or totally unambiguous, but maybe it is a starting point
for discussion. I would really like to see a small list of rules such
as the following being defined, as I am sure it will help avoid potentially
damaging confusion arising when products arrive and prove to be
One of the problems of defining rules for XML has been the grouping
of line-end codes with space separating characters under the 'S'
rule. By separating these concepts, it is quite easy to define rules
with are both backward compatible with SGML and HTML (very important
in its own right) and also intuitive.
While the idea of ignoring all line-end codes and manually inserting spaces at
the start of each line to compensate is at first sight attractive,
it is certainly not intuitive, and there are plenty of text files in existence
(including SGML and HTML files, of course), which do not follow this
An application should remove or transform whitespace characters
received from the XML-processor according to the following 5 rules:
RULE 1. Every CR and LF code is regarded as a line-end signal, except
when it immediately follows the other code ([CR][LF] or [LF][CR]), in which
case it is discarded (and is also ignored, so has no effect on
calculations for the next character). This rule applies even in 'preserved' content.
This rule standardizes input from documents prepared on Mac, Unix and
[CR] ---> line-end
[LF] ---> line-end
[CR][LF] ---> line-end
[LF][LF] ---> line-end, line-end
[CR][CR] ---> line-end, line-end
[CR][LF][CR][LF] ---> line-end, line-end (because both LF's are
By including this rule in preserved content, we avoid alternate blank
lines appearing in documents prepared on an MS-DOS system but viewed
on another system.
RULE 2. A line-end code (or codes) immediately following a start-tag, PI or
declaration, or immediately preceding an end-tag, is discarded (except in
<note>[CR][CR]<p>[CR]This is a para in a note.[CR]</p>
<note><p>This is a para in a note.</p>
But the CRs below are not removed (they are later converted to a space - see rule
<p>Here is an[CR]
<p>Here is an <em>emphasised</em> word.</p>
RULE 3. All other whitespace in element content is discarded.
<note>[SP][TAB]<p>This is a para in a note...
becomes (in validated input):
<note><p>This is a para in a note...
Note that only the presence of spaces and tabs in element content,
which is not common, will cause discrepancies between validated and
RULE 4. Line-end codes are discarded when preceded by a hard
or soft ('°') hyphen (and a soft hyphen is also discarded).
Remaining line-end codes are treated as spaces.
end code sep°[CR]
A line-end code seperates lines.
RULE 5. Consecutive whitespace characters (including translated
line-end codes) are reduced to a single space, except in preserved mode.
These lines are divide by a space[SP][CR]
These lines are divided by a space and carriage return.
Neil Bradley - Author of The Concise SGML Companion.
neil at bradley.co.uk
xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)
More information about the Xml-dev