5 Whitespace Rules

Sat Aug 9 12:15:01 BST 1997

I think it's time to pin down some rules or guidelines regarding 
the use of whitespace. I am not suggesting that the following is 
exhaustive or totally unambiguous, but maybe it is a starting point 
for discussion. I would really like to see a small list of rules such 
as the following being defined, as I am sure it will help avoid potentially 
damaging confusion arising when products arrive and prove to be 
incompatible.

One of the problems of defining rules for XML has been the grouping 
of line-end codes with space separating characters under the 'S' 
rule. By separating these concepts, it is quite easy to define rules 
with are both backward compatible with SGML and HTML (very important 
in its own right) and also intuitive.

While the idea of ignoring all line-end codes and manually inserting spaces at 
the start of each line to compensate is at first sight attractive, 
it is certainly not intuitive, and there are plenty of text files in existence 
(including SGML and HTML files, of course), which do not follow this 
convention.

--------------------
An application should remove or transform whitespace characters 
received from the XML-processor according to the following 5 rules:

RULE 1. Every CR and LF code is regarded as a line-end signal, except 
when it immediately follows the other code ([CR][LF] or [LF][CR]), in which 
case it is discarded (and is also ignored, so has no effect on 
calculations for the next character). This rule applies even in 'preserved' content.

/*
This rule standardizes input from documents prepared on Mac, Unix and 
MS-DOS/Windows platforms.

[CR] ---> line-end
[LF] ---> line-end
[CR][LF] ---> line-end
[LF][LF] ---> line-end, line-end
[CR][CR] ---> line-end, line-end
[CR][LF][CR][LF] ---> line-end, line-end (because both LF's are 
ignored)

By including this rule in preserved content, we avoid alternate blank 
lines appearing in documents prepared on an MS-DOS system but viewed 
on another system.
*/

RULE 2. A line-end code (or codes) immediately following a start-tag, PI or 
declaration, or immediately preceding an end-tag, is discarded (except in 
preserved content).

/*
 <note>[CR][CR]<p>[CR]This is a para in a note.[CR]</p>

becomes:

 <note><p>This is a para in a note.</p>

But the CRs below are not removed (they are later converted to a space - see rule 
4):

 <p>Here is an[CR]
 <em>emphasised</em>[CR]
 word.</p>

becomes:

 <p>Here is an <em>emphasised</em> word.</p>  
*/

RULE 3. All other whitespace in element content  is  discarded.

/*
 <note>[SP][TAB]<p>This is a para in a note...

becomes (in validated input):

 <note><p>This is a para in a note...

Note that only the presence of spaces and tabs in element content, 
which is not common, will cause discrepancies between validated and 
non-validated processing.
*/

RULE 4.  Line-end codes are discarded when preceded by a hard 
or soft ('&#176;') hyphen (and a soft hyphen is also discarded).
Remaining line-end codes are treated as spaces.

/*
 A[CR]
 line-[CR]
 end code sep&#176;[CR]
 erates lines.

becomes:

 A line-end code seperates lines.
*/

RULE 5. Consecutive whitespace characters (including translated 
line-end codes) are reduced to a single space, except in preserved mode.

/*
 These lines are divide by a space[SP][CR]
 and carriage[SP][TAB][SP]return.

becomes:

 These lines are divided by a space and carriage return.
*/
------------------------------

-----------------------------------------------
Neil Bradley - Author of The Concise SGML Companion.
neil at bradley.co.uk
www.bradley.co.uk

xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)