Whitespace rules (v2)

Neil Bradley neil at bradley.co.uk
Mon Aug 11 11:48:30 BST 1997


Due to some useful feedback, and further thoughts of my own, I would 
like to amend my list of 5 whitespace rules in a few respects.

For people who read the previous set of rules, the corrections are:

a) block-enclosing elements must be identified via list or style 
sheet
b) PI, Comment and empty element processing has totally changed
c) all rules explicitly apply to both validating and non-validating applications
d) the rules are explicitly to be applied in sequence

The new rules can be summarized as:

1. normalize line-end codes
2. Remove block surrounding whitespace
3. Remove leading/trailing block line-ends
4. Join lines and de-hyphenate
5. Remove surplus spaces in text

------WHITESPACE RULES------

A formatting application should remove or transform whitespace characters 
received from the XML-processor according to the following 5
rules. These rules are to be applied in sequence, by both validating and 
non-validating applications.

Note 1: PI's, comments and empty elements may be removed, and at 
any point in the process. 

Note 2: in some cases, 'line-end' codes (CR and LF) are distinguished 
from 'spacing' characters (SP and TAB), but the term 'whitespace' 
continues to indicate all these characters


----------
RULE 1. Every line-end code is regarded as a line terminator, except
when it immediately follows the other code ([CR] following [LF] or 
[LF] following [CR]), in which case it is discarded (and is also
ignored, so has no effect on calculations for the next character).
This rule also applies in 'preserved' content.
---
Note: this rule standardizes input from documents prepared on Mac, Unix and
MS-DOS/Windows platforms.

[CR] ---> line-end
[LF] ---> line-end
[CR][LF] ---> line-end
[LF][CR] ---> line-end
[LF][LF] ---> line-end, line-end
[CR][CR] ---> line-end, line-end
[CR][LF][CR][LF] ---> line-end, line-end (because both LF's are 
ignored)

Note: by including this rule in preserved content, we avoid alternate blank
lines appearing in documents prepared on an MS-DOS system but viewed
on another system.


----------
RULE 2. All whitespace preceding the start-tag and following the end-tag 
of a 'block enclosing' element is discarded.
---
Note: a non-validating applications must refer to a style sheet or
configuration file to identify 'block enclosing' elements (perhaps by 
applying this rule to elements not specified as in-line elements).
As a validating application cannot easily determine this rule from the
content model (the first mixed content element in the hierarchy is 
block enclosing, as well as all outer layers), it may choose the same 
approach. 


Note:

 <chapter>[SP]<note>[SP][TAB]<p>This is a[SP]<em>para</em>...

becomes:

 <chapter><note><p>This is a[SP]<em>para</em>

and:

 <p>Para 1.</p>[CR]
 <p>Para 2.</p>

becomes:

 <p>Para 1.</p><p>Para 2.</p>

Note: If PI's, comments or empty elements remain in the data stream,
they are deemed transparent to this process, so:

 [SP]<!--comment--><p>Some text...

becomes:

 <!--comment--><p>Some text...


----------
RULE 3. A sequence of one or more line-end codes immediately
following a start-tag, or immediately preceding an end-tag, are
discarded (except in preserved content).
---
Note:

 <note>[CR]
 <p>[CR]
 This is a para in a note.[CR]
 </p>

becomes:

 <note><p>This is a para in a note.</p>

Note: If PI's, comments or empty-elements remain in the data stream, 
they are deemed transparent to this process, so:

 <p><!-- a comment -->[CR]
 some text...

becomes:

 <p><!-- a comment -->some text...


----------
RULE 4.  A remaining line-end code is converted into a space, except when it is 
preceded by a normal (hard) hyphen, or by a soft hyphen ('&#176;'), 
in which case it is removed (a soft hyphen is also then removed). 
---
Note:

 A[CR]
 line-[CR]
 end code sep&#176;[CR]
 erates lines.

becomes:

 A line-end code seperates lines.

Note: PI's, comments and empty elements are treated as text, so:

 <p>Some[CR]
 <!-- comment -->[CR]
 text.

becomes:

 <p>Some[SP]<!-- comment -->[SP]text.

Note: if a space is required after the hyphen, it must be inserted before the 
line-end:

 4 -[SP][CR]
 3 = 1

becomes:

 4 -[SP][SP]3 = 1 


----------
RULE 5. Consecutive whitespace characters (including translated 
line-end codes) are reduced to a single space, except in preserved
mode.
---
Note:

 4 -[SP][SP]3 = 1 

becomes:

 4 -[SP]3 = 1 

Note: if PI's, comments or empty elements are removed after rule 5:

 <p>Some[SP]<!-- comment -->[SP]text.

has already become:

 <p>Some[SP][SP]text.

but now becomes:

 <p>Some[SP]text.

Note: Multiple spaces can be preserved using the non-break space
character ('&#160;').

 <p>Some&#160;&#160;&#160;spaces.
------------------------------

-----------------------------------------------
Neil Bradley - Author of The Concise SGML Companion.
neil at bradley.co.uk
www.bradley.co.uk

xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)




More information about the Xml-dev mailing list