Whitespace rules (v2)

Sat Aug 16 07:27:37 BST 1997

On Sun, 10 Aug 1997, Neil Bradley wrote:

> [...]
> RULE 2. All whitespace preceding the start-tag and following the end-tag 
> of a 'block enclosing' element is discarded.
> ---
> Note: a non-validating applications must refer to a style sheet or
> configuration file to identify 'block enclosing' elements (perhaps by 
> applying this rule to elements not specified as in-line elements).

No -- "blockness" is not at all the same as element content.
For example, you have to allow for a run-in heading, which starts out
looking like an HTML H3 (say) except that the rest of the paragraph
follow on on the same line.  So it isn't a block in the paragraph sense.

> As a validating application cannot easily determine this rule from the
> content model (the first mixed content element in the hierarchy is 
> block enclosing, as well as all outer layers), it may choose the same 
> approach. 

I think this is too complicated, as well as being not 100% right.
I don't think there's a single "right" solution.  This is why it's
best to allow the parser to pass _all_ whitespace back to the application,
although it is certainly useful if a DTD-aware parser, even if it isn't
validating, distinguishes element content whitespace from PCDATA whitespace
in some way.

More than this is a bad idea, I think.

> Note: If PI's, comments or empty elements remain in the data stream,
> they are deemed transparent to this process, so:
>  [SP]<!--comment--><p>Some text...
> 
> becomes:
> 
>  <!--comment--><p>Some text...

Note that if you have a very large comment, you might need a lot of
lookahead here.

> RULE 3. A sequence of one or more line-end codes immediately
> following a start-tag, or immediately preceding an end-tag, are
> discarded (except in preserved content).

This means that
<Paragraph>This is<Emphasis>
very
</Emphasis>strange.</Paragraph>

becomes
<Paragraph>This is<Emphasis>very</Emphasis>strange.</Paragraph>

or, if you format withut distinguishing emphasis,
<Paragraph>This isverystrange.</Paragraph>

which I don't think is what you want.

But SGML itself is broken in this regard.

> RULE 4.  A remaining line-end code is converted into a space, except when it is 
> preceded by a normal (hard) hyphen, or by a soft hyphen ('&#176;'), 
> in which case it is removed (a soft hyphen is also then removed). 
> ---
> Note:
> 
>  A[CR]
>  line-[CR]
>  end code sep&#176;[CR]
>  erates lines.
> 
> becomes:
> 
>  A line-end code seperates lines.

Well, note that there is no hyphen in that paragraph!!
The character "-" in ISO 8859-1 (Latin 1) and ASCII is _not_ a hyphen.
It is a minus sign.

The hyphen is 0255 octal (173 decimal).  It is a hyphen, not a soft hyphen.
There is no soft hyphen in Latin 1.

I don't have the necessary copy of Unicode in front of me, but last time
I checked (Unicode 1.1) it was the same in this regard, and also in having
the ` character be a spacing grave accent, not a single quote.

This should be done by applications.  I wouldn't want your mesage:
    ----------
    RULE 5. Consecutive whitespace characters (including translated 
turrning into
    ----------RULE 5. Consecutive whitespace characters (including translated 
for example.

> Note: Multiple spaces can be preserved using the non-break space
> character ('&#160;').
> 
>  <p>Some&#160;&#160;&#160;spaces.
Er, is this defined in Unicode or in ISO 10646??

Lee

-- 
Liam Quin --  the barefoot typographer -- Toronto
lq-text: freely available Unix text retrieval

email address:
l i a m q u i n    at host:    i n t e r l o g   dot   c o m

xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)