Whitespace rules (v2)
Liam Quin
liamquin at interlog.com
Sat Aug 16 07:27:37 BST 1997
On Sun, 10 Aug 1997, Neil Bradley wrote:
> [...]
> RULE 2. All whitespace preceding the start-tag and following the end-tag
> of a 'block enclosing' element is discarded.
> ---
> Note: a non-validating applications must refer to a style sheet or
> configuration file to identify 'block enclosing' elements (perhaps by
> applying this rule to elements not specified as in-line elements).
No -- "blockness" is not at all the same as element content.
For example, you have to allow for a run-in heading, which starts out
looking like an HTML H3 (say) except that the rest of the paragraph
follow on on the same line. So it isn't a block in the paragraph sense.
> As a validating application cannot easily determine this rule from the
> content model (the first mixed content element in the hierarchy is
> block enclosing, as well as all outer layers), it may choose the same
> approach.
I think this is too complicated, as well as being not 100% right.
I don't think there's a single "right" solution. This is why it's
best to allow the parser to pass _all_ whitespace back to the application,
although it is certainly useful if a DTD-aware parser, even if it isn't
validating, distinguishes element content whitespace from PCDATA whitespace
in some way.
More than this is a bad idea, I think.
> Note: If PI's, comments or empty elements remain in the data stream,
> they are deemed transparent to this process, so:
> [SP]<!--comment--><p>Some text...
>
> becomes:
>
> <!--comment--><p>Some text...
Note that if you have a very large comment, you might need a lot of
lookahead here.
> RULE 3. A sequence of one or more line-end codes immediately
> following a start-tag, or immediately preceding an end-tag, are
> discarded (except in preserved content).
This means that
<Paragraph>This is<Emphasis>
very
</Emphasis>strange.</Paragraph>
becomes
<Paragraph>This is<Emphasis>very</Emphasis>strange.</Paragraph>
or, if you format withut distinguishing emphasis,
<Paragraph>This isverystrange.</Paragraph>
which I don't think is what you want.
But SGML itself is broken in this regard.
> RULE 4. A remaining line-end code is converted into a space, except when it is
> preceded by a normal (hard) hyphen, or by a soft hyphen ('°'),
> in which case it is removed (a soft hyphen is also then removed).
> ---
> Note:
>
> A[CR]
> line-[CR]
> end code sep°[CR]
> erates lines.
>
> becomes:
>
> A line-end code seperates lines.
Well, note that there is no hyphen in that paragraph!!
The character "-" in ISO 8859-1 (Latin 1) and ASCII is _not_ a hyphen.
It is a minus sign.
The hyphen is 0255 octal (173 decimal). It is a hyphen, not a soft hyphen.
There is no soft hyphen in Latin 1.
I don't have the necessary copy of Unicode in front of me, but last time
I checked (Unicode 1.1) it was the same in this regard, and also in having
the ` character be a spacing grave accent, not a single quote.
This should be done by applications. I wouldn't want your mesage:
----------
RULE 5. Consecutive whitespace characters (including translated
turrning into
----------RULE 5. Consecutive whitespace characters (including translated
for example.
> Note: Multiple spaces can be preserved using the non-break space
> character (' ').
>
> <p>Some   spaces.
Er, is this defined in Unicode or in ISO 10646??
Lee
--
Liam Quin -- the barefoot typographer -- Toronto
lq-text: freely available Unix text retrieval
email address:
l i a m q u i n at host: i n t e r l o g dot c o m
xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)
More information about the Xml-dev
mailing list