Whitespace rules (v2)
Neil Bradley
neil at bradley.co.uk
Sat Aug 16 19:52:00 BST 1997
Dear Liam,
Thanks for the feedback.
> > [...]
> > RULE 2. All whitespace preceding the start-tag and following the end-tag
> > of a 'block enclosing' element is discarded.
> > ---
> > Note: a non-validating applications must refer to a style sheet or
> > configuration file to identify 'block enclosing' elements (perhaps by
> > applying this rule to elements not specified as in-line elements).
>
> No -- "blockness" is not at all the same as element content.
> For example, you have to allow for a run-in heading, which starts out
> looking like an HTML H3 (say) except that the rest of the paragraph
> follow on on the same line. So it isn't a block in the paragraph sense.
>
> > As a validating application cannot easily determine this rule from the
> > content model (the first mixed content element in the hierarchy is
> > block enclosing, as well as all outer layers), it may choose the same
> > approach.
>
> I think this is too complicated, as well as being not 100% right.
> I don't think there's a single "right" solution. This is why it's
> best to allow the parser to pass _all_ whitespace back to the application,
> although it is certainly useful if a DTD-aware parser, even if it isn't
> validating, distinguishes element content whitespace from PCDATA whitespace
> in some way.
Note that these rules are intended for the application, not the
parser, or any other part of the XML processor. As I state at the top of the rules, "A formatting application
should......according to the following 5 rules".
> > Note: If PI's, comments or empty elements remain in the data stream,
> > they are deemed transparent to this process, so:
> > [SP]<!--comment--><p>Some text...
> >
> > becomes:
> >
> > <!--comment--><p>Some text...
>
> Note that if you have a very large comment, you might need a lot of
> lookahead here.
Actually no, because the application would already KNOW that it is
currently in block content.
> > RULE 3. A sequence of one or more line-end codes immediately
> > following a start-tag, or immediately preceding an end-tag, are
> > discarded (except in preserved content).
>
> This means that
> <Paragraph>This is<Emphasis>
> very
> </Emphasis>strange.</Paragraph>
>
> becomes
> <Paragraph>This is<Emphasis>very</Emphasis>strange.</Paragraph>
>
> or, if you format withut distinguishing emphasis,
> <Paragraph>This isverystrange.</Paragraph>
>
> which I don't think is what you want.
>
> But SGML itself is broken in this regard.
I know, and as it is impossible to cover all angles. I think your
example is one of the least likely things to happen in reality, and if
necessary document authors must be educated to avoid it.
I am open to other suggestions, of course. I am only trying to get
detailed discussions rolling. For example, we could get rid of both
rules 2 and 3, and improve rule 5 to say that all surrounding white
space is removed.
> > RULE 4. A remaining line-end code is converted into a space, except when it is
> > preceded by a normal (hard) hyphen, or by a soft hyphen ('°'),
> > in which case it is removed (a soft hyphen is also then removed).
> > ---
> > Note:
> >
> > A[CR]
> > line-[CR]
> > end code sep°[CR]
> > erates lines.
> >
> > becomes:
> >
> > A line-end code seperates lines.
>
> Well, note that there is no hyphen in that paragraph!!
> The character "-" in ISO 8859-1 (Latin 1) and ASCII is _not_ a hyphen.
> It is a minus sign.
Well, most people in the past have used it as a hyphen in text
documents, which I think is the important point here.
Also, my source tells me that this character is the official ISO
hyphen - but my source may be wrong.
> The hyphen is 0255 octal (173 decimal). It is a hyphen, not a soft hyphen.
> There is no soft hyphen in Latin 1
OK. I will take your word on this. Again, my source of information may be wrong.
> I don't have the necessary copy of Unicode in front of me, but last time
> I checked (Unicode 1.1) it was the same in this regard, and also in having
> the ` character be a spacing grave accent, not a single quote.
>
> This should be done by applications. I wouldn't want your mesage:
It is being done by the application.
What "wouldn't you want your message:"?
> ----------
> RULE 5. Consecutive whitespace characters (including translated
> turrning into
> ----------RULE 5. Consecutive whitespace characters (including translated
> for example.
>
> > Note: Multiple spaces can be preserved using the non-break space
> > character (' ').
> >
> > <p>Some   spaces.
> Er, is this defined in Unicode or in ISO 10646??
Don't know. I have it as a non-breaking space, which I am 'liberally'
interpreting here as a required space (if it can't be broken over
lines, it must be pretty important). If Unicode has a more explicit
required space character, then fine, let's use that.
> Lee
Neil.
-----------------------------------------------
Neil Bradley - Author of The Concise SGML Companion.
neil at bradley.co.uk
www.bradley.co.uk
xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)
More information about the Xml-dev
mailing list