Whitespace rules (v2)

Neil Bradley neil at bradley.co.uk
Sat Aug 16 19:52:00 BST 1997

Dear Liam,

Thanks for the feedback.

> > [...]
> > RULE 2. All whitespace preceding the start-tag and following the end-tag 
> > of a 'block enclosing' element is discarded.
> > ---
> > Note: a non-validating applications must refer to a style sheet or
> > configuration file to identify 'block enclosing' elements (perhaps by 
> > applying this rule to elements not specified as in-line elements).
> No -- "blockness" is not at all the same as element content.
> For example, you have to allow for a run-in heading, which starts out
> looking like an HTML H3 (say) except that the rest of the paragraph
> follow on on the same line.  So it isn't a block in the paragraph sense.
> > As a validating application cannot easily determine this rule from the
> > content model (the first mixed content element in the hierarchy is 
> > block enclosing, as well as all outer layers), it may choose the same 
> > approach. 
> I think this is too complicated, as well as being not 100% right.
> I don't think there's a single "right" solution.  This is why it's
> best to allow the parser to pass _all_ whitespace back to the application,
> although it is certainly useful if a DTD-aware parser, even if it isn't
> validating, distinguishes element content whitespace from PCDATA whitespace
> in some way.

Note that these rules are intended for the application, not the 
parser, or any other part of the XML processor. As I state at the top of the rules, "A formatting application 
should......according to the following 5 rules".

> > Note: If PI's, comments or empty elements remain in the data stream,
> > they are deemed transparent to this process, so:
> >  [SP]<!--comment--><p>Some text...
> > 
> > becomes:
> > 
> >  <!--comment--><p>Some text...
> Note that if you have a very large comment, you might need a lot of
> lookahead here.

Actually no, because the application would already KNOW that it is 
currently in block content.

> > RULE 3. A sequence of one or more line-end codes immediately
> > following a start-tag, or immediately preceding an end-tag, are
> > discarded (except in preserved content).
> This means that
> <Paragraph>This is<Emphasis>
> very
> </Emphasis>strange.</Paragraph>
> becomes
> <Paragraph>This is<Emphasis>very</Emphasis>strange.</Paragraph>
> or, if you format withut distinguishing emphasis,
> <Paragraph>This isverystrange.</Paragraph>
> which I don't think is what you want.
> But SGML itself is broken in this regard.

I know, and as it is impossible to cover all angles. I think your 
example is one of the least likely things to happen in reality, and if 
necessary document authors must be educated to avoid it.

I am open to other suggestions, of course. I am only trying to get 
detailed discussions rolling. For example, we could get rid of both 
rules 2 and 3, and improve rule 5 to say that all surrounding white 
space is removed. 
> > RULE 4.  A remaining line-end code is converted into a space, except when it is 
> > preceded by a normal (hard) hyphen, or by a soft hyphen ('&#176;'), 
> > in which case it is removed (a soft hyphen is also then removed). 
> > ---
> > Note:
> > 
> >  A[CR]
> >  line-[CR]
> >  end code sep&#176;[CR]
> >  erates lines.
> > 
> > becomes:
> > 
> >  A line-end code seperates lines.
> Well, note that there is no hyphen in that paragraph!!
> The character "-" in ISO 8859-1 (Latin 1) and ASCII is _not_ a hyphen.
> It is a minus sign.

Well, most people in the past have used it as a hyphen in text 
documents, which I think is the important point here.

Also, my source tells me that this character is the official ISO 
hyphen - but my source may be wrong.

> The hyphen is 0255 octal (173 decimal).  It is a hyphen, not a soft hyphen.
> There is no soft hyphen in Latin 1

OK. I will take your word on this. Again, my source of information may be wrong.
> I don't have the necessary copy of Unicode in front of me, but last time
> I checked (Unicode 1.1) it was the same in this regard, and also in having
> the ` character be a spacing grave accent, not a single quote.
> This should be done by applications.  I wouldn't want your mesage:

It is being done by the application.

What "wouldn't you want your message:"?

>     ----------
>     RULE 5. Consecutive whitespace characters (including translated 
> turrning into
>     ----------RULE 5. Consecutive whitespace characters (including translated 
> for example.
> > Note: Multiple spaces can be preserved using the non-break space
> > character ('&#160;').
> > 
> >  <p>Some&#160;&#160;&#160;spaces.
> Er, is this defined in Unicode or in ISO 10646??

Don't know. I have it as a non-breaking space, which I am 'liberally' 
interpreting here as a required space (if it can't be broken over 
lines, it must be pretty important). If Unicode has a more explicit 
required space character, then fine, let's use that.

> Lee


Neil Bradley - Author of The Concise SGML Companion.
neil at bradley.co.uk

xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)

More information about the Xml-dev mailing list