XML and whitespace: lets just dump CR and LF!

Rick Jelliffe ricko at allette.com.au
Thu Aug 7 18:46:54 BST 1997


 
> From: Eric Baatz <eric.baatz at East.Sun.COM>
 
> > XML applications should ignore *ALL* CR and LF as a bad joke.
> 
> That doesn't seem reasonable from my point of view, although an option to do 
> so might be reasonable.  For example, my XML application, which reads text 
> and speaks it, is likely to be fed existing text that is only lightly marked 
> up with XML and that uses CR/LF (or newlines) and whitespace to convey 
> important information.  My application needs to see that information to 
> operate in an acceptable manner.  For example, input could be narrative 
> paragraphs denoted by adjacent newlines (or CR/LF's), poetry (lots of 
> prosodic information is in the the breaks and whitespace), or columns of 
> text (such as newspapers) and numbers (such as spreadsheets) that have not 
> been reduced to a single logical flow of characters.

Under the current proposals, white-space is preserved or defaulted. (This 
relates to labelling data for applications, not on how the application
presents it.) So there is no way to indicate whether newlines are hard returns 
or soft returns.

I think this hearkens back to XML last year, when the idea was around that 
XML without declarations would be mainly used for closed-systems, where the
recieving end had been built with a specific DTD in mind. 

Now it seems that this is not a big factor in the WG's mind, as the
XML-ATTRIBUTE discussion show: the WG wants to support systems that work
with many DTDs, even if they are not declared.  (I, of course, think this
is a mistaken change in direction for XML, but I bow to collective wisdom.)

Under a closed-system approach, it made sense to say "default" or "preserve",
since "default" and "preserve" might have some determinate meaning.  Under
the new all-singing-all-dancing direction for XML, I think they make little sense.

If XML-SPACE is just "preserve" or "default", then document instance's
newline coventions must be tailored for each application.  But what if we
are processing against an architectural form? Then every instance must
use the the newline conventions belonging to the meta-Document Type Definition.
And what if you have different AFs active at different parts of the document,
or even applicable concurrently on some elements? Then all the meta-DTD's 
newline conventions must match, or you must adopt different conventions
at different parts of the document.  

A hard return should be explicitly marked up: whether it is an attribute or
a PI or a <BR/> element or &#x2028;, it should not be stuck outside the
element in CSS or DSSSL--it is part of the data, not an artifact of formatting.

(I suppose that the Remappers will think it desirable to define a new standard
XML attribute that specifies which convention you use (PI, attribute, <BR>,
character reference, entity reference) to signify hard returns, and then
provide other attributes to let us cope with existing DTDs that have churlishly
adopted their own, prior, conventions.  But I think it is simpler to merely
say "The only way to signify hard returns in XML is  &#x2028;" )

If you have gotten rid of hard returns, then next we need to sort out
newlines that are soft returns in data from newlines that are in 
(or "attributable to") markup or element content.  For this distinction,
XML-SPACE may be good enough, in a brutish way.  But I think that the
Interleaf option, of making newlines not significant for presentation, is
superior, for the reasons given before.  I would also add another: it
may simplify indexing into character strings--if you decide "CR and LF
are not significant for presentation or indexing" then you get rid of 
the problem of documents needing to tell you which newline conventions they
have adopted: you don't care, and the users are free to translate between
different conventions without impacting indexes into documents (all other 
things being equal).


Rick Jelliffe


P.S. An Omnimark program to markup an existing well-formed HTML-in-XML 
document would be merely to add to a XML normaliser:

TRANSLATE "%n" WHEN ANCESTOR IS PRE        
	OUTPUT "&#x2028;%n"

TRANSLATE "%n" 
	OUTPUT "%n "

This does not seem too complex at all. 

xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)




More information about the Xml-dev mailing list