Specification Questions

Peter Murray-Rust Peter at ursus.demon.co.uk
Sat Aug 2 11:56:33 BST 1997


In message <199708020838.JAA11135 at andromeda.ndirect.co.uk> "Neil Bradley" writes:
[...]
[Paul Prescod]
> > The spec makes no special provision for whitespace at the beginning
> > and end of elements. I believe that this is intended to be one of
> > its simplifications over "regular" SGML. This seeming
> > incompatibility is mitigated by an an SGML TC which will allow XML
> > to remain compatible with (post-TC) SGML.

The spec is consistent over this, I think, and says that all characters that 
are not markup should be passed to the application.  This includes whitespace.
My personal view is that without some central guidance at least, the
XML treatment of whitespace will cause problems and incompatibility for
two groups of people:
	- those who are familiar with SGML
	- those who are not familiar with SGML.

The first group are accustomed to SGML parsers (primarily James Clark's) 
carrying out consistent operations on whitespace.  This includes:
	- removing line-ends immediately after and before markup
	- translating markup into a small number of platform-independent codes
		(e.g. ' ' and '\n').

The second group will be familiar with HTML where all whitespace is normalised
according to various rules of varying consistency between useragents/browsers.
Apart from characters within <PRE> and related markup, all whitespace is 
normalised to single spaces, which and line-ends are inserted according to
the user-agent software, not the document's content. Treatment of 'special'
characters (e.g. &nbsp; &#32; and other escaped characters or entities) is
probably inconsistent.  However, in general, whitespace is not a current 
concern of the second group.

***Both groups are in for a serious problem with XML unless there is some 
central guidance.  Otherwise we are at the mercy of any software implementor.
***

<QUESTION>
What whitespace characters can be passed to the application? Regardless of 
what is done with it, is CR+LF treated in the same way as LF or CR alone
in a document?  
</QUESTION>

If not, we shall appear to be in for variations according to what platforms 
the document is created on.  It will be no use telling people that this is 
what the spec says - I had always assumed that one of the attractions of
SGML was that it removed platform-dependent documents.  But reading 
XML-lang [2] suggests that CR and CR+LF produce different results.

The result of parsing, therefore, passes original whitespace to the 
application.  Thus:

<P>two  spaces</P>

and

<P>two spaces</P>

are different documents.

So are:

<P>no line feeds</P>

and

<P>
no line feeds
</P>

The first will confuse anyone accustomed to HTML only.  The second will also
confuse them, and in addition will confuse some current users of SGML.

> > 
> >  Paul Prescod
> 
> Is it up to the application to decide what to do with any leading line
> ending code in these positions then?
> 
> I am pleased to be rid of the 'record' concept (using RS and RE)
> defined for SGML, particularly as I have tended to use Mac and UNIX
> systems which use a single character to end a line (albeit different
> ones!). However, I still think there is too little information on the
> effect of line ending codes in mixed content. Obviously the safe thing
> to do is to make the content of all elements with a mixed content
> model fit on a single line, as in:
> 
> <p>This is a <b>long</b> paragraph.........................</p>
> 
> But with large text blocks, created using text editors, people will
> continue to use line ending codes to make it readable on-screen.
> Normally, a break between words would be interpreted as a space when
> the block is paginated:
> 
> <p>This is a <b>long</b> paragraph that is broken over two
> lines, with an implied space between 'two' and 'lines'.</p>

Yes.  Most people will want to work this way.  Very long lines are a menace
for many types of software.  We must assume (and in many cases encourage)
people will read and even edit XML documents with non-XML tools.

> 
> Yet what happens when a comment or processing instruction
> appears on its own line?
> 
> <p>This is a long paragraph that is broken over two
> <!-- comment -->
> lines, with an implied space between 'two' and 'lines'.</p>
> 
> Is this interpreted as "two <!-- comment --> lines...", which reduces
> to "two   lines"?

No.  it reduces (I think) to:

"...two

lines..."

If there is one single 'obvious' issue which will prevent the take-up of XML 
by 'ordinary' people (like myself) it is whitespace.  The present position
on whitespace is:
	- the rules are clear but not prescriptive
	- the rules are non-intuitive to most people
	- the rules allow many different ways of processing a given document
	- the role of whitespace in a given document will depend on the
		software used to process it

The philosophy of the XML-lang authors is consistently:
	- whitespace is a problem for the application, not the spec.
	- there is no generic way of treating whitespace
[I should make it clear that this isssue has been debated at great length,
and that the present position is the considered opinion of many experts.
I accept it, although I think it will be difficult to work with in practice.]

Without consistent treatment, a document author has to ask

	'which application is going to process my document?'

It means, for example, that the way that whitespace is treated in MathML 
may be different from that in CML and FooML and ... It effectively
destroys the possibility of (sub)document re-use, without a generally agreed
convention.

	I know that XML-lang authors read this group and may therefore
take some of these points on board.

	P.
> 
> 
> Neil.
> 
> 
> -----------------------------------------------
> Neil Bradley - Author of The Concise SGML Companion.
> neil at bradley.co.uk
> www.bradley.co.uk
> 
> xml-dev: A list for W3C XML Developers
> Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
> To unsubscribe, send to majordomo at ic.ac.uk the following message;
> unsubscribe xml-dev
> List coordinator, Henry Rzepa (rzepa at ic.ac.uk)
> 
> 

-- 
Peter Murray-Rust, domestic net connection
Virtual School of Molecular Sciences
http://www.vsms.nottingham.ac.uk/

xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)




More information about the Xml-dev mailing list