Specification Questions

Peter Murray-Rust Peter at ursus.demon.co.uk
Wed Aug 6 17:19:57 BST 1997


In message <199708050949.KAA07792 at andromeda.ndirect.co.uk> "Neil Bradley" writes:
> 
> 
> Reply-to:      Peter at ursus.demon.co.uk (Peter Murray-Rust)
> 
> > Some additional - hopefully constructive - thoughts on whitespace.
> > 
> > The XML-lang spec does not ( and I suspect will not) give detailed guidance
> > on how whitespace will be managed.  My impression is that it is up to 
> > implementers and/or groups like this to come up with particular solutions.
> > My worry is that these will be inconsistent and not inter-operable.
> 
> I agree totally. This was my original concern.
> 
> > ***
> > Therefore I propose that those on XML-DEV who care about this problem come
> > up with some guidelines for implementers. 
> > ***
> 
> I very much hope this happens.
> 
[...]
> 
> I think all applications should be expected to either or both 
> characters in sequence as a line end signal, so that platform 
> dependancies can be eliminated. If there is no good reason to omit 
> this taks from the XML-processor itself, I think it should be done 
> there.
> 
> 
[...]
> 
> I believe so. In addition, can we not put 'XML-SPACE 
> (PRESERVE|IMPLIED) "PRESERVE" in an attribute declaration for an 
            ^^^^^^^
I think you meant DEFAULT - #IMPLIED is when no value is given.

> element which will always have reserved content. It is common 
> practice for a DTD to have some kind of pre-formatted element, such 
> as HTML's '<pre>'.
> 
> 
> > If so, we can propose that the DEFAULT mode for any whitespace processing is
> > something along the lines (similar to HTML?).  Within an element with
> > XML-SPACE="DEFAULT"
> > 
> 
> > All whitespace sequences are mapped into a single space character.
> Agreed.
> 
> > All whitespace pseudo-elements are ignored (i.e. whitespace between markup)
> 
> Ummm. what about 'the <b>bold</b>  <i>italic</i> styles...'?
> 
> > All leading and trailing whitespace in #PCDATA is ignored.
> 
> I think all applications should remove leading and trailing CR and LF
> characters in a mixed content element. But not SP or HT, as this would
> be undesirable in the following fragment:
> 
> A<emph>  bold  </emph>word.
> 
> Although an unusual layout, some people may use it, and it would be
> unfortunate if it resulted in 'Aboldword'.
> 
OK - I had overlooked this.

Taking account of other posts on this subject here and elsewhere, there seems to
be a positive view that a set of Guidelines/Best Practice/Gerally Agreed 
Conventions should be developed, and that XML-DEV is probably the right place.

It's also clear that the more of this that can be done before the XMLProcessor
output gets to the *specific* application - e.g. a browser or transformer - the
better.  We seem to be looking at a filter or layer immediately after/on_top_of
the XMLProcessor.  At the ESIS stream level we could have:

Document ->[Parser] -> ESIS -> [XMLWhitespace] -> NewESIS -> [Application]

and at the API level something that either sits on top of the EventStream or
the  final TreeFactory (or whatever it's called).

(There is a difficulty in filtering any document, in that XPtrs in XML-LINK
would appear to have to operate on the unfiltered document (although this is
not specifically stated, it's implied).  So it might have to be that the 
stream or tree contained 'significant' and 'non-significant' whitespace, and 
that the application would have to be able to recognise the flag.  All Xptr
activity has to take place on *all* whitespace (although I don't think this
is pretty).

The current switch PRESERVE is clear (everything goes through).  It would go
against the spec if it didn't do this. That means (I suppose) that CR+LF is 
different from LF - that's the price paid for PRESERVE. The other option DEFAULT
cannot map onto a set of actions that we all agree for all documents. Therefore
we have to give DEFAULT some hints at the *document* level - presumably through
PIs.

Can we propose, therefore. a set of PIs that would control whitespace 
processing? I would hope that we could keep this to a very small number 
(ca. 3-4).  Is it too simple to suggest that there are two types of markup
(STRUCTURE and TEXT) that need to normalise whitespace?  the former would
deal with things like:
<PRETTY>
  <PRINT>
  </PRINT>
</PRETTY>
where the author did not intend there to be any whitespace, and the second
would deal with
<P>
This is a
long         space in a <B>paragraph</B>.
</P>
where all whitespace would be normalised to a single space as in HTML?

Where a document contained both, the author could use a PI to switch between 
them.

If we could come up with a very simple set of options, it might make it 
sufficiently simple that a standard filter could be devised, or the application
programmer had a much simpler strategy.  Is consensus possible?

	P.

-- 
Peter Murray-Rust, domestic net connection
Virtual School of Molecular Sciences
http://www.vsms.nottingham.ac.uk/

xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)




More information about the Xml-dev mailing list