Whitespace rules (v2)

Mon Aug 25 17:02:02 BST 1997

At 6:36 AM -0500 8/25/97, Peter Murray-Rust wrote:
>I have been away for a few days so maybe it's a useful time to try to
>summarise
>the Whitespace debate and to ask a few questions. You don't need to read the
>rest of this unless you believe there is a problem to be addressed :-)

Afraid that I have to chime in when I see a non-problem consuming valuable
time...

>
>In message <v03007800b01fa935a1f1@[205.181.197.116]> dgd at cs.bu.edu (David
>G. Durand) writes:
>> I observed with dismay that the issue of whitespace has surfaced on this
>> list, after we finally gave it the wooden-stake-in-the-heart treatment on
>> the WG discussion lists. As a chief proponent of the current method, I'll
>
>:-) I am not sure what has been killed :-)

I hoped the discussion. Certainly I hoped the shibboleth of a parser
"normalizing" whitespace on behalf of the application.

>I will take David's points first, because I *do* believe that many of those
>who were involved in the development of the spec feel that there is no scope
>for further discussion of this *IN THE SPEC*.  I agree with this.

Actually, the only question remaining, in my mind, is how the XML
stylesheet language should allow shitespace to be processed. I disagree
that there is any need for a non-stylesheet, non-application convention for
whitespace. Note, that in some sense, the Document type _description_ (i.e.
descriptive prose desribing the intent of a DTD) and the "schema" notions
are application specifications, and are entitled to declare whitespace
handling rules.

>Essentially the spec says:
>	- This is a difficult problem.  [Actually it doesn't say this, but
>it might help if it did in a footnote.]
It's only difficult if you think that it's a parser problem. It's easy in
XML, because all whitespace is visible. I can think of no _simpler_ rule
that a _parser_ could implement.

>	- We have taken a minimalist approach where we do not give any support
>to any whitespace philosophy [other than PRESERVE which passes everything and
>can be platform-dependent], but leave this to the community. DEFAULT is simply
>the absence of PRESERVE.

Yes, since there is not a universal "whitespace philosophy" even for a
single document (see my response to Marcus for an example), there's no
reason to declare it in the instance.

>I believe this solves one species of problem, where the authoring tool/system
>is closely coupled to the application. CDF might be such a system (e.g. I have
>never seen a native CDF file).

No, it's a case where the "philosophy" is coupled to the application, not
to the "document" in the abstract -- except insofar as it is defined by a
"document type description" or "schema" -- which is essentially a set of
ideal constraints that applications are expected to follow.

>(A) There is a defined DTD (e.g. TEI, HTML) but a variety of authoring tools
>and a variety of applications from different providers. Traditionally these
>will come from the SGML community. I believe that there will certainly be
>initial problems where m'facturer X emits whitespace in a particular way
>which is incompatible with Y's tools for rendering/transforming it. It may
>also be platform dependent.  We've seen this in the development of HTML
>systems
>although they are improving.

TEI defines where whitesspace is signficant (almost nowhere if I remember
correctly).

>Remember that most SGML systems are current implemented within a single site
>(the tools are chosen to be compatible throughout the process). Very little
>SGML is delivered over the WWW to be consistent between different m'facturers.
>XML is specifically designed to be delivered over the WWW in (I assume)
>a platform and m'facturer-independent way.  Do we expect to see 'this XML
>file best viewed with FOO software'??? If so, we might as well give up now.

No, but every document will _have_ to either conform to a well-known DTD or
schema of some sort, or be delivered with a stylesheet, and those are
usefule places that this behavior should be explained.

>IMO any developer needs to be able to say:
>	(i) I support a wide range of XML DTDs.
>	(ii) I can easily customise my software to support a range of commonly
>used DTDs
>	(iii) Documents authored by my software should be readable by software
>from another m'facturer with whom I have had no formal discussions
>	(iv) My system can support a range of applications which read documents
>produced by other m'facturers systems and with whom I have had no formal
>discussions

Nothing in a stylesheet based solution violates this to my mind.

>If all the manufacturers tell me this is a non-problem, I'll shut up (on this
>issue!) If each DTD defines its own use of whitespace (or worse, doesn't
>define it) they may have a lot of work.
>
>(B) There are generic XML applications. The XML community continues to discuss
>documents which 'contain information from more than one DTD' or 'are WF but
>not necessarily valid(atable)'. Examples of these are:
>	(i) an XML document to which meta-data has been prepended.
I'm probably not the best person to address this, as I think that the
mix-and-match proposals are ill-thought out, but since the data is supposed
to recognizable, presumably it is also to be ignored by all applications
other than "meta-applications". So that's not a problem.

>	(ii) an XML document which includes chunks conforming to well-defined
>DTDs such as MathML.

In which case, they should have well-known stylesheets or descriptions that
explain any whitespace conventions in use.
>
>The possible combinations are indefinitely large.

But since each individual part must have defined bevhavior, this should not
be a problem.

>It is impossible to write bespoke software to process these documents, and we
>need generic mechanisms. Perhaps many will be dealt with by stylesheets, and
>maybe the WS issue is a question of developing appropriate conventions in
>stylesheets.  In documents of this sort there have to be conventions and flags
>that indicate how to interpret the documents. The spec has indicated that it
>shouldn't be in the XML markup - no problem.  Somehow conventions have to
>evolve, either conveyed implicitly or explicitly (e.g. through PIs).
>[Remember that there are - as yet - no agreed conventions as to what a PI can
>look like - you can put anything in after the target.]

I used to think this might be useful, but I can't actually think of any
application that could plausibly care about whitespace folding and also do
meaningful processing without knowledge of the DTD. A text-indexer can work
without a DTD, but also doesn't need any whitespace info (folding is always
good enough) -- and it needs to see every byte, because it may have to
track file offsets of hits.

Can you think of any other useful examples of "DTD-blind" applications that
might care about how the document _intended_ the whitespace to be
processed. I cofness that I can't.

>Note; I am NOT trying to find a universal solution here.  I am suggesting that
>we develop some common, useful approaches which will solve a reasonable
>number of problems.

But I don't actually see what problems we can solve with such solutions,
that are not better addressed in either the stylesheet or DTD/schema
problems.

>> The problem with this is that there are a large number of ways that
>> whitespace can be used: the "tokens" form mentioned at the end, for
>> example, has never been proposed for XML.
>
>I agree there are a large number of ways.  Some classification would be
>valuable and IMO the sort of thing that XML-DEV could usefully provide.
>[The WS-separated tokens are no different from 'words' in HTML and I would
>expect that a large number of people would welcome a convention on
>normalising whetspace between 'words'.]

Enumerating these might have some pedagogical value, but I no longer see
the practical value of declaring the behaviors. I used to think it might be
useful, but I'm not so sure.

>Then the application needn't implement them :-)  Applications have to do
>*something* about whitespace.  This can be:
>	- ignore the problem (or use PRESERVE)
>	- their own thing
>	- a set of choices which is understood by the community
>	- refuse to process the document.

Only 2 (their own thing) makes any sense -- and is typically driven by
their knwoledge of a DTD or possesion and following of the dictates of a
stylesheet.

>It 'works' in that it shifts the problem to the application developer. I like
>the idea of an XML->XML transducer - perhaps in front of the application, or
>callable within it.  If David thinks that such tools could be built
>independently of applications that is exactly what I am suggesting :-)

They are close to a _null_ application, and require _no_ whitespace
normalization, since they need only pass any whitespace they see straight
through. This was my original point. Only if you insist on "normalizing" do
you _create_ problems with transduction.

>it's clear that an application *must* have access to all whitespace if it
>wants it (this is made clear by, say, the requirement of XMl_LINK to search
>on pseudoelements).  However it should also be able to access a normalised
>form of the document.
Why? I think I've argued effectively that this is not useful without a
stylesheet or well-known DTD, and in those cases, it is not necessary (as
the DTD or stylesheet should declare the conventions in use).

>> This is the option that XML universally adopts. That means  that any other
>> method can be implemented _by any processor that cares_. If one can imagine
>> destroying meaning of a document's content by the flattening of all
>> whitespace strings to a single space, then you may need more elements in
>> your content model, if you are not able to control the software that will
>> process the document.
>
>This is a good point.
>
>>
>> In other words the parser guarantees all WS will be visible to applications
>> -- this makes designing and implementing WS dependent processing easy --
>> but since applications are _not_ constrained as folding or other WS
>> processing behaviour, document authors will have to be cautious in using
>> significant whitespace. If you can't assume that applications to process
>> your markup will do the right thing, then you should not play games with WS.
>
>Yes. But where is the rigour in authoring going to come from? This is where
>I believe that XML-DEV has a role.
I'm not sure what you mean here... If the application or DTD depend on
whitespace critically (a bad idea, probably, but a permissible one) -- then
it is the author's responsibility to use it properly (and select a tool
that let's her). Since the generic dumb text-editor is such a tool, and
it's widely available, I don't see a big problem here.

>> This actually is not much of an issue for CML, since it's a reasonable
>> assumption that any implementation of CML markup-display will have to do
>> lots of special things, of which whitespace is the least.
>
>No, the point was that CML wishes to re-use HTML and MathML as additonal
>components in the document. And then meta-data, and ... So that the
>application will become bloated unless it can re-use the approaches from
>the rest of the community.

I'm afraid I don't see how you're going to share code with an HTML
processor. Nor can I psych myself up to believe that whitespace folding
code:
  while (isspace(c = getc())) ;
  outchar = ' ';
is a big bloat problem in a program that can render organic chem reaction
diagrams.

>> I think XML's agnostic position is the correct one for tha language.
>> Authors should probably assume (unless they anticipate absolutely no
>> re-use) that HTML-style draconian normalization might occur anywhere and
>> use markup rather than whitespace, or at least CDATA sections. This
>> position _may_ be moderated (a little) where a well-known DTD with
>> well-defined WS rules can be used (like the TEI or HTML).
>
>I agree on this.  The point I have been trying to promote is that it should
>be possible to collate the requirements of such systems and offer them
>on a re-usable basis.

If it's useful, just list some policies and be done with it, I guess. In
answering this mail I've found that I no longer believe that it's very
important, because I don't see how to use it effectively anywhere.

>An author could then say:
>	- the content of FOO, BAR, FLIP can be expected to be treated by
>XML-DEV-HTML-like WS normalisation.
>	- the content of BAZ, BLORT suffers WS stripping as described in
>XML-DEV-HTML-like-stripping.
>
>and that's about it. If we can get something along those lines, then
>I think a reasonable number of people would take note. It doesn't just have
>to apply to HTML DTDs.

Why not. Make a web page for the policies, create a notation declaration
that points at it, and then use that notation as a prefix on a PI to
declare these things. It can't do any harm other than maybe wasting time.

  -- David

_________________________________________
David Durand              dgd at cs.bu.edu  \  david at dynamicDiagrams.com
Boston University Computer Science        \  Sr. Analyst
http://www.cs.bu.edu/students/grads/dgd/   \  Dynamic Diagrams
--------------------------------------------\  http://dynamicDiagrams.com/
MAPA: mapping for the WWW                    \__________________________

xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)