Whitespace

Marcus Carr mrc at allette.com.au
Tue Aug 26 01:21:50 BST 1997


David G. Durand wrote:

> > It does involve parsing it, but only until
> >it sees mixed content. If elements are assumed to be ELEMENT until
> >proven otherwise, surely this wouldn't be a massive overhead.
>
> It might involve buffering large amounts for whitespace across an
> arbitrary parser lookahead, since there is no limit on the size of an
> element, or where the non-space PCDATA might show up. One would have
> to buffer the entire document in the parser before one could decide
> whether to emit any whitespace in the root element. This might be a
> bit of a memory performance hit...

Why would you need to buffer anything? Every element starts with a
default value of 'element'. As they're shown to be otherwise, their
status is revised. This involves tracking open elements, not picking up
chunks and reviewing them. One linear pass of the document tells you all
you need to know.

> Manye people have claimed that they use editors incapable of
> funtioning without inserting linends (of their local flavor) every 200
> characters or so. I (personally) wasn't very sympathetic to this
> argument, but it stood in for the empirical observation that people
> are very loose with whitespace/linends, and that forcing tools not to
> emit whatever line-ending codes it wants could be a problem.

This would still respect the limits set by the user in the same way an
application would behave when you turn off hyphenation - the line might
be shorter, but it's broken in a sensible place.

> The real problem is that there's an assumption that a generic
> processor can solve the "whitespace problem" -- and that is not really
> true. In a very real sense the meaning of whitespace is a product of
> the document _and_ and he application. For instance, line breaks (as
> indicated by whitespace) might be critical in a typesetting
> application for poetry (but _only in <poem> elements). The same
> document, however, would be best processed with some form of
> whitespace-collapsing everywhere, when indexed by a full-text search
> engine. The same data may have different signficance when processed
> differently.

If line breaks are critical, they should be marked explicitly. If you
gave a hand written poem to a data entry person with no knowledge of
poetry, you may have to specify that you want the current line
boundaries respected. Why should an application not be given the same
info?

> The fact is that whitespace should be controlled by the application.
> For typesetting and display, this means that practically, it's going
> to be part of the "stylesheet" or other processing mechanism.

Whitespace is also a mechanism used to make data readable. In that
sense, a space is a character in it's own right, not just something that
appears around words. Imagine the response if it wasn't whitespace that
was being discussed, it was the letter 'x', and we were telling people
'x' may or may not appear in their data.

> > As
> >far as I can see, much of the functionality in XML (such as linking)
> >relies on a DTD, so it's not going to be foreign to most XML
> >applications anyway.
>
> This is not necessarily the case. It's also harder to detect mixed
> content from DTD declarations, than simply to recognized #FIXED
> attributes.

It can't be that hard. If parameter entities (I assume they're allowed?)
have to be unravelled anyway, surely it's just a case of looking at the
content model? If it starts with #PCDATA and contains anything else,
it's mixed content.

> >I would really like to
> >see XML and SGML stay in synch - I think anything else would be to
> >everyones disadvantage.
>
> Yes, this is very true -- and this battle has been won by the
> compatibility camp -- they are in synch. SGML has a new "pass all
> whitespace" option for the declaration. This is not going to be a big
> problem for existing implementations, since it's incredibly easy for
> parsers to implement -- most have had to anyway, if they attempt to
> support SGML->SGML transformation tools. I think SP already can do the
> right thing.

"Pass all whitespace" will go some distance toward fixing the problem,
but what else does it impact? Does it mean that inclusions and
exclusions suddenly appear differently than they did in the 'old SGML'?

> And since there is, iun any case, no universal way to handle markup
> without a external processing spec (that can include whitespace among
> its many other factors) there's no reason to make the parser cause
> applications more problems than they will have to solve already.

My understanding is that one of the basic requirements of XML was that
the applications had to be easy to write, so things could be allowed to
happen quickly. As much as I do agree that this would have to be a good
thing, (and as you pointed out, applications are coming out already) I
would argue that maybe they should be more difficult to write, but
should address this issue correctly.


--
Regards

Marcus Carr                  email:  mrc at allette.com.au
_______________________________________________________________
Allette Systems (Australia)  email:  info at allette.com.au
Level 10, 91 York Street     www:    http://www.allette.com.au
Sydney 2000 NSW Australia    phone:  +61 2 9262 4777
                             fax:    +61 2 9262 4774
_______________________________________________________________



xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)




More information about the Xml-dev mailing list