Re Whitespace

David G. Durand dgd at cs.bu.edu
Thu Sep 18 21:38:33 BST 1997


At 1:40 PM -0500 9/18/97, Sean Mc Grath wrote:
>[David Durand]
>>No, but fgets (unlike gets) can deal with long lines --- you have to
>>recognize that you overflowed and make accomodations, but you can do the
>>right thing. iw as giving you the benefit of the doubt, since gets, at
>>least, has the problem that you are raising, while fgets does not.
>>
>[Sean Mc Grath]
>You mentioned gets(). I didn't. How your insertion of an irrelevant reference
>to gets() can be construed as giving me "the benefit of the doubt" I don't
>know.


Well, as fgets does not support your argument that "long lines cause
problems" I thought it might be a typo for gets (wh/ does have serious
problems w/ long lines, but is of course a canonical example of bad design,
and not something we want to accomodate).

as to fgets, I confess that I don't see that it should have any problem
with anyfile, newline-containing or not. Am I clear now?

>[David Durand]
>>Just try that in tables. You have to know the meaning of the markup, even
>>in HTML, if you want to do this. Now you can claim that table markup is
>>broken, and you might be right, but HTML does not suport your argument.
>
>[Sean Mc Grath]
>Why not? Why cannot I replace say, "<TD>" with "<TD>\n" everywhere?
>The problem then reduces to long data chunks such as...
>pre elements:-

Well, because people use tables to format, and that extra space queers the
pitch, inducing funny spacign bahavior. Agreed that a better table model
could avoid this.

>[David Durand]
>>
>>Similarly for pre elements: You can't do anything to lineneds in there --
>>maybe I'm using a 20K line in <pre> to force horisontal scrolling for a
>>rhetorical reason.
>
>[Sean Mc Grath]
>Absolutely agreed. the <data><line end><data> case is fundamentally different.
>These line-ends are truly part of the data and a processor that adds new ones
>is blowing the integrity of the data. Thus the plausible argument in favour
>of not
>using line-end as data content.

I confess to not understanding why a lineend cannot occur at the beginning
of an element. Even SGML never proposed to remove more than _1_ such line
break.

So you want to take them all away, so that grep won't break.

>[David Durand]
>>
>>>>Can you suggest any solution to the "grep" problem other than requiring a
>>>>fixed line-max in XML.
>>>
>[Sean Mc Grath]
>>>Yes. Ignore all line ends. I know this presents its own set of difficult
>>>problems
>>>but I'd prefer to tackle these - and maintain compatability with a decades
>>>worth
>>>of tools - rather than break the tools.

Well, it makes data rather unrevealing.

And of course, the tools are only broken if common practice leads to the
use of long lines -- and if that becomes the case, then it will only have
been because the tools are _not_ actually that important.

This is a social argument that you have not addressed yet, but it cuts to
the core of why we should not do this... We get a simpler easier model, and
there is  nothing to stop people from any self-imposed discipline their
tools require.

And if people are _not_ following such a discipline, then there's no reason
to worry about the tools, because it can only happen if people are not
using those tools for XML.

>[David Durand]
>>lack of <pre>-style elements
>
>Broken As Designed. If something has to give I think <pre> elements should
>be first to go.
Well, theoretically there's a lot of reasonableness to using explict markup
for such line breaks. But, the pragmatist in me has to note that there has
been _no_ successful markup or document processing language without such a
feature (except for word-processors, but the case there is complicated
because the user never _sees_ the relevant representation.

>Alternatively the problem can alway be "arcformed" away. We use
>     <!ATTLIST <e> DIGITOME CDATA #FIXED "PREFORM">
>all the time. Our pretty printing, word wrapping SGML processing tools use
>this to
>avoid adding extraneous WS that would blow the data content.

Doesn't solve the problem you raised. That data has a long line in it and
grep crashes. You have to split the line, and take the consequences, or not
use grep.
if you don't allow arbitrary line-break introduction anywhere, you haven't
solved the legacy tool problem, which weakens your argument somewhat. If
you do, you've mad it impossible to count on line-breaks _ever_ being
significant. The XML committee considered this and rejected it as too
divergent from current practice (that people did not want to give up).

>[David Durand]
>>, inability to write XML filters that preserve linespace jsut from generic
>XML parsers.
>
>[Sean Mc Grath]
>Line ends (at least those) tipping up to start-end tags would *not* be part
>of the data. They
>could thus be added/dropped without effecting the data. The CGR output of
>the grove
>would be the final arbiter on "equivalence" and the launching pad for
>offsets used in
>addressing.

Yes, and the "looks the same in my editor" arbiter of equivalence would
fail. This has long been felt unacceptable by those who use such
transformations. If any hand-editing is involved it is unacceptable
behaviour to change all the line-ends.

>[Sean Mc Grath]
>>>Yes. Line oriented text processing has been a hugely popular paradigm for
>>>many years now. I don't think of these tools as "defective" at all. I dare
>>>say many wielders of these tools are of the same opinion. These people will
>>>be rightly miffed at the suggestion that they are defective by virtue of the
>>>use of a line oriented paradigm. They will also be rightly miffed that they
>>>cannot bring their tools/skills to bear in the XML world.

>[David Durand]
>>But they can, they just need to limit their files to crrespond to the
>>limitation of their tools. People do this all the time, without difficulty.

Yes, If your editor and tools have a 72 character line limit, you don't
create files with long lines. Then your tools always work. If you want
everyone's tools to always work, and you admit a maximum line-length for
tools, you need to pick that number so I can make files that won't toast
your software. Either that, or someone with different software will exceed
the limits of your software, of whose existence she has never even heard!

>
>[Sean Mc Grath]
>No difficulty?
>
>Problem : I receive an XML file from a user who works with <1024 lines in
>his tools.
>
>I use <512. how do I munge his file to suite my tools? I can't without
>blowing the data. If tag-tipping line ends were transient I could make
>a stab at it. I would still have to address the "<data><line end><data>"
>case. But hey! I never said this was simple! I just said that the alternate
>set of problems this presents have the benefit of not throwing out our
>existing line oriented tools and techniques.

Look, we have a solution. Proposing a new solution based on a new problem
(grep and other tools with hard line-length limitations) requires that the
new solution actually _solve_ the problem. Your solution does not solve the
problem you yourself pose, so it's hard for me to take seriously.

>[David Durand]
>>Of course if the world at large decides to abandon the "line paradigm" then
>>those who stick to it will be inconvenienced. But then if "the world" make
>>the shift, then there's still not a very big problem, is there?
>
>[Sean Mc Grath]
>That is one-helluva shift IMHO! I am not sure to what extent the world is
>   a) aware of this aspect of XML
>   b) willing to bite that bullet.

In that case, they create files with short lines, and there is no bullet to
bite. The only way this problem can become common is if long lines become
very popular. I don't see how long lines can become popular if they create
fatal tool problems with popular tools. Either long lines will not be
common, or tools that cope with long lines will be common along with the
long lines themselves.

It's a simple feedback loop. No need to change the standard, just let
people's desire to share data feed back into the general knowledge of what
data is shareable.
>[David Durand]
>>if XML is
>>supposed to require lines no longer than some limit, we need to specify
>>that limit in the standard.
>
>[Sean Mc Grath]
>No we don't! We need to have a well defined mechanism whereby a tool with
>a line length limit of N can work with XML with line length > N without
>blowing the integrity of the data.
How do we do this for legacy tools like grep with a hard-compiled limit
(that is not documented, and varied from vendor to vendor)?
If files that work with arbitrary tools are to be possible, we need to know
the constraints that those tools impose.

>[David Durand]
>>Otherwise all we can say is that any XML
>>processor is free to reject any document if the lines are "too long for
>>that tool". That's en even worse prescription for interoperability.
>>
>See above.

I saw. I didn't see how you're going to fix grep (for your data\ndata
case). Or rather the "40K of data with no \n" case which is the real killer.

>[David Durand]
>>If there are limits, a standard has to tell you how to be safe and not
>>break any of those limits. At least, a good standard should.
>>
>
>[Sean Mc Grath]
>The standard does not have to establish a limit. It could help users
>of "legacy" tools to *cope* with limits though. "Buy/build better tools"
>is one
>line that can be taken but it is not the only one.

Well, how could the standard do that?

Actually, since the standard is almost certainly not going to change, I
don't really care how it could do it. My sense is that people won't do
without <pre> equivalents -- so you can never get total freedom to
remove/add linends. So since the problem is unsolvable, lets not waste
time, and complicate the standard to get a partial solution (ie. solution
that fails to solve the problem) at the cost of a popular feature.

  -- David

I think that's it for me.

_________________________________________
David Durand              dgd at cs.bu.edu  \  david at dynamicDiagrams.com
Boston University Computer Science        \  Sr. Analyst
http://www.cs.bu.edu/students/grads/dgd/   \  Dynamic Diagrams
--------------------------------------------\  http://www.dynamicDiagrams.com/
MAPA: mapping for the WWW                    \__________________________



xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)




More information about the Xml-dev mailing list