Is there anyone working on a binary version of XML?
Rick Jelliffe
ricko at allette.com.au
Sun Mar 28 12:39:57 BST 1999
From: David Megginson <david at megginson.com>
>Rick Jelliffe writes:
>
> > One trivial way to minimise file sizes for transmission is to
> > collapse white-space inside markup (e.g. [\ \t \n\ r]+ becomes
> > [\n]),
>
>Yes, that might be helpful (but only minimally in most cases).
The reason I suggest it is this: at several stages in a network there is
liable to be some point-to-point compression. In particular, of course,
at the modem of the receiver (well, most receiving ends). XML's
verboseness can be partially justified by the existance of this
compression.
Attempting to compress already-compressed data does not always lead to
increased benefits: in fact, compressing already-compressed data can
easily lead to larger files, which is why many compression systems first
check that they have made any gains before writing out the compressed
blocked. (And if you are going through 7-bit mail systems, then you can
increase your transmission size by compressing data, if the data is
ACII.)
When judging an XML compression, it is important to judge its effect
after being recompressed by the kind of compression that is found in
modems (i.e., at the bottleneck): the simple, fastest deflate found in
gzip can be useful. Furthermore, it is important to recognise that,
because of the slow-start algorithm in TCP/IP and the WWW having quite
long ACK delays, a compression of 2:1 is not the same thing as a
doubling in arrival speed: more data will arrive earlier in each packet
group, but the number of packet groups may be the same. In the case of
the binary version of XML being mentioned, it would be interesting to
see the four-way comparison (raw XML, binary "XML", compressed XML,
compressed binary).
One interesting results of my tests on the interaction of
short-referencing and compression was that collapsing white-space was
(for my independently-produced RDF test files) just as effective as
short-referencing. (One reason might be that many compression algorithms
only have a certain dictionary size, and a certain match-string size:
reducing unneeded white-space may free up dictionary entries and allow
more useful match-strings. Especially for on-the-fly compression, such
as modems. )
I was surprised, because I thought that white-space was fairly
insignificant: but I was wrong, for the data I was using (some data
would fare better, I would hope, but some may be worse). So developers
should pay attention to letting users keep their file sizes down: a 10
percent reduction in file size may not seem much, but if, at an extreme,
all the packets are just over the size of the first packet group and the
ACK latency is greater than the packet transmission time, it can result
in the files completing in half the time. At the smaller file sizes of
XML, and the trends to linking to external stylesheets and so on,
reducing the crap in headers is quite important. In fact, I would think
that it was good policy to have no unneccessary whitespace in header
data in XML documents.
>> and to minimize whitespace in data: (removing trailing spaces, [\
>> \t]+\n) becomes [\n], is a safe transformation, for example.)
>No. It might be a safe transformation for specific XML formats, but
>not for XML in general, because you don't know what people might be
>using that whitespace for.
Of course. But in practise text editors and some kinds of processing
systems will often strip out trailing whitespace on opening or closing.
So I should have said something like "It is not prudent to generate
'[\t\s]+\n' where the whitespace is significant unless you are sure how
software which uses that data treats trailing white-space." In any
case, I was trying to say that one good way to reduce file sizes is to
not generate unneeded characters in the first-place: I was not proposing
an external compression mechanism based on white-space collapsing.
Rick
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)
More information about the Xml-dev
mailing list