Free Tool for Efficient XML Data Compression

Thomas B. Passin tpassin at
Sun Dec 19 18:54:21 GMT 1999

Philip Boutros
> Thomas B. Passin wrote:
> > I also took a 2-column Word97 document - 63.5K -
> > and opened it with Abiword (
> > then saved it.  Abiword uses an XML file format as
> > its native format.  Abiword XML file size: 48.8K.
> While I would love to chime in on the whole compressed XML discussion (my
> guess is word dictionaries and skip lists should slaughter gzip in terms
> compression size) I would like to address this statement in particular.
> 1.
> Comparing Word97's file size to that of the current version of Abiword is
> ludicrous exercise. I am very familiar with the Microsoft Word file format
> (no, I don't work for Microsoft and never have) and while it contains a
> number of inefficiencies and chunks of legacy garbage, given how much it
> encodes it is reasonably efficient for large documents. I can name at
> a hundred features (styles, page layout, frames, borders, backgrounds,
> graphics, fields, properties, etc.) that Word97 must deal with in its file
> format that Abiword's format does not address. In fact, given how little
> Abiword encodes, I was surprised that Abiword wasn't 10 times as
> See #2 for a tirade about that.
Well, of course I know Word documents include a ton of stuff that, say,
Abiword files don't.  And this particular file doesn't even have any VBA
macros of my own in it. And I'm not arguing for Abiword's file format,
either.  In this case, though, my document doesn't need the rest of that
stuff Word includes.  So doing this conversion gave me some rough way to
compare sizes when the two documents were typical of the type I often use.
For an XML document, you could have argued that some other format doesn't
need end tags so of course XML would be bigger.  That's not the point.  The
point is that - I think it will turn out this way - for many actual cases,
the supposed size disadvantage of an XML document  will be relatively small
or non-existent.

Of course, the XML standard was developed under the guideline that
"terseness ... is of minimal importance", so if file size alone is going to
be the driver, XML might not be a favorable candidate.

Tom Passin

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at
Archived as: and on CD-ROM/ISBN 981-02-3594-1
To unsubscribe, mailto:majordomo at the following message;
unsubscribe xml-dev
To subscribe to the digests, mailto:majordomo at the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at

More information about the Xml-dev mailing list