Binary-encoding of XML for communication

Mon Sep 20 16:58:19 BST 1999

I looked at the WAP spec, and the subsequent comments on this list and have
concluded:

1) Yeah, a binary spec for XML is a cool idea after all.  If nothing else,
we can probably parse a binary rep faster than we can parse text.

2) The WAP spec does not seem to have any guiding principle for making the
transition from text to binary.  In particular, tossing out comments with
the bathwater is a strange choice.  Also, giving themselves a special set
of enumerations for DTDs is politically curious.

3) I wouldn't be surprised if a document encoded in their binary format
ended up *bigger* than a text XML doc rammed thru zlib (their use of octets
and 32-bit integers is going to lead to lots and lots of 0 bits).  Is LZ
decompression a problem in embedded devices or something?

4) This spec is a lot closer to a network protocol than it is to the XML
spec, and, IMHO, it should be an IETF RFC, not a W3C Rec.

Anyone agree?

I propose we small-fry developers could do the following:

A) Decide *why* we want a binary XML spec, including rationale for that
decision
B) Produce an elegant spec and a reference implementation in C and java
C) Use IETF or a similarly open forum to promulgate it

I'm willing to step up to take the lead on this, although I'd happily back
off and let someone else take the reigns.  I think this can help with both
download size and startup time issues with my company's product, so I'm
motivated to work on it.

With your permission, I'll take a crack at step A (using my best
approximation of the funny language of specs):

<Preamble>

The binary XML format specification, hereafter referred to as XML-bin is
required to reduce the transmission size of XML documents, to speed
processing of those documents, and to reduce the size and complexity of XML
parser software.  (For purposes of this specification, the existing XML
specification will be referred to as XML-text.)

The XML-bin format specification shall be a lossless encoding of a textual
XML document.  That is, a document can be translated from XML-text to
XML-bin and back an arbitrary number of times, and no information content
will be lost.  Information content, in this sense, excludes those
properties of the text which are defined as "insignificant white space" in
the XML specification [anything else we need to exclude here?].

<Rationale>
The motivation for adjusting the machine representation of XML should be
expressed in the terms of computing machinery.  Allowing this effort to
attempt to change the rules of what should be in an XML document (e.g., the
WAP attempt to banish comments), or to fix some bigger issues (e.g.,
allowing more expressive DTDs) would doubtless interfere with acceptance of
this specification as a standard.
</Rationale>

</Preamble>

How's that?  The obvious (to me, anyway) way to implement that is to choose
a reasonable binary representation of a parse tree -- the way many
programming language compilers store data between their front-and and
back-end processes.  Maybe a string table followed by a binary dump of a
heap (a tree stored in a vector, for those of you who never took a data
structures course), all rammed thru zlib to compress out common patterns.

But before we decide on the implementation, we need to reach consensus on
the motivation.  Did I capture it?

-Joshua Smith

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)