Streaming XML and SAX

Sun Feb 28 03:29:32 GMT 1999

David Megginson wrote:

> ...As you
> can see in the above excerpt, the character-set discover heuristics in
> XML are intended for use only in the absence of protocol-specific
> encoding information.

I suspect those lengthy notes were written to explain exactly how developers were to reconcile
the fact that an external way of declaring the encoding already existed in HTTP, which it
would have been rather unkind to ignore.  Tim Bray's annotations to the spec seem to confirm
this.

But since we're designing a protocol independent of HTTP, we ought to let the XML encoding
declaration do its job.

> For example, imagine that I have a Java class
> like this:
>
>   public class Purchase {
>     public int seqno;
>     public int customerId;
>     public int vendorId;
>     public int invoiceId;
>     public float total;
>   }
>
> In XML, an instance of this information might look like this:
>
>   <purchase xmlns="http://www.ecommerce.net/ns/ec/">
>    <seqno>12345678</seqno>
>    <customer-id>87654321</customer-id>
>    <vendor-id>18273645</vendor-id>
>    <invoice-id>81726354</invoice-id>
>    <total>92674.12</total>
>   </purchase>
>
> Based on my (limited) understanding of the Java VM, the Java versions
> of a Purchase objects will require 24 bytes of storage each; I'd guess
> that even a heavily-optimised generic DOM implementation would require
> at least 5-10 times as much storage (I'll welcome corrections from any
> DOM implementors on this list).
>
> In other words, if I go straight from the XML to my own object model,
> I can store 100,000 purchases in 2,400,000 bytes of storage; if I go
> from XML to a generic DOM object model, I will require between
> 12,000,000 and 24,000,000 (or more) bytes to store the same
> information, and then I will *still* have to build my own object model
> afterwards.

Multiplying your numbers by 100,000 is a little gratuitous, since it would be lousy
application design to force all 100,000 objects to be stored in DOM format at the same time
(say, by cramming them all into some super-document).  I will be the first to admit that it
takes resources to parse XML out to a standard memory representation, but I see no reason why
those resources shouldn't be in line with the work accomplished, which is mostly converting
markup to memory structures.  And actually, you should be comparing it with the storage
required by unparsed XML, not your application object.  That's how you would need to store it
if you chopped up the stream into chunks to be passed off to separate threads or boxes as you
suggest.

Tom Harding

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)