Streaming XML and SAX

Sat Feb 27 19:03:58 GMT 1999

Tom Harding writes:

 > How?  You would doubtless agree that mandating a specific encoding
 > for all streams sidesteps one of the major benefits of XML.
 > Introducing an encoding declaration mechanism into the transport
 > protocol, as HTTP does, would duplicate the function of the XML
 > processor.  

Here's a short excerpt from the non-normative Appendix F of the XML
1.0 Recommendation:

  The second possible case occurs when the XML entity is accompanied by
  encoding information, as in some file systems and some network
  protocols. When multiple sources of information are available, their
  relative priority and the preferred method of handling conflict should
  be specified as part of the higher-level protocol used to deliver
  XML. Rules for the relative priority of the internal label and the
  MIME-type label in an external header, for example, should be part of
  the RFC document defining the text/xml and application/xml MIME
  types. In the interests of interoperability, however, the following
  rules are recommended.

  - If an XML entity is in a file, the Byte-Order Mark and
    encoding-declaration PI are used (if present) to determine the
    character encoding. All other heuristics and sources of
    information are solely for error recovery.

  - If an XML entity is delivered with a MIME type of text/xml, then
    the charset parameter on the MIME type determines the character
    encoding method; all other heuristics and sources of information
    are solely for error recovery.

  - If an XML entity is delivered with a MIME type of application/xml,
    then the Byte-Order Mark and encoding-declaration PI are used (if
    present) to determine the character encoding. All other heuristics
    and sources of information are solely for error recovery. 

  These rules apply only in the absence of protocol-level documentation;
  in particular, when the MIME types text/xml and application/xml are
  defined, the recommendations of the relevant RFC will supersede these
  rules.

If I were defining a streaming protocol for e-commerce, news,
financial markets, etc., I probably would mandate a single encoding
for all packets (UTF-8 or UTF-16), just to keep things simple.  As you 
can see in the above excerpt, the character-set discover heuristics in 
XML are intended for use only in the absence of protocol-specific
encoding information.

 <snip/>

 > It's amazing how two people can see things so differently.  I think
 > it's supremely elegant that only the XML processor needs to look at
 > data coming off the wire.  It's also as efficient as it gets.  

It is efficient only if you know for certain that you need to use a
single object model for all of the XML information that you're
receiving; otherwise, you'll end up building a generic object model
(like a DOM), then tearing it down to build an optimised
domain-specific one (such as a vector graphic or a
financial-transaction object), and that process would be painful.

 > course the software architecture that handles the documents emitted
 > must be modular and extensible, but the task of parsing is done.

Parsing is relatively easy (though it's wasteful to do it twice);
building an object model from the parsing is time- and
resource-consuming.  For example, imagine that I have a Java class
like this:

  public class Purchase {
    public int seqno;
    public int customerId;
    public int vendorId;
    public int invoiceId;
    public float total;
  }

In XML, an instance of this information might look like this:

  <purchase xmlns="http://www.ecommerce.net/ns/ec/">
   <seqno>12345678</seqno>
   <customer-id>87654321</customer-id>
   <vendor-id>18273645</vendor-id>
   <invoice-id>81726354</invoice-id>
   <total>92674.12</total>
  </purchase>

Based on my (limited) understanding of the Java VM, the Java versions
of a Purchase objects will require 24 bytes of storage each; I'd guess
that even a heavily-optimised generic DOM implementation would require
at least 5-10 times as much storage (I'll welcome corrections from any 
DOM implementors on this list).

In other words, if I go straight from the XML to my own object model,
I can store 100,000 purchases in 2,400,000 bytes of storage; if I go
from XML to a generic DOM object model, I will require between
12,000,000 and 24,000,000 (or more) bytes to store the same
information, and then I will *still* have to build my own object model 
afterwards.

All the best,

David

-- 
David Megginson                 david at megginson.com
           http://www.megginson.com/

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)