SAX: New Idea for Entity Resolution

Fri Apr 17 13:45:25 BST 1998

James Clark writes:

 [my example omitted]

 > This is fine except that it should use byte streams not character
 > streams.  What you get if you are reading from the net or from an
 > archive or a database or whatever is bytes not characters and it is
 > part of the function of an XML processor to manage the conversion
 > into bytes using the encoding declaration and the XML specified
 > mechanisms for encoding auto-detection.  You could provide both,
 > but the fundamental one is a for a stream of bytes.  Also the
 > EntityResolver needs to be able to indicate an externally specified
 > encoding (as with the additional argument for parse with a
 > SAXByteStream).  In other words SAXEntityResolver needs to return
 > an object with two members: a SAXByteStream and a (possibly null)
 > String.

I hope that people will at least admire my wisdom if I admit that I am
not smart enough to figure this one out myself.  I suspect that this
will be the Last Great Issue with SAX before we can finalise it, so
help will be appreciated.

Here are what seem to me to be the costs and benefits of supporting
character streams, byte streams, or both:

* Character streams only

  Pro: - the application writer has specialised knowledge about the
         information source that the parser writer lacks; as a
         result, the application writer can better optimise the
         conversion, if necessary
       - information from dialogue boxes, internal buffers, and
         (eventually, with internationalisation) databases will all be
         characters rather than bytes
       - most programming languages are moving towards characters and
         away from processing raw bytes 
       - many programming languages (such as Java) already have
         standard methods for converting byte streams to character
         streams, and application writers can use these if needed or
         desired

  Con: - the application may have to convert from bytes to characters
         itself if an input source is not available
       - the parser may have its own, internal, efficient mechanism
         for byte-stream conversion

* Byte streams only

  Pro: - supports the minimum common denominator: all platforms have
         some concept of a byte stream
       - allows parsers to use their own, efficient, internal methods
         for byte-stream conversion

  Con: - adds serious inefficiencies, since characters (say, from a
         dialog box, an internal buffer, or a database with I18N
         support) will have to be decomposed back into bytes to be
         passed to the parser, then reassembled back into characters
         by the parser
       - requires a new SAX class encapsulating a ByteStream and its
         recommended encoding

* Both Byte and Character streams

  Pro: - keeps everyone happy

  Con: - requires more interfaces
       - requires another method in the Parser interface
       - requires a new SAX class encapsulating a ByteStream and its
         recommended encoding (or perhaps the ByteStream interface
         will have a getEncoding() method)
       - will greatly complicate the EntityResolver mechanism (the
         application will need to be able to return a byte stream _or_
         a character stream -- how could I handle this?)

Thanks, and all the best,

David

-- 
David Megginson                 ak117 at freenet.carleton.ca
Microstar Software Ltd.         dmeggins at microstar.com
      http://home.sprynet.com/sprynet/dmeggins/

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)