SAX: Byte Stream Needed?

Fri Apr 17 05:03:20 BST 1998

David Megginson wrote:
> 
> James Clark writes:
> 
>  [...]
> 
>  > InputStreamReader, however, leaves something to be desired because
>  > it doesn't allow users to supply their own character-to-byte
>  > conversion routines. But if you have an InputStream you should be
>  > using the interface to the parser that takes an InputStream.  In
>  > any case it's not practical to use an InputStreamReader for XML
>  > because that won't deal with XML's rules for detecting encodings.
> 
> I have actually been toying with omitting the byte-stream parse()
> method altogether, so that there would be only two parse methods:
> 
>   public abstract void parse (String publicId, String systemId)
>     throws java.lang.Exception;
> 
>   public abstract void parse (String publicId, String systemId,
>                               SAXCharacterStream input)
>     throws java.lang.Exception;
> 
> I've defined SAXCharacterStream as follows:
> 
>   public interface SAXCharacterStream {
>     public abstract int read ()
>       throws SAXException;

Why do you need this?

>     public abstract int read (char ch[], int start, int count)
>       throws SAXException;
>   }
> 
> (Where SAXException is, in the Java version, a direct and unmodified
> subclass of java.io.IOException).  The result of either method is -1
> if there are no characters left to read; otherwise, it is a UTF-16
> character value for the first, and the number of characters read for
> the second.
> 
> The advantage of using SAXCharacterStream is that behaviour over CORBA
> (or, I suppose, DCOM) is now well-defined.  The disadvantage is
> another bloody interface.
> 
> I had also written a SAXByteStream, but then I started wondering why
> we really need it -- information coming from a database, for example,
> or from a buffer should already be in characters, not in raw bytes
> (and in Java, at least, it is simply to wrap a Reader around any
> InputStream when necessary -- I expect that other languages will have
> good internationalisation support soon).
> 
> Can anyone put forward a convincing case for having a standard SAX
> method parsing from a raw byte stream (remembering that
> implementations can always extend the SAXParser interface themselves
> for special requirements)?

You would be biasing SAX towards implementations that work internally by
converting into UTF-16 and then parsing.  Not all parsers work like this
and it is not the most efficient way to write a parser.  My parsers work
directly on a stream of bytes and don't convert to a character stream
first.  That's one reason why they are faster than other parsers.  In
fact the way I would implement support for a SAXCharacterStream is to
wrap an InputStream around it to turn it into a sequence of bytes.

XML implementations may well provide their own machinery for converting
from bytes to characters.  The system provided facilties (as in Java)
are in practice often slow, buggy (lacking surrogate support for
example), with inconsistences between platforms.  By providing only
SAXCharacterStream you would be preventing users from taking advantage
of this machinery when not reading from a URL.

Another reason is that the XML defined mechanisms for specification of
the encoding (with the encoding declaration and auto-detection of
encodings) would not be available when reading from a stream.

Yet another issue is that the XML spec specifies how to parse byte
streams not character streams.  When you try to infer from it how to
parse character streams, issues arise like treatment of the byte order
mark and encoding declaration which are not defined by the XML spec.

I think SAX is getting way too complicated and these should be left out
for now.  If you are going to have only one it should be SAXByteStream
not SAXCharacterStream.

James

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)