SAX: Character Stream vs. Byte Stream proposal...
Tyler Baker
tyler at infinet.com
Fri Apr 17 18:18:34 BST 1998
David Megginson wrote:
> Tyler Baker writes:
>
> > Why not simply have a standard factory that takes any type of
> > InputStream (UTF-16, UTF-8, etc) similiar to how the parse method
> > works and it returns a type (say CharacterStream) which can then be
> > passed to either the parser or the application. In this case the
> > implementations for doing all of this low level character reading
> > from bytes could be standardized for each platform.
>
> The problem is that SAX is an API, not an architecture -- that is, it
> attempts to impose the fewest possible constraints on implementations.
> There are several good reasons for this approach:
>
> 1. SAX is one of (possibly) many APIs that an XML parser will
> implement, and other APIs may make conflicting demands.
In this case it usually makes sense to have a separate parser for each API set
rather than having code like this:
if (parserAPICode == SAX) {
// Do SAX parsing
}
else if (parserAPICode == Foo) {
// Do foo parsing
}
Conditionals like this will greatly depreciate the speed of your code if every
method is littered with them. Better to just write a new parser for every new
API. Nevertheless, having a standard way for each parser to get at the low level
stuff makes sense from a code-reuse as well as consistency standpoint.
> 2. XML parsers need to compete on speed, memory usage, etc., and to do
> so, they need to be free to take different approaches.
I was suggesting that you would still have an interface, but a default
implementation for byte to character encoding in the SAX package I feel is
perfectly reasonable. I may get flames for this, but I think most parsers will
compete on how they solve an application's XML handling problems (the design) not
on whether one parser is 1% faster than another. In this case, a default solid
implementation for character encoding would allow parser writers to concentrate
on coming up with new and interesting ways to allow applications to model XML
content, instead of having to worry about bit shifting all over the place.
Typically, low-level stuff such as this I feel should be implemented once and
then reused over and over again. There are only so many ways to write character
encoders / decoders and I would wager that most parsers out there pretty much
have very similiar implementations for reading from byte streams. XML's beauty
is not in the fact the spec defines support for about 6 or so different character
encoding formats, it is in everything else. If another character encoding format
comes out, then every SAX parser will have to possibly do a rewrite. If people
could agree upon one good efficient dependable implementation, then no one (other
than the people doing the 600 or so lines of character encoding implementation
code) will have to do a thing. Of course, people could plug in their own
character encoder / decoder implementations if they so choose, but at least they
would have the choice.
I really think it would of made a hell of a lot more sense for XML to have one
standard encoding format, say UTF-16 or UTF-8 instead of actually defining in the
spec the actual legal encoding formats. It would make much more sense I feel to
just convert everything to a UTF-8 or UTF-16 format if documents were indeed in a
different format, rather than to force parser writers to handle just about every
major character encoding format known to man. One example would be databases
which may store XML content in a proprietary character format. An XML parser for
the database will need to do this translating anyways from the native character
format to something defined in the XML spec (unless you want to deviate from it).
Anyways, just some suggestions...
Tyler
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)
More information about the Xml-dev
mailing list