Character streams vs. Byte Streams

Fri Apr 17 18:50:28 BST 1998

Michael Kay wrote:

> James Clark:
> >This is fine except that it should use byte streams not character
> >streams.  What you get if you are reading from the net or from an
> >archive or a database or whatever is bytes not characters...
>
> I have enormous respect for James's arguments as always but on this one I
> beg to disagree. The reason I have asked for support for character streams
> is so that the parser can process not only stuff stored on disc but *the
> output of another program*. For  example, I have an application where the
> XML document is constructed as the result of an SQL query that pulls
> together fragments of XML stored in different places in a database. The SQL
> query, like most other programs I use and write, prefers  to output
> characters rather than bytes. That, after all, is the reason XML was
> designed to be human-readable.
>
> And I have to say that in my experience so far, the parsers are so lightning
> fast compared with the  application that generates the XML or consumes it,
> that an argument based on saving microseconds will not sway me much.

This is what I found out personally and why I decided to write my own parser for
my application.  When using SAX I found that my DocumentHandler implementation
was taking up about 75% of the processing time while the parser was only taking
25%.  The main reason for this I found is that using the String.equals() method
is quite expensive and is really the only good way in SAX for recognizing
elements.  When I switched to the Object framework I designed the parsing times
for my actual parser implementation were lower, but more importantly, the time
spent in the application handling the XML content was reduced to less than the
time spent in the parser which was a big surprise.

> I don't think there is a real problem with the XML spec. This defines the
> syntax of XML in terms of characters. It requires the parser to accept
> certain encodings of the character stream as a byte stream, but it permits
> the parser to accept other encodings and therefore by implication to
> delegate the decoding of the byte stream to another object in the  system.
> In fact it explicitly recognises that an "external transport protocol" might
> have a say in the matter, and that is a term we could interpret very widely.

Another reason why a CharacterStreamFactory I feel is a good idea.  It separates
the low-level encoding aspect of characters from the rest of the parser which I
feel should only really need to use one type of encoding format in the first
place.  If there was a default CharacterStreamFactory implementation the
following I feel are important issues...

- The default implementation should support all of the character encoding formats
defined in the XML 1.0 spec
- The default implementation should have a way to add in support for custom
character encoding formats (like with DB's).
- The default implementation should have a mechanism to replace implementations
for various encoding streams if the parser writer chooses to do so either for
optimization purposes he/she feels is necessary or some other reason.

The alternative I feel is never ending code bloat like in the case with current
major word processors where they all have endless amounts of kludgy code for
reading each others proprietary document formats and in the end just bloat the
application's resource consumption significantly.

Tyler

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)