Character Stream vs. Byte Stream proposal...

Fri Apr 17 07:50:14 BST 1998

Why not simply have a standard factory that takes any type of
InputStream (UTF-16, UTF-8, etc) similiar to how the parse method works
and it returns a type (say CharacterStream) which can then be passed to
either the parser or the application.  In this case the implementations
for doing all of this low level character reading from bytes could be
standardized for each platform.  This way you could have a lot of
different parsers that don't have redundant character converting
implementations in the parsers that as I have seen add to almost 50%
code bloat in some instances.  Yes this would mean a concrete
implementation for all of these types of streams in a CharacterStream
factory would have to be agreed upon for each language, but I feel this
is absolutely essential to SAX as it makes writing parsers a ton easier
since you don't have to worry about very low-level encoding formats that
can take years to learn.  Java would not be successful at all if all the
low level stuff was simply defined as interfaces and not as concrete
implementations.  If XML adds or removes encoding formats or other
low-level specifications in the future), many parser writers may not
have the time or expertise to redo everything all the time.

The closest analogy I can think of is if everyone had to write their own
Java version of System.arraycopy().

Having about 5 billion different byte to character translation
implementations would be akin to having 5 billion java.util.Vector
implementations.

Nevertheless, the standard factory could be represented as an interface
so that parsers which absolutely need to do their own byte to character
translation implementations for the parser could do so.

The closest analogy I can think of to this is the pluggable sockets
framework in JDK 1.1 and beyond.

Any ideas.  I don't want to see SAX turn into an interface explosion,
nor do I feel all parsers should do the most redundant activities
possible at the I/O level.

Last but not least, some parsers (such as the one I have) could of
course benefit immensely by having a concrete default implementation for
these character streams as for people like me, low-level byte to
character I/O is not my personal forte.  The parser I have written uses
its own proprietary XML Object framework which I feel is more efficient
in some respects for modeling data in Java than an event based parser
like SAX.  It is non-validating right at the moment (unfinished), and it
seems to parse 200% faster than Aelfred right now for my documents which
was a huge surprise - 220 milliseconds parsing vs. Aelfred's 459
milliseconds after several tests.  Spitting out XML data in a tree like
form took under consistently 20 milliconds.  Please take these numbers
with a grain of salt as the parser is currently pretty much
non-validating as well as the fact that the XML documents were not large
enough I feel to do any true comparison.  The main goal of the framework
was to eliminate the common if-then-else handling in an event based
parser which may be part of the speed increase.  Simply having a fast
parser I feel is not useful if the way it spits out data to applications
requires signigicant overhead to handle.  This approach I feel has
significant advantages to event based parsing, however it also has
significant drawbacks as well that are hard to elaborate on unless I go
in depth about how the parser works.

For the actual application I have modeling data in an event based way
has maintenance problems and the Element factory concept of parsers like
MSXML I feel are very resource hungry since they essentially construct I
symbolic tree at runtime (at least that is my understanding).  I would
of preferred not to have to do any XML parser writing at all, but I just
felt that for my particular application, event-based parsing, or a
parser that represents elements as an XML tree, I feel were flawed in
design for the needs my application has.

I would make the XML Object Framework free since its design is totally
removed from the application itself and we will never actually try and
make money off of, but the startup I am with is in the process of
incorporating itself and until that happens I cannot just hand out stuff
for free for legal reasons other than under my personal name (-:

For those interested, it handles both input and output of XML data in a
very similiar method to how Object serialization works in Java.  In fact
the application I am developing needs to represent its content in both
formats for various technical and political reasons.  Oh well enough of
the self-aggrandizement...

In summary, I think this would immensely help out all parser writers,
not just the ones who have event based parsers as it would significantly
reduce code bloat for SAX parsers (and therefore the applications) as
well as allow all parsers to use an efficient default byte to character
factory rather than have to muddy themselves with bit shifting of
octets.

If there is even a dream of dynamically loading various parsers at
runtime, I think it should be a priority to eliminate as many possible
redundancies between parsers as possible, not just for the parser
writers sake, but the actual people who use XML in their applications.
Byte to character encoding via a default factory interface (with a
default implementation that comes with SAX) I think would be a good
start.

Tyler

P.S. - My comments about my parser in comparison to Aelfred are in no
ways meant as a challenge to Aelfred at all as I have the greatest
respect for David.  In fact, my parser in the end with validation and
such may in fact be much more inefficient than Aelfred or any other
major parser.  I guess when I finally finish it up, I will be able to
see what the true results are.  Nevertheless, I think my approach will
significantly improve performance by the application in handling XML
documents even if the parser itself is inefficient.

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)