SAX: New Idea for Entity Resolution
David Megginson
ak117 at freenet.carleton.ca
Fri Apr 17 13:45:25 BST 1998
James Clark writes:
[my example omitted]
> This is fine except that it should use byte streams not character
> streams. What you get if you are reading from the net or from an
> archive or a database or whatever is bytes not characters and it is
> part of the function of an XML processor to manage the conversion
> into bytes using the encoding declaration and the XML specified
> mechanisms for encoding auto-detection. You could provide both,
> but the fundamental one is a for a stream of bytes. Also the
> EntityResolver needs to be able to indicate an externally specified
> encoding (as with the additional argument for parse with a
> SAXByteStream). In other words SAXEntityResolver needs to return
> an object with two members: a SAXByteStream and a (possibly null)
> String.
I hope that people will at least admire my wisdom if I admit that I am
not smart enough to figure this one out myself. I suspect that this
will be the Last Great Issue with SAX before we can finalise it, so
help will be appreciated.
Here are what seem to me to be the costs and benefits of supporting
character streams, byte streams, or both:
* Character streams only
Pro: - the application writer has specialised knowledge about the
information source that the parser writer lacks; as a
result, the application writer can better optimise the
conversion, if necessary
- information from dialogue boxes, internal buffers, and
(eventually, with internationalisation) databases will all be
characters rather than bytes
- most programming languages are moving towards characters and
away from processing raw bytes
- many programming languages (such as Java) already have
standard methods for converting byte streams to character
streams, and application writers can use these if needed or
desired
Con: - the application may have to convert from bytes to characters
itself if an input source is not available
- the parser may have its own, internal, efficient mechanism
for byte-stream conversion
* Byte streams only
Pro: - supports the minimum common denominator: all platforms have
some concept of a byte stream
- allows parsers to use their own, efficient, internal methods
for byte-stream conversion
Con: - adds serious inefficiencies, since characters (say, from a
dialog box, an internal buffer, or a database with I18N
support) will have to be decomposed back into bytes to be
passed to the parser, then reassembled back into characters
by the parser
- requires a new SAX class encapsulating a ByteStream and its
recommended encoding
* Both Byte and Character streams
Pro: - keeps everyone happy
Con: - requires more interfaces
- requires another method in the Parser interface
- requires a new SAX class encapsulating a ByteStream and its
recommended encoding (or perhaps the ByteStream interface
will have a getEncoding() method)
- will greatly complicate the EntityResolver mechanism (the
application will need to be able to return a byte stream _or_
a character stream -- how could I handle this?)
Thanks, and all the best,
David
--
David Megginson ak117 at freenet.carleton.ca
Microstar Software Ltd. dmeggins at microstar.com
http://home.sprynet.com/sprynet/dmeggins/
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)
More information about the Xml-dev
mailing list