String interning (WAS: SAX2/Java: Towards a final form)
Clark C. Evans
clark.evans at manhattanproject.com
Thu Jan 13 20:58:09 GMT 2000
On Thu, 13 Jan 2000, David Brownell wrote:
> "Clark C. Evans" wrote:
>> There are going to be lots of server side filter
>> architectures using the SAX interface which may
>> not do this. Indeed, I'd say that the "parser"
>> interface is mis-named. It's really an "emitter".
>> And I'd go so far to say that in a few years,
>> 99% of the "emitters" out there won't be parsers!
>
> I sort of hope so. XML data models shouldn't be forced to stop right
> above parsing; it's not always appropriate. It should certainly be
> possible to assemble pipelines of components which may optionally
> be sourced by a parser, but don't need to be.
I've been doing alot of work building a filter architecture.
And I've found the SAX 1.0 interface lacking. Below is a
suggested modification for SAX2. Your comments would be
cool.
Here is the relevant post to the SML-DEV list (edited
for here for clarity). The concepts are identical,
just add attributes, and all of the other XML stuff..
---------- Forwarded message ----------
Date: Thu, 13 Jan 2000 15:04:41 -0500 (EST)
From: Clark C. Evans <clark.evans at manhattanproject.com>
Reply-To: sml-dev at egroups.com
To: sml-dev at egroups.com
Subject: [sml-dev] Character Tugging
Consider the following interface for
"push" elements, and "pull" characters.
public interface Handler {
public void begin(String name) ;
public void characters(CharTug value);
public void end(String name);
}
public interface CharTug {
public Reader toReader();
public String toString()
public boolean hasObject()
public Object getObject()
}
Thus, a handler would be pushed the "begin" event,
for every SML start tag, and an "end" event for
every SML end tag.
This much is very similar to the SAX API.
However, where it differs is "characters". For the
characters event, most SAX implementations that I
have read make a temporary copy of the relevant parts
of input buffer in zero or more events to the handler.
The "characters" event for the SAX interface
has several problems:
1. The hander may receive two or more sequential
characters() event calls when a element's
content crosses a buffer boundary. Thus state
must be maintained and the termination of a
sequence of characters is determined by two
other events, the begin or end. Hardly obvious.
2. Most of the time, the character array is
converted into a string, thus the temporary
memory is allocated and then immediately
de-allocated. This is not optimal.
Alternatively, the characters passed can
be direct pointers into the parser's character
buffer -- but the value may be stored,
and this could cause unexpected problems.
3. If SAX events are put into a processor
pipeline, then an application specific
object, lets say "Currency" must be
converted to characters and back for
each stage of the processing. This is,
to say the very least inneficient.
4. In the common case of building a string,
the handler must put in special code.
By passing a CharTug instead, most of these
problems are solved.
1. If the application would rather 'read'
the information directly, it can ask for a
reader, getReader(). The parser is then like
a FilteredReader, scanning for begin/end tags,
and propery terminating the character sequence.
2. In the case of a getReader, no additioanl
intermediate storage is needed.
3. For the pipeline case, a hasObject can be
called to see if an application specific
object, like Currency or Integer has
already been built. If so, then it can ask
for this instead -- rather than breaking
down the currency into characters and then
re-building them at the other end.
4. And, for the common case, toString() is
a helper function which will return the
characters as a String object. For the
parser case, the parser would build the
string directly from its input buffer.
For the pipeline case, the previous stage
could use the toString() method of its
application specific object. If a
reader is requested, then it can either
build a custom reader, or it can return
a StringReader from the toString() result.
This case can be provided as a helper class.
Note:
If the handler wants to disregard the characters
content, then at worst case, a tiny CharTug object
(does it need any member variables?) will have been
created and destroyed with far less usage than
the corresponding char[] in a characters call.
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ or CD-ROM/ISBN 981-02-3594-1
Please note: New list subscriptions now closed in preparation for transfer to OASIS.
More information about the Xml-dev
mailing list