String interning (WAS: SAX2/Java: Towards a final form)

Clark C. Evans clark.evans at manhattanproject.com
Thu Jan 13 20:58:09 GMT 2000


On Thu, 13 Jan 2000, David Brownell wrote:
> "Clark C. Evans" wrote: 
>> There are going to be lots of server side filter 
>> architectures using the SAX interface which may 
>> not do this.  Indeed, I'd say that the "parser" 
>> interface is mis-named.  It's really an "emitter". 
>> And I'd go so far to say that in a few years, 
>> 99% of the "emitters" out there won't be parsers!
>
> I sort of hope so.  XML data models shouldn't be forced to stop right
> above parsing; it's not always appropriate.  It should certainly be
> possible to assemble pipelines of components which may optionally
> be sourced by a parser, but don't need to be.

I've been doing alot of work building a filter architecture.  
And I've found the SAX 1.0 interface lacking.  Below is a 
suggested modification for SAX2.  Your comments would be
cool.  

Here is the relevant post to the SML-DEV list (edited 
for here for clarity).  The concepts are identical,
just add attributes, and all of the other XML stuff..


---------- Forwarded message ----------
Date: Thu, 13 Jan 2000 15:04:41 -0500 (EST)
From: Clark C. Evans <clark.evans at manhattanproject.com>
Reply-To: sml-dev at egroups.com
To: sml-dev at egroups.com
Subject: [sml-dev] Character Tugging

Consider the following interface for
"push" elements, and "pull" characters.

  public interface Handler {
    public void begin(String name) ;
    public void characters(CharTug value);
    public void end(String name);
  }
  public interface CharTug {
    public Reader   toReader();
    public String   toString()
    public boolean hasObject()
    public Object  getObject()
  }

Thus, a handler would be pushed the "begin" event, 
for every SML start tag, and an "end" event for 
every SML end tag.  

  This much is very similar to the SAX API.

However, where it differs is "characters".  For the
characters event, most SAX implementations that I
have read make a temporary copy of the relevant parts 
of input buffer in zero or more events to the handler.

The "characters" event for the SAX interface
has several problems:

1. The hander may receive two or more sequential
   characters() event calls when a element's 
   content crosses a buffer boundary.  Thus state
   must be maintained and the termination of a
   sequence of characters is determined by two
   other events, the begin or end.  Hardly obvious.

2. Most of the time, the character array is 
   converted into a string, thus the temporary
   memory is allocated and then immediately 
   de-allocated.  This is not optimal.  
   Alternatively, the characters passed can
   be direct pointers into the parser's character
   buffer -- but the value may be stored,
   and this could cause unexpected problems.

3. If SAX events are put into a processor 
   pipeline,  then an application specific
   object, lets say "Currency" must be 
   converted to characters and back for
   each stage of the processing.  This is,
   to say the very least inneficient.

4. In the common case of building a string,
   the handler must put in special code.

By passing a CharTug instead, most of these
problems are solved.

1. If the application would rather 'read' 
   the information directly, it can ask for a 
   reader, getReader().  The parser is then like 
   a FilteredReader, scanning for begin/end tags, 
   and propery terminating the character sequence.

2. In the case of a getReader, no additioanl 
   intermediate storage is needed.

3. For the pipeline case, a hasObject can be
   called to see if an application specific
   object, like Currency or Integer has 
   already been built.  If so, then it can ask
   for this instead -- rather than breaking
   down the currency into characters and then
   re-building them at the other end.

4. And, for the common case, toString() is
   a helper function which will return the
   characters as a String object.  For the
   parser case, the parser would build the
   string directly from its input buffer.

   For the pipeline case, the previous stage
   could use the toString() method of its
   application specific object.  If a
   reader is requested, then it can either
   build a custom reader, or it can return
   a StringReader from the toString() result.
   This case can be provided as a helper class.

Note:  

   If the handler wants to disregard the characters
   content, then at worst case, a tiny CharTug object
   (does it need any member variables?) will have been 
   created and destroyed with far less usage than
   the corresponding char[] in a characters call.



xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ or CD-ROM/ISBN 981-02-3594-1
Please note: New list subscriptions now closed in preparation for transfer to OASIS.





More information about the Xml-dev mailing list