String interning (WAS: SAX2/Java: Towards a final form)

Fri Jan 14 16:40:23 GMT 2000

Tim Bray wrote,
> Miles Sabin wrote:
> > Anyhow, maybe the waters are getting a bit muddied. I'm
> > assuming that all parsers will do interning of one sort or
> > another internally. The issue for me is how much of that 
> > gets exposed via the SAX API. I don't want java-interning 
> > exposed, because that means my parser has no option but to 
> > use String.intern().
>
> Yes.   Given that *every* credible parser does this,

No argument here (assuming that my 'all parsers' => your 'every
credible parser').

> ... it's a major convenience for programmers using the 
> API to be able to compare strings with ==, there is at some 
> level an argument that we ought to expose this fact.
>
> I'd go further; based on having written a parser, it seems to 
> me that the only sane tactic is for the parser to use 
> java.intern(), but only once for each unique name, with some 
> sort of internal char[] or equivalent table.  If this is 
> true, it's an even stronger argument for just saying "element > types and
attribute names coming out of the parser are intern
> ()ed, period".  

OK, this is David M's position.

Sure, there's a case for this. But there's a case against too. 
There are at least two scenarios in which this would be a 
burden.

One is where SAX isn't sitting on top of a parser (this is
Arkin's worry). Instead it's generating SAX events from a DOM 
tree, java reflection, or some other data structure, a JDBC 
query perhaps.

Unlike a parser, these event sources deliver Strings directly, 
so if there were no requirement to String.intern() they could 
simply pass Strings straight through the ContentHandler API. A 
requirement that SAX return String.intern()'d Strings rules 
that out tho', because none of DOM, reflection, or JDBC make 
any guarantees that the Strings they return are interned. The 
cost of interning (whether via a direct call on String.intern() 
or via a David M style lookup against a table of interned 
Strings) would be a significant additional overhead.

You could argue that these aren't legitimate or central uses
of the SAX API. But if you want to do that you should make it
explicit, because it's likely to be quite a controversial
line.

The other scenario is mine (multiple parsers running over
arbitrary documents in multiple threads) where the global
String.intern() map is a point of contention. I won't bore
everyone with the details again.

Here too you could argue that this isn't a core use for the
SAX API, but again it would be helpful if that argument were
made up front, because I don't think I'm the only one who wants
to use SAX this way: I'd guess there are people who want use
XML -> HTML transforms via XSL in servlets on heavily loaded
servers ... they could be hit by this problem too.

There seem to be two main points to your argument for String.
intern()ing.

1. Reducing the amount of String object creation in parsers.

   I don't think _anybody_ thinks that this isn't important.
   the only issue is how best to do it. String.intern() is
   one way. An internal parser data structure is another.

2. Allowing ContentHandler implementors to use == instead of
   String.equals()

   This isn't at all clear cut.

   First, I suspect that at least some of the push for this is
   the possiblity that some people are implementing handlers
   as huge case statements,

     if(foo.equals("elem1")
       // handle elem1
     else if(foo.equals("elem2")
       // handle elem2

     // repeat many times

     else if(foo.equals("elemn")
       // handle elemn

   Whilst it's certainly true that being able to replace all
   the calls on String.equals() with == would significantly
   improve performance here (if there were a large number of
   cases), it's highly likely that switching to a better
   algorithm (chained conditionals are effectively linear
   search) would do even better, eg.,

     ElementHandler handler =
       (ElementHandler)someTable.lookup(foo);

     handler.handleElement();

   On the other hand if there are few branches, then, with a
   decent JVM, the difference between String.equals and == is
   going to be insignificant.

   In any case, with a small addition to the XMLReader
   interface it'd be possible for internal parser tables to
   support most of the core cases of this programming style 
   (even if it's poor style). If we added the following method
   to XMLReader,

     String intern(String toBeInterned)

   Then a ContentHandler implementor could write code like
   this,

     XMLReader r = XMLReaderFactory.createReader();

     // Pre-populate the readers intern table and get refs
     // to the (per-parser) uniqued Strings

     String ELEM1 = r.intern("elem1")
     String ELEM2 = r.intern("elem2")
     // ...
     String ELEMN = r.intern("elemn");

     r.parseDocument(...);
     r.parseDocument(...);
     // etc.

     // Then in the ContentHandler implementation

     if(foo == ELEM1)
       // handle elem1
     else if(foo == ELEM2)
       // handle elem2

     // Repeat many times

     else if(foo == ELEMN)
       // handle elemn

I think that something like this will get most people most of
what they want.

To be honest, tho', I don't see any particular reason why the
SAX API should be expected to support this sort of code.

> > But I'd much prefer it if the SAX API didn't expose any
> > interning behaviour at all. I think we agree on that?
>
> I think we're *arguing* about that... I don't detect 
> agreement yet. -T.

Err ... sorry, misunderstanding. My question was directed at
David B (who I _think_ does agree with me on that).

-- 
Miles Sabin                       Cromwell Media
Internet Systems Architect        5/6 Glenthorne Mews
+44 (0)20 8817 4030               London, W6 0LJ, England
msabin at cromwellmedia.com          http://www.cromwellmedia.com/

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ or CD-ROM/ISBN 981-02-3594-1
Please note: New list subscriptions now closed in preparation for transfer to OASIS.