String interning (WAS: SAX2/Java: Towards a final form)
Miles Sabin
msabin at cromwellmedia.co.uk
Fri Jan 14 16:40:23 GMT 2000
Tim Bray wrote,
> Miles Sabin wrote:
> > Anyhow, maybe the waters are getting a bit muddied. I'm
> > assuming that all parsers will do interning of one sort or
> > another internally. The issue for me is how much of that
> > gets exposed via the SAX API. I don't want java-interning
> > exposed, because that means my parser has no option but to
> > use String.intern().
>
> Yes. Given that *every* credible parser does this,
No argument here (assuming that my 'all parsers' => your 'every
credible parser').
> ... it's a major convenience for programmers using the
> API to be able to compare strings with ==, there is at some
> level an argument that we ought to expose this fact.
>
> I'd go further; based on having written a parser, it seems to
> me that the only sane tactic is for the parser to use
> java.intern(), but only once for each unique name, with some
> sort of internal char[] or equivalent table. If this is
> true, it's an even stronger argument for just saying "element > types and
attribute names coming out of the parser are intern
> ()ed, period".
OK, this is David M's position.
Sure, there's a case for this. But there's a case against too.
There are at least two scenarios in which this would be a
burden.
One is where SAX isn't sitting on top of a parser (this is
Arkin's worry). Instead it's generating SAX events from a DOM
tree, java reflection, or some other data structure, a JDBC
query perhaps.
Unlike a parser, these event sources deliver Strings directly,
so if there were no requirement to String.intern() they could
simply pass Strings straight through the ContentHandler API. A
requirement that SAX return String.intern()'d Strings rules
that out tho', because none of DOM, reflection, or JDBC make
any guarantees that the Strings they return are interned. The
cost of interning (whether via a direct call on String.intern()
or via a David M style lookup against a table of interned
Strings) would be a significant additional overhead.
You could argue that these aren't legitimate or central uses
of the SAX API. But if you want to do that you should make it
explicit, because it's likely to be quite a controversial
line.
The other scenario is mine (multiple parsers running over
arbitrary documents in multiple threads) where the global
String.intern() map is a point of contention. I won't bore
everyone with the details again.
Here too you could argue that this isn't a core use for the
SAX API, but again it would be helpful if that argument were
made up front, because I don't think I'm the only one who wants
to use SAX this way: I'd guess there are people who want use
XML -> HTML transforms via XSL in servlets on heavily loaded
servers ... they could be hit by this problem too.
There seem to be two main points to your argument for String.
intern()ing.
1. Reducing the amount of String object creation in parsers.
I don't think _anybody_ thinks that this isn't important.
the only issue is how best to do it. String.intern() is
one way. An internal parser data structure is another.
2. Allowing ContentHandler implementors to use == instead of
String.equals()
This isn't at all clear cut.
First, I suspect that at least some of the push for this is
the possiblity that some people are implementing handlers
as huge case statements,
if(foo.equals("elem1")
// handle elem1
else if(foo.equals("elem2")
// handle elem2
// repeat many times
else if(foo.equals("elemn")
// handle elemn
Whilst it's certainly true that being able to replace all
the calls on String.equals() with == would significantly
improve performance here (if there were a large number of
cases), it's highly likely that switching to a better
algorithm (chained conditionals are effectively linear
search) would do even better, eg.,
ElementHandler handler =
(ElementHandler)someTable.lookup(foo);
handler.handleElement();
On the other hand if there are few branches, then, with a
decent JVM, the difference between String.equals and == is
going to be insignificant.
In any case, with a small addition to the XMLReader
interface it'd be possible for internal parser tables to
support most of the core cases of this programming style
(even if it's poor style). If we added the following method
to XMLReader,
String intern(String toBeInterned)
Then a ContentHandler implementor could write code like
this,
XMLReader r = XMLReaderFactory.createReader();
// Pre-populate the readers intern table and get refs
// to the (per-parser) uniqued Strings
String ELEM1 = r.intern("elem1")
String ELEM2 = r.intern("elem2")
// ...
String ELEMN = r.intern("elemn");
r.parseDocument(...);
r.parseDocument(...);
// etc.
// Then in the ContentHandler implementation
if(foo == ELEM1)
// handle elem1
else if(foo == ELEM2)
// handle elem2
// Repeat many times
else if(foo == ELEMN)
// handle elemn
I think that something like this will get most people most of
what they want.
To be honest, tho', I don't see any particular reason why the
SAX API should be expected to support this sort of code.
> > But I'd much prefer it if the SAX API didn't expose any
> > interning behaviour at all. I think we agree on that?
>
> I think we're *arguing* about that... I don't detect
> agreement yet. -T.
Err ... sorry, misunderstanding. My question was directed at
David B (who I _think_ does agree with me on that).
--
Miles Sabin Cromwell Media
Internet Systems Architect 5/6 Glenthorne Mews
+44 (0)20 8817 4030 London, W6 0LJ, England
msabin at cromwellmedia.com http://www.cromwellmedia.com/
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ or CD-ROM/ISBN 981-02-3594-1
Please note: New list subscriptions now closed in preparation for transfer to OASIS.
More information about the Xml-dev
mailing list