String interning (WAS: SAX2/Java: Towards a final form)
Tyler Baker
tyler at infinet.com
Mon Jan 17 22:23:50 GMT 2000
Miles Sabin wrote:
> Tim Bray wrote,
> > Miles Sabin wrote:
> > > Anyhow, maybe the waters are getting a bit muddied. I'm
> > > assuming that all parsers will do interning of one sort or
> > > another internally. The issue for me is how much of that
> > > gets exposed via the SAX API. I don't want java-interning
> > > exposed, because that means my parser has no option but to
> > > use String.intern().
> >
> > Yes. Given that *every* credible parser does this,
>
> No argument here (assuming that my 'all parsers' => your 'every
> credible parser').
>
> > ... it's a major convenience for programmers using the
> > API to be able to compare strings with ==, there is at some
> > level an argument that we ought to expose this fact.
> >
> > I'd go further; based on having written a parser, it seems to
> > me that the only sane tactic is for the parser to use
> > java.intern(), but only once for each unique name, with some
> > sort of internal char[] or equivalent table. If this is
> > true, it's an even stronger argument for just saying "element > types and
> attribute names coming out of the parser are intern
> > ()ed, period".
>
> OK, this is David M's position.
>
> Sure, there's a case for this. But there's a case against too.
> There are at least two scenarios in which this would be a
> burden.
>
> One is where SAX isn't sitting on top of a parser (this is
> Arkin's worry). Instead it's generating SAX events from a DOM
> tree, java reflection, or some other data structure, a JDBC
> query perhaps.
>
> Unlike a parser, these event sources deliver Strings directly,
> so if there were no requirement to String.intern() they could
> simply pass Strings straight through the ContentHandler API. A
> requirement that SAX return String.intern()'d Strings rules
> that out tho', because none of DOM, reflection, or JDBC make
> any guarantees that the Strings they return are interned. The
> cost of interning (whether via a direct call on String.intern()
> or via a David M style lookup against a table of interned
> Strings) would be a significant additional overhead.
>
> You could argue that these aren't legitimate or central uses
> of the SAX API. But if you want to do that you should make it
> explicit, because it's likely to be quite a controversial
> line.
In a DOM package I wrote I faced exactly this problem if a user programmatically generated
a DOM document tree. If they generated a DOM document tree from a file, then all names
would be interned anyways as the parser would present the DOM document with only interned
names.
The way to get around this problem is somewhat complex but it is doable. What you need to
do is have a String table internal to the document. Whenever someone invokes:
Document.createElement(String name);
You just replace the argument String with an interned string. The other alternative,
though a little more expensive in some cases (such as multi-threaded situations) would be
to just call String.intern() every time the user invokes:
Document.createElement(String name);
I have not seen much of the popular DOM packages these days, but I am sure they have found
a similiar workaround as well.
> There seem to be two main points to your argument for String.
> intern()ing.
>
> 1. Reducing the amount of String object creation in parsers.
>
> I don't think _anybody_ thinks that this isn't important.
> the only issue is how best to do it. String.intern() is
> one way. An internal parser data structure is another.
Most parsers do both. You don't need to Java intern your strings to reduce String object
allocation. But Java interning the Strings has nothing to do with decreasing object
allocation anyways.
Tyler
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ or CD-ROM/ISBN 981-02-3594-1
Please note: New list subscriptions now closed in preparation for transfer to OASIS.
More information about the Xml-dev
mailing list