String interning (WAS: SAX2/Java: Towards a final form)

Wed Jan 12 19:47:45 GMT 2000

Miles Sabin wrote:

> I think we need to clarify a couple of ambiguities here.
> There are two sorts of interning being talked about wrt SAX2:
> the standard java interning performed by String.intern(); and
> other, parser specific, mechanisms for ensuring that Strings
> which are String.equal() are also ==. I'll call the former
> java-interning and the latter app-interning.
>
> I have a strong objection to SAX requiring that Strings
> returned from it's methods be java-interned. I'm not bothered
> about requiring app-interning so long as the guarantees are
> weakened a little.
>
> First, the problem with java-interning. The way this is
> implemented (in all the JVM's I've seen the sources of) is
> via a hash-lookup of the pre-interned String in a JVM-internal
> table. Because this table is shared by all threads in a JVM
> this lookup has to be synchronized. The upshot is that there
> is a huge potential for lock-contention where many threads
> are interning simultaneously. This is bad enough on a single
> processor machine, but could seriously clobber performance on
> a multi-processor box. I, for one, want to use multiple SAX
> parser instances driven from multiple threads on SMP machines,
> and I'd be a tad distressed if java-interning were a SAX
> requirement.

This can be true in theory but in reality no one implements their parsers to make a call
to String.intern() every time they encounter a new String. Most parsers including the one
I wrote a long time ago, simply call String.intern() once for each new instance of a Name.
This interned string is then cached in a hashtable using a hashcode of the character
sequence of the string so that whenever you encounter another name you can quickly
calculate the hashcode of the Name and see if a matching interned string is in your
interned string table.

> David Megginson has mentioned a way of reducing the overhead of
> java-interning: here we have a parser-internal map from
> character sequences onto java-interned Strings ... if when you
> lookup on the char sequence you get non-null String back then
> that's the java-interned result; otherwise you convert the char
> sequence to a String, java-intern it and enter it in the table.
>
> Whilst this might improve things a bit, it's still a
> performance hit: if the parser internal map is shared between
> parsers then we have the same contention problem back again
> (tho' this time in application code rather than the JVM); if it
> isn't (and hence is parser-/thread-local), then it has to be

Having a shared map of interned strings among different parser instances is kinda dumb
anyways as synchronization would be necessary for each lookup. You might save a little
memory, but at the cost of performance of course.

> repopulated at least for each new parser instance, probably for
> each new document. Even tho' this only requires one java-intern
> for each distinct name it still provides plenty of
> opportunities for synchronization collisions.

Nope. Names in XML are highly redundant especially for Namespace prefixes. Also, even if
the number of calls to String.intern() were significant (which they rarely if ever are),
modern Java runtimes have lowered synchronization overhead to be small enough that you
don't really have to think about it much in terms of impacting performance anymore.

> App-interning could be fine tho' ... so long as it's defined in
> such a way that it can be implemented in a completely thread-
> local way. Doing that means we'd have to,
>
> 1. Weaken the guarantees on the equivalence of String.equals()
>    and ==.
>
>    To avoid synchronization issues we'd have to say that
>    app-interning is done relative to a given parser call,
>
>    ie. where foo and bar are both obtained via a callbacks from
>    the same call on XMLReader.parse()
>
>      foo.equals(bar) iff foo == bar
>
>    but if foo and bar are not both obtained via callbacks from
>    the same call on XMLReader.parse()
>
>      foo.equals(bar) does not imply foo == bar

The entire purpose of testing for identity instead of equality is you don't need to invoke
any method whatsoever in your case statements. All you need to do is test for object
identity which is the fastest operation in Java next to integer comparisons.

> 2. Adopt something like Lars proposal of a StringInterner
>    interface.
>
>    We'd need this to allow a SAX client to app-intern any
>    literal Strings it wants to == test against in it's
>    handlers.

This would force my application code to use some special interface to populate my
constants rather than simply writing code that goes like:

if (name == "foo") {
 // Do something
}

> This should get us what most people want: fast equality
> comparisons and shared representation within the implementation
> of a ContentHandler, but without any need for synchronization.

Actually this method would likely be slower than just a plain string comparison using the
equals method, unless of course your internal parser string map is global and ends up in
effect just being a redundant form of the JVM string map. Your global string map would
likely have to be synchronized as well.

> One point to bear in mind: none of the foregoing would
> _prevent_ a SAX implementor from using java-interning if they
> wanted to.

Yes, but it would make it so that using Java-interning would be a pointless exercise at
that point because unless you maintain a map of interned strings in the actual parser,
using String.intern() does not help you at all. Unless the application is presented with
an already interned string for each Name, then using String.intern() will just be more
overhead and trouble than not using it at all.

Tyler

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ or CD-ROM/ISBN 981-02-3594-1
Please note: New list subscriptions now closed in preparation for transfer to OASIS.