String interning (WAS: SAX2/Java: Towards a final form)

Miles Sabin msabin at cromwellmedia.co.uk
Wed Jan 12 18:04:37 GMT 2000


I think we need to clarify a couple of ambiguities here.
There are two sorts of interning being talked about wrt SAX2:
the standard java interning performed by String.intern(); and
other, parser specific, mechanisms for ensuring that Strings
which are String.equal() are also ==. I'll call the former
java-interning and the latter app-interning.

I have a strong objection to SAX requiring that Strings
returned from it's methods be java-interned. I'm not bothered 
about requiring app-interning so long as the guarantees are 
weakened a little.

First, the problem with java-interning. The way this is
implemented (in all the JVM's I've seen the sources of) is
via a hash-lookup of the pre-interned String in a JVM-internal
table. Because this table is shared by all threads in a JVM
this lookup has to be synchronized. The upshot is that there
is a huge potential for lock-contention where many threads
are interning simultaneously. This is bad enough on a single
processor machine, but could seriously clobber performance on
a multi-processor box. I, for one, want to use multiple SAX
parser instances driven from multiple threads on SMP machines,
and I'd be a tad distressed if java-interning were a SAX
requirement.

David Megginson has mentioned a way of reducing the overhead of
java-interning: here we have a parser-internal map from 
character sequences onto java-interned Strings ... if when you 
lookup on the char sequence you get non-null String back then 
that's the java-interned result; otherwise you convert the char 
sequence to a String, java-intern it and enter it in the table.

Whilst this might improve things a bit, it's still a 
performance hit: if the parser internal map is shared between 
parsers then we have the same contention problem back again 
(tho' this time in application code rather than the JVM); if it 
isn't (and hence is parser-/thread-local), then it has to be 
repopulated at least for each new parser instance, probably for 
each new document. Even tho' this only requires one java-intern 
for each distinct name it still provides plenty of 
opportunities for synchronization collisions.

App-interning could be fine tho' ... so long as it's defined in 
such a way that it can be implemented in a completely thread-
local way. Doing that means we'd have to,

1. Weaken the guarantees on the equivalence of String.equals() 
   and ==.

   To avoid synchronization issues we'd have to say that
   app-interning is done relative to a given parser call,

   ie. where foo and bar are both obtained via a callbacks from 
   the same call on XMLReader.parse()

     foo.equals(bar) iff foo == bar

   but if foo and bar are not both obtained via callbacks from 
   the same call on XMLReader.parse()

     foo.equals(bar) does not imply foo == bar

2. Adopt something like Lars proposal of a StringInterner 
   interface.

   We'd need this to allow a SAX client to app-intern any
   literal Strings it wants to == test against in it's
   handlers.

This should get us what most people want: fast equality 
comparisons and shared representation within the implementation 
of a ContentHandler, but without any need for synchronization.

One point to bear in mind: none of the foregoing would 
_prevent_ a SAX implementor from using java-interning if they 
wanted to.

Cheers,


Miles

-- 
Miles Sabin                       Cromwell Media
Internet Systems Architect        5/6 Glenthorne Mews
+44 (0)20 8817 4030               London, W6 0LJ, England
msabin at cromwellmedia.com          http://www.cromwellmedia.com/



xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ or CD-ROM/ISBN 981-02-3594-1
Please note: New list subscriptions now closed in preparation for transfer to OASIS.





More information about the Xml-dev mailing list