Slowness of JDK 1.1.x String.intern() [was Re: SAX, Java, and Namespaces ]

Fri Feb 12 08:49:19 GMT 1999

David Brownell wrote:

> Tim Bray wrote:
> >
> > At 10:12 AM 2/5/99 -0800, Jeff Greif wrote:
> > >JDK 1.1.7 intern is native, but is slow because it first converts the
> > >characters in the string [to a canonical form]
>
> No comment ... that's not my code ... ;-)
>
> > Actually, the real reason that most XML parsers will *never* use
> > built-in intern is because they probably have the name available in a
> > character array, and can go look things up in the handcrafted
> > table without String-i-fying it - thus skipping several steps
> > of work that a built-in intern is going to have to do.  E.g. Lark's
> > symbol table is a double array, storing both the character-array
> > and String version of each name - you lookup based on the
> > character array and return the string if it's already there.  The
> > point is that you call new String() only once per unique name.
>
> This gives "per-parse" uniqueness, which is valuable to a fair
> degree beyond the performance win of avoiding allocating a new
> string.
>
> However, Sun's package currently goes one step further and actually
> interns that string.  It's such a small cost (on top of the cost
> to check that array-to-string cache in the first place) that it's
> barely measurable.  (Anyone try "java -Xrunhprof:cpu=samples ..." on
> JDK 1.2/SPARC?)

This is what I do in an XML parser as well.  The costs would only be
relatively high if you had a only one instance of an element type for each
element in the document.  This in the real world will never happen as you
will instead of have lots of repeated element and attribute Names which can
be cached and interned the first time.

> That provides "per-VM" uniqueness which has turned out to be handy
> for things like stylesheet processing -- comparing strings in the
> stylesheet and source document is quite fast, and that does add
> up to a performance difference in template matching.

This is very true.  Some DOM implementations such as Docuverse's also do
this for the DOM tree.  You have a relatively low performance cost for
interning Names in a document, but you could possibly get huge benefits when
doing node iteration.  As of JDK 1.1.7 the String.equals() method is now
something of the form:

public boolean equals(Object o) {
  if (s == this) return true;

  String s = (String)o;
  if (s.length != length) return false;

  // Do character matching
}

Actually, I think just about all DOM implementations in Java that I am aware
of intern Names so a call to Node.getNodeName() will always return an
interned string.

It would be nice for applications if SAX stated that all Names are presented
to the DocumentHandler interface as interned strings as Names are nothing
more than symbols anyways and should be treated as such, with of course the
exception of the weirdness of namespace declaration names appearing as
attribute names (e.g. "xmlns:" + some prefix name)".

Tyler

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)