String interning (WAS: SAX2/Java: Towards a final form)

Miles Sabin msabin at cromwellmedia.co.uk
Fri Jan 14 19:33:43 GMT 2000


David Megginson wrote,
> I was very concerned about this use case at first, but my 
> concerns lessened a bit once I started to consider 
> implementation details.
>
> If I'm writing a filter, where do the strings for the names 
> I'm passing on come from?

I'd put filters in a slightly different box. Layers of SAX
handlers feeding into XMLReaders feeding into handlers ...
will come out fine, because, as you say, everything is
String.intern()'d from source to sink.

> Iterating over a DOM, on the other hand, is a legitimate 
> problem. Every DOM implementation worth its salt will have 
> interned all element and attribute names (a DOM tree is big 
> enough already), but there's no way to be sure of that in the 
> general case, or to be sure that the names are == the results 
> of java.lang.String.intern().  

Not just a DOM. Quite a few people are sitting SAX on top of
all sorts of data-structures which don't necessarily make any
interning guarantees. And don't forget database queries.

> Too bad the DOM level one Java binding didn't require that.

Hmm ... similar issues. Some people layer DOM implementations
on top of non-DOM data structures, java-reflection and DB
queries.

> > The other scenario is mine (multiple parsers running over
> > arbitrary documents in multiple threads) where the global
> > String.intern() map is a point of contention. I won't bore
> > everyone with the details again.
>
> I'm much more skeptical about this one, because there are so 
> many preconditions:
>
> [snip: 4 conditions]
>
> If all of these conditions arise at the same time (and I 
> question #3 and #4), then perhaps over-all XML parsing might 
> slow down by 1-2%; if the actual XML parsing represents even 
> as much as 30% of the processing time (the rest is taken by 
> whatever the ContentHandler callbacks do with the 
> information), that's a 0.6% slowdown under these
> circumstances.

Those four conditions cover my situation pretty accurately.
You'll just have to swallow (3), but (4) is the single-
processor vs. multi-processor thing. 

> Granted, the potential speedup for other apps probably isn't 
> much greater, but since the vast majority of SAX apps will 
> not meet the above criteria, and since the penalty when one 
> does meet these criteria is so small, it makes sense not to 
> penalize everyone else.

OK, there's not much I can say to that. If I really am doing
something very far out then it'd be unreasonable for you to
twist the API to suit me.

I'm not convinced tho'. This isn't quite my app, but I can 
imagine people wanting serve HTML generated via XSL from 
heterogenous XML on a heavily loaded, multi-threaded HTTP server.
It'd be a shame if lock contention issues made it harder for
them to scale up to more users by sticking a couple more
processors in the box.

> If there's any real concern, I think, it's the DOM scenario.

Arkin? Comments?

> > [snip big case statement example]
> > To be honest, tho', I don't see any particular reason why 
> > the SAX API should be expected to support this sort of 
> > code.
>
> How about running in a tight loop?

I doubt that the difference between String.equals() and == would 
be critical even here if the code under the conditionals does
much work. But even if it _is_, adding an interning method
to XMLReader,

  String intern(String toBeInterned);

would do the trick,

  // In startDocument() or outside the ContentHandler
  // altogether

  RDF = r.intern("http://www.w3.org/1999/02/22-rdf-syntax-ns#");
  ABOUT = r.intern("about");
  ID = r.intern("ID");
  ABOUT_EACH = r.intern("aboutEach");

  XHTML = r.intern("http://www.w3.org/1999/xhtml");
  HREF = r.intern("href");
  CLASS = r.intern("class");
  NAME = r.intern("name");


  // In startElement()

  for (int i = 0; i < len; i++) {
    String name = atts.getName(i);
    if (atts.getURI(i) == RDF) {
      if (name == ABOUT) {
        do something
      } else if (name == ID) {
        do something
      } else if (name == ABOUT_EACH) {
        do something
      }
    } else if (atts.getURI(i) == XHTML) {
      if (name == HREF) {
        do something
      } else if (name == CLASS) {
        do something
      } else if (name == NAME) {
        do something
      }
    }
  }

Cheers,


Miles

-- 
Miles Sabin                       Cromwell Media
Internet Systems Architect        5/6 Glenthorne Mews
+44 (0)20 8817 4030               London, W6 0LJ, England
msabin at cromwellmedia.com          http://www.cromwellmedia.com/

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ or CD-ROM/ISBN 981-02-3594-1
Please note: New list subscriptions now closed in preparation for transfer to OASIS.





More information about the Xml-dev mailing list