String interning (WAS: SAX2/Java: Towards a final form)
Miles Sabin
msabin at cromwellmedia.co.uk
Fri Jan 14 19:33:43 GMT 2000
David Megginson wrote,
> I was very concerned about this use case at first, but my
> concerns lessened a bit once I started to consider
> implementation details.
>
> If I'm writing a filter, where do the strings for the names
> I'm passing on come from?
I'd put filters in a slightly different box. Layers of SAX
handlers feeding into XMLReaders feeding into handlers ...
will come out fine, because, as you say, everything is
String.intern()'d from source to sink.
> Iterating over a DOM, on the other hand, is a legitimate
> problem. Every DOM implementation worth its salt will have
> interned all element and attribute names (a DOM tree is big
> enough already), but there's no way to be sure of that in the
> general case, or to be sure that the names are == the results
> of java.lang.String.intern().
Not just a DOM. Quite a few people are sitting SAX on top of
all sorts of data-structures which don't necessarily make any
interning guarantees. And don't forget database queries.
> Too bad the DOM level one Java binding didn't require that.
Hmm ... similar issues. Some people layer DOM implementations
on top of non-DOM data structures, java-reflection and DB
queries.
> > The other scenario is mine (multiple parsers running over
> > arbitrary documents in multiple threads) where the global
> > String.intern() map is a point of contention. I won't bore
> > everyone with the details again.
>
> I'm much more skeptical about this one, because there are so
> many preconditions:
>
> [snip: 4 conditions]
>
> If all of these conditions arise at the same time (and I
> question #3 and #4), then perhaps over-all XML parsing might
> slow down by 1-2%; if the actual XML parsing represents even
> as much as 30% of the processing time (the rest is taken by
> whatever the ContentHandler callbacks do with the
> information), that's a 0.6% slowdown under these
> circumstances.
Those four conditions cover my situation pretty accurately.
You'll just have to swallow (3), but (4) is the single-
processor vs. multi-processor thing.
> Granted, the potential speedup for other apps probably isn't
> much greater, but since the vast majority of SAX apps will
> not meet the above criteria, and since the penalty when one
> does meet these criteria is so small, it makes sense not to
> penalize everyone else.
OK, there's not much I can say to that. If I really am doing
something very far out then it'd be unreasonable for you to
twist the API to suit me.
I'm not convinced tho'. This isn't quite my app, but I can
imagine people wanting serve HTML generated via XSL from
heterogenous XML on a heavily loaded, multi-threaded HTTP server.
It'd be a shame if lock contention issues made it harder for
them to scale up to more users by sticking a couple more
processors in the box.
> If there's any real concern, I think, it's the DOM scenario.
Arkin? Comments?
> > [snip big case statement example]
> > To be honest, tho', I don't see any particular reason why
> > the SAX API should be expected to support this sort of
> > code.
>
> How about running in a tight loop?
I doubt that the difference between String.equals() and == would
be critical even here if the code under the conditionals does
much work. But even if it _is_, adding an interning method
to XMLReader,
String intern(String toBeInterned);
would do the trick,
// In startDocument() or outside the ContentHandler
// altogether
RDF = r.intern("http://www.w3.org/1999/02/22-rdf-syntax-ns#");
ABOUT = r.intern("about");
ID = r.intern("ID");
ABOUT_EACH = r.intern("aboutEach");
XHTML = r.intern("http://www.w3.org/1999/xhtml");
HREF = r.intern("href");
CLASS = r.intern("class");
NAME = r.intern("name");
// In startElement()
for (int i = 0; i < len; i++) {
String name = atts.getName(i);
if (atts.getURI(i) == RDF) {
if (name == ABOUT) {
do something
} else if (name == ID) {
do something
} else if (name == ABOUT_EACH) {
do something
}
} else if (atts.getURI(i) == XHTML) {
if (name == HREF) {
do something
} else if (name == CLASS) {
do something
} else if (name == NAME) {
do something
}
}
}
Cheers,
Miles
--
Miles Sabin Cromwell Media
Internet Systems Architect 5/6 Glenthorne Mews
+44 (0)20 8817 4030 London, W6 0LJ, England
msabin at cromwellmedia.com http://www.cromwellmedia.com/
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ or CD-ROM/ISBN 981-02-3594-1
Please note: New list subscriptions now closed in preparation for transfer to OASIS.
More information about the Xml-dev
mailing list