String interning (WAS: SAX2/Java: Towards a final form)
David Megginson
david at megginson.com
Fri Jan 14 18:33:49 GMT 2000
Miles Sabin <msabin at cromwellmedia.co.uk> writes:
> One is where SAX isn't sitting on top of a parser (this is Arkin's
> worry). Instead it's generating SAX events from a DOM tree, java
> reflection, or some other data structure, a JDBC query perhaps.
I was very concerned about this use case at first, but my concerns
lessened a bit once I started to consider implementation details.
If I'm writing a filter, where do the strings for the names I'm
passing on come from?
Well, most of the time, they'll come from upstream in the filter
chain, so they're already interned and I don't have to worry about
it.
Now, let's say that instead I want to introduce my own names, so that
(for example) I rename every "foo" element to "bar". Good news! The
string literal that I use in my code for "bar" is already interned
automatically, so there's nothing to worry about.
The only problem comes if my filter is reading names dynamically from
an external source, like a database or a non-XML text file, and
introducing them into the filter stream: in that case, the filter
would be required to invoke some kind of interning function for all of
the names.
Note that this applies only when element or attribute *names* are
being read from the external source, not when attribute values or
character data content is. For example, imagine that I have some
database tables that I'm always going to dump into the same XML
structure:
<employee id="E12345">
<name>David Megginson</name>
<position>Grand Poohbah</position>
<salary>Underpaid</salary>
</employee>
There's no problem with interning here, because the string literals
that my filter uses for "employee", "name", "position", and "salary"
are already interned by the Java VM.
Iterating over a DOM, on the other hand, is a legitimate problem.
Every DOM implementation worth its salt will have interned all element
and attribute names (a DOM tree is big enough already), but there's no
way to be sure of that in the general case, or to be sure that the
names are == the results of java.lang.String.intern(). Too bad the
DOM level one Java binding didn't require that.
> The other scenario is mine (multiple parsers running over
> arbitrary documents in multiple threads) where the global
> String.intern() map is a point of contention. I won't bore
> everyone with the details again.
I'm much more skeptical about this one, because there are so many
preconditions:
1. you have to have many SAX parsers running in many threads on the
same system;
2. the SAX parsers have to be being reused over and over in a
time-critical environment;
3. the XML documents being processed have to be extremely
heterogenous, or else each parser will have seen most of the
available names after the first five or ten documents; AND
4. the rest of the parsing process has to be fast and interning has
to be slow enough that there's serious contention for the interning
Hashtable even when each parser is looking up only 20-30 names
(perhaps fewer) for each parse.
If all of these conditions arise at the same time (and I question #3
and #4), then perhaps over-all XML parsing might slow down by 1-2%; if
the actual XML parsing represents even as much as 30% of the
processing time (the rest is taken by whatever the ContentHandler
callbacks do with the information), that's a 0.6% slowdown under these
circumstances.
Granted, the potential speedup for other apps probably isn't much
greater, but since the vast majority of SAX apps will not meet the
above criteria, and since the penalty when one does meet these
criteria is so small, it makes sense not to penalize everyone else.
If there's any real concern, I think, it's the DOM scenario.
[snip big case statement example]
> To be honest, tho', I don't see any particular reason why the
> SAX API should be expected to support this sort of code.
How about running in a tight loop?
int len = atts.getLength();
for (int i = 0; i < len; i++) {
String name = atts.getName(i);
if (atts.getURI(i) == "http://www.w3.org/1999/02/22-rdf-syntax-ns#") {
if (name == "about") {
do something
} else if (name == "ID") {
do something
} else if (name == "aboutEach") {
do something
}
} else if (atts.getURI(i) == "http://www.w3.org/1999/xhtml") {
if (name == "href") {
do something
} else if (name == "class") {
do something
} else if (name == "name") {
do something
}
}
}
All the best,
David
--
David Megginson david at megginson.com
http://www.megginson.com/
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ or CD-ROM/ISBN 981-02-3594-1
Please note: New list subscriptions now closed in preparation for transfer to OASIS.
More information about the Xml-dev
mailing list