SAX2/Java: Towards a final form

Tyler Baker tyler at infinet.com
Tue Jan 11 20:46:37 GMT 2000


Lars Marius Garshol wrote:

> * Stefan Haustein
> |
> | Is "no namespace" reported with a null or empty String (for interned
> | Strings, the equals problem does not exist)?
>
> * David Megginson
> |
> | Empty string sounds like a reasonable suggestion when Namespace
> | processing is being performed; null when it is not (so that a bugs
> | in code will show up sooner).
>
> There is a problem with this: SAX filters should be able to compare
> names without knowing whether namespace processing is on or not.
> Allowing parts of names to be null makes this much more complicated,
> since this is a comparison of two three-string tuples. So from a
> filter point of view it would be much better if no part of a name
> could ever be null. (I'm a bit unsure what to do with the raw name
> when there is no original raw name.)
>
> | That's a good question -- should SAX2 require that all names and
> | Namespace URIs be interned (i.e. == to the results of
> | java.lang.String.intern)?
>
> This sounds like it could cause a huge performance gap between
> implementations. I think the MSXML driver and the SAX1 adapter will
> have to intern every name-part string that is passed to them, which I
> assume would be very costly. (The alternative would be breaking
> applications, unless there is a cheaper way.)
>
> Also, many parsers already do their own interning and support for
> SAX2, and these would then require either the solution above or a
> (non-costly) change to the parser itself. This definitely sounds like
> something that is easily forgotten, thus causing incompatibilities.

Very true, but parsers can keep interned strings (the result of java.lang.String.intern() )
mapped in their own parser string table. So whenever you come across a new element or
attribute name for instance, your readName() method (assuming you have one) would:

- Check to see if the character sequence comprises a legal XML Name.
- Generate a hashcode of the XML Name characters.
- Look in your string map using the generated hashcode and character sequence to see if there
is an already stored interned String in it.
- If there is an interned string in the string map that is equal to the XML Name character
sequence, return the interned string.
- If there is no matching interned string, create a new String object using the read
characters, and call intern on it to retrieve an interned string. Store the interned string in
the string map, and return it as well.

I have found this to be a significant performance enhancement at the application level as your
case statements using XML names can safely test for identity and not equality (which is much
more expensive especially in a large case statement).

For SAX drivers, interning every string could cause big performance problems, but most parsers
support SAX natively now so worrying about drivers of XML parsers that have lackluster
performance in ther first place, should not be a big concern here anyways.

Tyler


xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ or CD-ROM/ISBN 981-02-3594-1
Please note: New list subscriptions now closed in preparation for transfer to OASIS.





More information about the Xml-dev mailing list