XML processing experiments
Tim Bray
tbray at textuality.com
Wed Nov 5 08:36:07 GMT 1997
First off, thanks to James for a some very thought-provoking work.
At 07:03 PM 04/11/97 +0700, James Clark wrote:
>If all you want to do is be able to
>correctly parse well-formed XML, and you don't care about detecting
>whether or not it is well-formed, how much code does it take and is it
>significantly faster than using an XML parser ...
>Lark: 10.5 seconds .. MSXML: 24 .. nsgmlsu: 8 .. sgcount:11 ..
>xmlec (C): 0.5 seconds .. (Java): 1.5 seconds.
[BTW, when I got Lark to run "almost as fast as SP", I decided that
was qualitatively fast enough for now].
>I was quite surprised that there was such a big performance difference
No kidding.
Discussions here are a bit dangerous, since in the Java domain, we are
kind of operating in the dark; we don't have profiling tools
with really good granularity. This is my excuse for engaging in
performance analysis based on intuition, something for which I have
personally fried more than one junior programmer.
Let's look at James' code eating up a "-quoted literal, where characters
are in the byte array buf[], start and end being integer indices therein:
case (byte)'"':
{
for (++start; start != end; ++start) {
if (buf[start] == (byte)'"') {
nextTokenIndex = start + 1;
return TOK_LITERAL;
}
The following are candidates for why a program like Lark or MSXML
might run slower.
- works with Java char rather than byte variables
- does a method dispatch (or at least a few conditionals) per
character processed for at least two reasons: to manage the entity
stack, and to have a place to put the different character encoding
processing modules.
[Note: A look at James' code makes me wonder if this is
*really* as necessary as I thought]
- does quite a bit more work upon recognizing some markup
constructs; in particular for a start tag it pulls
appart the attribute list and packages up the element type
& attributes in a nice structure convenient for an API user
I went and looked at Lark's main loop, and for a 'typical' character
processing mode, i.e. it's not the begin or end of a tag or attribute or
something and no buffers run out but the text is being saved, it ends up
executing 25 lines of Java including one getXmlCharacter() method
dispatch; none of them are monster conditionals or anything.
James' code above, in the equivalent case, is executing 3 I think.
so while lines-of-code is very shaky yardstick indeed, the difference is
8 or 9 to 1, which is not out of line with the observed performance
difference.
My intuition is that what's holding Lark back is
(a) the per-char dispatching, and
(b) turning the DFA crank, which requires a 2D array reference, then
a shift & mask
I have some ideas on how to fix both, but first I have to make Lark
do conditional sections and validate (neither should slow it down
significantly).
One other experiment would be useful, that might shed light from
a different angle. James, how about doing element counts per type;
i.e. actually *using* some of the info come back from the tokenizer,
nothing fancy, just use a java.util.Hashtable or some such; should be
able to run very similar code on Lark and your TokenStream thing; I
wonder if it would change the numbers. I'll get around to this sometime
if nobody else does, but not for the next 2-3 weeks. -Tim
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)
More information about the Xml-dev
mailing list