XML processing experiments

Wed Nov 5 08:36:07 GMT 1997

First off, thanks to James for a some very thought-provoking work.

At 07:03 PM 04/11/97 +0700, James Clark wrote:
>If all you want to do is be able to
>correctly parse well-formed XML, and you don't care about detecting
>whether or not it is well-formed, how much code does it take and is it
>significantly faster than using an XML parser ...
>Lark: 10.5 seconds .. MSXML: 24 .. nsgmlsu: 8 .. sgcount:11 .. 
>xmlec (C): 0.5 seconds .. (Java): 1.5 seconds.

[BTW, when I got Lark to run "almost as fast as SP", I decided that
was qualitatively fast enough for now].

>I was quite surprised that there was such a big performance difference

No kidding.  

Discussions here are a bit dangerous, since in the Java domain, we are 
kind of operating in the dark; we don't have profiling tools 
with really good granularity.  This is my excuse for engaging in 
performance analysis based on intuition, something for which I have 
personally fried more than one junior programmer.

Let's look at James' code eating up a "-quoted literal, where characters 
are in the byte array buf[], start and end being integer indices therein:
      case (byte)'"':
	{
	  for (++start; start != end; ++start) {
	    if (buf[start] == (byte)'"') {
	      nextTokenIndex = start + 1;
	      return TOK_LITERAL;
	    }
The following are candidates for why a program like Lark or MSXML
might run slower.
 - works with Java char rather than byte variables
 - does a method dispatch (or at least a few conditionals) per 
   character processed for at least two reasons: to manage the entity
   stack, and to have a place to put the different character encoding
   processing modules.
   [Note: A look at James' code makes me wonder if this is
    *really* as necessary as I thought]
 - does quite a bit more work upon recognizing some markup
   constructs; in particular for a start tag it pulls
   appart the attribute list and packages up the element type
   & attributes in a nice structure convenient for an API user

I went and looked at Lark's main loop, and for a 'typical' character 
processing mode, i.e. it's not the begin or end of a tag or attribute or 
something and no buffers run out but the text is being saved, it ends up 
executing 25 lines of Java including one getXmlCharacter() method 
dispatch; none of them are monster conditionals or anything.  

James' code above, in the equivalent case,  is executing 3 I think.

so while lines-of-code is very shaky yardstick indeed, the difference is
8 or 9 to 1, which is not out of line with the observed performance
difference. 

My intuition is that what's holding Lark back is 
(a) the per-char dispatching, and
(b) turning the DFA crank, which requires a 2D array reference, then
    a shift & mask

I have some ideas on how to fix both, but first I have to make Lark
do conditional sections and validate (neither should slow it down
significantly). 

One other experiment would be useful, that might shed light from
a different angle.  James, how about doing element counts per type;
i.e. actually *using* some of the info come back from the tokenizer,
nothing fancy, just use a java.util.Hashtable or some such; should be 
able to run very similar code on Lark and your TokenStream thing; I
wonder if it would change the numbers.  I'll get around to this sometime
if nobody else does, but not for the next 2-3 weeks. -Tim

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)