XML processing experiments

Tue Nov 4 12:12:03 GMT 1997

One nice feature of XML is that it is easily processable by the
Desperate C/C++/Java/Perl hacker: the syntax is simple enough that you
can do useful things with XML without a full XML parser. I've been
exploring this sort processing.  If all you want to do is be able to
correctly parse well-formed XML, and you don't care about detecting
whether or not it is well-formed, how much code does it take and is it
significantly faster than using an XML parser that does proper
well-formedness or validity checking?

I used Jon's Old Testament XML file as test data (after removing the
doctype line), which is about 3.7Mb.  I ran the tests on a Toshiba Tecra
720CDT (133MHz Pentium, 80Mb RAM) with Windows NT 4.0.  I used the IE
4.0 Java VM. The timings I give are after a couple of runs, so there's
little or no disk I/O involved.  Lark 0.97 parsed the file in about 10.5
seconds, MSXML in about 24 seconds.  I suspect the difference is partly
because MSXML is building a tree (I didn't see any command line switch
to turn this off).  By comparison nsgmlsu -s took about 8 seconds. I
also tried LT XML (which is written in C). I didn't find a program that
did nothing but parsing.  The fastest one I found was the sgcount
program (which counts the number of each element type); it took about 11
seconds.  That's much slower than I expected; I suspect there may be
some Windows-specific performance problems.

The code I wrote is available at
<URL:ftp://ftp.jclark.com/pub/test/xmltok.zip>. First I wrote a little
library in C for doing XML "tokenization".  This code just splits the
input up into "tokens" where each token is data or some kind of XML
markup (start-tag, end-tag, comment etc).  The idea is that it does the
minimum necessary to do any kind of useful XML-aware processing. I wrote
a little application xmlec that just counts the number of elements in an
XML document.  This can compiled either to use Win32 file mapping (if
FILEMAP is defined) or normal read() calls.  You'll probably have to
tweak the code a little if you're using anything other than Visual C++. 
I then translated this into Java (I'm not much of a Java programmer, so
there's probably plenty of scope for improvement in the Java version).

xmlec parses the test file in about 0.5 seconds. Using read() instead of
file mapping increases the time to about 0.65 seconds.  The Java
version  takes about 1.5 seconds.

I also wrote a Java version of the LT XML textonly program (which
extracts the non-markup of an XML document).  The LT XML version ran in
about 13.5 seconds.  My Java version ran in about 3.5 seconds.

The class files for the Java element counting program total about 6k. 
The source for the C version is about 750 lines, including both the file
mapping and read()ing version.

I was quite surprised that there was such a big performance difference
between real, conforming XML processing that does well-formedness
checking, and quick and dirty XML processing that does the minimum
necessary to get the correct result.  This doesn't seem right to me... 

James

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)