XML processing experiments

Istvan Cseri istvanc at microsoft.com
Tue Nov 4 16:57:42 GMT 1997


I can offer a couple of reasons why a 'real' parser would be slower then
an ad-hoc processor:

- abstraction for encapsulating different encodings
- keeping track of line and column information for error reporting
- storing attributes and checking for uniqueness
- checking for valid element close tags
- processing entity references

In addition to this the MSXML parser is building the tree. We are going
to have a version where this can be turned off but when XML is used as
data it is extremely useful to have the tree around so you can actually
do different kinds of lookups, navigation on it and can update it.

Istvan

> ----------
> From: 	James Clark[SMTP:jjc at jclark.com]
> Reply To: 	James Clark
> Sent: 	Tuesday, November 04, 1997 4:03 AM
> To: 	XML Developers' List
> Subject: 	XML processing experiments
> 
> One nice feature of XML is that it is easily processable by the
> Desperate C/C++/Java/Perl hacker: the syntax is simple enough that you
> can do useful things with XML without a full XML parser. I've been
> exploring this sort processing.  If all you want to do is be able to
> correctly parse well-formed XML, and you don't care about detecting
> whether or not it is well-formed, how much code does it take and is it
> significantly faster than using an XML parser that does proper
> well-formedness or validity checking?
> 
> I used Jon's Old Testament XML file as test data (after removing the
> doctype line), which is about 3.7Mb.  I ran the tests on a Toshiba
> Tecra
> 720CDT (133MHz Pentium, 80Mb RAM) with Windows NT 4.0.  I used the IE
> 4.0 Java VM. The timings I give are after a couple of runs, so there's
> little or no disk I/O involved.  Lark 0.97 parsed the file in about
> 10.5
> seconds, MSXML in about 24 seconds.  I suspect the difference is
> partly
> because MSXML is building a tree (I didn't see any command line switch
> to turn this off).  By comparison nsgmlsu -s took about 8 seconds. I
> also tried LT XML (which is written in C). I didn't find a program
> that
> did nothing but parsing.  The fastest one I found was the sgcount
> program (which counts the number of each element type); it took about
> 11
> seconds.  That's much slower than I expected; I suspect there may be
> some Windows-specific performance problems.
> 
> The code I wrote is available at
> <URL:ftp://ftp.jclark.com/pub/test/xmltok.zip>. First I wrote a little
> library in C for doing XML "tokenization".  This code just splits the
> input up into "tokens" where each token is data or some kind of XML
> markup (start-tag, end-tag, comment etc).  The idea is that it does
> the
> minimum necessary to do any kind of useful XML-aware processing. I
> wrote
> a little application xmlec that just counts the number of elements in
> an
> XML document.  This can compiled either to use Win32 file mapping (if
> FILEMAP is defined) or normal read() calls.  You'll probably have to
> tweak the code a little if you're using anything other than Visual
> C++. 
> I then translated this into Java (I'm not much of a Java programmer,
> so
> there's probably plenty of scope for improvement in the Java version).
> 
> xmlec parses the test file in about 0.5 seconds. Using read() instead
> of
> file mapping increases the time to about 0.65 seconds.  The Java
> version  takes about 1.5 seconds.
> 
> I also wrote a Java version of the LT XML textonly program (which
> extracts the non-markup of an XML document).  The LT XML version ran
> in
> about 13.5 seconds.  My Java version ran in about 3.5 seconds.
> 
> The class files for the Java element counting program total about 6k. 
> The source for the C version is about 750 lines, including both the
> file
> mapping and read()ing version.
> 
> I was quite surprised that there was such a big performance difference
> between real, conforming XML processing that does well-formedness
> checking, and quick and dirty XML processing that does the minimum
> necessary to get the correct result.  This doesn't seem right to me...
> 
> 
> James
> 
> xml-dev: A list for W3C XML Developers. To post,
> mailto:xml-dev at ic.ac.uk
> Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
> To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
> (un)subscribe xml-dev
> To subscribe to the digests, mailto:majordomo at ic.ac.uk the following
> message;
> subscribe xml-dev-digest
> List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)
> 

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)




More information about the Xml-dev mailing list