XML processing experiments

Tue Nov 4 15:46:09 GMT 1997

>I also tried LT XML (which is written in C). I didn't find a program that
>did nothing but parsing.  The fastest one I found was the sgcount
>program (which counts the number of each element type); it took about 11
>seconds.  That's much slower than I expected; I suspect there may be
>some Windows-specific performance problems.

It's true that we do our development under unix, and I don't have any
benchmarks for MS Windows.  I just ran "sgcount <ot.xml" on an AMD K5
PR-100 (supposedly equivalent to a 100MHz Pentium) under FreeBSD, and
it took 6.8 seconds.  This suggests that we run about twice as fast
under unix as MS Windows, which is something we will have to look
into.

But in any case, the currently-released version of LT-XML (0.9.5) is
far too slow on all platforms.  The next version, which we hope to
release by the end of the year, has a completely new parser and is
roughly three times as fast.

Why is the old version so slow?

- It's written in yacc and lex.  I didn't expect this to be slow, but
profiling shows that it's spending most of its time in the yacc and lex
internals, which we can't do much about.  The new version is written in
plain C, and I actually think it's much clearer.  Yacc is not well-suited
to the sort of context-dependent tokenising that is required in DTDs.  We
had to abandon lex anyway to handle 16-bit characters.

- It does a malloc() and free() for every start tag, end tag, attribute
name, attribute value, and pcdata.  The new version only does that for
attribute values and pcdata.

Another reason that both versions are slower than the desperate C
hacker's programs is that they maintain a stack of input sources to
implement entity expansion.  This adds an overhead even when entities
are not being expanded.

The figures above are all for 8-bit-character systems.  The next
release will have a compile-time option to support 16-bit characters.
I expect the 16-bit version to be about 30% slower than the 8-bit
version (for the same 8-bit data).

We also plan to release the parser itself separately from the rest of
the LT-XML/LT-NSL toolkit, for use in programs that just need an XML
parser.  I expect it be about 25% faster than the LT-XML version, just
because a layer is removed.

> >I was quite surprised that there was such a big performance difference
> >between real, conforming XML processing that does well-formedness
> >checking, and quick and dirty XML processing that does the minimum
> >necessary to get the correct result.  This doesn't seem right to me...

It isn't, and we're hoping to reduce it.

-- Richard

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)