XML processing experiments

Fri Nov 7 07:29:17 GMT 1997

Tim Bray wrote:

> The following are candidates for why a program like Lark or MSXML
> might run slower.
>  - works with Java char rather than byte variables
>  - does a method dispatch (or at least a few conditionals) per
>    character processed for at least two reasons: to manage the entity
>    stack,

Given XML's requirements that entity references in the instance are
synchronous, I would have thought that the overhead of an entity stack
could be avoided for parsing the instance.  The parser passes the
application an entity reference event, and the application can then, if
it chooses, recursively invoke the parser to parse the referenced
entity.

>    and to have a place to put the different character encoding
>    processing modules.

The issue of how to deal with multiple encodings is an interesting one. 
The straightforward approach is to abstract an encoding as a process
that converts from bytes to characters (and vice-versa) and perform this
conversion process before parsing.  This involves a significant
performance hit.  This is particularily the case if you want to get
correct byte offsets when using a variable width encoding (such as
UTF-8); it's hard to do this without a method call per character.

I think it ought to be possible to abstract an encoding in a way that
avoids this.  Instead of having a two-stage process -- convert a stream
of bytes to a stream of characters and then divide the stream of
characters into tokens -- there would be a one stage process that
converted a stream of bytes into a stream of characters already split up
into tokens.  I haven't worked this through yet.

>  - does quite a bit more work upon recognizing some markup
>    constructs; in particular for a start tag it pulls
>    appart the attribute list and packages up the element type
>    & attributes in a nice structure convenient for an API user

One ought to be able to do at least some of this work lazily.  For
example, your API can say here's the start-tag, and can then provide a
separate function that pulls apart the start-tag if the user needs it. 

This could be useful in reducing the heap usage when building trees for
large documents (MSXML used about 25Mb on the ot.xml file): instead of
building some complex data structure representing the element type and
attributes for each element, you simply store a pointer to the element's
start tag, and then parse the start-tag to extract the element type and
attributes if and when the user requests them.  This would work
particularily well with file mapping (CreateFileMapping/mmap).  Is there
any way to get at this sort of OS functionality in Java?

> One other experiment would be useful, that might shed light from
> a different angle.  James, how about doing element counts per type;
> i.e. actually *using* some of the info come back from the tokenizer,
> nothing fancy, just use a java.util.Hashtable or some such; should be
> able to run very similar code on Lark and your TokenStream thing; I
> wonder if it would change the numbers.

I've done this.  Here are the timings I got:

[0. Using bytes, counting total number of elements: 1.5s]

1. Using chars instead of bytes and Reader instead of InputStream; use
the standard InputStreamReader: 2.3s

2. Same as 1, but roll my own XmlInputStreamReader that simply throws an
exception on non-ASCII characters: 1.9s

3. Same as 2, but extract element type name on end tag as a String: 2.3s

4. Same as 3, but count element types using a Hashtable: 2.8s

5. Same as 5, but use custom hash table to avoid allocating String when
element type name is already in the table: 2.4s

James

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)