XML tools and big documents (was: Re: Is there a size limitation on XML file given to MSXSL as input?)

Wed Sep 2 03:38:34 BST 1998

David Megginson wrote:

> Ingo Macherius writes:
>
>  > My afterall impression is that most available tools do well with
>  > toy examples, but any input being in the MB range easily blasts
>  > them. At least that's true for what came from MS so far.
>
> I don't think that that's true in general.  Most of the Java-based XML
> parsers I've tried seem to be able to handle Jon Bosak's XML Old
> Testament (nearly 4MB) just fine, if somewhat slowly -- I used ot.xml
> for routine testing and profiling while developing AElfred, and
> AElfred barely kicked up a sweat.
>
> The problem comes if the parser tries to build a tree rather than
> simply reporting an event stream.  Depending on the implementation,
> document trees tend to be very large.  With a naive tree
> implementation, a 10MB document might use 100MB or more of virtual
> memory for the document tree -- that'll bring most current desktop
> systems to a screeching halt.

This is especially true for Java which is very memory hungry.  Most of the memory
problems with objects can be significantly reduced if your nodes only allocate
memory for sub-arrays as needed (most implementations I would assume would use an
array rather than a Vector to store children).  Also, if there is only one child,
do not create an array just to store that one child.

In other words, you have something like this:

class Node {
  Node child;
  Node[] children;
  int nodeLength
}

if child is null, then there are no elements
if the size ever goes above 1, set child to null and copy the contents of child
into children[0] and the parameter node into children[1].

Then when you look up a child by index of name you first test to see if child is
null.  If it is not then return the child if the index requested is 0, otherwise
the index is out of bounds.  If child is null, test to see if children is null.
If children is not null, then just look up the node by index.  If children is
null then there are no elements (nothing has been added or deleted).

For a lot of trees where it is somewhat common for nodes to only have one child,
this can save you a lot of memory.  It can also speed up your tree traversals a
bit since you do not have to look up the children nodes by index in the case
where there is only one child.  Also, for building the tree, you will likely
speed your app up a lot since you will now only have to create a new array object
if the child index is greater than 1.  Otherwise it is just a reference
assignment which is about as fast as an integer assignment.

I have not had a lot of problems building trees.  For the DOM implementation, in
conjunction with the parser I have, I build a DOM tree off of Jon Bosak's ot.xml
in about 12 seconds running a JIT with JDK 1.2 b4, on an old P-120 with 64 megs
of RAM running Windows NT 4.0.  I have not been able to do any reliable memory
benchmarks because the GC seems to be invoked much frequently with SUN's JDK 1.2
VM.

I would suspect that the DOM package provided by Don Park has similiar
performance and memory consumption.  Your best bet would probably be to look at
an XSL package which takes a DOM tree of your XML data, and a DOM tree of an XSL
stylesheet and spits out the content.  That way you are not stuck with an MS,
IBM, Oracle, or whatever implementation that you are not happy with.

Tyler

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)