XML tools and big documents

Thu Sep 3 17:24:56 BST 1998

David Megginson wrote:

> Don Park writes:
>
>  > > As for the memory issue, I have thought about some sort of LZW
>  > > compression of all of the text in a document tree.  This would
>  > > save a lot of memory, but may slow down building the DOM tree a
>  > > bit.  Any ideas on this?
>  >
>  >
>  > Your performance will suffer and memory problem still remains.
>
> Agreed.  The overhead comes from the node objects, not from the text.
> The biggest hogs can be attributes, especially in the standard SGML
> DTDs which often include dozens of defaulted attributes for each
> document type.  If you can optimise those (allocating nodes only on
> demand and then freeing them as soon as they're not needed), you're
> half-way there.
>
> The second biggest hogs are leaf elements which contain only text.  If
> you can treat those as special cases and allocate only one object for
> each one instead of three (element node, node list, text node), then
> you're another quarter of the way there.

Very true.  However, in Java at least you can get around allocating a new object
for the node list by having your Node implementation also implement the NodeList
implementation as well.  Only allocate a buffer to store the children as needed.
You can do the same thing with the Element Node with regard to attributes.  This
saves a lot of memory and heap-based object allocation that you would have to do
otherwise.  Nevertheless, in Java allocating raw Objects is a memory hog to begin
with.

> PIs , doctype declarations, notations, etc. are rare enough that you
> don't gain much by optimising them.  Your mileage on comments, entity
> references and CDATA sections may vary, but you're probably best
> skipping them or replacing them with their contents when you build the
> tree, unless your application has very specialised requirements.

This is very true.  For large documents both heavily document oriented or
transaction oriented I still think that compressing all of the text in the
document tree may have some promise.  I guess before spending any more time
talking about it, I should spend the necessary hours to just do it.

Tyler

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)