XML tools and big documents
Tyler Baker
tyler at infinet.com
Thu Sep 3 17:24:56 BST 1998
David Megginson wrote:
> Don Park writes:
>
> > > As for the memory issue, I have thought about some sort of LZW
> > > compression of all of the text in a document tree. This would
> > > save a lot of memory, but may slow down building the DOM tree a
> > > bit. Any ideas on this?
> >
> >
> > Your performance will suffer and memory problem still remains.
>
> Agreed. The overhead comes from the node objects, not from the text.
> The biggest hogs can be attributes, especially in the standard SGML
> DTDs which often include dozens of defaulted attributes for each
> document type. If you can optimise those (allocating nodes only on
> demand and then freeing them as soon as they're not needed), you're
> half-way there.
>
> The second biggest hogs are leaf elements which contain only text. If
> you can treat those as special cases and allocate only one object for
> each one instead of three (element node, node list, text node), then
> you're another quarter of the way there.
Very true. However, in Java at least you can get around allocating a new object
for the node list by having your Node implementation also implement the NodeList
implementation as well. Only allocate a buffer to store the children as needed.
You can do the same thing with the Element Node with regard to attributes. This
saves a lot of memory and heap-based object allocation that you would have to do
otherwise. Nevertheless, in Java allocating raw Objects is a memory hog to begin
with.
> PIs , doctype declarations, notations, etc. are rare enough that you
> don't gain much by optimising them. Your mileage on comments, entity
> references and CDATA sections may vary, but you're probably best
> skipping them or replacing them with their contents when you build the
> tree, unless your application has very specialised requirements.
This is very true. For large documents both heavily document oriented or
transaction oriented I still think that compressing all of the text in the
document tree may have some promise. I guess before spending any more time
talking about it, I should spend the necessary hours to just do it.
Tyler
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)
More information about the Xml-dev
mailing list