XML tools and big documents

Thu Sep 3 17:14:12 BST 1998

Don Park wrote:

> >As for the memory issue, I have thought about some sort of LZW compression
> of all
> >of the text in a document tree.  This would save a lot of memory, but may
> slow
> >down building the DOM tree a bit.  Any ideas on this?
>
> Your performance will suffer and memory problem still remains.
>
> Don

Well the memory problem will remain but it could be reduced significantly for
large redundant documents.  Some people have claimed they get 97% compression of
some XML documents when using popular compression utilities like Winzip.
Reducing memory overhead with Names can be done at the parser level and actually
is implemented in some fashion for every major parser I know of.  As for
character content, the idea centers largely around each text node only allocating
a new String if the application requests it.  The String however is created by
looking up all of the character fragments stored in some sort of symbol table and
then parsing the String.  Then the String would be cached.  Nevertheless if the
text node is mutated in any way, the String reference is then set to null.

On second thought this may not degrade performance too much as you will be
getting the added benefit of only needing to allocate memory to store an integer
array (the sequence of symbols used to parse the string from the symbol table)
instead of a using a String which allocates two objects, the String object
itself, and the character array contained within it.  Of course this optimization
is Java specific and in languages like C++ or Eiffel where heap based objects are
not as expensive to deal with, this may be counter-productive.  Who knows it
might be counter-productive in Java.  I guess there is only one way to find out
unless someone has already tried this and has some insight they can lend.

Most parsers and parser interfaces like SAX present the character data as
characters and not as Strings.  So building the DOM tree without ever needing to
create any new String objects initially is very much doable.

I guess the real question is: should the DOM even be used for multi-megabyte
documents in the first place.  Initially I thought of XML as something that would
be used for two main purposes: EDI like web transactions and as a replacement for
HTML.  It seems like people now are using it for so many other things, many of
which may not be suitable for XML's abilities.  I guess the responsibility of XML
tools developers is to provide the most abstract functionality possible so people
can do many more things with XML than what it was intended for.  Nevertheless, I
think it is also a responsibility not to sell XML as the do-all solution of every
computing problem known to man.

Tyler

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)