Storing Lots of Fiddly Bits (was Re: What is XML for?)

Thu Feb 4 06:00:15 GMT 1999

Clark Evans wrote:

> Jonathan Borden wrote:
> > What are you trying to say here?
> > Are you criticizing objects?
>
> You can't always treat a stream as object.  If you do, you
> loose significant power.
>
> > Suppose I want to process the data using XSL? Is this conceivably an
> > acceptable reason to use a DOM interface (assuming I don't actually want to
> > convert my database to serialized XML itself).
>
> I would see this as the last thing you would want to do.
> However, I don't have XSL experience, so someone
> with real-world experience would be a better spokesperson.
>
> DOM requires the entire stream be read before the
> the document object is returned and processing can begin.
> Not only does this chew significant memory for very large
> streams, but it causes significant delay before output
> could be generated.  In the worst case, it turns a
> perfectly simple problem into an "impossible" one
> where the memory requirements and time delay make
> the solution useless.

This is only true if you are building the DOM from a file.  What if you are building up the
DOM Document programmatically or else the DOM is merely an interface to structured data in a
DBMS.

> If the stream is only going to be "filtered", why read
> the entire thing into memory before starting the
> transformation process (in this case filtering)?

In some cases, this is possible and even desirable.  In this case of XSL the spec enforces
constraints which make it impossible to be able to properly process an XML document unless it
has been fully parsed into an in-memory tree structure (for most people this will be the DOM).

> > Certainly XSL is best served by a DOM representation if
> > the data is presented via a DOM interface.
>
> I would speculate to the contrary, and would think that
> driving XSL with SAX would be a far better choice.

No way.  You are totally throwing out all of the applications that create a DOM document
programmatically (such as through scripting).  The alternative is to build a DOM document
programmatically, write it out to XML, reparse it with an XML parser, and then process the
document as SAX parsing events.  This is an extra layer of indirection that is otherwise
totally unnecessary if you use the DOM.  The only other option is to take the DOM (an already
parsed in-memory tree), parse it into SAX events using something like SAXON, and then reparse
things back into another entire custom source tree.  Again another layer of indirection which
in languages like Java which that are particularly sensitive to unnecessary object allocation,
a major cost in processing an XML document.

> > The other option is to serialize everything.
>
> No.  The option is to move to Event based processing
> of streams.  You can then model with "event objects"

You still need to waste time recreating a source tree when you already have the DOM.  Writing
code to recursively spit out SAX parse events directly (instead of building a tree first) is
not an easy chore and in many cases is totally impractical (you need to do everything
sequentially in document order to make things work).

> > This makes no sense unless the DOM implemention is sub-optimal.
>
> No.  It's a computational complexity issue.  For a
> decent size stream, with a transformation that can
> be done in a single-pass (XML->HTML), no DOM
> implementation will even come close to an implementation
> using SAX.  Crunch some numbers.

Already have.  My results are quite the opposite.  The most significant overhead to using the
DOM is quite frankly dynamic method overhead for node iteration.  With respect to Java and
future optimizing compilers, this will become less and less of an issue.  The costs for object
allocation in one way or another will always be there.

> If your still not convinced, read Ableson, Structure and
> Interpretation of Computer Programs, ISBN 0-07-000484-6,
> Section 3.5.1, page 317.  There he talks about:
>
>         "severe inefficiency with respect to both time and space".

In general I would agree with your assertions, but it is an engineering fact that in languages
like C++ and Java, object allocation tend to be in a lot of apps the number one performance
bottleneck in applications which are not sensitive to reducing this overhead.

One of the most famous examples in Java is in the Java AWT when calling getSize() over and
over in paint methods. Doing this can bring some apps to a crawl as the code of
java.awt.Component.getSize() looks something like this:

public Dimension getSize() {
  return new Dimension(width, height);
}

java.awt.Component in JDK 2 now has getX(), getY(), getWidth(), and getHeight() methods to
reduce unnecessary object allocation and which for some GUI apps has made a major difference
in paint routines.

Tyler

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)