RFC: Simple XML Event-Based API for Java

Thu Dec 18 08:42:48 GMT 1997

Don Park wrote:

> >I didn't suggest XmlApplication should should store XmlProcessor in a
> >member variable.  I suggested that implementations of XmlApplication
> >could (if they needed to make callbacks to XmlProcessor) store
> >XmlProcessor in a member variable.
>
> OOPS.  Point taken.
>
> >I don't think it's appropriate to carry over patterns from GUI events
> >and apply them to XML events just because we happen to use the word
> >"event" to describe them both.  I believe performance is important for
> >XML processing, and an interface shouldn't impose an unnecessary
> >performance cost.
>
> >
> >The real merit of this interface is that it's simple; unless there's a
> >really compelling need for a feature, I think it should be left out.
>
> While David suggested that add/removeApplication methods allow
> implementation of XmlProcessors which support multiple XmlApplications, it
> is completely up to the implementations to support multiple XmlApplication
> or only one at a time.  As JavaBeans spec suggests,
> TooManyListenersException should be thrown if XmlProcessor supports only one
> XmlApplication for performance and simplicity sake.
>
> >> I do not think so.  Just as every Mac developer loved having RefCon to
> hang
> >> thing onto, I like userData.
> >
> >Could you explain a typical case where you need this?
> >
> >Are there any standard Java classes that do this?
>
> userData is a cheap way to associate extra info with the XmlProcessor.  For
> example, I can store the source URL in the userData.  There are other ways
> to have XmlProcessors provide the URL info (i.e. Java Activation Frame has
> URLDataSource for this) but they are fairly expensive and would
> unnecessarily taint the API with URL related stuff.  It should be possible
> to use XmlProcessor with a File and building URL out of File is not reliable
> in all platforms.
>
> Don
>

I am not sure if this is at all relevant to this discussion, but I got some info
via email from the JDC newsletter that gives an interesting tip on how to
efficiently build tree structures without sucking up too much RAM.  I figure,
that for building XML parsers the most efficient way of storing the parsed data
would be some help to the XML parser writers.  Anyways, here is the tip.

PERFORMANCE -- using Object to represent disparate types.  This tip is a
little tricky, but it recently came up in an actual application, and
illustrates how Java language features are used to efficiently represent a
large data structure.

The application is one where a very large tree structure, consuming
millions of bytes, is built up.  Some of the nodes in the tree reference
child nodes (non-terminals), while others are leaf nodes (terminals) and
have no children, but contain String information.  The application involves
parsing a large Java program and representing it internally via a tree.

One simple approach to this problem is to define a Node class such as the
following:

        public class Node {
                private int type;
                private Node child[];
                private String info;
        }

If the node is a leaf node, then info is used.  Otherwise, child refers to
the children of the node, and child.length to the number of children.

This approach works pretty well, but uses a lot of memory.  Only one of
child and info are used at any one time, meaning that the other field is
wasted.  Child is an array, with attendant overhead, for example, in
storing the dimensions of the array for subscript checking.  For certain
large inputs, the parser program runs out of memory.

The first refinement of this approach is to collapse child and info:

        public class Node {
                private int type;
                private Object info;
        }

In this scheme, info can refer to either a String, for a leaf node, or to a
child node array.  Object is the root of the Java class hierarchy, so that
for example, the following:

        class A {}

implicitly means:

        class A extends Object {}

An instance of a subclass of Object, such as String, can be assigned to an
Object reference.  An array of Nodes can likewise be assigned to an Object.
The instanceof operator can be used to determine the actual type of an
Object reference.

In the parser application, using Object to represent both data types is not
good enough because it still takes up too much memory.  So a further change
has been implemented.  After doing some research, it was found that the
child array consisted of a single Node element about 95 percent of the
time.  So it's possible to represent one-child cases directly using an
Object reference to the child node, rather than a reference to a one-long
array of child nodes.

This representation is complicated, and it's useful to define a method for
encapsulating the abstraction as in the following example:

        public class Node {
                private int type;
                private Object info;

                // constructors, other methods here ...

                // gets the i-th child reference
                public Node getChild(int i)
                {
                        if (info instanceof String)
                                return null;
                        else if (info instanceof Node && i == 0)
                                return (Node)info;
                        else
                                return ((Node[])info)[i];
                }
        }

getChild returns the i-th child, or null for leaf nodes.  If there is
exactly one child, then info is of type Node, referencing that child.  If
there is more than one child, info is of type Node[], and a cast to Node[]
is done, followed by a retrieval and return of the child reference.

In the parser application, this change is enough to tip the scales, so that
the application would not run out of memory.  The internal representation
in this example is tricky, but it can be hidden via methods such as
getChild.  In general, it's wise to avoid tricky coding, but useful to know
how to do it when the need arises.

The example also illustrates the utility of using one Object reference to
represent several different data types.  In C/C++ similar techniques would
use void* pointers or unions.

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)