Streaming XML and SAX
Didier PH Martin
martind at netfolder.com
Sun Feb 28 17:04:10 GMT 1999
Building a DOM everytime is inefficient, but I have to agree with Tom
that having XML act as the protocol as well is quite elegant. Why
presume that the XML processor capable of handling the protocol layer
would have to build a _generic_ object model? And why presume that an
XML processor has to build a _single_ object from all the information?
> <purchase xmlns="http://www.ecommerce.net/ns/ec/">
It seems like parsers could be made a whole lot more configurable than
they currently are. If more configurable, the top level XML processor
could build the domain-specific objects itself. Continuing with your
<purchase> example, I can envision a processing model like this:
Parser sees: <purchase>
Checks: Is a 'purchase' parser registered?
Yes: Pass control to it, 'purchase' parser reads until
</purchase>, then returns control to top level parser.
or Yes: Slurp text until </purchase>, pass "<purchase>...</purchase>"
(unparsed) to a 'purchase' parser running under another thread
or Yes: Slurp text until </purchase> and store it (unparsed) in the
DOM to be handled on a later pass.
No: keep parsing text and adding nodes to the DOM.
or No: Throw away text (unparsed) up until </purchase>
It would then be up to the subparser to build its own objects which
could be used later. Or the subparser could return an already
processed node to be inserted into the generic object model (or DOM).
Is this model possible with any existing parsers?
This architecture brings more work than required. An other way to do it
would be (in fact we are already doing that with our DSSSL,XSL
a) parse the document or the stream
b) a interpreter router check for certain Gi or Pi. On matching one, load
the appropriate interpreter
c) the interpreter interprets parsed GIs until the end of the document (in
your example: </purchase>)
d) When the end of the document is reached, the router goes back to listen
mode for this multiplexed channel (a channel is a multiplexed stream within
a session) and the interpreter is unloaded
For document based parsing, as usual, we use file protocols. For streaming
parsing, we are using HTTP-NG or MEMUX techniques. MEMUX is a work in
progress but basically, this is multiplexing on a single session. Because,
this protocol level takes care of the multiplexing, the parser do not have
to care about mixing streams and its universe is only a single stream with
documents organized in strict sequence. In a multiplexed stream all
documents are in a row and follow a strict sequence. However, globally, on a
single session, several documents are sent simultaneously. Thus, this
architecture has several layers:
For file based or blob based documents, replace the first two layers by the
file protocol (file, http,ftp. etc...)
A SGML/XML document without an interpreter is like a sleeping beauty :-). To
transform a XML document into something useful, you not only have to parse
the it but also to interpreter what you will do with each GI.
Actually, because MEMUX is still a moving target, we implemented our own
version of it until we get a consensus around a new spec which should be the
conclusion of the newly created IETF MEMUX workgroup.
Building the object model is probably the more expensive part, but in
many cases multiple selective parsing passes (skimming) would be more
efficient than parsing everything completely the first time through.
It seems that all current parsers assume that their duty is always to
create a faithful model of all of the entire document they are
presented with, and thus parse the entire document in a single pass
with a single thread of control. Why this assumption?
Not all parsers make this assumption :-) in our case, our parser either do
event based processing or build a grove or a DOM. In fact, for DOM like
interface, we prefer a new model we internally use which is based on
generalized property sets. This kind of interface can deal with either
directory service objects or document objects. We merged both world because,
when you look at these thought the perspective of property sets, both are
very similar. Then, with property set based model, an interpreter support an
interface based on the composite pattern (ref: "Patterns" - Gamma & al.). It
can _do_ something either with directory service objects, relational
database rows or document elements. This abstraction set apart the
interpretation and the parsing operations. What is a property set based API
A hierarchy of objects and each object has a property set attached to it. An
object can contain other objects (i.e. the composite pattern). thus, if each
object is a collection of objects and that each member of the collection is
classified with an associative array (i.e. a map, B+ tree, etc..) therefore
if an object can contain an other object, you obtain a tree.
A) the object has members to manipulate the objects collection and has
collection manipulation members like:
B) a property set is also a collection and the property set interface has
the same members:
c) an enumerator can be implemented as:
Thus, if this is implemented with objects languages or object middleware
(java, DCOM, CORBA, ILU, etc...), an interpreter has just to get/find the
object, enumerate its content and for each object get its properties. In the
case of a document object then one of the properties is the GI content. For
instance, to get a property from the <vendor-id> GI we call
property->Get("Content", Content) or with an interpreted language: Content =
Get("Content"). Remark that we don't need with a composite pattern interface
to know in advance all properties names nor do we have to know all object's
type. Therefore, the interface is more lightweight and we don't have to
create a new interface with each new object. The interface is general enough
to process a lot of whole-part structures as long as each member can be
associated with a name.
If we don't need a property set based interface because of memory footprint
or other constrains, the interpreter is event based and implements an object
event handler which receives a property set enumerator as parameter like:
The interpreter just enumerate the property set and do something on it.
Because the interpreter knows only certain keywords, it will process only
these keywords. But, all interpreters being event based, in this case, use
the same interface with parsers or something else like a directory
To replicate DSSSL or XSL mechanisms, each even handler can have the GI
name. For instance, On_Vendor-id that correspond to <vendor_ID>, On_Customer
to <customer> etc... This way, each even handler can process property sets
differently based on each event handler. this last mechanism replicates in a
certain way the pattern match mechanism found with style languages or
transformation languages. Thus, to process your document, we would have:
// enumerate all properties with enumerator
// and _do_ something
So, not all architectures are primitive :-).
Didier PH Martin
mailto:martind at netfolder.com
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)
More information about the Xml-dev