XML parsing memory overhead concerns

Clark C. Evans clark.evans at manhattanproject.com
Fri Dec 17 07:19:44 GMT 1999


James,  

Could you comment once more?  *confused look*

On Fri, 17 Dec 1999, James Clark wrote:
> "Clark C. Evans" wrote:
> > On Fri, 17 Dec 1999, James Clark wrote:
> > > Paul Miller wrote:
> > > > The only way I could have done that
> > > > and keep the sub-element parsing model that I want is to have expat
> > > > parse entire document into one big internal memory buffer.
> > >
> > > How so?  You can layer a next event style interface on top of expat by
> > > maintaining a queue of events.  A request for the next event returns the
> > > head of the queue; if the queue is empty, it fills the queue by reading
> > > another chunk of input and passing it to XML_Parse() with event handlers
> > > that append to the queue of events.
> > 
> > This would require a multi-threaded approach,
> > is this correct?
> 
> No. Unlike most parsers, expat has a "push" input model: instead of the
> parser making calls to get each block of input bytes, the application
> calls the parser passing it each block in sequence whenever it is
> convenient for the application.

Paul (and correct me if I'm wrong) is attempting to 
develop a "pull" model for his C++ program, similar 
to the following XSL...

  <xsl:template match="my-criteria" >
     <!-- pre-process -->
     <xsl:apply-templates />
     <!-- post-process -->
  </xsl:template>

Kinda like:

   process(String elementName) {
      ;  // pre-process
      process-children();
      ;  // post-process
   }

As opposed to the following "equivalent" SAX:

  beginEvent(String name, AttributeList att)
  {
       ; // pre-process
  }

  endEvent(String name)
  {
       ; // post-process
  }


Anyway, given a SAX event source, pushing
the entire document his way, I don't see
how a single threaded solution is possible.

And, from the expat declaration of setElementHandler, 
which requires both a StartElementHandler and an 
EndElementHandler, I assumed that expat works in 
a similar (if not identical) manner.  

Assume the XML source is "<parent><child/></parent>"
>From the StartElementHandler, paul's process()
would be called on parent.  The pre-processing 
would occur, and then process-children() would 
be the next item on the execution stack.  
However, since StartElementHandler has not
gotten it's return, expat cannot move on
to the child... thus the "push" model is
incompatible with Paul's "pull" mechanism.

Thus, two options are left: the XML source
is stored in random-access memory, or, 
the system is broken into two threads,
communicating through an element queue.

Is there something I'm missing here?

If I'm not going crazy above, as a consequence 
of this "push" model (evenif expat is not used
for the XML source), Paul's processor proposal 
requires a thread for each stage of processing 
in a pipe-line.

More generally, this has implications for
multi-stage XSL processing.  It would require
one of three things: (a) either random access 
to the source document, (b) each stage in 
in a seperate thread, (c) a pre-compiler which
rewrites expressions of the first form (single 
function with a call-back in the middle) to 
expressions of the second form (two functions,
one for the begin and one for the end).

Am I completely on a different planet?

Thanks James!

Clark

P.S.  Paul's processor proposal also uses a
nested matching system -- which I belive is 
a tangential issue.  This would be something 
like this in XSL:


  <xsl:template name="x" />
 
  <xsl:template match="my-criteria">
    <paul:register match="my-sub-criteria" to-template-named="x" />
    <!-- pre-child processing -->
    <xsl:apply-templates />
    <!-- post-child processing -->
  </xsl:template>
                   



xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To unsubscribe, mailto:majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)





More information about the Xml-dev mailing list