XML parsing memory overhead concerns

Paul Miller stele at fxtech.com
Sat Dec 18 14:56:05 GMT 1999


> I would suggest that what you need here is a layer
> above Expat that queues up the
> "events" generated from expats callbacks for consumption
> by your application.

I've tried this. The problem comes when you get to character data
between element tags. I'd have to buffer all of the data inside the
event queue. For my needs, this wouldn't be a problem, since my custom
element data is generally pretty small. But for a more general-purpose
solution, an implementation that did not have to copy all of the data
would be preferrable. I can't make my design work over expat without
copying all of this data (so I can provide it to my user handlers as a
single unit). In other words, because of expat's (and most other
parsers') push model, I can't possibly implement the pull model I need
for element data, without also buffering this data somewhere. It's
*possible*, but is this desired?

I asked around a few days ago about general memory-usage requirements,
and got a lot of feedback about requiring a low-memory-overhead
solution. My current implementation, with its several limitations,
provides almost zero-memory-overhead, which is a huge advantage in some
situations.

To recap, for those just joining us. I want to be able to handle a
single element all at once, and write code like this inside an element
handler (this is C++):

For this XML fragment:
	<Point>100x100</Point>

void Point::Parse(XML::Element &elem)
{
	// this gets called when the parser sees <Point>
	// I'll ask for the element data, and the parser will
	// pull this to me until it sees </Point>
	char buf[40];
	elem.GetData(buf, sizeof(buf));
	sscanf(buf, "%dx%d", &x, &y);
}

The only way to implement this over expat is to queue up the Element
tags as they are found as you suggest. When the characterDataHandler is
called, it would have to buffer the contents of the character data
(until the ending element tag is found) with that element, storing a
complete copy of the data. When I see the </Point> end element, I can
then call my Point parser and allow it to "pull" the data from the
buffer element data. To implement this, I would be storing THREE copies
of the element data in memory (expat's buffered chunk of the document,
my copy of the data in the element queue, and then the element handler's
copy while it is interpreting the data (buf[40] above). With my
implementation, there is only one "copy" of the data in memory, and
that's the element handler's private buffer (buf[40] above), since the
implementation pulls a character at a time.

--
Paul Miller - stele at fxtech.com

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To unsubscribe, mailto:majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)





More information about the Xml-dev mailing list