half-baked parsers vs binary XML

Sun Mar 28 04:54:31 BST 1999

Gabe Beged-Dov writes:

[on a validating parser]

 > There would be a little speed difference from not having to check
 > for defaulted attributes.

Not a measurable one -- the parser just needs to set a boolean flag
when there are no default values available, then it doesn't have to
check each time.

 > The half-baked parser might also be able to directly point to the
 > xml input without having to copy it, i.e. use start-length pointers
 > for the tags and attrs.  This would be more cumbersome if there was
 > less of a one to one correspondence between the raw xml and what
 > you got after expansion and defaulting.

I think that James Clark does something like that with Expat, which
does read the prolog properly, though it doesn't expand external
entities by default.  At least, Expat can always return the exact
string where an event originated.

Most efficient XML parsers play pretty clever tricks with their input
buffers, even with entity expansion.

 > > There will be a small size difference, but it will be less
 > > exciting than you think -- the code to detect the prologue and
 > > load the module will make up much of the difference.
 > 
 > Detecting the prologue and loading an alternate module takes a few
 > lines of Java code.  

Well, a little more than that, because you'll have to pass the current
state on to the new module.

 > Prologue processing, entity expansion and attribute defaulting take
 > up a little more than that in the parsers that I've looked at.

The version of AElfred that I wrote was around 27K (uncompressed)
including full parsing of element, attribute, and entity declarations,
and expansion of external entities (including the external DTD
subset); even then, AElfred would have been about 7K smaller if I
hadn't written my own hashing, interning, buffer-handling etc. for
speed's sake.

I still believe that a 10K XML non-validating parser class in Java is
not out of reach, *including* parsing the prolog, if people are
willing to use the standard Java classes.

 > > doing the well-formedness checks for legal characters can take up
 > > a lot of code, but you're supposed to do that anyway (I cheated
 > > with AElfred).
 > 
 > I'm not sure I understand. Could you elaborate on how you cheated :-?

At least when I was maintaining it, AElfred didn't perform all of the
required well-formedness checks for different ranges of Unicode
characters allowed and not allowed in names, attribute values,
character data, etc.  I tried adding it, but it bloated the code by
about 7-8K (much more than parsing the prolog and DTD).

All the best,

David

-- 
David Megginson                 david at megginson.com
           http://www.megginson.com/

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)