Roll-Your-Own Parsers

Mon Feb 15 20:16:22 GMT 1999

>>We have taken that approach with our 'Version 2" parsers, Java and C++.
>>They are pretty well layered and pluggable. Don't plug in a validation
>>handler and you won't do any validation work. Don't plug in an entity
>>handler, and you won't get any entity information, etc... Basically we've
>>just extended the concept of a SAX-like handler all the way into the core
>>of the parser. It allows both for extensibility by rolling your own
>>handler, and for the client who is putting together a particular type of
>>parser configuration to tell the lowest level of the parser "do the least
>>work possible for this group of things, since I'm not even interested".
>
>This sounds (and looks) promising.  I'm not clear exactly _how_ modular it
>is, though.  Can I take info from the SAX parser, abuse (or nicely
process)
>it, and feed it back into the DOM tree builder?  Or am I stuck to choosing
>validating/non-validating and DOM/SAX?  Is the 'SAX-like handler' really
>SAX with extras, or is it incompatible?
>

A 'parser' in our system is really just a small amount of code which wires
the events coming from our internal APIs to any kind of standard outgoing
API you want to support. For a 'SAX Parser' its pretty much a one to one
mapping of an internal event to an external event, throwing away some
information that cannot be passed through the SAX API. You can also choose
to write your program in terms of our internal event API, if you want to
have full access to the maximum amount of information.

So, you could write a 'mutating SAX parser' that takes the internal events,
passes them through a 'look aside' plug in object which mutates them, then
pass it on out the SAX interface to client code. The possible scenarios are
pretty endless I think. But the basic deal is that a 'parser' is not
something that we define, its an open ended configuration of a scanner,
whatever internal handlers you want to install on it, and any outgoing APIs
that you want to then spit the data out through (massaged in any way you
want.)

The internal APIs have more info than can be passed out SAX. The C++
version can actually, using the internal APIs, spit back out the original
file almost character for character (after required entity substitution
anyway.) It can also spit back out the internal and external subsets very
close to the original. You can tell the scanner whether you want 'advanced
callbacks' which will cause it to tell you about whitespace everywhere (not
just in the content) and it will tell you about markup decls that it parsed
but isn't going to use because they are overrides of previously declared
decls, etc...

One of the powerful features is that allows you to create and plug in a
custom validator. We ship a default validator which does DTD like things.
However, if you want to, you can create a DCD validator or an XSchema
validator and plug it in as well. The scanner will maintain all of the info
that is required to do DTD like validation (and make it available to the
validator) and the validator can just any extra information that it needs
to do the extra work (such as type info and whatnot.)

>Reading the API is kind of weird.  I'd like to know what this 'scanner'
>critter is doing too.
>

The scanner is the core of system. It handles the actual scanning of XML
text, the generation of internal events, basic w/f and validation checking,
etc... You install handlers on it (the handlers for the internal events),
which it calls. It also maintains an element decl pool, attribute decl
pool, entity decl pools, etc... And it does most of the standard well
formedness and validation checks (with validators only handling any
extended checking, and handling structural validation.)

The C++ version of the scanner presents the same basic internal interface
as the Java one, though its internal implementation is much different.

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)