ANNOUNCEMENT: Proposed SAX Revisions

Fri Mar 20 21:37:24 GMT 1998

* David Megginson
|
| If you are interested in SAX, either as a parser writer or as an
| application writer, please take a few minutes to read through this
| web page. I will look forward to receiving your comments,
| corrections, and suggestions.

Before my comments I should perhaps add some background: 

A Python special interest group for XML processing has recently been
established and as a part of that effort there are now several budding
XML parsers in Python (plus one C module), a SAX library, a
prototypical DOM implementation etc.

See <URL:http://www.python.org/sigs/xml-sig/> for more details.

My part in all this has so far been an XML parser as well as the
Python SAX translation and drivers. (This is available from the link
above. A parser version with full well-formedness checking will
probably be released later this weekend.)

Apart from the comments below this I agree with all the changes and
mostly with the rationales for the changes as well. 

However, I don't like adding the Location argument to every *Handler
method. IMHO, that clutters the interfaces too much. I'd much prefer
this alternative:

Make a new interface BaseHandler, which only has two methods:
getLocator and setLocator, which can be used to give the handler a
Location object[1] it can ask about the current location.

This interface can then be implemented by DTDHandler, EntityHandler,
DocumentHandler and ErrorHandler. It would simplify those four
interfaces (by removing an attribute from every method in each
interface) and probably both simplify implementation and transition to
the new SAX version. (SAX version numbers might perhaps be an idea?)

The specification should perhaps also specify exactly where the
Location object should point to. The most obvious choice is the first
character of the reported construct, but IMHO that should be spelled
out. 

The last issue is that of AttributeList. In Python (and many other
languages) lists, hash tables and tuples are "native" types and this
is basically what AttributeList is. Also, Java is now going to have a
standardized Collections API with Java 1.2. 

I think AttributeList should be in a form that makes it implementable
with the "native" types where that is natural and still make it
conform with the Collections API of Java 1.2.

One way to do that might be to have an Attribute object with Name,
Type and Value attributes and just make AttributeList a hash table
that maps attribute names to those objects. In Python/Common Lisp/Perl
this might be implemented with hash tables and lists/tuples.

Alternatively, one could throw out the type information and just use a
plain hash table/associative array.

Below this point I have two ideas that may clash with what people want
with/from SAX. If they are out of the question that's OK, I just want
to hear the reactions to them.

One thing that would be very nice would be to make it possible for SAX
clients to do validation themselves in case the underlying parser does
not support it. 

This would make it possible to build a validating XML parser in
languages like Python/tcl/Scheme from three components: a C module for
fast document scanning, a Python/tcl/Scheme module for the same in
case the C one hasn't been compiled in and finally the validation
itself, written in Python/tcl/Scheme.

What's necessary for this is basically the doctype method, access to
the internal subset somehow and access to the XML declaration.[2]
These things should perhaps be in a separate interface, since they are
pretty different from most of the other things SAX is concerned with.
(DTDHandler is probably the wrong place for them, since that would
probably only be implemented if the parser already does validation.)

This is of course not a matter of life and death, since it's possible
to do this without SAX, but IMHO it would be nice as it would provide
a clean decoupling of XML scanning and XML validation.

Finally, one thought that struck me when I read this API was that SAX
seemed to be biased towards parsers that read the DTD, while I'm
biased the other way. This is probably due to the differences between
the situation Python is in (and Perl/... probably will be in) and the
currents state of affairs in Java.

This makes me think that it might perhaps be an idea to add a second
level to SAX: one that provides logical information about elements,
entities, attributes and notations as they are declared in the DTD.

The existing SAX can then be simplified to only provide logical
information about the document itself. The second level will of course
only be supported by the parsers that actually parse the DTD and if my
suggestions above are taken in the second level can be built on the
first.

Doing this would also solve the AttributeList and DTDHandler
"problems", since AttributeList would then become a plain hash table
in most languages[3] while DTDHandler would become part of the DTD
interfaces. One advantage of this is that there would no longer be
several methods in SAX level 1 that many simple parsers will not
support.

Just an idea. The problem is of course how to provide access to the
complex stuff: element content models. (Possibly by avoiding the issue
entirely and instead having methods that ask the DTD "is this element
allowed here?")

[1] With this change it would probably be best to rename Location to
    Locator, since that's really what happens. Locator might also
    perhaps be merged with Parser.

[2] It would be nice if the SAX spec could specify whether or not the
    XML declaration should be reported as an ordinary processing
    instruction. Reporting it would solve the problem with access to
    it, but would add complexity for the users.

[3] Probably an assoc list in R5RS Scheme and object/record arrays in
    VB/Delphi/C/....

-- 
"These are, as I began, cumbersome ways / to kill a man. Simpler, direct, 
and much more neat / is to see that he is living somewhere in the middle /
of the twentieth century, and leave him there."     -- Edwin Brock

 http://www.stud.ifi.uio.no/~larsga/      http://birk105.studby.uio.no/

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)