YAXPAPI (Yet Another XML Parser API)- an XDEV proposal

Peter Murray-Rust peter at ursus.demon.co.uk
Sat Dec 13 14:25:43 GMT 1997

In case anyone has missed my postings over the last 10 months, I would like
an API for XML parsers :-). JUMBO has been interfaced to 3 publicly
available Java parsers (besides its  Mus Michaelis one) and finds it
sufficiently hard grunt work adding more because of the inconsistency of
what's presented through the existing APIs. Note that all three parsers
(Lark0.97, NXP97-09, AElfred1.0beta) provide EventDriven interfaces. I have
not tried - and do not at present intend to try - to interface with someone
else's Tree or Grove model. [Lark builds a tree if required - the others
don't. NXP has classes for a CompleteGrove - I haven't used them.]

*** Please understand that any apparent frustration below is NOT criticism
of these three parsers and their authors - all of whom have made an
extremely important contribution. Nor is the omission of MSXML, tcl-based
parsers and JamesC's software anything than lack of time ***

It's also clear that none of the three allow me to get at all the
information in the document I want, though I think AElfred is almost there
[I haven't looked at the latest version.] Let's assume I want the Name in
the DOCTYPE [29] - the root elementType.

In Lark097: 
public boolean doDoctype(Entity e, String rootType, String publicID, String

OK - I can manage this, but I have no idea what the Entity class is in any
of Lark's calls. "Those names Element, Attribute and Entity are obvious in
their function.". This is just another example of my Dumbness, but it's a
reality. I don't have time to explore precisely what it is - and I can't
actually print it out.

In NXP97-09-05 (I think) I can grep and find (XML.java):

final public String doctypedecl(); 

Since the code is autogenerated by JACC I haven't the first idea what the
contents of the String are (I would have to experiment). If it goes by the
spec it's the whole String contents of all the subsets, I assume.

In AElfred1.0beta:
public abstract void doctypedecl(XmlParser parser, String name, String
pubid, String sysid);

This is fully documented in javadoc.

[Note: javadoc is free, comes with the system, is relatively easy to use
after you have fought the classpath and there is no good reason not to use

So three parsers, three quite different interfaces, three more midnight
hacks for JUMBO. I haven't looked at MSXML but I would be amazed if there
wasn't yetanotherinterface.

All of this makes JUMBO very tired.

There seem to be several reasons for this lethargy in producing an API -
we've been at this since February. Since there is relatively little
discussion  I am guessing these reasons from "vibes". :-)
	- it's too early to do anything - the language spec has only been
published this week.
	- it's all in the spec - if you can't work out what to do properly that's
not our problem.
	- a proper grove plan takes care of this.  Anything simpler is inadequate.
	- this will all be sorted out by the DOM, so let's do nothing until this
	- parsers are unlikely to be interoperable anyway. 
	- this is an area which should be left to the software houses - the W3C is
primarily to develop markets for its members.
	- it's in our interests to have non-interoperability because we'll protect
our markets that way.
	- it's too difficult and I'm not paid to spend the time thinking about it.

So - as a first step - I make the following proposal and ask for
constructive comments. I am quite prepared to be shown it's shallow and


*Simple* Java interfaces are usually built by identifying the objects
involved and using a consistent style for naming objects, methods,
interfaces and related hooks. An example is Java Beans, where getXyz() and
setXyz() have semantics which the Beans reflection mechanism can identify. 

The XML spec has very precise definitions of the components that are
required in an interface. 

My proposal is simply that we should use these two approaches wherever
possible in naming classes and methods, and that we should list the
functions in the interface. That's all :-).

If I want the rootType of the document I refer to [29] and see that it is a
Name. Therefore I could do all I want with code like:

/** extract the string directly from the document [29] */
public String Document.getDoctypedeclName()  OR:

/** or have a class for Doctypedecl [29] */
public Doctypedecl Document.getDocumentdecl();
public String Doctypedecl.getName();


To get the contentspec and default attribute value for the Bar attribute
name of the Element Foo: (note the differences in capitalisation of the
string 'decl' in the spec);

Enumeration elementdecls = Document.getElementdecls(); /*[29-30]*/
while (elementdecls.hasMoreElements()) {
    Elementdecl elementdecl = (Elementdecl) elementdecls.nextElement();
    if (elementdecl.getName().equals("Foo")) {    /*[45]*/
        String contentspec = elementdecl.getContentspec();
Enumeration attlistdecls = Document.getAttlistDecls();  /*[29, 30]*/
while (attlistdecls.hasMoreElements()) {
    AttlistDecl attlistDecl = (AttlistDecl) attlistdecls.nextElement();
    if (attlistDecl.getName().equals("Foo")) {
        Vector attDefVector = attlistDecl.getAttdefs();  /*[52]*/
        for (int i = 0; i < attDefVector.size(); i++) {
            AttDef attDef = (AttDef) attdefVector.elementAt(i);
            if (attDef.getName().equals("Bar")) {       /*[53]*/
                String value = attDef.getDefault();     /*[54]*/


If something is defined in the spec, it has a clear place where it is
defined, and a clear term. Why not use this? It should only take a few
hours to go through the 82 productions and decide which of them returned
anything useful (we  are unlikely to require [26], for example :-); - many
productions are irrelevant to the parsed, normalised document. The
semantics are clear (at least as clear as the spec can provide), and can be
precisely pinpointed

We have to decide which components require classes and which are simply
Strings. In some cases capitalisation is a problem. Java strongly urges
initial caps so I would write:

public Prolog getProlog()/*[23]*/

(I am not sure whether there are name collisions separated only by case).

In some cases the names clash with existing java classes, so in [59] we
might have to write:
public jumbo.parser.Enumeration getEnumeration();

since there is a java.util.Enumeration.

In some cases there are repeatable values [e.g. [58] ] where we might need:

public String[] NotationType.getNames();

or we may choose to have Vector, etc.

The use of many classes might make the parsers too large or slow, so maybe
some other style might be useful.


This is simple, and is easy to implement. Dumb hackers like me can
understand it by reading the spec - they don't need to know about groves,
DOM or whatever. I expect that it's not comprehensive - there is no error
model for example - but I can't see much that I need from a document that
isn't in the spec. Anything else would be parser-specific flags, or perhaps
retrieval of unnormalised input.



Peter Murray-Rust, Director Virtual School of Molecular Sciences, domestic
net connection
VSMS http://www.nottingham.ac.uk/vsms, Virtual Hyperglossary

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)

More information about the Xml-dev mailing list