JUMBO

Peter Murray-Rust Peter at ursus.demon.co.uk
Sat Mar 22 19:46:38 GMT 1997


JUMBO is a prototype browser/editor/search/transformation tool for
XML documents.  I have now managed to bolt in both Lark and NXP 
instead of my parser (which was crude and did not support some of the
XML constructs).  The bolting-in is still rather crude and concentrates
my mind on the need for a simple API at this level.  Here are some comments
which may be useful.

NXP.
----
NXP has an interface Esis, with function such as open_tag, close_tag,
process_instruction, etc.  [I think they would be more properly called 
start_element??].  JUMBO uses this to build up a Vector representing the
ESIS event stream, somthing like:
"_START_TAG" "CML"  AttributeList "_START_TAG" "MOL" ... "_END_TAG" "MOL"...
JUMBO then builds a tree out of this, adding attributes, etc.

NXP has a class XML which is built by JACC.  This contains inter alia
an Esis_Stdout object (implements Esis).  There are several objects in XML
which are private and therefore not easily accessed - I think they should
have accessors, but at present I have subclassed it to PMRXML, which has
the requisiste accessors.

My test program then creates a PMRXML object, and extracts the event stream
which is then passed to JUMBO's existing tree object:
    NXP.PMRXML xml = new PMRXML(NXP.Streams.load_File(file, true));
    pmr.chemime.ChemTree chemTree = new ChemTree(xml.getStreamVector());
    pmr.sgml.GeneralTOC toc = chemTree.createGeneralTOC(3);

Comments:  I have still to work out what whitespace NXP creates - there seems 
to be a lot of content which is simply white.  Maybe we have to address
COLLAPSE and KEEP at this stage?  Also it isn't easy to extract certain 
info - for example I had to hack XML.java to get the doctype - this isn't a good
idea and we need an accessor.  I am also still not clear how NXP does (or should)
behave with:
<!DOCTYPE CML>
and <!DOCTYPE CML SYSTEM "cml.dtd">
(the default on the latter is to try to validate, I think, even if validate
is set to false.  I'd prefer to be able to turn off validation, but I may have
missed something).
	In general I'd like to be able to treat NXP as a black box, and subclass
my Esis object.  That could mean passing it as an argument to XML, e.g.:
   
public class PMREsis implements Esis {
    public void open_tag(String name) {
...
    }
}

    PMREsis esis = new PMREsis();
    NXP.XML xml = new NXP.XML(esis, NXP.Streams.load_File(file, true))
    pmr.sgml.SGMLTree tree = new pmr.sgml.SGMLTree(xml);

and so on.

NXP is a validatin parser, but my DTDs are still struggling with Parameter
Entities so I have no experience here.

Lark
----
	Lark creates a tree (called Lark) and provides a handler for 
the user to pick up a variety of events (e.g. doDoctype(), doPI()).  The
tree contains Elements ('Nodes') which have Attributes and a type (String).

Rather than subclassing these elements, I process Lark but iterating through
the Elements and creating a JUMBO SGMLTree (this can be delayed if required).
The tree seems complete, but I am not sure I have got all the doFOO routines
working correctly.  I have also had problems with PIs (if the ?> delimiter
is used) - these may be mine.

Lark does not validate.  However it is easy to interface and is fast.


General
-------
I do not use PIs myself though I shall start to do so.  If they are
kept in the document tree, is there a convention where they live?  (The last
opened element?  What if they occur in PCDATA?).

I intend to make JUMBO available with both Lark and NXP but it's a bit creaky
at present and the interface is a bit slow.  I have been told that the larger
the number of classes, the slower the program - any comments?  Also I don't
know whether I should be deliberately garbage-collecting at this stage.

Any general thoughts would be welcome.  I intend to bolt a crude search tool
into JUMBO along the TEI lines.  I shall also see whether I can extract the
bits of NXP that do the validating, because then we have a crude validating
editor.  

Any feedback from the current JUMBos would be appreciated.  (I already know
it's slow, and the graphics creak in several places :-)

P.

-- 
Peter Murray-Rust, domestic net connection
Virtual School of Molecular Sciences
http://www.vsms.nottingham.ac.uk/

xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)




More information about the Xml-dev mailing list