JUMBO
Peter Murray-Rust
Peter at ursus.demon.co.uk
Sat Mar 22 19:46:38 GMT 1997
JUMBO is a prototype browser/editor/search/transformation tool for
XML documents. I have now managed to bolt in both Lark and NXP
instead of my parser (which was crude and did not support some of the
XML constructs). The bolting-in is still rather crude and concentrates
my mind on the need for a simple API at this level. Here are some comments
which may be useful.
NXP.
----
NXP has an interface Esis, with function such as open_tag, close_tag,
process_instruction, etc. [I think they would be more properly called
start_element??]. JUMBO uses this to build up a Vector representing the
ESIS event stream, somthing like:
"_START_TAG" "CML" AttributeList "_START_TAG" "MOL" ... "_END_TAG" "MOL"...
JUMBO then builds a tree out of this, adding attributes, etc.
NXP has a class XML which is built by JACC. This contains inter alia
an Esis_Stdout object (implements Esis). There are several objects in XML
which are private and therefore not easily accessed - I think they should
have accessors, but at present I have subclassed it to PMRXML, which has
the requisiste accessors.
My test program then creates a PMRXML object, and extracts the event stream
which is then passed to JUMBO's existing tree object:
NXP.PMRXML xml = new PMRXML(NXP.Streams.load_File(file, true));
pmr.chemime.ChemTree chemTree = new ChemTree(xml.getStreamVector());
pmr.sgml.GeneralTOC toc = chemTree.createGeneralTOC(3);
Comments: I have still to work out what whitespace NXP creates - there seems
to be a lot of content which is simply white. Maybe we have to address
COLLAPSE and KEEP at this stage? Also it isn't easy to extract certain
info - for example I had to hack XML.java to get the doctype - this isn't a good
idea and we need an accessor. I am also still not clear how NXP does (or should)
behave with:
<!DOCTYPE CML>
and <!DOCTYPE CML SYSTEM "cml.dtd">
(the default on the latter is to try to validate, I think, even if validate
is set to false. I'd prefer to be able to turn off validation, but I may have
missed something).
In general I'd like to be able to treat NXP as a black box, and subclass
my Esis object. That could mean passing it as an argument to XML, e.g.:
public class PMREsis implements Esis {
public void open_tag(String name) {
...
}
}
PMREsis esis = new PMREsis();
NXP.XML xml = new NXP.XML(esis, NXP.Streams.load_File(file, true))
pmr.sgml.SGMLTree tree = new pmr.sgml.SGMLTree(xml);
and so on.
NXP is a validatin parser, but my DTDs are still struggling with Parameter
Entities so I have no experience here.
Lark
----
Lark creates a tree (called Lark) and provides a handler for
the user to pick up a variety of events (e.g. doDoctype(), doPI()). The
tree contains Elements ('Nodes') which have Attributes and a type (String).
Rather than subclassing these elements, I process Lark but iterating through
the Elements and creating a JUMBO SGMLTree (this can be delayed if required).
The tree seems complete, but I am not sure I have got all the doFOO routines
working correctly. I have also had problems with PIs (if the ?> delimiter
is used) - these may be mine.
Lark does not validate. However it is easy to interface and is fast.
General
-------
I do not use PIs myself though I shall start to do so. If they are
kept in the document tree, is there a convention where they live? (The last
opened element? What if they occur in PCDATA?).
I intend to make JUMBO available with both Lark and NXP but it's a bit creaky
at present and the interface is a bit slow. I have been told that the larger
the number of classes, the slower the program - any comments? Also I don't
know whether I should be deliberately garbage-collecting at this stage.
Any general thoughts would be welcome. I intend to bolt a crude search tool
into JUMBO along the TEI lines. I shall also see whether I can extract the
bits of NXP that do the validating, because then we have a crude validating
editor.
Any feedback from the current JUMBos would be appreciated. (I already know
it's slow, and the graphics creak in several places :-)
P.
--
Peter Murray-Rust, domestic net connection
Virtual School of Molecular Sciences
http://www.vsms.nottingham.ac.uk/
xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)
More information about the Xml-dev
mailing list