JUMBO

Peter Murray-Rust Peter at ursus.demon.co.uk
Mon Mar 24 09:51:08 GMT 1997


In message <3336AFF1.1A23 at edu.uni-klu.ac.at> "Norbert H. Mikula" writes:
> Peter Murray-Rust wrote:
> > NXP has an interface Esis, with function such as open_tag, close_tag,
> > process_instruction, etc.  [I think they would be more properly called
> > start_element??].  
> 
> You are absolutely right !

I learnt the importance of precise terminology when Erik Naggum was a 
regular contributor to c.t.s.:-) :-)  He used to point out gently but firmly any
lapse in terminology.  The problem with SGML is that its terminology is
sufficiently different from other disciplines that people make guesses and
also don't realise the distinctions matter.  (I have been very guilty in
this respect).  However there are a number of areas where the distinction
is subtle - I still don't know if there is a difference between 'GI' and 
'Element type', for example.

There is no doubt that adherence to the agreed terminology is a key aspect 
of the API.

> 
> > JUMBO uses this to build up a Vector representing the
> > ESIS event stream, somthing like:
> > "_START_TAG" "CML"  AttributeList "_START_TAG" "MOL" ... "_END_TAG" "MOL"...
> > JUMBO then builds a tree out of this, adding attributes, etc.
> > 
> > NXP has a class XML which is built by JACC.  This contains inter alia
> > an Esis_Stdout object (implements Esis).  There are several objects in XML
> > which are private and therefore not easily accessed -
> 
> Would it be possible to send me a list of those objects ?

/** start of NXP/PMR list */
// from NXP with PMR comments

package NXP;

//... 

public class XML implements XMLConstants {

// PMR - I guess most of these would be valuable.
// note that unless they are 'protected' they can't be acccessed 
// by a subclass from another package
// '//?' means that I don't know what they are for yet (I haven't spent
// time looking :-)
// '//+' means I need them
// '//+?' means I think I might need them :-)

//+?
  XMLCatalogMain catalog = null;
//?
  boolean start = true;
//?
  int state_counter = 0;
//+
  static protected boolean validate = false;
//+
  static protected boolean talkative = false;
//?
  final static int NO_SWITCH = -1;
  final static protected int ALL = 0;
  final static protected int INTERNAL = 1;
  final static protected int NONE = 2;
  static protected int rmd = ALL;
//+
  final static protected Esis_Stdout esis = new Esis_Stdout();
//+?
  final static protected Hashtable element_hash = new Hashtable(30);
//+?
  final static protected Hashtable open_element_hash = new Hashtable();
//+?
  final static protected Hashtable notation_hash = new Hashtable(5);
//+?
  final static protected Hashtable id_hash = new Hashtable(100);
//+?
  final static protected Hashtable idref_hash = new Hashtable(100);
//?
  static protected Element open_el = null;
//?
  static protected Vector att_val = new Vector();
//?
  final static protected Hashtable found_attributes = new Hashtable();
//?
  final static protected Hashtable gen_entity_hash = new Hashtable(10);
//?
  final static protected Hashtable par_entity_hash = new Hashtable(10);
//?
  final static protected Stack lexer_stack = new Stack();
//?
  final static protected Stack openel_stack = new Stack();
//?
  static protected String stop_external = null;
//?
  final static int GENERAL = 0;
  final static int PARAMETER = 1;
  final static Element NULL_ELEMENT = new Element();
//+
  static String base_url;
//+
  static String base_path;
//+
  static boolean base = true;
//+
  final static int URL_INPS = 0;
//+
  final static int FILE_INPS = 1;
//+
  static int input_stream;
//?
  final static Object DUMMY = new Object();
//+ (I had tp add this to the XML code :-(
  protected String pmrDoctype;

final void popTokenManager()
{
  XMLTokenManager tok_man = (XMLTokenManager) lexer_stack.pop();

  ReInit(tok_man);
}

//+ (Note that this is NOT accessible to a subclass, and as it is final
// cannot be overridden)
final void setCatalog(XMLCatalogMain catalog)
{
  this.catalog = catalog;
}

....
// This was my own class PMRXML, which I added to NXP.

package NXP;

import java.io.InputStream;

import java.util.Vector;

import NXP.Catalog.XMLCatalogMain;

public class PMRXML extends XML {
    public PMRXML(InputStream is) {
        super(is);
    }

    public void setTalkative(boolean t) {
        talkative = t;
    }

    public void setValidate(boolean t) {
        validate = t;
    }

    public static int FILE_INPS() {
        return XML.FILE_INPS;
    }

    public static int URL_INPS() {
        return XML.URL_INPS;
    }

    public void setBaseUrl(String u) {
        base_url = u;
    }

    public String getBaseUrl() {
        return base_url;
    }

    public void setBasePath(String p) {
        base_path = p;
    }

    public String getBasePath() {
        return base_path;
    }

    public void setBase(boolean b) {
        base = b;
    }

    public void setInputStream(int is) {
        input_stream = is;
    }

// the 'junk' was to avoid the same signature as setCatalog above
// which is 'final'

    public void setCatalog(XMLCatalogMain c, String junk) {
        this.catalog = c;
    }

    public Vector getStreamVector() {
        return esis.vector;
    }

    public String getDoctype() {
// this was just to get it to run.
        if (pmrDoctype == null) pmrDoctype = "CML";
        return pmrDoctype;
    }
}
/** end of NXP/PMR */
> 
[...]
> > 
> > Comments:  I have still to work out what whitespace NXP creates - there seems
> > to be a lot of content which is simply white.  Maybe we have to address
> > COLLAPSE and KEEP at this stage?  
> 
> As soon as I will know how the standard defines the treatment of
> whitespace
> in all those scenarios, for instance w/ DTD w/o DTD, in element content
> etc. 
> I will implement it that way. (I admit that the whitespace is really
> annoying, but
> I didn't want to waste my time with experiments.)

Agreed.  I have (pragmatically) deleted all elements from NXP which consist
only of whitespace.  (This because my DTDs are biassed to this since the chance
of getting a molecular scientist to know and love the SGML whitespace/RE/RS
rules is outwith the 2nd law of thermodynamics.

> 
> > Also it isn't easy to extract certain
> > info - for example I had to hack XML.java to get the doctype - this isn't a good
> > idea and we need an accessor.  
> 
> People didn't seem to be too interested in my idea of an interface for
> passing along a complete grove. At least I didn't get too much 
> feedback.

(a) some people (e.g. me) didn't know what a complete grove was :-)
(b) I think we were worried about overkill before we have got the plane off
the ground.
(c) I am not sure I would recognise a doctype within a complete grove :-)
most of the names seemed to have come out of a FORTRAN program (i.e. 6 
consonants)

> 
> > I am also still not clear how NXP does (or should)
> > behave with:
> > <!DOCTYPE CML>
> > and <!DOCTYPE CML SYSTEM "cml.dtd">
> > (the default on the latter is to try to validate, I think, even if validate
> > is set to false.  I'd prefer to be able to turn off validation, but I may have
> > missed something).
> 
> I will check it. Thank's for pointing it out to me !

Great.

> 
> >         In general I'd like to be able to treat NXP as a black box, and subclass
> > my Esis object.  That could mean passing it as an argument to XML, e.g.:
> > 
> > public class PMREsis implements Esis {
> >     public void open_tag(String name) {
> > ...
> >     }
> > }
> > 
> >     PMREsis esis = new PMREsis();
> >     NXP.XML xml = new NXP.XML(esis, NXP.Streams.load_File(file, true))
> >     pmr.sgml.SGMLTree tree = new pmr.sgml.SGMLTree(xml);
> 
> That's the basic idea that I had in mind. We really must continue with
> working on our unified interface for XML/Java based applications.

Splendid.

NXP has behaved fine on my document instances (but they aren't torturing it!)
It also seems to be fast - at least *much* faster than my own stuff. 
Partly that is due to building a tree which I think hammers the memory, so 
anything that helps at parse time would be useful.

<WG>
The only thing I can recall that the WG might consider is character entities
NXP announces that &gt cannot be resolved.  My own feeling is that parsers
should be at liberty to insert these as a default option (perhaps a commandline
switch '-e assume that <!ENTITY gt '>'> is included).
</WG>


-- 
Peter Murray-Rust, domestic net connection
Virtual School of Molecular Sciences
http://www.vsms.nottingham.ac.uk/

xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)




More information about the Xml-dev mailing list