SAX: New Idea for Entity Resolution

David Megginson ak117 at freenet.carleton.ca
Sat Apr 18 16:19:42 BST 1998


James Clark writes:

 > You could just have a class that encapsulates a structure with three
 > members:
 > 
 > - a CharacterStream
 > - a ByteStream
 > - a String
 > 
 > At least one of the CharacterStream and ByteStream must be non-null. If
 > the ByteStream is non-null the String can specify the encoding.

[Read on to the bottom for a large-ish design change.]

This implies, then, the following three interfaces:

  public interface ByteStream {
    public abstract int read ()
      throws SAXException;
    public abstract int read (byte b[], int start, int count)
      throws SAXException;
  }

  public interface CharacterStream {
    public abstract int read ()
      throws SAXException;
    public abstract int read (char ch[], int start, int count)
      throws SAXException;
  }

  public class InputSource {
    // For each variable, imagine a get/set pair instead...
    public ByteStream byteStream;
    public CharacterStream characterStream;
    public String encoding;
  }

The nice thing here is that all of these can live on separate systems
in a distributed environment: the InputSource can be a C-program on a
VAX, the CharacterStream can come a Python program running under alpha
Linux, and the parser can be running in Java on a Windows box.  There
is no dependency on language- or system-specific features (except for
java.lang.String, which should be able to map predictably to other
languages).

Now, why not take this a step further?

  public class InputSource {
    // For each variable, imagine a get/set pair instead...
    public String publicId;
    public String systemId;
    public ByteStream byteStream;
    public CharacterStream characterStream;
    public String encoding;
  }

We'd have to define rules of precedence:

1) if there is a character stream, use it;

2) if there is no character stream but there is a byte stream, use the
   byte stream;

3) if there is neither a character stream nor a byte stream but there
   is a system identifier, open a connection to the system identifier;

4) if there is no character stream, byte stream, or system identifier,
   throw an exception (or invoke the ErrorHandler).

Now, we can get away with only one parse() method in
org.xml.sax.Parser:

  public abstract void parse (InputSource source)
    throws Exception;

It might still be useful to keep two separate methods in
EntityResolver, though:

  public interface EntityResolver
  {
    public String resolveSystemId (String publicId, String systemId)
      throws SAXException;
    public InputSource openEntity (String systemId)
      throws Exception;
  }

Comments?


All the best,


David

-- 
David Megginson                 ak117 at freenet.carleton.ca
Microstar Software Ltd.         dmeggins at microstar.com
      http://home.sprynet.com/sprynet/dmeggins/

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)




More information about the Xml-dev mailing list