SAX: New Idea for Entity Resolution

James Clark jjc at jclark.com
Sun Apr 19 07:48:13 BST 1998


David Megginson wrote:
> 
> James Clark writes:
> 
>  > You could just have a class that encapsulates a structure with three
>  > members:
>  >
>  > - a CharacterStream
>  > - a ByteStream
>  > - a String
>  >
>  > At least one of the CharacterStream and ByteStream must be non-null. If
>  > the ByteStream is non-null the String can specify the encoding.
> 
> [Read on to the bottom for a large-ish design change.]
> 
> This implies, then, the following three interfaces:
> 
>   public interface ByteStream {
>     public abstract int read ()
>       throws SAXException;
>     public abstract int read (byte b[], int start, int count)
>       throws SAXException;
>   }
> 
>   public interface CharacterStream {
>     public abstract int read ()
>       throws SAXException;
>     public abstract int read (char ch[], int start, int count)
>       throws SAXException;
>   }

Why are the single character read calls there?  They unnecessarily
complicates the interface.

>   public class InputSource {
>     // For each variable, imagine a get/set pair instead...
>     public ByteStream byteStream;
>     public CharacterStream characterStream;
>     public String encoding;
>   }
> 
> The nice thing here is that all of these can live on separate systems
> in a distributed environment: the InputSource can be a C-program on a
> VAX, the CharacterStream can come a Python program running under alpha
> Linux, and the parser can be running in Java on a Windows box.  There
> is no dependency on language- or system-specific features (except for
> java.lang.String, which should be able to map predictably to other
> languages).
> 
> Now, why not take this a step further?
> 
>   public class InputSource {
>     // For each variable, imagine a get/set pair instead...
>     public String publicId;
>     public String systemId;
>     public ByteStream byteStream;
>     public CharacterStream characterStream;
>     public String encoding;
>   }
> 
> We'd have to define rules of precedence:
> 
> 1) if there is a character stream, use it;
> 
> 2) if there is no character stream but there is a byte stream, use the
>    byte stream;
> 
> 3) if there is neither a character stream nor a byte stream but there
>    is a system identifier, open a connection to the system identifier;
> 
> 4) if there is no character stream, byte stream, or system identifier,
>    throw an exception (or invoke the ErrorHandler).
> 
> Now, we can get away with only one parse() method in
> org.xml.sax.Parser:
> 
>   public abstract void parse (InputSource source)
>     throws Exception;

I don't think this is a good idea: it makes SAX harder to use in the
simple case of reading from a URL.

James


xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)




More information about the Xml-dev mailing list