Announcement: SAX Java Implementation (pre-release)

David Megginson ak117 at freenet.carleton.ca
Sun Apr 12 23:45:22 BST 1998


James Clark writes:
 > David Megginson wrote:
 > 
 > > I have put together a new, beta version of SAX with quite a few
 > > changes
 > 
 > This looks good.  I have some nits:

Actually, these are very good points, all of which deserve detailed
answers, and several of which are so self-evidently correct that I'd
like to make the changes right away.

Could anyone implementing SAX (on either the parser or application
side) please read through this entire reply?  There are several points
where I'd appreciate feedback.

 > 1. Why has a SAX prefix been added to all classes?

There are a few benefits to this decision:

1. Programmers can import SAX classes into their own namespaces
   with less danger of collision (they will often have their
   own "Parser" and "DocumentHandler" classes).  Experienced programmers
   might snort at this, but I have had several messages from people
   who couldn't understand why their code wasn't compiling properly.

2. I save a lot of time that I would have had to spend helping people
   who still had the old Java SAX classes somewhere on their
   CLASSPATH.

3. Porting to languages like C, which don't have namespaces, becomes
   clearer (though the use of overloading in SAXParser will still
   cause trouble -- perhaps I should fix that before the final release).

 > 2. For consistency with SAXException, in SAXLocator getSystemId should
 > return null if no system id is available, and getLineNumber,
 > getColumnNumber should similarily return -1 if no line or column
 > number is available.

Absolutely correct -- I will change the documentation.

 > 3. The interface for reading character streams needs more
 > specification if it is to be interoperable.

 > a) There's a critical ambiguity in the concept of a character stream:
 > a Java concept of a char does not correspond to the XML concept of a
 > character. A character outside the BMP is a single XML character but
 > is represented by a pair of Java chars.  If you want to use the Java
 > Reader interface, then a character stream must be a stream not of
 > characters in the XML sense but in the Java sense.  I don't have the
 > Unicode standard handy, but it has precisely defined terms for these
 > two different things; I suggest referencing the Unicode standard and
 > using the appropriate term.

The real challenge here is to define the level of interoperability
that we need.  My first impulse is to leave "byte stream" and
"character stream" deliberately undefined, so that each language can
use its native implementation (if one is available).  I think most
users will find life easier if in Java, for example, they can use
java.io.InputStream for a byte stream and java.io.Reader for a
character stream; C++ programmers can use istream for a byte stream
and whatever the ANSI committee is considering for character streams;
etc., etc.

This is somewhat messy, since (as you correctly point out) the exact
behaviour becomes language-specific (that is one reason that I didn't
include these in the first pass); I am reluctant, though, to create
SAXByteStream and SAXCharacterStream, and to force everyone to use 
wrappers for their InputStream/Reader-type classes.

What does everyone else think about this point?  Is this a good case
for pragmatism over logical consistency, or am I introducing an ugly
kludge that will come back to haunt us all?

 > b) Is it legal for a byte order mark character to be present at the
 > start of the character stream? The right answer is that it should not
 > be legal: this should be stripped out in the byte to character
 > conversion process.

This is a tricky point.  I had planned to leave it in -- what is the
default behaviour for java.io.Reader (and for other languages with
character streams)?

 > c) How does this interact with the encoding declaration in the XML
 > document?  The docs should say that it's legal for the character
 > stream to include an encoding declaration and it doesn't matter what
 > encoding it specifies.

I'd think that it should be ignored under these circumstances, since
the characters are already decoded (though again, in an underspecified
way -- are we dealing with UCS-2, UCS-4, or UTF-16?).

 > 4. The doc for SAXDTDHandler should say that the order in which DTD
 > events are fired is unspecified except that they will be all be fired
 > after startDocument and before startElement.

Thanks.  I will change this.

 > 5. Maybe the name of SAXDTDHandler should be changed to reflect the
 > fact that it is not attempting to be a complete DTD interface.  Some
 > future version of SAX might provide optional support for full DTDs and
 > it would be nice to be able to use the name SAXDTDHandler as the name
 > for that.

I thought for a while about this -- SAXDocumentHandler also provides
only partial document information, so I was thinking that we would
have something like

  public interface SAX2DocumentHandler extends SAXDocumentHandler {
  }

and

  public interface SAX2DTDHandler extends SAXDTDHandler {
  }

in SAX2 (any suggestions for a better prefix than "SAX2" will be
gratefully acknowledged).

 > 6. I strongly object to including the name argument in
 > SAXEntityResolver.resolveEntity.  There's nothing in XML that says
 > that the name should be used in resolving an entity and so there's no
 > reason to suppose a parser will make it available.  I also think it's
 > wrong in principle to make use of it.  This business with "[document]"
 > and "[dtd]" is gross. At the very least the spec should say that name
 > maybe null if this information is not available.

I'm neutral on this point, though I do agree that "[document]" and
"[dtd]" are ugly.  Does anyone object to the removal of the name
argument?

 > 7. Is the first character on the line at column 0 or column 1? (GNU
 > Emacs says column, but others say column 1.)  The docs need to make
 > this clear.

The first character is in column 1.  I will fix the docs.

 > 8. I don't think SAXException.getLocalizedMessage is the right
 > approach to internationalization.  Although the JDK does have
 > Throwable.getLocalizedMessage, as far as I can tell nothing uses it
 > and it's not at all convenient.  It would be better to have a
 > setLocale(Locale locale) method on SAXParser that specified the locale
 > in which messages should be returned.  This is the approach that is
 > used in AWT.  In any case SAXException.getLocalizedMessage is entirely
 > redundant since SAXException has Throwable as an indirect superclass,
 > and Throwable includes an identical definition of getLocalizedMessage.

This was a last-minute addition before release: the redundancy is
deliberate, since non-Java implementations will not inherit
getLocalizedMessage.  I will gladly bend to the will of the
localisation experts (or at least, cognoscenti) on this list -- if
SAXParser.setLocale() is a better approach, then I am happy to use it.

 > 9. I think SAXHandlerBase.error and SAXHandlerBase.warning should be
 > no-ops like almost all the other methods.  Having the default be to
 > print messages on System.err introduces a command-line bias that seems
 > inappropriate to me.  In addition using a PrintStream (which
 > System.err is) is irretrievably broken from an internationalization
 > perspective, as is made clear in the PrintStream docs.

And even worse, it's not clear how useful printing to STDERR is when
the parser is running in a distributed environment.  I agree fully,
and will change the default behaviour.


All the best, and thank you very much for the feedback.


David

-- 
David Megginson                 ak117 at freenet.carleton.ca
Microstar Software Ltd.         dmeggins at microstar.com
      http://home.sprynet.com/sprynet/dmeggins/

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)




More information about the Xml-dev mailing list