Announcement: SAX 1.0gamma

James Clark jjc at jclark.com
Sun May 3 18:47:28 BST 1998


David Megginson wrote:
> 
> James Clark writes:

>  > It should be specified whether a byte order mark at the beginning
>  > of a XML byte stream is included as part of the character stream.
>  > I don't think it should be since the byte-order mark isn't included
>  > the XML document production, and the XML spec says explicitly that
>  > the byte order mark "is an encoding signature, not part of either
>  > the markup or the character data of the XML document".
> 
> My first hunch is the opposite: the XML productions deal with
> characters, not bytes.  When I provide a raw byte stream
> (java.io.InputStream), I'm requiring the XML parser to take on two
> logical tasks:
> 
> 1) convert the bytes to characters
> 
> 2) apply the XML productions to the characters
> 
> You have already mentioned that, unlike many XML parsers (including
> AElfred), XP does not perform these as independent, serial steps;
> conceptually, however, the tasks are still distinct.  The BOM is part
> of the raw byte stream, but not part of the character stream.
> 
> I think that it also simplifies Java implementation if the parser can
> behave the same way with an InputStream from a URLConnection and an
> InputStream supplied explicitly by an application.

I'm a bit confused by your reply.  You say you're disagreeing with me,
but the points you make don't seem to contradict my suggestion. I agree
there are conceptually two stages.  My point is that the BOM bytes are
removed as part of the first stage because they are part of the encoding
signature not part of the sequence of characters that matches the
document production.  Thus the InputStream should include the BOM bytes,
but the Reader shouldn't include the 0xFEFF character.

>  > How are relative system identifiers supposed to be handled in
>  > DTDHandler?  Suppose I have a DTD with a system id of dir/foo.dtd,
>  > which declares an unparsed entity with a system id of foo.eps
>  > (which refers to dir/foo.eps). If the systemId argument to
>  > DTDHandler.unparsedEntityDecl is foo.eps, then the application is
>  > going to have problems.  There's a similar issue with
>  > EntityResolver.resolveEntity.
> 
> This does seem to be a serious problem.  One solution is to require
> the parser to fully resolve system identifiers before reporting them
> (as AElfred already does).  This approach will work well with URLs,
> but may break for other URI schemes.
> 
> Any other solutions?

In XP, my analog of InputSource has both an InputStream and optionally a
URL to use a base URL for system identifiers in that InputStream.  In
each case where the application is passed a system identifier (whether
for parsed or unparsed entity), the parser passes both the specified
system identifier and the base URL from the InputSource analog.  This
gives the application complete control over resolving relative URLs,
although at the cost of some complexity.

In implementing the SAX driver for XP I try to make an absolute URL from
the specified system identifier and the base; if that succeeds I pass
the result (after conversion to a String); if it fails (for example
because it is parsing from an InputStream with no specified system
identifier) I pass the specified system identifier.  That is the
approach I would suggest for SAX.

James

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)




More information about the Xml-dev mailing list