SAX2 Namespace Support

Sun Jan 2 18:48:30 GMT 2000

David Megginson wrote:
> 
> David Brownell writes:
> 
>  > There's been way too much email on this topic -- I should have
>  > weighed in earlier.  In all honesty I'd prefer to see all namespace
>  > support be cleanly layered on top of SAX1.  It's easy to do it that
>  > way; just add some optional code to postprocess a SAX event stream.
> 
> The argument against that is efficiency: I have found that even the
> most efficient Namespace post-processor that I can write adds about
> 25% to parsing time.  The reason, I think, is that there is a high
> cost to iterating through every attribute list and examining every
> attribute name, and copying or wrapping the attribute lists to give a
> Namespace view.

Ah -- but what was that post-processor trying to do, then?

I once did a pretty quick'n'dirty one that cost barely 10% even though
it used DOM data structures (which, for attribute lists and values, are
often grossly inefficient).  I've every reason to believe it could be
done with much lower cost.  Perhaps your 25% was doing more work than
was required.

Then there's also the approach of having the processing be integrated
with the next layer, which already needs to iterate and examine.  (And
with different goals than a parser!)  Such approaches can reduce the
cost much further -- effectively, to zero.

>  > With respect to this particular proposal, I have several comments.
>  >
>  > First, it's unclear to me what's happened to our old friend, the
>  > org.xml.sax.DocumentHandler.startElement callback:
>  >
>  >     public void startElement (String name, AttributeList attrs)
>  >     throws SAXException;
>  >
>  > If that call is gone, I anticipate migration problems to SAX2.
> 
> There have been so many proposals that I'm starting to lose track.
> The idea, I think, is that this would be replaced ...

Raising that point about compatibility.  There's not been much of
discussion abou the fairly substantial change you're proposing,
that SAX1 and SAX2 seemingly be incompatible at this basic level.
It's only come out implicitly in reaction to other points.

>  > If it's still there, then it must be the application's choice to use
>  > the new sax2.DocumentHandler interface or the original ... presumably
>  > it would use Configurable.setProperty() with some ID for the new
>  > namespace-aware sax2.DocumehtHandler to identiy its choice.
> 
> One option that no one has suggested yet is to create the
> NamespaceHandler a little differently:
> 
>   public class NamespaceHandler
>   {
>     public void startElement (String namespaceURI, String localName,
>                               NSAttributeList atts)
>       throws Whatever;
> 
>     [ deletia ]
>   }
> 
> That way, SAX parsers could still use the original DocumentHandler to
> report the XML 1.0 view (with prefixed names), and the
> NamespaceHandler to report the Namespace view of elements and
> attributes, which is the only place the view differs.
>
> 	[ deletia ]
>
> Personally, I find this approach a little brittle ...

I'd thought about that approach too.  I couldn't quite put a finger
on why it bothered me, beyond making adding extra calls which would
surely cause some application trouble someday.  

>  > Second, it's unclear how to report violations of namespace conformance.
>  
>  [ deletia ]
>
>  > That is, faced with this document
>  >
>  >      <?xml version="1.0"?>
>  >      <html:p>Hello again! :-)</html:p>
>  >      <?at-end-of-document?>
>  >
>  > Two reporting issues arise:  (a) How does one know that namespaces are
>  > to be used at all?  It's a legal XML 1.0 document, so inherently there
>  > is no error.
> 
> That's a big problem.  My SAX2 proposal is for XML+Namespaces by
> default, but it's possible to try to disable Namespace support.  That
> means that, by default, you would get an error for this document.

Apart from that "try to", I like that approach.

I've been leaning in various cases to "XML+namespaces" as a default,
but think it _must_ be possible to do more than "try" to disable it.
Namespaces are not mandatory (not part of the XML spec) and by now
it's well known they're not trouble-free.  SAX2 shouldn't preclude
systems that choose not to use namespaces.

>  > (b) If one knows that namespaces are to be used, is the undeclared
>  > "html" prefix to generate a warning, recoverable error, or fatal
>  > error through sax.ErrorHandler?  Is it reported some other way?
> 
> I think that it would be wrong to use fatalError to report Namespace
> violations, but others may disagree.

And given the W3C's original actions, I believe some of them are at W3C.
(Else they'd have made this explicit in the namespace specification, when
the issue came up in review comments.)

>  > I think that using ErrorHandler.error() is the best solution, but then
>  > that leads to the issue of how to report namespace URIs that aren't
>  > available.  (And as I recall, there were more errors to deal with than
>  > just unresolved namespace prefixes.)
> 
> Error numbers would be helpful, if someone were willing to invent some.

I'll start your collection by proposing SAXParseException get a bunch of
integer literal constants.  All undefined constants would be available for
use by revisions of SAX, but these numbers increment from zero (so folk
wanting proprietary codes have some sand to stand on for a while).

	/** Error, details unspecified (default for SAX1) */
	final static const int SAX_ERR_UNSPECIFIED = 0;

	/** Namespace prefix was undeclared.  The undeclared prefix is
	 *  in a String member of the exception (access TBD). */
	final static const int SAX_ERR_NAMESPACE_PREFIX_UNECLARED = 1;

Folk wanting to follow up -- please change the title to "SAX2 error numbers"!

I'll anticipate three general categories of followup:  about the framework
(e.g. proprietary codes), about the errors (more than two will arrive :-),
about access to error-specific detail info, about how I can't count, and
finally about things I didn't anticipate.

Once the basic framework is settled, I'll scan my enhanced AElfred (and a
layered validator) to come up with more error codes.

>  > > This would never be enabled by default, but for the relatively small
>  > > class of apps that needed to know the original prefix, the prefix
>  > > would be available simply by splitting the name argument.
>  >
>  > Clearly that class includes "DOM-using applications", which for better
>  > or worse (opinions do vary :-) isn't a small class.
>  >
>  > DOM L2 applications explicitly have the same option that I noted above:
>  > use (or non-use) of namespace information is the choice of the application,
>  > not the choice of some version of an XML infrastructure.
> 
> Is DOM2 more explicit about processing than DOM1, then?  There's
> nothing in DOM1 that says (for example) that you have to include
> comments and other stuff from the original XML document, if in fact
> there is an original XML document.

DOM L2 still doesn't say any more than DOM L1 did about the association
between a given XML document (comments, entity refs, DTD, etc) and a
given DOM tree ... and it's APIs are still incomplete for parsing into
a "full" DOM tree without using proprietary API extensions.  Sigh.

For the record, I've updated my "DOM2" implementation's javadoc to
explicitly discourage applications from using the following features:
CDATASection, DocumentType, Entity, EntityReference, Notation, and
the fact that attribute nodes can have children.  (The "DOM Functionality
to Avoid" section in the package javadoc provides more info.)

That list doesn't include comments.  Comments may be unwise, and not
all parsers can report them, but their functionality is at least
complete within the scope of DOM, and ignoring them is trivial.

> Even in DOM2, I wonder if you'd have to have the *original* prefixes
> or just some prefixes?  After all, the DOM won't always be built from
> an XML document; it might be a wrapper around a bunch of DB tables
> (for example) where there are no original prefixes available.

Prefixes are settable, for namespace-aware nodes.  As we all know,
such prefixes _should_ be used only for output.  After massaging a
document (perhaps built from multiple databases or B2B partners)
some pre-output processing stage needs to massage the tree to ensure
that all namespace prefixes get declared.

As for "original", the intent is that prefixes that start out with
some human-meaningful connotation ("xsl:template" vs "a82mfx:template")
preserve it, but that's clearly up to the application.  Apps can do
whatever output mangling they choose, including removing almost all
whitespace to make the document completely unintelligible.  Neither
of those are good things to do, or IMHO to encourage.

- Dave

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To unsubscribe, mailto:majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)