Streaming XML and SAX

Wed Feb 24 02:57:55 GMT 1999

I've been following the discussions on streaming (on both XML-Dev and
XSL-list, which has been interesting to compare) with lots of interest.
Unfortunately, it's been through the haze of just having finished
(yesterday) another book, so my observations may be tilted.

I started my thinking about the subject with the XP (Extensible Protocol)
proposal at the IETF:  (it's about three weeks old now.)

http://www.ietf.org/internet-drafts/draft-harding-extensible-protocol-00.txt

(I've cc'd the author of the draft since I don't know if he's one of our
lurkers, and figure he might be interested in knowing that we've got a live
discussion going on here.  I'm not saying XP is the cure to all our
troubles, but it's a good place to start.)

XP proposes a pretty simple mechanism for sending requests and responses as
streams of XML documents.  As the draft puts it,

>To extend XML from a class of data objects into a protocol is to
>extend the rules for constructing a single document into rules for
>constructing two interrelated streams of documents.  Accordingly, we
>introduce mechanisms for handling both the sequential and
>interrelated aspects of the document streams.

Requests are prefaced with a processing instruction (PI) that uses the form:

>      RequestPI ::= '<?xp' S 'Request' Eq Nmtoken '?>'

Responses are prefaced with a PI using the form:

>      ResponseToPI ::= '<?xp' S 'ResponseTo' Eq Nmtoken '?>'

A 'terminator PI' is used to mark the end of a document, using the form:

>      TerminatorPI ::= '<?xp' S '/?>'

It's a pretty simple mechanism, using Nmtokens to keep two streams of
processing and information  in sync with each other.

XP doesn't directly address the issues that seem to be bedeviling this
list.  The issue of associating DTDs with documents, for instance, is left
untouched, and the examples use simple well-formed XML.  It does, however,
suggest a fairly simple approach to stream processing that might be
appropriate in a number of situations. 

Basically, rather than arguing about documents and streams and how they
should relate to each other within the context of XML, maybe it's time to
step outside the tight XML framework and start thinking of streams as a set
of XML documents presented in some kind of sequence with meaningful
delimiters.  The stream itself may not be a valid or even well-formed XML
document - since the end element may appear a long ways in the future, or
even possibly never appear - but the stream can be decomposed into a set of
valid XML documents.

Some folks on this list have suggested mechanisms like control characters -
^L or ^C - to manage these streams.  While that might work, it doesn't
provide very much flexibility of expression.  For example, it providrd no
information about the relation of the documents in the stream except their
sequence.  In many cases, relating documents in the stream to each other -
or, like XP, to an entirely separate stream - may be important.  The use of
processing instructions (or, if you want to be grouchy, markup that uses a
PI-like syntax) seems appropriate.

This might also reduce the need for preprocessing, or for parsers that look
specifically for control characters, and would allow the reuse of
mechanisms we've already got.  A SAX parser might be able to carry out
stream parsing, sending standard SAX events to multiple threads
representing different document components of the stream, for example.  The
PIs could be sent as part of the prolog - it might mean rearranging the
prolog so <?xml?> comes before the PI, but that I think is doable - so the
application could get the information.  It could give startDocument and
endDocument some real work to do that isn't just the province of the first
startElement and the last endElement.  (Yes, I know startDocument is
important for catching stuff that appears before the root element.)

Defining this in a general way doesn't seem like it would be too painful.
It might be a general description of a mechanism that XP applies in a
particular request/response situtation, or it might be something else.

In any event, defining XML streams and rules for dealing with them is an
important issue, one with very important implications for interchange.  If
we could hammer this down, we might be able to ensure that all kinds of
developers will be able to share XML streams as easily as they share XML
documents.  If we define streams cleanly, we might even be able to nest
streams within streams (hopefully) avoiding the next round up of
multiple-container processing battles.

It'd be worth fleshing out, and I could see adding two new events to SAX -
beginStream and endStream or something like that.

On the other hand, maybe I've just been working too hard too long and it's
time for a nice long vacation.  If folks thinks this is worthwhile, though,
I'd be happy to put some work into it.

Simon St.Laurent
XML: A Primer / Building XML Applications (April)
Sharing Bandwidth / Cookies
http://www.simonstl.com

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)