Streaming XML and SAX

Mon Mar 1 22:32:12 GMT 1999

On Sun, Feb 28, 1999 at 09:40:40PM -0800, Tom Harding wrote:
> Marcelo Cantos wrote:
> 
> > It has already been pointed out in this discussion that some
> > environments try to increase the throughput by dispatching
> > documents off to different threads.  A system with 50 CPU's is
> > going to be operating as low as 2% capacity if it is forced to
> > pipe the entire parsing load through a single thread.  I don't see
> > how you can argue that this is efficient.
> 
> Even if you believe that parsing to convert markup into memory
> structures is slower than back-end processing, if parsing is faster
> than the stream itself there is no difference in the two approaches.

That is an awfully big _if_ to enshrine in a standard (if that's where
all this broo-ha-ha ultimately ends up).  What if client and server
are on the same machine?

> Anyway, in the general case the question is moot because there may
> be inter-document dependencies, so you have to look inside the
> document before trying to parallelize.

The question is far from moot since an enormous class of very
interesting problems does not fall into this category.  There are
myriad applications for self-contained XML packets.

Furthermore, inter-document dependenies are not a fundamental problem
for parallelisation.  Threads can talk to each other and block waiting
for other threads to finish parsing, while allowing other threads to
continue independent tasks.  You are suggesting that because in some
cases it isn't trivial to parallelise we should therefore never even
allow the possibility of such a thing to occur.

> The whole point of this discussion was whether the document
> terminator ought to be XML or non-XML.  Aside from the fact that I
> haven't yet seen a workable suggestion for a non-XML terminator,

I am frankly incredulous that there are no systems, protocols or
standards available today that adequately address the need to stream
multiple logical units of information.  This is not a new problem.
Let me suggest one off the top of my head: send a null terminated
decimal length, followed by a document.  This is sufficient to
dispatch data to multiple threads and raise concurrency levels.  Any
further processing can be done inside the parsers.

> it
> isn't necessary to completely examine a document or convert it to a
> tree just to find an XML terminator.

You can do better than a well-formedness parser?  What are you going
to do, grep for </doc>?

> As Nathan pointed out, you
> could write a semi-parser to find terminators and then actually
> parse documents in parallel, but you'd need to suggest a way for
> dealing with inter-document dependencies.

You get the threads to talk.  Inter-document dependencies are not and
need not be a protocol issue.

At the end of the day, the problem of streaming documents is not a
difficult one to solve at the protocol level (HTTP-NG will have it
built in, AFAIK).  Why do you want to complicate life by overloading
the parser's job?

Actually, my real question is, what on earth do you hope to gain?  Or
is this just a philosophical preference thing?

Cheers,
Marcelo

-- 
http://www.simdb.com/~marcelo/

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)