Fw: DOM vs. SAX??? Nah. (was RE: Storing Lots of FiddlyBits (was Re: What is XML for?)

Thu Feb 11 18:30:13 GMT 1999

Walter Underwood wrote:
> At 02:17 PM 2/11/99 +0200, Oren Ben-Kiki wrote:
> >David Megginson <david at megginson.com> wrote:
> >>1. SAX and DOM are complementary
> >IMVHO SAX should be defined not as a "parser interface" but as a "DOM tree
> >visitor interface".
> 
> We use a fair amount of XML inside Infoseek, and were just having
> this DOM vs. SAX discussion on Monday. There are applications that
> really are interested in the document, and the DOM interface is a
> tremendous help for those. For some other applications, the DOM is
> a total waste of time -- they need to turn the contents of the
> document into application data (maybe objects, maybe not), and
> creating DOM objects for everything an unnecessary step that slows
> things down and bloats code.
> 
> An example of the latter is the XML text extractor in the Ultraseek
> Server search engine. It needs to convert the incoming XML document
> to fieldname/textbuffer pairs so they can be further analyzed and
> inserted into the search index. The expat handlers are about 80 lines
> of Python. Works great.
> 
> Other applications use XML in an RPC-like manner. Those parsers
> need to behave like an RPC marshalling parser, oriented towards
> translating into user structures/objects, not RPC- or XML-centered
> objects.
> 

Neet-O.  I heard this quote

> SAX is just a way to populate DOM, it's at lower level.

a while back and it gave me convulsions.  Glad to hear the 
real-world experience.   *smile*

Clark Evans

P.S.

Older posts on the subject that may be of use....
I spend too much time e-mailing.

-------- Original Message --------
Subject: Re: Forking the DOM (was Re: Storing Lots of Fiddly Bits)
Date: Wed, 03 Feb 1999 17:29:15 +0000
From: Clark Evans <clark.evans at manhattanproject.com>
To: XML Developers' List <xml-dev at ic.ac.uk>
References: <000e01be4f36$c7b29370$d3228018 at jabr.ne.mediaone.net> <199902031447.JAA20034 at hesketh.net>

"Simon St.Laurent" wrote:
> 
> Given the fairly strong comments excerpted below (and Paul's not the only
> one muttering like this), is it time to contemplate a very different API?

I think that (at least) one other (very different) API already 
exists, it's called SAX.  Perhaps the contemplating to do
is in pattern study.  Identifying the various ways to 
view information and creating a taxonomy (or pattern 
language) of *existing* tools and methods which can 
best handle those view may be the ticket.  I feel that
having a generative grammer of tools and techniques is 
better than one big sledge hammer.  Although, I must admit,
the sledge hammer is much more impressive and seductive.

For large amounts of "simple" information with complex 
interaction I have found through experience that
modeling the Relations (and implementing with an RDBMS) 
works wonderfully.  Also, for large complicated
inter-related processing units, I have found that 
modeling with Objects to be very useful (and 
implementing in an OOPL).   

However, every time I have tried, modeling 
time-oriented streams in a Relational database or 
with an Object language, the result has been UGLY.
It just dosn't work well, I call it a "mis-fit".

Once I saw XML, it clicked.  And it clicked hard.  
You use XML to implment stream-oriented messaging
systems.  XML is very loosely typed, and with
Architectural forms it is actually the reverse
of object-orientation!  Explanation: A given data 
stream has multiple classifications depending upon 
an observer's context.  In object-orientation, an object
is a single class.  Where objects are encapsulated
so that a single behavior is tied to a given data
structure, a stream is exposed so that multiple
behaviors can be activated from a single event.
By trying to shoe-horn Messages into objects, you 
must assign a single class and a single behavior, 
and thus  you loose the very thing which makes 
XML powerful.

Of course, you can use object-orientation to model
your message-processing sofware, just as you can use
message-orentation to model your object-processing
software.  SAX is a good example of the former, while
Scenerios is a good example of the latter.   I don't
doubt the power of recursive application of the pattern.

Here is a related post I made to the news server
last week.  My thoughts have changed some, but the
primary argument that treating something as an Object 
and as a Stream are very different perspectives and 
that the DOM and SAX interfaces reflect this reality.

>    Subject: DOM and SAX: complementary aspects of XML
>       Date: Thu, 28 Jan 1999 19:50:22 +0000
>       From: Clark Evans <clark.evans at manhattanproject.com>
> Newsgroups: comp.text.xml, comp.text.sgml> 
>
> I saw some debating a while back, stating "Which
> should I use DOM or SAX". Then a few people were
> stating that the W3C had standardized on DOM and
> not on SAX.   This puzzled me.   Anyway, I spend
> a small amount of time reviewing each and it kinda
> struck me what was going on.  So, I figured I'd take
> a crack at the explanation...
> 
> DOM and SAX are "complementary" ways to look at an XML document.
> 
> SAX - XML AS STREAM, A PUSH MODEL
> ~~~
> 
> SAX views the document as a stream, sending events
> as the document passes through its view:
> 
> SOURCE ==XML==> DESTINATION (Bit Bucket?)
>           ^
>           |
>          SAX
>           |
>           E (Event Notification)
>           |
> Your      |
> App  <--<-+
> Prog
> 
> In this diagram, I picture an information
> stream where XML documents move from the
> SOURCE to the DESTINATION.  I picture SAX
> as an "Observer" (see design pattern book),
> picking off the events of interest and
> passing along notificaions to your application
> program.
> 
> Big Advantage:    You don't have to store the stream.
> Big DisAdvantage: You can't go back in time.
> 

I really missed the big point of Archetectural
Forms here.  Where the Stream can be Observed
by more than one Object, each with a different
context.  Thus the Stream has multiple Classifications
depending upon the Object which is doing the
Observing.

> DOM - XML AS OBJECT, A PULL MODEL
> ~~~
> 
>     +-------+
>     |  XML  |
>     | STORE |  <==> DOM
>     +-------+       | |
>                     | |
> Your >-(Request)-->-+ |
> App                   |
> Prog <-(Response)-<---+
> 
> In this diagram, I picture a storage
> facility, be it memory, disk, database,
> etc., which holds the XML document for
> random access.  I picture DOM as a Broker?
> answering requests from your application
> program about the structure and content
> of the XML document.
> 
> Big Advantage:    You have random access to the
>                   document object.
> Big DisAdvantage: You must provide storage for the
>                   document object.
> 

As the complement, here the Stream is treated
as an Object, so that a single Classification
mechanism is applied.  

For XML, I feel that this is much less useful 
than SAX.  If you want to go this far with 
objects, perhaps it's better to "translate" 
the Stream (using SAX) into an Object Framework
that better reflects your problem domain.

By trying to avoid the "impedence" mismatch,
you undermine the relative strengths of both 
object-oriented systems and message-oriented
middleware, and end-up with a cripled compromise.

I guess if the information in the XML document
has a single Observer, then DOM will work well,
but then the question becomes, why XML?
Just use object serialization.  If you absolutely
must use XML for buzz-word compliance, have 
your serilization library use XML.  Then you
don't need DOM at all.

> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> Conclusion:
> 
> I would not say that one is better than the other.
> A true XML programmer should learn both.  For
> sequential access, SAX would be preferred since
> it requires less memory.  For random access, you
> should use DOM.  Any "complete" XML parsing tool
> should support both attitudes of reasoning to
> best adapt to the users needs.
> 
> It is tempting to pick one and only use that one,
> however, additional coding to store previous
> state will be the penalty for those who always
> use SAX, and extra memory/cycles will be the
> penalty for those who always use DOM.
> 
> As for the W3C, if both are not equally treated
> as standards, I respectfully submit that they
> should both be approved since they are both
> complementary aspects of the same process.
> 
> A unified standard, DOMSAX should be the result.
> Hope this helps!

Here I was not suggesting that they are "unified"
into "one" language, but rather "interwoven" so
that a programmer can switch back and forth
depending upon his context.  

> 
> Question:  For SGML, do NSGMLS and GROVE
> share the same complementary pattern?
> 
> Best wishes,
> 
> Clark Evans

Here is another followup post.

> -------- Original Message --------
> Subject: Re: DOM and SAX: complementary aspects of XML
> Date: Thu, 28 Jan 1999 21:18:55 +0000
> From: Clark Evans <clark.evans at manhattanproject.com>
> Organization: Posted via RemarQ, http://www.remarQ.com - Discussions start here!
> CC: "Joseph Kesselman (yclept Keshlam)" <keshlam at alum.mit.edu>,dan at capecod.net
> Newsgroups: comp.text.xml,comp.text.sgml
> References: <36B0BF7E.4E0E09F at manhattanproject.com> <36B0C97E.87535497 at alum.mit.edu>
> 
> "Joseph Kesselman (yclept Keshlam)" wrote:
> > 
> > you _can_ use both DOM and SAX as part of the
> > processing stream for a single document.
> 
> Glad to hear that some people doing XML don't
> think of it as a exclusive choice.  This is 
> great.  Would you comment on the following?
> 
> Would it be safe to put XML processing on a Spectrum? Say..
> 
> 
> STREAM <--------------?---------------> OBJECT
> 
> Where a particular use of XML would fit somewhere in-between?
> 
> "Mostly" Stream Examples:
> ~~~~~~~~~~~~~~~~~~~~~~~~~
> * Production information from a peice of equipment on
> the plant floor, say the torque measurement on a drill.
> * The NASDAQ ticker tape during the middle of the day.
> * A video camera feed from a live site.
> * etc.
> 
> In these cases SAX might be the better choice, since it
> is the _event_ that is of interest ... waiting untill the
> entire stream is processed before acting would be a bad 
> idea (in the case of NASDAQ, perhaps finantially devisating?)
> 
> "Mostly" Object Examples:
> ~~~~~~~~~~~~~~~~~~~~~~~~~
> * An order placed on a web site.
> * The number of holes drilled at the end of the day
> by the plant floor production machine.
> * The NASDAQ closing prices.
> * etc.
> 
> In these cases, DOM might be the better choice, since it
> is the entire object that is of interest.  In these 
> cases it's not the parts of the stream that matter, but
> the entire stream taken as a whole.  Having the buyer
> without what he/she purchased dosn't do any good.
> 
> In Between Examples:
> ~~~~~~~~~~~~~~~~~~~
> * A drawing.
> * NASDAQ 15 min snapshot
> 
> In these cases I was thinking it would be "ideal" to 
> have an integrated DOM/SAX tool, where the programmer
> could have the incoming stream spark SAX events, but
> would result in a queryable DOM object.  For instance,
> the drawing could be drawn as the stream is being
> read, but any editing would have to wait untill
> the whole drawing has been received.  For the NASDAQ
> example, having the current ticker, would be nice,
> but having a "snapshot" in time accessable via 
> DOM might be very useful as well.
> 
> Question:  Are there any XML parsing tools that take
> both of these approaches by allowing _both_ DOM 
> and SAX.  If you didn't register for any events, 
> you would not be using the SAX part, where, if you
> didn't ask for the "DOM" object, the stream
> would be discarded after events are fired.  This
> type of "unified" tool would be great.
> 
> Your thoughts?
> 
> Clark Evans
> 

Clark Evans wrote:
> 
> <philosophy>
> 
> Perhaps the role of namespaces is fundamentally
> different in the "stream processing" paradigm
> than it is in "object processing" paradigm?
> 
> Could this be the issue underlying the current
> debate?  I don't know enough on the topic to
> say.  However, I feel I can help by explaining
> my observations about the differences between
> the paradigms.
> 
> 1. A tenant of object oriented programming
> is encapsulation, data hiding.  For stream
> processing it is the opposite, data exposure.
> 
> 2. Objects are modified or undergo state change
> by invoking methods. Where streams are re-written
> or translated by transformations.
> 
> 3. Ideally, an object retains it's identity.
> The entire goal of a stream is to merge it's
> information with each and every observer; this
> is equivalent to identity loss.
> 
> 4. An object has a 1-1 correspondence between its
> data and its code.  A stream has a 1-M correspondence
> between its data and its code.  Where the document is
> the data, and the code is the observer's
> transformation system.
> 
> 5. Objects are finite, they have a boundry.
> Streams may be effectively infinite.  For
> example, a pressure transducer sending water
> level measurements may operate continuously
> for years!   Thus, you can store an entire
> object in memory, you may not want to store
> an entire stream in memory.
> 
> 6. An object's interface describes a block of
> functionality provided.  A stream's interface
> describes the information conthat it carries.
> 
> 7. An object has one type or class which is
> assigned to the data, where a stream can
> be classified differently by each and every
> observer.  This is especially clear if
> you read about Arcetectures.
> 
> etc.
> 
> Anyway, I'm not saying that one is better
> than the other, just that they are different
> and subtly interwoven. For instance, Scenerios
> is the study of object interactions as
> a stream of events.  And SAX is a wonderful
> event-driven stream observer object.
> 
> I feel that the key to the success of XML
> is to recognize that it is part of a different
> paradigm --XML complements existing technology.
> As such, it is important to scrutinize the
> application of object-oriented idioms to the
> new paradigm.
> 
> </philosophy>
> 
> Hope this helps,
> 
> Clark Evans

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)