FYI: YML: A Grand Unification of SAX and DOM? (fwd)

Clark C. Evans clark.evans at manhattanproject.com
Fri Dec 3 19:15:13 GMT 1999


Hello everyone,

I'd like to carry on a sub-discussion of SML called
YML on the SML list.  It is of particular importance
to the interaction between XML and XSL.

Clark

---------- Forwarded message ----------
Date: Fri, 3 Dec 1999 02:13:25 -0500 (EST)
From: Clark C. Evans <clark.evans at manhattanproject.com>
To: sml-dev at eGroups.com
Subject: YML: A Grand Unification of SAX and DOM?

Paul,

I didn't want to speak up a few days ago -- claiming that I was 
going to do a grand unfication of SAX and DOM (even though this 
was exactly what I was thinking may be the case).

On Thu, 2 Dec 1999, Paul Tchistopolskii wrote:
> > Isn't it  the end of long discussion of Elements vs Attributes?
> > Now when I see the question: "Should I use attributes or
> > elements?" - I know the answer:
> >
> > "If you want it to be processed by current APIs not keeping
> > the entire docuemnt in the memory  - use attributes everywhere
> > you can."

Exactly...  It is a matter of what type of access you would 
like to have when processing the information -- you have two 
choices:   Random Access  (DOM) or Sequential Access (SAX)

The trick, however, is subtle.  You don't want _all_ random 
access nor do you want _all_ sequential access.  You want a 
ballance.  And this is where the binary doubly recursive
pattern comes in. 

On Thu, 2 Dec 1999, G. Ken Holman wrote:
> I posit that expressing properties of hierarchical components as
> sub-elements of ancestry does not work well in information design 
> for the following reasons:
>
> (1) - I claim that the information in <b> has no (and should have no)
> direct relation to the information in <e>, but that the information in
> attr1= may or may not have direct relation to the information in <e>
> because of descendent scope (<e> is a descendant of <a> but not of <b>
> so how could <b>'s "influence" be construed as impacting on <e>?)
>
> (2) - when I am processing <e> (say with XSLT and XPath) I can very
> easily determine properties of <e> by examining ascendent places in 
> the hierarchy (<a> is an ascendant of <e> therefore <a> and its
> attributes are easily addressed via the ancestor:: axis)
>   - if I didn't have attributes and I was obliged to use sub-elements,
> the extra processing involved to examine all child elements of all
> ancestors for possible applicable properties would be both unwieldy 
> and ambiguous    

This is a great summary.  The last day I jotted down a rough
sketch of the Idea I've had running thorugh my head the last
few days.  It is posted at

   http://clarkevans.com/yml.html

With a text version below.  It's *far* from perfect, but I wanted 
to get the idea out as a cohesive unit so we can work on it
as a community.

I'd like to hear what you all think...

Clark


--------------------------------- 
YML - The Why Markup Language
---------------------------------- 
Authors:  Clark Evans 
History:  Version .1, 03-DEC-1999 

Summary

  YML is currently an assembly of thoughts regarding
  the creation of a doubly recursive markup language
  and parser description.  YML is an extension of
  the simple markup language ("SML"), which is a
  strict subset of the extensible markup language
  ("XML").  Further, YML is a unification of the XML
  document object model ("DOM") and the simple
  application programming interface for XML ("SAX"). 

Motivation

  YML was motivated from two reoccurring debates on the
  XML list, under the titles "SAX vs DOM" and "element
  vs attribute".  It is interesting how they are interwoven.
  The SAX vs DOM debate often centers around which is
  better for processing information: random access
  method (RAM) or a sequential access method (SAM).
  Those from the DOM camp state that having the entire
  document in memory makes things easy to program; while
  those from the SAX camp point to efficiencies of
  stream processing.

  The element vs attribute debate is concerned with
  the distinction between an element's content and
  its attributes.  One camp believes that the difference
  reflects an clear contextual binary decomposition, while
  the other camp views the distinction as syntax sugar.
  These debates are subtly linked since SAX provides
  an accepted, de-facto interpretation -- included
  with sequential access for each element, is a
  random access collection of its attributes.  This
  interpretation is of huge value, as it ties the
  element vs attribute debate to a more tangeable
  processing concern, sequential access or
  random access.

  SAX is of interest for one other reason.  It does
  not notify the processor individually for each
  attribute -- instead, it waits until it has
  collected each and every attribute before providing
  them as a single collection.  This is in sharp
  contrast to its treatment of elements, which are
  handled individually, one by one. 

Another Motivation: Transformation Languages

  The real value in XML isn't just data representation
  or ease of parsing, it is the promise of a
  transformation language expressed in XML itself.
  The XML style sheet language ("XSL") is one approach
  to markup language transformations.  XSL is the composite
  of many wonderful constructs, lessened by a few
  particularly bad restrictions.

  The delightful recursive template matching system is XSL's
  claim to fame.  XSL is a collection of such templates,
  where a match clause identifies an expression which
  will trigger particular elements (and not attributes) to
  be processed according to the rules provided. These 'xpath'
  expressions define multiple axis.  The ancestor axis is the
  most important, it is the current element stack.  Of secondary
  importance is the attribute axis, which allows access
  to an element's attributes.  These axis together allow for
  a very powerful way to identify and process elements.
  One disturbing aspect of XSL is the inclusion of the forward
  and previous axes in this xpath expression syntax.  Furthermore,
  loop constructs and the ability to re-visit elements was also
  included in the language.  For an XSL processor to reasonably
  support these features, random access to the information is
  a requirement. This is a problem for large-size information
  sets or low-memory processing devices.

  There are a few individuals who are contemplating a stream
  based alternative to XSL which will work without these large
  memory restrictions.  Assuming that SAX was the underlying
  basis for such a processor; the only items available in
  random-access memory at any given time would be an element
  stack, including each element's attributes.  This is hardly
  good enough to be efficient.  An extension to a minimal
  processor build on top of SAX, could enable those elements
  on the "previous" axis -- as long as they are mentioned
  somewhere in the stylesheet.  A smart collector could identify
  an element which must be used later, pinning it for random
  access in the future.  Unfortunately, this method would not
  work well with dynamically generated xpath expressions.
  There are many other concerns as well, such as how to
  accomplish sorting, repeat performances, and other
  clear benifits that the random access brings.  However,
  so far, there has not been a clear approach. 

Direction

  The goal of YML is to be a building block upon which
  an alternative to XSL can be built.  One which is more
  space efficient than XSL, yet one does not sacrifice
  time as a pure stream based alternative appears to do.
  To accomplish this, memory must be managed differently
  at the parser level; thus a new parser description ("PD")
  must be provided -- one that ballances the constraints of
  SAX with the power of DOM.  And, to accomplish this,
  the syntax of the markup language ("ML") itself must be
  substantially altered.  Strictly speaking, the ML could
  easily be an XML extension, however, the data model
  presented here would be too hard to grok with all of XML's
  subtleties.

  These are serious changes, however, if it is possible to
  unify SAX and DOM, and perhaps enable the generation of a
  better transformation processor, it may be worth it. 

Background

  Consider these included by reference:
  http://www.lists.ic.ac.uk/hypermail/xml-dev/xml-dev-Nov-1999/1120.html
  http://www.lists.ic.ac.uk/hypermail/xml-dev/xml-dev-Nov-1999/1136.html
  http://www.lists.ic.ac.uk/hypermail/xml-dev/xml-dev-Nov-1999/1205.html
  http://www.egroups.com/group/sml-dev/31.html
  http://www.egroups.com/group/sml-dev/89.html 

Development

  Consider an enhanced SAX parser with an element stack enabling
  the new XSL processor random access to the entire ancestor
  and attribute axis, with sequential access otherwise.

  Consider further:

    <root r="x" >
      <s1/>
      <s2/>
    </root>

  Here, both sequential access nodes s1 and s2 have random access
  to the node r.  However, these nodes cannot access each other
  since they are provided sequentially: When the s1 is visited,
  s2 has not yet been provided.  Also, when s2 is visited, s1
  has already been dropped from memory.
  Note, that this is recursive:

    <root r="x">
      <s sr="a">
        <ss/>
      </s>
    </root>

  Here it is clear that the node ss can access node
  s, sr, and r.  So far so good.  Lets enumerate the
  possible node types:

       s, r, sr, ss, ssr, sss, sssr, ...

  Notice that given the current XML syntax, and this
  processing model, random access nodes with children
  are not allowed.  In other words, a given xpath
  may only consist of sequentially accessed nodes
  followed by an optional, random access tail node. Meat
  It is shown below how a change in XML's syntax to
  permit recursion on the attribute axis would allow
  a parser to be built having random access nodes
  allowing children.

  It is hypothesized that this syntax change would allow
  a construction of a parser that could be used in lieu
  of both DOM and SAX, giving random access or sequential
  access in a context sensitive manner, as a function
  of the source information.

  It is further hypothesized that this parser could be
  used to build a processor that has most of XSL's
  advantages without sacrificing performance. 

The Change

  Consider the following syntax (due to John Cowan):

  <root
    <r
      <rr/>
    >
      <rs/>
    </r>
  >
    <s
      <sr/>
    >
      <ss/>
    </s>
  </root>

  With this change, it is possible to generate all of
  the possible node types:

    r s rr rs sr ss rrr rrs rsr rss srr srs ssr sss ...

  This may not be the prettiest syntax; however,
  XML becomes a sub-set of this new syntax -- where
  the following definition is used for backwards
  compatability.

    <el att="val" /> <=> <el <att>val</att> />

  And perhaps allowing the following syntax sugar
  is used for nested attributes (due to Colas Nahaboo):

    <el att=<ch>val</ch> /> <=> <el <att <ch>val</ch>/>/>

  Further, there should be no problem for XML parsers to
  enable the recognition of the new syntax since the
  above are since neither of the above expressions are
  well-formed.

  Thus, this is the basis for a completely different
  parser behavior that alternates between random access
  or serial access depending upon the type of node
  which is encountered, according to something like
  the following:

    interface yml-node {
      boolean is-random();
    }
    interface yml-branch extends yml-node {
      String name();
    }
    interface yml-leaf extends yml-node {
      String text();
    }
    interface yml-stack {
      yml-node current();
      yml-stack parent();
      // list of random children
      int count();
      yml-stack child(int i);
      // private
      yml-stack(yml-stack parent, yml-node current);
      add( yml-stack child );
      complete();
    }
    interface yml-output {
      void push( yml-stack element );
      void leaf( yml-stack element );
      void pop( yml-stack element );
    }
    interface yml-input {
      // to be defined
    }
    void yml-process( yml-input in, yml-output out,
                      yml-stack stack, boolean is-random )
    {
      if(in.peek-is-leaf()) {
        yml-stack top =
          new yml-stack( stack ,
            new yml-leaf(is-random,in.next());
        if(is-random)
          stack.add(top);
        else
          out.leaf(top);
        return;
      }
      // it's a branch
      yml-stack top =
        new yml-stack( stack,
          new yml-branch(is-random, in.next());
      while(in.inside-the-tag())
         yml-process(in,out,top,true);
      top.complete();
      if(!is-random)
        out.push(top);
      while(in.outside-the-tag())
         yml-process(in,out,top,false);
      if(is-random)
        stack.add(top);
      else
        out.pop(top);
      return;
    }

  Thus, if the entire output uses the "random recursion"
  extreme, with only a single node (the top one) being
  sequential, then this method looks very much like DOM.
  In the other extreme, if "sequential recursion" is used,
  with an occasional attribute, then this method looks
  very much like SAX.  However, if the input stream is
  a unique mixture then the result is surprising:
  the parser configures its memory usage subject
  to the structure of the information being processed.
  Thus, a unified parser is created.

  For an transformation processor built on top of this
  type of parser, it motivates an additional 'random'
  axis.  Define a sequential node as one visited by
  the procesor's interface, and is dropped from
  memory afterwords.  Define a random node as
  one not visited by the processor's interface, but
  made available through a random access method.
  Access of random nodes by sequential nodes is
  provided by the following rules: (a) Sequential
  nodes may reference its or any of its sequential
  ancestors's random siblings.  (b) If a sequential
  node may reference a random node, then it may also
  reference any random children of the random node.
 
  Furthermore, if XML syntax compliance is absolutely
  needed -- the attribution notion could be used to
  mark random nodes:
 
  <root>
    <r random-access="yes">
      <rr random-access="yes"/>
      <rs/>
    <r>
    <s>
      <sr random-access="yes" />
      <ss/>
    </s>
  </root>
 
  Alternatively, the distinction between sequential
  or random nodes could be detailed in a DTD or some
  other schema document.  However, I feel all of these
  are kluges and that John Cowan's syntax is the best
  expression of the idea.  So the syntax becomes a bit
  more complicated... maybye.  I say make the lower level
  a bit more complicated... to simplify everything else.
 
  It may be a ways off, however, I believe that this
  binary recursive method provides an novel and
  unexpected approach to making information processors
  more efficient.
 
 
Best Wishes,
 
Clark Evans
 
 
Credits
 
Too many to mention, first the xml-dev and sgml-dev and xsl lists; filled
with smart people.  Second, to Dan Palanza for introducing me to binary
recursive models.  Further, to the huge amount of philosophical, and
technical literature out there regarding programming and systems theory
that has shaped the manner in which
I approach problems.  Thanks!  





xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To unsubscribe, mailto:majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)





More information about the Xml-dev mailing list