FYI: YML: A Grand Unification of SAX and DOM? (fwd)
Clark C. Evans
clark.evans at manhattanproject.com
Fri Dec 3 19:15:13 GMT 1999
Hello everyone,
I'd like to carry on a sub-discussion of SML called
YML on the SML list. It is of particular importance
to the interaction between XML and XSL.
Clark
---------- Forwarded message ----------
Date: Fri, 3 Dec 1999 02:13:25 -0500 (EST)
From: Clark C. Evans <clark.evans at manhattanproject.com>
To: sml-dev at eGroups.com
Subject: YML: A Grand Unification of SAX and DOM?
Paul,
I didn't want to speak up a few days ago -- claiming that I was
going to do a grand unfication of SAX and DOM (even though this
was exactly what I was thinking may be the case).
On Thu, 2 Dec 1999, Paul Tchistopolskii wrote:
> > Isn't it the end of long discussion of Elements vs Attributes?
> > Now when I see the question: "Should I use attributes or
> > elements?" - I know the answer:
> >
> > "If you want it to be processed by current APIs not keeping
> > the entire docuemnt in the memory - use attributes everywhere
> > you can."
Exactly... It is a matter of what type of access you would
like to have when processing the information -- you have two
choices: Random Access (DOM) or Sequential Access (SAX)
The trick, however, is subtle. You don't want _all_ random
access nor do you want _all_ sequential access. You want a
ballance. And this is where the binary doubly recursive
pattern comes in.
On Thu, 2 Dec 1999, G. Ken Holman wrote:
> I posit that expressing properties of hierarchical components as
> sub-elements of ancestry does not work well in information design
> for the following reasons:
>
> (1) - I claim that the information in <b> has no (and should have no)
> direct relation to the information in <e>, but that the information in
> attr1= may or may not have direct relation to the information in <e>
> because of descendent scope (<e> is a descendant of <a> but not of <b>
> so how could <b>'s "influence" be construed as impacting on <e>?)
>
> (2) - when I am processing <e> (say with XSLT and XPath) I can very
> easily determine properties of <e> by examining ascendent places in
> the hierarchy (<a> is an ascendant of <e> therefore <a> and its
> attributes are easily addressed via the ancestor:: axis)
> - if I didn't have attributes and I was obliged to use sub-elements,
> the extra processing involved to examine all child elements of all
> ancestors for possible applicable properties would be both unwieldy
> and ambiguous
This is a great summary. The last day I jotted down a rough
sketch of the Idea I've had running thorugh my head the last
few days. It is posted at
http://clarkevans.com/yml.html
With a text version below. It's *far* from perfect, but I wanted
to get the idea out as a cohesive unit so we can work on it
as a community.
I'd like to hear what you all think...
Clark
---------------------------------
YML - The Why Markup Language
----------------------------------
Authors: Clark Evans
History: Version .1, 03-DEC-1999
Summary
YML is currently an assembly of thoughts regarding
the creation of a doubly recursive markup language
and parser description. YML is an extension of
the simple markup language ("SML"), which is a
strict subset of the extensible markup language
("XML"). Further, YML is a unification of the XML
document object model ("DOM") and the simple
application programming interface for XML ("SAX").
Motivation
YML was motivated from two reoccurring debates on the
XML list, under the titles "SAX vs DOM" and "element
vs attribute". It is interesting how they are interwoven.
The SAX vs DOM debate often centers around which is
better for processing information: random access
method (RAM) or a sequential access method (SAM).
Those from the DOM camp state that having the entire
document in memory makes things easy to program; while
those from the SAX camp point to efficiencies of
stream processing.
The element vs attribute debate is concerned with
the distinction between an element's content and
its attributes. One camp believes that the difference
reflects an clear contextual binary decomposition, while
the other camp views the distinction as syntax sugar.
These debates are subtly linked since SAX provides
an accepted, de-facto interpretation -- included
with sequential access for each element, is a
random access collection of its attributes. This
interpretation is of huge value, as it ties the
element vs attribute debate to a more tangeable
processing concern, sequential access or
random access.
SAX is of interest for one other reason. It does
not notify the processor individually for each
attribute -- instead, it waits until it has
collected each and every attribute before providing
them as a single collection. This is in sharp
contrast to its treatment of elements, which are
handled individually, one by one.
Another Motivation: Transformation Languages
The real value in XML isn't just data representation
or ease of parsing, it is the promise of a
transformation language expressed in XML itself.
The XML style sheet language ("XSL") is one approach
to markup language transformations. XSL is the composite
of many wonderful constructs, lessened by a few
particularly bad restrictions.
The delightful recursive template matching system is XSL's
claim to fame. XSL is a collection of such templates,
where a match clause identifies an expression which
will trigger particular elements (and not attributes) to
be processed according to the rules provided. These 'xpath'
expressions define multiple axis. The ancestor axis is the
most important, it is the current element stack. Of secondary
importance is the attribute axis, which allows access
to an element's attributes. These axis together allow for
a very powerful way to identify and process elements.
One disturbing aspect of XSL is the inclusion of the forward
and previous axes in this xpath expression syntax. Furthermore,
loop constructs and the ability to re-visit elements was also
included in the language. For an XSL processor to reasonably
support these features, random access to the information is
a requirement. This is a problem for large-size information
sets or low-memory processing devices.
There are a few individuals who are contemplating a stream
based alternative to XSL which will work without these large
memory restrictions. Assuming that SAX was the underlying
basis for such a processor; the only items available in
random-access memory at any given time would be an element
stack, including each element's attributes. This is hardly
good enough to be efficient. An extension to a minimal
processor build on top of SAX, could enable those elements
on the "previous" axis -- as long as they are mentioned
somewhere in the stylesheet. A smart collector could identify
an element which must be used later, pinning it for random
access in the future. Unfortunately, this method would not
work well with dynamically generated xpath expressions.
There are many other concerns as well, such as how to
accomplish sorting, repeat performances, and other
clear benifits that the random access brings. However,
so far, there has not been a clear approach.
Direction
The goal of YML is to be a building block upon which
an alternative to XSL can be built. One which is more
space efficient than XSL, yet one does not sacrifice
time as a pure stream based alternative appears to do.
To accomplish this, memory must be managed differently
at the parser level; thus a new parser description ("PD")
must be provided -- one that ballances the constraints of
SAX with the power of DOM. And, to accomplish this,
the syntax of the markup language ("ML") itself must be
substantially altered. Strictly speaking, the ML could
easily be an XML extension, however, the data model
presented here would be too hard to grok with all of XML's
subtleties.
These are serious changes, however, if it is possible to
unify SAX and DOM, and perhaps enable the generation of a
better transformation processor, it may be worth it.
Background
Consider these included by reference:
http://www.lists.ic.ac.uk/hypermail/xml-dev/xml-dev-Nov-1999/1120.html
http://www.lists.ic.ac.uk/hypermail/xml-dev/xml-dev-Nov-1999/1136.html
http://www.lists.ic.ac.uk/hypermail/xml-dev/xml-dev-Nov-1999/1205.html
http://www.egroups.com/group/sml-dev/31.html
http://www.egroups.com/group/sml-dev/89.html
Development
Consider an enhanced SAX parser with an element stack enabling
the new XSL processor random access to the entire ancestor
and attribute axis, with sequential access otherwise.
Consider further:
<root r="x" >
<s1/>
<s2/>
</root>
Here, both sequential access nodes s1 and s2 have random access
to the node r. However, these nodes cannot access each other
since they are provided sequentially: When the s1 is visited,
s2 has not yet been provided. Also, when s2 is visited, s1
has already been dropped from memory.
Note, that this is recursive:
<root r="x">
<s sr="a">
<ss/>
</s>
</root>
Here it is clear that the node ss can access node
s, sr, and r. So far so good. Lets enumerate the
possible node types:
s, r, sr, ss, ssr, sss, sssr, ...
Notice that given the current XML syntax, and this
processing model, random access nodes with children
are not allowed. In other words, a given xpath
may only consist of sequentially accessed nodes
followed by an optional, random access tail node. Meat
It is shown below how a change in XML's syntax to
permit recursion on the attribute axis would allow
a parser to be built having random access nodes
allowing children.
It is hypothesized that this syntax change would allow
a construction of a parser that could be used in lieu
of both DOM and SAX, giving random access or sequential
access in a context sensitive manner, as a function
of the source information.
It is further hypothesized that this parser could be
used to build a processor that has most of XSL's
advantages without sacrificing performance.
The Change
Consider the following syntax (due to John Cowan):
<root
<r
<rr/>
>
<rs/>
</r>
>
<s
<sr/>
>
<ss/>
</s>
</root>
With this change, it is possible to generate all of
the possible node types:
r s rr rs sr ss rrr rrs rsr rss srr srs ssr sss ...
This may not be the prettiest syntax; however,
XML becomes a sub-set of this new syntax -- where
the following definition is used for backwards
compatability.
<el att="val" /> <=> <el <att>val</att> />
And perhaps allowing the following syntax sugar
is used for nested attributes (due to Colas Nahaboo):
<el att=<ch>val</ch> /> <=> <el <att <ch>val</ch>/>/>
Further, there should be no problem for XML parsers to
enable the recognition of the new syntax since the
above are since neither of the above expressions are
well-formed.
Thus, this is the basis for a completely different
parser behavior that alternates between random access
or serial access depending upon the type of node
which is encountered, according to something like
the following:
interface yml-node {
boolean is-random();
}
interface yml-branch extends yml-node {
String name();
}
interface yml-leaf extends yml-node {
String text();
}
interface yml-stack {
yml-node current();
yml-stack parent();
// list of random children
int count();
yml-stack child(int i);
// private
yml-stack(yml-stack parent, yml-node current);
add( yml-stack child );
complete();
}
interface yml-output {
void push( yml-stack element );
void leaf( yml-stack element );
void pop( yml-stack element );
}
interface yml-input {
// to be defined
}
void yml-process( yml-input in, yml-output out,
yml-stack stack, boolean is-random )
{
if(in.peek-is-leaf()) {
yml-stack top =
new yml-stack( stack ,
new yml-leaf(is-random,in.next());
if(is-random)
stack.add(top);
else
out.leaf(top);
return;
}
// it's a branch
yml-stack top =
new yml-stack( stack,
new yml-branch(is-random, in.next());
while(in.inside-the-tag())
yml-process(in,out,top,true);
top.complete();
if(!is-random)
out.push(top);
while(in.outside-the-tag())
yml-process(in,out,top,false);
if(is-random)
stack.add(top);
else
out.pop(top);
return;
}
Thus, if the entire output uses the "random recursion"
extreme, with only a single node (the top one) being
sequential, then this method looks very much like DOM.
In the other extreme, if "sequential recursion" is used,
with an occasional attribute, then this method looks
very much like SAX. However, if the input stream is
a unique mixture then the result is surprising:
the parser configures its memory usage subject
to the structure of the information being processed.
Thus, a unified parser is created.
For an transformation processor built on top of this
type of parser, it motivates an additional 'random'
axis. Define a sequential node as one visited by
the procesor's interface, and is dropped from
memory afterwords. Define a random node as
one not visited by the processor's interface, but
made available through a random access method.
Access of random nodes by sequential nodes is
provided by the following rules: (a) Sequential
nodes may reference its or any of its sequential
ancestors's random siblings. (b) If a sequential
node may reference a random node, then it may also
reference any random children of the random node.
Furthermore, if XML syntax compliance is absolutely
needed -- the attribution notion could be used to
mark random nodes:
<root>
<r random-access="yes">
<rr random-access="yes"/>
<rs/>
<r>
<s>
<sr random-access="yes" />
<ss/>
</s>
</root>
Alternatively, the distinction between sequential
or random nodes could be detailed in a DTD or some
other schema document. However, I feel all of these
are kluges and that John Cowan's syntax is the best
expression of the idea. So the syntax becomes a bit
more complicated... maybye. I say make the lower level
a bit more complicated... to simplify everything else.
It may be a ways off, however, I believe that this
binary recursive method provides an novel and
unexpected approach to making information processors
more efficient.
Best Wishes,
Clark Evans
Credits
Too many to mention, first the xml-dev and sgml-dev and xsl lists; filled
with smart people. Second, to Dan Palanza for introducing me to binary
recursive models. Further, to the huge amount of philosophical, and
technical literature out there regarding programming and systems theory
that has shaped the manner in which
I approach problems. Thanks!
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To unsubscribe, mailto:majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)
More information about the Xml-dev
mailing list