Streaming XML (Was RE: XML Information Set Requirements, W3C Note 18-February-1999)

Steven R. Newcomb srn at
Wed Feb 24 22:51:24 GMT 1999

[Jonathan Borden:]

> ... this is basic stuff, but the point is to
> emphasize that the distinction between what an
> object 'does' and what an object 'is' is not so
> clearcut.

Actually, property sets make it very clearcut.
Remember that property sets are not implementation
descriptions, whereas UML models are.

In property sets there are never any methods
whatsoever.  This point is emphasized by the fact
that, in the grove paradigm, the information
components are called "nodes" rather than
"objects".  If you choose to instantiate a grove
as a collection of objects (as many reasonable
people, including those at my own company,
certainly would), that's OK, but the fundamental
abstraction does not have the concept of methods.

[Much good stuff from Jonathan Borden omitted,
with all points taken.]

> In fact, when you get out of the SGML/XML world,
> the use of the terms 'property set' and 'grove'
> get replaced by terms 'UML', 'persistence' and
> 'object model'. What you promise that use of
> property sets and grove plans will automate
> processing of data and interoperability, CASE
> tools vendors promise using UML. What is the
> essence of the difference between an information
> set and/or property set and/or grove plan versus
> UML?

I was hoping you would ask this question!

Let me begin by oversimplifying: the difference is
that you can do much more with UML, and that
oversufficiency is precisely UML's deficiency in
this problem-space.

It is very difficult for people who have made
their careers in *information processing* to
perceive the virtue of making a complete
distinction between processing and information.
Even so, it's of paramount importance to make this
distinction, if any of the following statements
are true:

* the information may outlast existing processing

* the information may have unforeseen uses in an
  ever-changing world, and

* the information must be interchanged in an open,
  multivendor environment.

Instead of encapsulating such information in
methods, as objects often do, we need to
encapsulate it in semantics, as XML can be used to
do.  Having rendered the information as XML, and
having chosen appropriate semantic-bearing tags
and other attributes for its various components,
we now have the information in a totally useless
but highly interchangeable form that can become
input to any application for any purpose,
including unforeseen purposes.

For me, this useless but interchangeable XML form
of the information is the form that is most
deserving of its owner's respect.  It is the
owner's best choice of representation as the
"maintained source code" of the information asset.
It's the form that nobody but the information
owner owns or controls.  It's the form that no
software vendor has a lock on.  It's the form that
(presumably) has everything needed to reconstitute
a useful, application-ready form of the same
information asset, regardless of the nature of
that application, foreseen or unforeseen.

Now let's consider how well-described this XML
asset really is.  After all, if the asset doesn't
have a very accurate description, we can't be
sure that unforeseen applications will find the
information intelligible.

With DTDs, we have a way to model the structural
relationships of the elements to each other.  But
that's not enough to guarantee that the
information will be understood in the manner that
its architects and creators intended.  With
various proposed XML schema languages, we can
impose lexical typing requirements and certain
additional syntactic/structural requirements, but,
again, that doesn't guarantee that the information
will be understood in the manner that was
intended.  Neither the DTD nor the schema
extensions so far proposed can tell us the
information set that is supposed to be derivable
from the XML form of the information asset.  The
information is still not described well enough to
allow unforeseen applications, developed by
unforeseeable developers, to use the information
or to create new but similar information.  All of
the generic structural/syntactic validation in the
world will not guarantee that!

This is because the interchangeable form of the
information is not the same as the useful form,
which we will assume, for purposes of this
discussion, is objects that conform to certain
classes and have certain constellations of
properties and relationships.  Now the question
becomes, "What defines the data,
interrelationships, and semantics of those
objects?"  The ISO/SGML answer is, "A property
set, designed as part of the interchange
architecture, that defines the classes of objects
that will reflect the quintessential information
set conveyed by the resource."

The object classes defined by a property set, and
the node-objects in the groves that conform to
those classes, are strictly the canonical, static
*result* of the processing that is explicitly (but
only conceptually) *required* to be done to all
resources that conform to the architecture, before
they are used by an application.  Conceptually
speaking, these "groves" fully respect the
characteristics of the interchangeable resource
that they represent, including the fact that an
interchangeable resource has no methods, and there
is nothing dynamic (or even useful) about it when
it's in its XML form.

A property set is an abstract model of the useful
information that can be extracted from an
interchangeable resource.  There is nothing in a
grove that isn't already in the corresponding
resource.  Property sets are designed to exactly
reflect the characteristics of information that
can be extracted from information resources.

An intelligent person like yourself may remark,
"Well, then, I guess the abstract properties of
C++ notation must be very complex, because they
can describe arbitrarily complex processes."
You're right, they are, and the abstract
properties of C++ notation can be modeled using
the property set paradigm.  (And modeling C++
notation would be an interesting exercise,
although I'm not yet confident of commercial
interest.)  A property set for C++ notation might
include node classes with such names as "variable
name", "passed argument", "operator", "method",
"object", "class definition", etc.

So why bother with property sets, when UML is more

* Because property sets impose the design
  discipline of focusing on what is being
  interchanged, rather than on what might be done
  by particular applications.  They force you to
  focus on the precise nature of the "maintained
  source code" of the information.  They force you
  to think more abstractly, which can be
  uncomfortable but is often very worthwhile.
  They force you to recognize that interchangeable
  information cannot modify itself, and has no
  built-in methods.

* Because property sets are designed to support
  the addressing of arbitrary components of
  information, and their nature imposes the
  discipline of designing for various forms of
  addressing.  Everything that is modeled in a
  property set can become a node in a grove, and
  everything that can become a node in a grove is
  predictably and reproducibly addressable.  This
  means that addresses created and recorded by one
  application will be understandable and correctly
  resolvable by other applications.  This is the
  key to the solution of the general hyperlinking
  problem.  If, for example, we're addressing some
  node by counting other nodes, all of the counted
  nodes must exist, at least conceptually.

> Don't get me wrong, I think the work on
> information sets, property sets and groves is
> terrific and needs to be continued. One way to
> do this is to turn our heads sideways ever so
> often to see what collegues in the distributed
> object world are doing. These problems are
> universal.

Very true.

But information interchange is a funny thing.  XML
does not proceed from the study of computer
programming.  It comes from another direction, and
it's a different problem space.
Portable-software-ology is a specialized subdomain
of, and not the same thing as,

(I sure wouldn't want to try to support portable
information without portable software, though!)

At the risk of confusing the reader, let me add
that the property set syntax is just one syntax
for doing what property sets do, albeit the ISO
standard one for doing it.  The claim has been
made by Eliot Kimber that the STEP schema
language, EXPRESS, would do as well or better.  I
think he's probably right.  EXPRESS, however, is a
more powerful language that is more demanding to
learn.  By contrast, the property set syntax is
defined as an SGML or XML DTD, and a small and
simple one at that.


Steven R. Newcomb, President, TechnoTeacher, Inc.
srn at

voice: +1 972 231 4098 (at ISOGEN: +1 214 953 0004 x137)
fax    +1 972 994 0087 (at ISOGEN: +1 214 953 3152)

3615 Tanner Lane
Richardson, Texas 75082-2618 USA

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at
Archived as: and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at

More information about the Xml-dev mailing list