Storing Lots of Fiddly Bits (was Re: What is XML for?)

Sat Jan 30 00:54:02 GMT 1999

At 02:17 PM 1/29/99 -0800, Tim Bray wrote:
>At 02:49 PM 1/29/99 -0600, Paul Prescod wrote:
>>The data structures observed in XML are "annotated tree with second-class
>>links." This can be used to model "annotated directed graph" and even just
>>"annotated graph" if you pretend that the links are first-class.
>>"Annotated graphs" are the basic structures used by object databases. So
>>you seem to be saying that it would be really nice if there were
>>high-performance object databases. 

[...]

>What was really worrying me was what I thought was an assertion that 
>a repository that directly models XML document structures on a large
>scale wasn't interesting; I think it is. -T.

But aren't XML document structures just one instance of a more general
class of data that is composed of lots of little fiddly bits organized into
complex hierarchies and graphs? Other instances include vector graphics,
descriptions of power plants, models of human enterprises, etc. 

Doesn't this suggest that rather than trying to use XML's abstract data
model as a base for then modeling other kinds of data, that we should
develop a more general data model, develop supporting technologies around
it, and then apply that to XML? If the result can handle XML at scale, it
should be able to handle the result.  Thus, I agree with Tim, but probably
for a different reason (notwithstanding that my main job is explicitly to
help people handle documents, and not powerplants).

In other words, there seems to be a bit of poor reasoning at work in a lot
of quarters that goes like this [not that I'm accusing Tim of this--he just
provided a convenient seque for the following rant]:

1. I have data that doesn't fit into a relational database
2. XML lets me represent this data using an easy-to-see and
easy-to-create-and-use data format.
3. I can use "XML tools" to manage this data and it will be cheap and
effective.

Unfortunately, the jump from 2 to 3 is not justified. That's because at
point 2 you are working in the *syntactic domain*. XML works very well for
serializing complex data structures because its hierarchy provides rich
organizational facilities and its robust definition helps ensure
transmission fidelity. 

But, the XML data model, that is, the abstract model for the
*serialization* is not the same as the abstract model of the data being
serialized.  That is, the representation is not the thing.  Thus, when you
move from the syntactic domain back to the abstract domain, the abstration
you get is not the abstaction you started with--it's the abstraction of an
XML document serialization of the abstraction you started with.  There's
another step you have to perform before you get back your original
abstraction, which is to translate the serialization back into the original
abstraction.

For example, I start with the following abstact model (using EXPRESS syntax
just because I happen to know it and it doesn't get out much, so why
not--also I've been working on the XML serialization grammar for EXPRESS
and EXPRESS-driven data, so it's fresh in my mind):

TYPE gender ENUM;
  male;
  female;
  unknown;
END_TYPE:

ENTITY person SUBTYPE OF being;
  name : STRING;
  sex  : gender;
  employer : OPTIONAL enterprise;
END_ENTITY;

ENTITY enterprise;
  name : STRING;
  address : STRING;
END_ENTITY;

Now I create some instance data (using lisp syntax to represent the
in-memory abstractions):

(person
  (oid 1)
  (name "Eliot")
  (sex male)
  (employer (oid-ref 2)))
(enterprise
  (oid 2)
  (name "ISOGEN International Corp")
  (address "Dallas, TX")
  (derived::employs (oid-ref 1)))

Here is one possible (of an infinite number of possible) XML serialization:

<?xml version="1.0"?>
<data-serialization>
<schema-ref>business objects schema</schema-ref>
<data-instances>
<entity-instance id="i0000">
 <types>
  <type>person</type>
  <attributes>
   <attribute>
    <attname>name</attname>
    <attvalue>Eliot</attvalue>
   </attribute>
   <attribute>
    <attname>sex</attname>
    <attname>male</attname>
   </attribute>
   <attribute>
    <attname>employer</attname>
    <attvalue><entity-ref>i0001</entity-ref></attvalue>
   </attribute>
  </attributes>
  </type>
 </types>
</entity-instance>
<entity-instance id="i0001">
 <types>
  <type>enterprise</type>
  <attributes>
   <attribute>
    <attname>name</attname>
    <attvalue>ISOGEN International Corp.</attvalue>
   </attribute>
   <attribute>
    <attname>address</attname>
    <attvalue>Dallas, TX</attvalue>
   </attribute>
  </attributes>
 </types>
</entity-instance>
</data-instances>
</data-serialization>

If you now parse this document into an abstraction conforming to the DOM,
SGML Property Set, or similar rational abstract data model for XML
documents, you'll get something like this:

(xml-document
  (prolog
    (pi xml version="1.0")
    (doctype-decl))
  (document-element
    (gi data-serialization)
    (content
      (element 
        (gi schema-ref)
        (content
          (literal "business objects schema")))
      (element
        (gi data-instances)
        (content
           ...)))))

You get the idea--clearly the in-memory abstraction of the document bears
no direct relationship to the in-memory abstraction of the original data.
Even if you do an early-bound abstraction where you take the element types
as node types, you still get something that is not the abstraction:

(xml-document
  (prolog
    (pi xml version="1.0")
    (doctype-decl))
  (data-serialization
    (schema-ref
      (literal "business objects schema"))
    (data-instances
      (entity-instance
        (types "person")
        (attributes
          (attribute
            (attname "name")
            (attval "Eliot"))
     ....))))

You get the idea. 

Even in this early-bound form, the abstraction is still reflecting the
structure of the serialization, not the original abstraction.  To get the
original abstraction back, I have to apply the reverse of the original
serialization algorithm.  I might do this literally or I might do it by
providing a set of query functions over my document that does it (e.g.,
translates the query "select person where name is 'Eliot'" to a more
complex XML-specific query defined in terms of the semantics and structure
of the serialization structure).  Either way, the mapping has to be defined
and implemented.  Whether doing it literally (that is, importing the
database back into some "non-XML" repository) or doing it virtually on top
of an "XML repository" is an implementation/optimization choice.

Thus, even saying "XML means the data abstraction you get from XML syntax"
isn't very helpful.  Because the resulting abstraction isn't really what
you want.

However, the characteristics of the XML in-memory abstraction *as a class
of data* are very much similar to the characteristics of other
abstractions. For example, the abstract data objects that describe a power
plant are very much like the abstract data objects that describe a document:

- There are a lot of them (every pipe, valve, pump, joint, etc., represents
at least one node, with many relationships to other nodes)
- Each node has lots of properties (position, identifier, operating
characteristics, geometry, status, age, etc.)
- The nodes exist in both a hierarchy reflecting their physical structure
(plant-unit-assembly-subassembly-part) and a graph representing their
connected nature to other parts (valve one must be closed before valve two
can be opened)
- They are equally static and dynamic, that is, a large part of the data
never changes, a large part of it is constantly changing.
- I want to ask a lot of questions about the data and I can't predict what
sort of questions I might want to ask
- If something is wrong in the data, bad things may happen

This suggests that the technology that can handle documents at large scale
can also handle powerplants at large scale (or ships or airplanes or
buildings or electronic components or enterprises or governments or ...).

This, I think, leads to an excitement about XML and its application to
managing large data stores because it provides an easy-to-understand entry
into the problem space and an easy-to-get-started place to start stressing
and testing the technology. This is all good, but we have to be careful not
to lose sight of the fact that the goal shouldn't be to shoe-horn all
complex structures into XML's abstract data model, it should be to develop
data management technologies that will handle documents well, because if we
do, they will also handle powerplants and airplanes well.  And the reverse
is true as well--if I have a database that can handle a powerplant or an
aircraft, chances are it will handle documents at scale too.

Near as I can tell from my work in the STEP world and in the document
world, the technology to manage data of this sort at the scales we need
simply doesn't yet exist. I don't know if this is a hardware problem or a
science problem, but I suspect it's a bit of both.  I suspect that the
solution requires an entirely new way of thinking about storing little
fiddly bits of data that is neither relational nor object nor
object-relational, but is entirely else (or at least significantly enough
else to be something different).  

Cheers,

E.
--
<Address HyTime=bibloc>
W. Eliot Kimber, Senior Consulting SGML Engineer
ISOGEN International Corp.
2200 N. Lamar St., Suite 230, Dallas, TX 75202.  214.953.0004
www.isogen.com
</Address>

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)