Storing Lots of Fiddly Bits (was Re: What is XML for?)

W. Eliot Kimber eliot at
Sun Jan 31 20:32:38 GMT 1999

At 06:45 PM 1/31/99 -0000, Mark Birbeck wrote:
>I know this thread has progressed but one of the original points has not
>been addressed so I'd like to re-raise it.
>All this is true, but I wonder if you are not comparing like with like.
>The model for your data should be more like:
><person oid="1">
>  <name>Eliot</name>
>  <sex>male</sex>
>  <employer oid-ref="2" />
><enterprise oid="2">
>  <name>ISOGEN International Corp</name>
>  <address>Dallas, TX</address>
>  <derived obj="employs" oid-ref="1" />
>Although you are right to say there are 'an infinite number' of possible
>mappings, if you use a representation that relies on a completely
>different DTD to represent the data, you shouldn't then be surprised if
>your XML no longer mirrors that original data. Nothing says my mapping
>is right, but at least it is based on the same schema that you
>introduced at the beginning of your message, namely:

But you've not solved my problem, because the in-memory abstraction of the
*document* is still:

      (gi "person")
          (gi "name")
            (literal "Eliot")))))))

So, while the result is closer to the abstraction of the data, it is still
not the original abstraction. And note that my example is trivial, so the
mapping from one possible DTD for the serialization document to the
abstract schema is both obvious and possible, but in a real-world example,
neither are likely to be (for example, there's no way to represent the
results of having class hierachies using a DTD alone--you have to have
specialized markup constructs to bind instance attributes to the class or
superclass that defines them).

And note that even for an early-bound form, there are still infinitely many
ways to construct it, and at least three reasonable ways
(attribute-primary, content primary, mix of attributes and content).
Another tricky design choice: how is redundancy represented--use by
reference? data copying? Do you use your own addressing scheme or use XLink
(or HyTime or ID/IDREF or ...)?  There are too many design choices when
defining an early-bound serialization to be able to make general
predictions about any given such format or to impose general conventions or

So no matter how you slice it, there will always be a disjoint between the
abstraction of the serialization form and the abstraction of the data
objects being serialized, which means that a query onto the abstraction of
the serialization will not be the same as a query onto the abstraction of
the data that has been serialized. The gap might be bigger or smaller, but
there will always be a gap.  

Of course, for any given binding of abstract schema to serialization, one
can define query functions that implement the de-serialization algorithm,
but these bindings must be defined on a per-schema or per-schema-mechanism
basis, so there's no potential generalization or standardization benefit

Which begs the question: if the abstraction of the document is not the
abstraction of the data, why bother to create and store the abstraction of
the document when you can just as easily create and store the abstraction
of the data?

Note that for the EXPRESS language (part of the STEP family of standards,
ISO 10303, see <>), we are in the process of defining
an XML-based serialization format, which will include an algorithm for
going from any EXPRESS schema and its data instances to their late-bound
serialization and back again. Given that, you can then, of course, write
queries in XML-specific systems (DSSSL, XSL, XQL, GroveMinder, DOM, etc.)
that will implement the deserializations and treat the documents as though
they were the original data instances. This will be useful, because you'll
be able to turn existing (and largely free) document-processing tools into
EXPRESS-based data access tools, but that is simply a side effect of using
XML as the serialization format--it is not our primary motivation for using
XML. In a production environment, you would not normally introduce an
additional layer of indirection between the request for data and the data
itself when that layer adds no additional value over saving a few Euros on

Our primary motivation for using XML is that, as a serialization syntax it
does a better job of enabling reliable interchange than the current
serialization format for EXPRESS-driven data.

Note that I'm making a distinction between the motivation for standardized,
industrial-scale solutions and one-off, small-scale solutions. In the
latter case, using document-processing tools to do database-like things off
the serialization format can be a big win, believe, me, I'm depending on
that myself. But for large-system implementation, it would not be the right
thing to do.

One of the things we quickly realized as we thought through our design
principles, requirements, and goals was that we *cannot* and *should not*
define the standard *early bound* form of the serialization, because there
are simply too many useful ways to do it.  Rather, we will provide a
general mechanism for mapping from early-bound serialization syntaxes to
the standardized late-bound syntax (we will almost certaintly use SGML
architectures for this--name spaces do not help in this case, because the
whole point is to let the designer of the early-bound syntax define their
own element and attribute names).


<Address HyTime=bibloc>
W. Eliot Kimber, Senior Consulting SGML Engineer
ISOGEN International Corp.
2200 N. Lamar St., Suite 230, Dallas, TX 75202.  214.953.0004

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at
Archived as:
To (un)subscribe, mailto:majordomo at the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at

More information about the Xml-dev mailing list