Serializations and data structures (was Re: Topic Maps on SQL)

Wed Nov 25 04:19:21 GMT 1998

Tim Bray wrote:
> 
> You know, this goes straight to the core of a deep issue.  Where
> I have often felt out of sync with the grove/property-set evangelists
> is that I perceive syntax as fundamental and any particular
> data model as ephemeral.  

On the issue of what works and is interoperable in the real world, you are
quite likely right, but on this "chicken and egg" issue of serialization
vs. data model you are not. The serialization only exists to provide for
the longevity of the data. Thus the data model is fundamental and the
serialization ephemeral.

Your argument is that a sentence is more fundamental than an idea, because
the sentence is easier to transmit, record, replay and otherwise
manipulate. But *by definition* ideas are more fundamental than sentences,
because there can only be a sentence after their is an idea, and in the
XML-world, the idea can be reconstructed from the sentence losslessly and
thus lives *at least as long as* the sentence.

Consider the following cases: 

Document A must be published three times. It is encoded now in SGML with
full minimizations. It is sent to the publisher and prints beautifully.
Years pass. RCS SGML is superseded in the organization by XML. The 
document's syntax is changed radically by running it through "sx." But 
the person in charge of the conversion is careful to make sure that the 
grove does not change. They run the print job again: it will print 
beautifully -- and identically -- as long as the formatter is grove 
driven. 10 more years pass. XML fades into oblivion and is replaced by 
the more compact Lisp S-Expression notation (yes, Lisp has finally caught 
on). But the S-Expression notation is designed to be lossless-ly 
compatible with the XML grove, so the software runs off of the grove 
instead of the serealization syntax. The document will print identically 
*again*.

Now what would happen if the formatter were driven directly from the
syntax in each of these cases? It would have broken. Now the usual
syntax-oriented way of handling this problem would be to do a 
transformation from S-Expressions to XML and then to RCS SGML so that
the syntax-driven software can find it. But how do you actually *do*
this translation?

Well, how do you translate from English into French? You take the 
English serialization for the idea, "parse" it back into an idea and
choose a French serialization that preserves the meaning. Meaning 
is king.

Similarly, when we do text transformations, we do something like this:

<tag> -> Element Object -> S-Expr Object -> (S-Expr serialization)

You might avoid creating literal objects, but you can't avoid 
understanding the data in terms of its data model if you don't
want the translation to be lossy. The serializations are 
ephemeral, arbitrary encodings for the underlying data.

> The reason is that I know how to interoperate robustly and efficiently
> based on syntax; but in my 20 years of experience in software I have seen
> very little successful interoperation based on data structures or APIs;
> and usually at high cost and irritating fragility.

I think that what you are really saying is that it is easier to work with
a well-defined serialization (i.e. XML) than a poorly defined API (i.e.
ODBC). Furthermore, XML makes it easy to define serializations (with DTDs)
but there isn't anything as simple but powerful for defining data models.

> This was what
> originally drew me to SGML.  Also I have trouble believing that any
> one data model will work well across the infinite breadth of application
> types and computing infrastructures.  Even one as sophisticated as groves.

There are a few points to make here:

 * the property set paradigm is best described as a meta data-model.
Different applications work with completely different groves constructed
from the same XML document(s). For instance, one could work with the
syntactic grove (e.g. an XML editor). Another would work with the abstract
grove (e.g. a browser) and another would work with the HyTime semantic
grove (e.g. a browser). Yet another would work with a system-specific
grove such as the RDF data model.

 * a complete XML syntactic grove has all of the information in an XML
document. If an XML document provides enough information for you to
accomplish what you need to accomplish, then the grove does too. Of course
building a syntactic grove to pretend it is a character serialization is a
little silly, but the point is that the a complete grove is really
complete.

 * the grove is not the data model to end all data models. It is merely
the foundational model for data from the SGML family of standards. It
makes perfect sense to abstract over that and build Python lists or Java
iterators or a whole other semantic level. If you encode a programming
language in XML, then of course you want a data structure that represents
functions, parameters and classes directly. That may or may not be best
described as a grove.

--

Let me ground my discussion in reality: there are strong sociological and
historical reasons why character-based linearizations are more 
interoperable than binary linearizations. And of course, any transmission
of data between computers will require linearization. So syntax is not
unimportant: but it is a) transient and b) secondary. XML's brilliance
derives from the recognition that something can be transient and
secondary but still very, very important.

 Paul Prescod  - http://itrc.uwaterloo.ca/~papresco

The United Nations Declaration of Human Rights will be 50 years old on
December 10, 1998. These are your fundamental rights:
http://www.udhr.org/history/default.htm

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)