Integrity in the Hands of the Client

Sun Nov 23 01:23:53 GMT 1997

	From: Joe Lapp <jlapp at acm.org>

	In this posting I'm going to be a little bold and propose that both
	the XML and DOM specifications are flawed.  The existence of these
	flaws ride on the assumption that we care to use SGML/XML to create
	domain models for data where the data evolves over time.  I'm also
	assuming that it is unacceptable for the client objects of a document
	to maintain the integrity of the document.

I've not been following this thread closely, so I apologize if I get
something wrong. I'll stop, first, too, to note that when
interconverting data formats we rarelt can represent every validity
constraint in the new format -- If I dump a DB record to tabbed files
I lose referential (and all other) integrity checks, but I may have
much better luck moving to a compeiting vendor's system.

When using XML, we may reasonably expect that the richer formalism
will give us more control (and for hierarchical data, that
expectation is well (if not perfectly) met. We may also expect that
other properties can be preserved (eg IDrefs eliminate broken
pointers, but don't allow typed references), but some probably won't be.

	We need to design the DTD for this document.  Here is our first pass:

	<!DOCTYPE catalog [
	<!ELEMENT catalog (books, authors)>
	<!ELEMENT books (book*)>
	<!ELEMENT authors (author*)>
	<!ELEMENT book (summary)>
	<!ATTLIST book
	title CDATA #REQUIRED
	author IDREF #REQUIRED>
	<!ELEMENT author (bio)>
	<!ATTLIST author
	id ID #REQUIRED
	name CDATA #REQUIRED>
	<!ELEMENT summary (#PCDATA)>
	<!ELEMENT bio (#PCDATA)>
	]>

	To get a better feel for what we've designed, we create a little sample
	document:

	<catalog>
	<books>
	    <book title="The Postman" author="A1">
	        <summary>Text goes here.</summary></book>
	    <book title="Startide Rising" author="A1">
	        <summary>Text goes here.</summary></book>
	    <book title="Hitchhiker's Guide to the Galaxy" author="A2">
	        <summary>Text goes here.</summary></book>
	</books>
	<authors>
	    <author id="A1" name="David Brin"><bio>Text goes
here.</bio></author>
	    <author id="A2" name="Douglas Adams"><bio>Text goes
here.</bio></author>
	</authors>
	</catalog>

	This seems to work.  It stores information about books and authors,
	and it is not possible to add a book without associating it with
	the description of some author.  But we can see that it breaks as
	soon as we add any other kind of element that has an ID.  We know
	that every book will eventually have an ID, because we'll soon want
	to have an element whose content elements reference the New York
	Times Bestsellers.  Once we do that, nothing prevents an administrator
	(or the client program he or she is using) from indicating that the
	author of a book is another book.  This DTD will not suffice.

The problem with this is that it uses database style "joins" on ID
values. XML's most powerful constraints are tree constraints, based on
containment. For example the following structure does not have this
problem:

 <catalog>
     <authors>
   		<author id=A1><name>David Brin</name>
		<bio>whatever<bio>
        <books>
          <book><title>The Postman</title>
		  <summary> whatever </summary></book>
		  other books go here. If we have more than one author:
          <book coauthors="A2 A3"> ...etc </book>
		</books>
	  </authors>
  </catalog>

Note that you do have to pick a "by author" or "by book" hierarchy to
use this technique. I also moved title and author into elements:
titles frequently contail markup, and names can be complex enough that
it's often a good idea to be prepared for the eventual need for
markup. Consider Chinese names where the order of family and personal
names is different than it is in most European cultures.

	It seems that we might have to use links, but lets look at other
	approaches first.  We entertain the idea that an author's books
	belong to the content of the author.  We quickly throw that one out
	when we realize that a book can have more than one author.

Or take an alternative approach (as I sketched above).

	I have not been able to find a way to have the document server force
	clients to ensure that whenever they add a book, that book is
	associated with some author.  Clients are given the responsibility
	of maintaining the integrity of the document.

No, Servers that want to impose non-XML integrity constraints (such as
you are demanding) must impose those constraints themselves. XML, like
traditional databases (which seem to be your starting point)
represents some things well, nd some things very badly. Attempting to
create relational schemas for XML documents produces that same kind of
hairy, unnatural  specifications and requires similar extra integrity
checks on update to represent typical document information.

Basically, I think that the flaw of not providing what you ask for is
in fact no flaw, but an artifact of different tools being targeted to
different purposes. There is a difference -- since XML is a data
format and _not_ a processing technology the way a database is, it may
be useful as a way to represent data and transport best _manipulated_
in non-XML ways. You get a rich language of structures for free by
using an XML parser, and that may save some time in writing data
transporters -- for instance, a DTD for the transport of complete RDB
table sets would be easy to write -- but checking those tables for
semantic correctness would not be one of the things you get for free.

	I think the XML specification as it currently stands is extremely
	well-suited for describing data that does not change over time, but
	that it is lacking in specifying how documents are to evolve.

You overstate the case here. It's suited for describing how the data
whose integrity costraints correspond to XML validity should evolve.
These constraints are not theoretically justified, but are
pragmatically justified by the fact that people can get useful
document management work done using them.

This is the same thing with relational database -- all those theorems
about normal forms and algebra merely show that the system is well
defined -- the fact that tables are useful for many kinds of data is
still a pragmatic one, and not a theoretical one. The world is still
full of things that don't fit the relational model very well.

I know that our current data-manipulation-savior is OO databases, bit
once we have experience with them we'll grow to understand the ways in
which they fall short of perfection as well.

Nevertheless, future versions of XML might have small improvements
that will help cases like this. The provision of multiple ID spaces
(ability to have typed IDs and typed IDrefs) is one that has been
suggested a number of times. It would also be very useful in
documents, since (begin example) only <figures> would have "fignum"
attributes, and so the user of "figref" attributes will be prevented
from referring instead to a paragraph of random text.

Small suggestions like this that also offer a lot of leverage may get
considered for XML 1.1. (Small in the sense that little syntax is
required to support it, and little processing beyond that already
required for ID/IDREF processing).

To my mind, such suggestions are compelling to the extent that they
are useful in _document_ management (as well as general data
management) because that really describes the primary focus of XML
design. XML may well be useful beyond that area, but I think it should
stay away from bidding on the "universal data format of the ages"
title, that may well be impossible to ever attain.

   -- David

------------------------------------------+----------------------------
David Durand                 dgd at cs.bu.edu| david at dynamicDiagrams.com
Boston University Computer Science        | Dynamic Diagrams
http://www.cs.bu.edu/students/grads/dgd/  | http://dynamicDiagrams.com/
                                          | MAPA: mapping for the WWW

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)