Integrity in the Hands of the Client
Joe Lapp
jlapp at acm.org
Fri Nov 21 15:56:17 GMT 1997
In this posting I'm going to be a little bold and propose that both
the XML and DOM specifications are flawed. The existence of these
flaws ride on the assumption that we care to use SGML/XML to create
domain models for data where the data evolves over time. I'm also
assuming that it is unacceptable for the client objects of a document
to maintain the integrity of the document.
In order for me to most convincingly convey the point, I need you to
bear with me as I explore an example of how we might use XML. I do
not directly suggest how to correct the XML specification, but I
think I end up implying a few different solutions. However, it seems
that the correction to DOM is a bit more straightforward, so I make
the obvious suggestions.
Suppose we want to create a document that contains information about
books and about the authors of those books, and suppose we require
that whenever the document has a book, it also has information about
the author of the book. The document will reside on a server, and
one or more administrators will populate the document from their
clients. Other users will be free to browse the document.
We need to design the DTD for this document. Here is our first pass:
<!DOCTYPE catalog [
<!ELEMENT catalog (books, authors)>
<!ELEMENT books (book*)>
<!ELEMENT authors (author*)>
<!ELEMENT book (summary)>
<!ATTLIST book
title CDATA #REQUIRED
author IDREF #REQUIRED>
<!ELEMENT author (bio)>
<!ATTLIST author
id ID #REQUIRED
name CDATA #REQUIRED>
<!ELEMENT summary (#PCDATA)>
<!ELEMENT bio (#PCDATA)>
]>
To get a better feel for what we've designed, we create a little sample
document:
<catalog>
<books>
<book title="The Postman" author="A1">
<summary>Text goes here.</summary></book>
<book title="Startide Rising" author="A1">
<summary>Text goes here.</summary></book>
<book title="Hitchhiker's Guide to the Galaxy" author="A2">
<summary>Text goes here.</summary></book>
</books>
<authors>
<author id="A1" name="David Brin"><bio>Text goes here.</bio></author>
<author id="A2" name="Douglas Adams"><bio>Text goes here.</bio></author>
</authors>
</catalog>
This seems to work. It stores information about books and authors,
and it is not possible to add a book without associating it with
the description of some author. But we can see that it breaks as
soon as we add any other kind of element that has an ID. We know
that every book will eventually have an ID, because we'll soon want
to have an element whose content elements reference the New York
Times Bestsellers. Once we do that, nothing prevents an administrator
(or the client program he or she is using) from indicating that the
author of a book is another book. This DTD will not suffice.
It seems that we might have to use links, but lets look at other
approaches first. We entertain the idea that an author's books
belong to the content of the author. We quickly throw that one out
when we realize that a book can have more than one author.
Now we consider having authors belong to the content of a book,
but we throw that idea out because authors may author many books.
It is possible to put author information in the content of each book,
but then we'd be duplicating the lengthy bio and wasting disk space
as well as introducing the headache of managing duplicate copies.
The same problem arises if we were to duplicate book information
under each of the authors of the book, especially since each book has
a lengthy book description.
So now we ask whether links can do the job. Links allow us to use
URLs and XPointers to reference other elements. For the moment,
consider trying to accomplish our task using a single DTD, so that
all element IDs have the same scope. In this case, the URL of any
link references the document that contains the link, so all of our
distinguishing information resides in the XPointers. The ID()
location term looks useful, but this term cannot constrain the
element type of the element that it references. Using ID() as the
first locator term would not be sufficient to distinguish between
books and authors.
Suddenly a brilliant idea comes to mind. We'll use a locator term
to specify the <authors> element and then follow that with the ID()
term to select the idea of the particular <author> element. But
this idea has a problem: when the ID() term appears, it must appear
as the first locator term.
Another idea comes to mind. We could use the following combination
of locator terms:
CHILD(1,authors)(1,author,id,'A3')
Here 'A3' is the identifier of the author. We know that we cannot
try to match the author's name, because more than one author may
have the same name. ID's are guaranteed to be unique.
That seems to work. Something similar could have been accomplished
by separating books and authors into different documents and then
using the URL portion of the href to specify the document that
contains the target element.
However, these link solutions all have one problem: nothing in the
link specification allows a link element declaration to constrain
the kind of resource to which a link links. WD-XML-LINK-970731
indicates that an href is an URL, and that when the URL references
another XML document, XPointer locator terms may be appended to
the URL. I do not see any mechanism by which a link element can
constrain the kind of element that the link references.
I have not been able to find a way to have the document server force
clients to ensure that whenever they add a book, that book is
associated with some author. Clients are given the responsibility
of maintaining the integrity of the document.
The problem grows more complicated when we also ask that no author
exist in the document unless we also have at least one book be
associated with the author. A solution to the first problem would
not be a sufficient change to specifications in order to guarantee
a solution that handles this additional requirement. By having
constraints operate in both directions we now require that every
change to a document occur within a transaction, so that the
document is validated against the DTD only at transaction boundaries.
(If every book had to have at least one author and every author
had to have at least one book, then when it comes time to add a
new book by a new author, the document will not validate against
the DTD after we add one and before we add the other.)
The example I have given here may seem trivial. Surely we can find
a way to live with books that don't have associated author entries
and authors that don't have associated book entries. However, in
general, constraints between elements will be important. For
example, it would not be acceptable to store away an account
deduction entry without having an associated account entry or to
have an account entry that does not have at least one associated
account-owner entry. It seems to me that there are very few domains
that can be represented without these kinds of constraints.
I think the solution to this problem resides partly in the XML
specification and partly in the document access language. A DTD
needs to be able to express these kinds of constraints among
elements, so that the document server can enforce the constraints.
We would then not be relying on the proper behavior of all the
clients that wish to add to or modify the document. (Let me know
if you need an argument for why clients should not hold this
responsibility; I'm assuming we agree on this point.) The access
language also needs to reflect the solution because in order for
a server to implement constraints, all document update operations
must be couched in the language of transactions. That is, every
document update operation must be associated with a transaction.
The DOM model allows us to manage documents from a client, so long
as clients assume part of the responsibility for maintaining object
model constraints. However, if we decide that the document server
is responsible for maintaining these constraints, then the DOM
model as it is currently architected will not suffice, since its
document-update operations are not architected around transactions.
Moreover, I do not see a way to extend the current DOM design so
that it can safely support transactions. One way to correct DOM
is redesign it so that it submits query/edit objects to the server,
where each query/edit object is submitted via a transaction object.
Another way to correct DOM is to add a transaction parameter to all
document-update method signatures. I don't think of this latter
approach as an extension to DOM, since the corrected DOM would not
be backwards-compatible with the current DOM.
I think the XML specification as it currently stands is extremely
well-suited for describing data that does not change over time, but
that it is lacking in specifying how documents are to evolve.
--
Joe Lapp (Java Apps Developer/Consultant)
Unite for Java! - http://www.javalobby.org
jlapp at acm.org
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)
More information about the Xml-dev
mailing list