Weak DTDs

Fri Oct 17 23:03:00 BST 1997

W. Eliot Kimber wrote:
> Peter has run head-on into one of the fundamental problems with DTDs as
> currently defined by SGML (and XML): we want them to describe *classes* of
> documents when they actually describe *individual* documents (and are
> incapable of defining classes of documents except in very weak ways).

I don't see how you can say that. Clearly the TEI DTD does not define an
*individual document*. It defines a class of documents with certain
constraints. Peter merely wants those constraints loosened a little. We
can do that in part by providing the #ANY keyword he asks for (or
directing him to the workaround), by pointing him to SGML's "&" operator
that can loosen ordering constraints (sorry, not in XML!) and by
providing subclassing mechanisms that allow people to build upon the CML
DTD and re-add constraints incrementally.

> Draconian rules are fine when your use scenario requires draconian
> policies, such as when creating military documents or documents that drive
> well-defined and specific processes.  However, not all uses of SGML require
> draconian policies (i.e., the TEI). XML, in particular, is expressly
> designed for situations that *probably don't* require draconian policies
> (as evidenced by the potential lack of any DTD declarations).  In other
> words, there is a continuum of possible constraint policies, from no
> variation allowed to anything is allowed.  Unfortunately, SGML only really
> supports the 'no variation' end of that spectrum and XML only really
> supports the 'anything is allowed' or the 'no variation' ends, with no
> obviouis support for the middle ground, where you want some constraints but
> not necessarily full constraint.

XML allows you to specify certain element types and leave others
unspecified. Many of us have argued that we should explicitly define the
result of piece-wise validation. It only makes sense that if
declarations are provided there should be an option to validate against
them. This will give you and Peter the middle ground you need.

> Thus the frustration that Peter describes is unavoidable with DTDs alone:
> he has clearly defined a general document type, the CML, that needs to
> allow a range of specialization options.  However, if the CML is defined as
> a set of declarations to be used directly in documents as their DTD
> declarations, it cannot do that, as the declarations define the *complete
> set* of constraints on those documents. The CML must either impose
> arbitrary constraints that are necessarily appropriate for all CML
> documents or it must be so loose as to define no constraints beyond type
> names.
> 
> In short: DTDs don't define document classes.  

That just isn't true. DTDs have *always* defined document classes. 

SGML Handbook page 124

"Document type: A class of documents having similar characteristics; for
example journal, article, technical manual, or memo"

Yes, the facilities for defining those classes are a) a little too
strict and b) not well designed for incremental extension. But we can
attack both of those problems *directly* without introducing another
"level" of processing in the way that architectural forms do. In any
type system, from Aristotle's classification of animals to Simula's
simulation of real-world type systems, the mechanism for making more
flexible classification rules is subclassing and we can add this
directly to SGML with straight-forward semantics.

> This is why something like architectures is required for the productive and
> large-scale use of SGML and XML: you must have a way to define true
> document classes with clear, machine-processible and validatable
> specialization constraints that dont', at the same time, impose unnecessary
> constraints on individual documents.  

True, something is needed. But I do not see why it must be a new level
of processing, an "architecture" when the DTD needs only to be made more
flexible.

> An architecture is defined by the *combination* of a set of DTD
> declarations and accompanying documentation that together define the rules
> for a class of documents (the documentation is vitally important because
> there will always be rules and constraints that cannot be expressed through
> syntax, regardless of what syntax you are using to formally express
> constraints).  As part of these rules, the range of allowed variation among
> documents that confrom to the class can be defined, both formally in the
> syntax and completely in the documentation.   The DTD declarations form a
> "meta-DTD", that is a DTD that defines the syntactic rules for the class,
> not for instances.  Instances will have their individual DTDs (explicit or
> implicit) that define their individual syntax rules.

But this is the definition of "Document Type Definition". See page 126.
You've just paraphrased it.

A DTD defines the allowed occurrence of elements and attributes for a
class of instances. A "meta-DTD" (I prefer the term "architectural DTD")
defines the allowed occurrence of architectural elements and
architctural attributes for a class of instances. It's basically the
same thing, except one uses the straightforward SGML syntax and the
other uses the architectural syntax. Both work on classes of documents.

They are not inherently "more flexible" than DTDs at all. One way that
they are flexible, I'll admit, is that they allow piece-wise validation
which SGML has not had in the past. But there is no reason that we
cannot add this directly to SGML and XML. I wouldn't be sure if the SGML
WebTC adds this already, but I'm not sure.

> Architectures can themselves be derived from other architectures, allowing
> you to form a hierarchy of document classes. By the same token, any
> architecture can be used as the base for a more specialized architecture.
> In addition, a single document or architecture can be derived from many
> different architectures (for example, the CML might be derived in part from
> some RDF architecture in order to standardize the way the CML structures
> metadata).

The word "derived from" can be very misleading in this context.
Basically, you include a few processing instructions and notation
declarations. You do not inherit any declarations. There are no
constraints placed on the DTD. This DTD:

<!ELEMENT ANY ANY>
<!ATTLIST ANY RDF CDATA #IMPLIED
	  	EXTRAINFO CDATA #IMPLIED>

could be "derived from" RDF or CML with a couple of extra notation
statements. But instances conforming to this DTD are not constrained at
the SGML level to be valid RDF or CML instances at all. As you can see,
this particular DTD is actually much more flexible than RDF.

> ...Thus, the CML is an SGML architecture because the CML DTD can be used as
> an architectural meta-DTD (with the possible addition of a few small
> changes to better express its specialization constraints).  To use this
> architecture with documents, you need to define a mapping between the
> elements, attributes, and data of the document with the elements and
> attributes in the architectural meta-DTD. The AFDR mechanism does this
> with attributes and provides a natural automatic mapping mechanism so that
> documents that are very similar to their meta-DTDs need provide mappings
> only for those things that differ from the meta-DTDs (that is, those things
> that are specialized beyond what the architectures define).

Right. In the AFDR, the mapping from elements to architectural element
types is done through attributes. In DTDs, the mapping is done with GIs
and attribute names. I don't think it follows from that that
architectures are inherently more flexible than DTDs. They seem to have
almost the same flexibility modulo piece-wise validation.

> The idea of a "wildcard" for content models is expressed in the AFDR by the
> notion of "bridging" element forms, "bridging" in the sense that they
> bridge between the architecture and non-architectural stuff. In the
> meta-DTD, a bridging form simply says "anything can go here".  Thus, rather
> than saying the following in the document's declarations:
> 
> <!ELEMENT MOL (#ANY,ATOMS,BONDS)*>
> 
> You would say this in the meta-DTD:
> 
> <!ELEMENT ANY -- Bridging form that allows anything to occur --
>   - - (#PCDATA | ANY)*
>
> This is essentially the same as what Rick suggested, except that we're
> doing it in the meta-DTD, rather than the document's DTD (the document may
> not have a DTD).

Right. So we haven't really bought any document-level flexibility (which
is what I interpreted Peter's request as). We've just 

a) moved it up a level

and 

b) Provided the option for specializing the element in a "derived" DTD.

The former is a bad thing, in that it adds up to more work. We could
provide the latter just as well by making element type subclassing a
first-class feature of DTDs.

> Note that if you have an existing SGML document with an explicit DTD, you
> can make that SGML document into a DTD-less XML document simply by using
> the existing DTD as an architectural meta-DTD.  This removes the necessity
> of parsing the declarations with the document any time you want to parse it
> without removing the connection between the document and its syntactic and
> semantic constraints (thus allowing validation on demand). This is
> particularly useful when the DTD you use is large (e.g., Docbook, full TEI,
> etc.).

XML already removes the necessity of parsing declarations without
removing the connection between documents and their syntactic and
semantic constraints. The reason we have an RMD is to allow this. So
once again we haven't bought anything by making our DTD into an
architecture.

> But wouldn't it be cool if XML editors *were* architecture aware such that
> you could say "I want to create documents that conform to architecture X"
> and the editor would determine and enforce the specialization rules,
> letting you define new element types (or modify existing ones) and either
> warn you when you were doing something outside the architecture or prevent
> you from doing something outside the architecture (depending on what your
> local specialization policies are)?  I think so.  In fact, I think this is
> the only way you can have a useful XML editor at all.

I agree with your direction, but feel that AFDR architectures are poorly
suited to this in the long run.

By definition, they express constraints on *elements* and not *element
types*. That means that you can define an element that behaves according
to the architecture 100 times, but on the 101st time you will get a
cryptic error message about architectural non-compliance. Worse, that
error message could be for a base architecture of a base architecture of
a base architecture of the architecture you are familiar with. I don't
think that that is what we want. Every *element type* in the "derived"
DTD should subclass from a particular *element type* in the base DTD.
And the *DTD* should be constrained such that it in turn constrains
documents to conformance with the meta-class DTD. Before you make a
single instance you should know that there is nothing you can do in the
instance that could invalidate any of your base classes.

I agree that we need a) more flexible DTDs (#ANY etc.) and b) a
mechanism for extending and constraining these flexible DTDs. I do NOT
agree that we need a concept of "architectures" to do so. Extending TEI
(for example) should be as simple as:

<!ENTITY TEI SYSTEM "http://....">
%TEI;
<!ELEMENT CAUTION TYEPOF P>

And the result should a) be a single document type, not a document type
and an architecture and b) be guaranteed to constrain documents to TEI
conformance. In other words, we need element type subclassing, but we
don't have to bring the whole HyTime architecture mechanism to do so.

 Paul Prescod

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)