Weak DTDs

Fri Oct 17 15:21:21 BST 1997

At 06:28 PM 10/13/97, Peter Murray-Rust wrote:

[...]
>I'd like *constructive* views on the value of DTDs in XML. [I know that the
>community has strongly held ones, so please avoid too much passion :-).
>There was a very interesting discussion a few weeks back on the aesthetics
>of DTDs - a good DTD is a thing of beauty.] I can see the following reasons
>for DTDs.

[...]

>In creating CML documents I find myself:
>	(a) wanting to introduce foreign names (e.g. <DC:author>, or <MathML:EQN>)
>These could reasonably come at many places in the document
>	(b) forgetting my own 'rules', e.g. order of elements within a content
>model. So I can't expect others to follow them :-)
>	(c) adding new components to content models - for good reasons. There is
>no reason why an <MOLECULE> cannot contain a <FIGURE>, but I didn't think
>of that earlier. I don't want to have to think of all combinations and ask
>'is that reasonable?'.

Peter has run head-on into one of the fundamental problems with DTDs as
currently defined by SGML (and XML): we want them to describe *classes* of
documents when they actually describe *individual* documents (and are
incapable of defining classes of documents except in very weak ways).  

It was clearly the intent of the SGML designers that DTDs describe
*classes* documents (thus the term 'document type').  Unfortunately, by
making the DTD declarations a property of individual documents, they are
prevented from being used in that way except in the most draconian fashion:
all documents of a type must have *exactly* the same rules (because they
all share exactly the same declaration set as part of their syntactic
content).  Valiant attempts at making configurable declaration sets,
typified by the TEI and Docbook, simply emphasize the problem: there is no
useful way with DTDs alone to define flexible document classes that can be
easily specialized at the document level.

Draconian rules are fine when your use scenario requires draconian
policies, such as when creating military documents or documents that drive
well-defined and specific processes.  However, not all uses of SGML require
draconian policies (i.e., the TEI). XML, in particular, is expressly
designed for situations that *probably don't* require draconian policies
(as evidenced by the potential lack of any DTD declarations).  In other
words, there is a continuum of possible constraint policies, from no
variation allowed to anything is allowed.  Unfortunately, SGML only really
supports the 'no variation' end of that spectrum and XML only really
supports the 'anything is allowed' or the 'no variation' ends, with no
obviouis support for the middle ground, where you want some constraints but
not necessarily full constraint.

Thus the frustration that Peter describes is unavoidable with DTDs alone:
he has clearly defined a general document type, the CML, that needs to
allow a range of specialization options.  However, if the CML is defined as
a set of declarations to be used directly in documents as their DTD
declarations, it cannot do that, as the declarations define the *complete
set* of constraints on those documents. The CML must either impose
arbitrary constraints that are necessarily appropriate for all CML
documents or it must be so loose as to define no constraints beyond type
names. 

In short: DTDs don't define document classes.  The use of parameter
entities to create configuratable declaration sets is a very weak way of
expressing the allowed range of specialization, one that depends entirely
on syntax tricks and conventions and one that cannot be reliably machine
processed (it is impossible to impute meaning to the names and/or positions
of parameter entities in the geneal case). And one that cannot be used at
the document level with any of the commercial SGML editors I'm familiar
with (because none allow element or attribute declarations in the internal
subset).

This is why something like architectures is required for the productive and
large-scale use of SGML and XML: you must have a way to define true
document classes with clear, machine-processible and validatable
specialization constraints that dont', at the same time, impose unnecessary
constraints on individual documents.  SGML architectures, as defined by the
AFDR (http://www.ornl.gov/sgml/wg8/docs/n1920/html/clause-A.3.html),
provide such a mechanism.

An architecture is defined by the *combination* of a set of DTD
declarations and accompanying documentation that together define the rules
for a class of documents (the documentation is vitally important because
there will always be rules and constraints that cannot be expressed through
syntax, regardless of what syntax you are using to formally express
constraints).  As part of these rules, the range of allowed variation among
documents that confrom to the class can be defined, both formally in the
syntax and completely in the documentation.  The DTD declarations form a
"meta-DTD", that is a DTD that defines the syntactic rules for the class,
not for instances.  Instances will have their individual DTDs (explicit or
implicit) that define their individual syntax rules.  

Architectures can themselves be derived from other architectures, allowing
you to form a hierarchy of document classes. By the same token, any
architecture can be used as the base for a more specialized architecture.
In addition, a single document or architecture can be derived from many
different architectures (for example, the CML might be derived in part from
some RDF architecture in order to standardize the way the CML structures
metadata).

Because architectures are defined using normal DTD syntax, any existing DTD
declaration set can be used as an architecture without modification
(although most existing DTDs can benefit from some redesign in order to
make them better architectures).

Thus, the CML, in the abstract, is clearly an architecture in the general
sense: it defines the rules for a class of documents.  It does (or needs
to) define specialization constraints.  The current definition of the CML
includes a declaration set...

...Thus, the CML is an SGML architecture because the CML DTD can be used as
an architectural meta-DTD (with the possible addition of a few small
changes to better express its specialization constraints).  To use this
architecture with documents, you need to define a mapping between the
elements, attributes, and data of the document with the elements and
attributes in the architectural meta-DTD.  The AFDR mechanism does this
with attributes and provides a natural automatic mapping mechanism so that
documents that are very similar to their meta-DTDs need provide mappings
only for those things that differ from the meta-DTDs (that is, those things
that are specialized beyond what the architectures define).

[...]

>These are powerful conditions, but if we try to express them in DTDs,
>validation will fail. What I'd like to have is a wildcard #ANY (this has
>already been suggested) which can be used for content models something like
>the (currently illegal) XML:

The idea of a "wildcard" for content models is expressed in the AFDR by the
notion of "bridging" element forms, "bridging" in the sense that they
bridge between the architecture and non-architectural stuff. In the
meta-DTD, a bridging form simply says "anything can go here".  Thus, rather
than saying the following in the document's declarations:

<!ELEMENT MOL (#ANY,ATOMS,BONDS)*>

You would say this in the meta-DTD:

<!ELEMENT ANY -- Bridging form that allows anything to occur --
  - - (#PCDATA | ANY)*
>

This is essentially the same as what Rick suggested, except that we're
doing it in the meta-DTD, rather than the document's DTD (the document may
not have a DTD).

To define the mapping from a document to a governing architecture, you
declare the architecture and then define the mapping.  In the AFDR as
written the architecture is declared using a NOTATION declaration [several
of people, including myself and Peter Newcomb, have suggested alternative
PI-based mechanisms for doing these declarations as XML doesn't yet provide
data attributes, which the AFDR mechanism relies on--what's important is
making the connection, not the precise syntax by which it is made.].

A document that is derived from the CML and takes advantage of the above
might look like this:

<!DOCTYPE CML [
 <!NOTATION CML 
   PUBLIC "-//VSMS//DTD Chemical Markup Language Architecture//EN"> 
 <!ATTLIST #NOTATION CML 
   ArcDTD  CDATA #FIXED "CML.meta-DTD"
   ArcBridge NAME #FIXED "ANY"
 >
 <!NOTATION SGML 
  PUBLIC "ISO 8879:1986//NOTATION Standard Generalized Markup Language//EN">
 <!ENTITY CML.meta-DTD 
   SYSTEM "http://www.vsms.nottingham.ac.uk/vsms/cml.dtd" CDATA SGML >

 <!-- Map the local element type 'MyElement' to the CML bridging form
      'ANY': -->
 <!ATTLIST MyElement
    CML    NAME #FIXED "ANY"
 >
]>
<CML>
 ...[normal CML stuff]
 <MOL>
  <MyElement>...</MyElement>
  <ATOMS>...</ATOMS>
  <BONDS>...</BONDS>
 </MOL>
</CML>

By the normal rules of automatic architectural mapping, I only had to
explicitly map the 'MyElement' element to something in the CML meta-DTD
because everything else used the same names as in the meta-DTD.  This means
that I didn't need any other DTD declarations in the document in order to
be able to interpret it as a CML document (that is, as a document that
conforms to the general rules defined by the CML architecture).  

To process it as a CML document, I can simply derive the "architectural
instance" using an architecture processor like SGMLNORM:

sgmlnorml -A CML mydoc.sgml > cmlai.sgm

The code samples Peter shows in his note could easily be used for
architecture-aware processing simply by looking at the result of the
architectural mapping rather than directly at element types.  Resolving
architectural mapping in an ad-hoc way requires about 20 lines of code if
you make some reasonable assumptions about the use of the architecture
(assuming you aren't prepared to do fully-general architectural processing
involving actually loading the meta-DTD, which you don't usually need to do
for most purposes).

To define the sort of attribute constraints Peter wants, you must still
rely on either documentation that states the rules that must then be
enforced by an architecture-aware processor or you have to use something
like the lextype facility in ISO/IEC 10744 Annex A.2.  However, if you're
building a processor for a specific architecture (i.e., a CML-aware
processor), building in rules for specific attributes isn't a big deal and
is no different than the sorts of things people do in specialized SGML
processors every day.  The architecture does give you a central place to
put the documentation of the constraint and lets you make your
implementation as generalized as you want (or have time for).

Thus we can use architectural meta-DTDs to really and truly define the
syntactic rules for classes of documents and then create documents that are
specialized from those classes.  The specialization rules are (mostly)
machine processible and enforceable (there will always be semantic rules
that can't be enforced by syntax alone).  Because of automatic mapping,
documents derived from architectures need have no explicit declarations of
their own except as needed to express specific specializations (as shown
above).  

Note that if you have an existing SGML document with an explicit DTD, you
can make that SGML document into a DTD-less XML document simply by using
the existing DTD as an architectural meta-DTD.  This removes the necessity
of parsing the declarations with the document any time you want to parse it
without removing the connection between the document and its syntactic and
semantic constraints (thus allowing validation on demand).  This is
particularly useful when the DTD you use is large (e.g., Docbook, full TEI,
etc.).

This then continues to beg the question: why have DTDs for documents at all?

In fact, most documents need never have a full set of explicit declarations
if they are derived from an architecture if they are also well formed.  The
only time you'd need explicit declarations would be to define
specializations or to drive non-architecture-aware authoring or validation.

But wouldn't it be cool if XML editors *were* architecture aware such that
you could say "I want to create documents that conform to architecture X"
and the editor would determine and enforce the specialization rules,
letting you define new element types (or modify existing ones) and either
warn you when you were doing something outside the architecture or prevent
you from doing something outside the architecture (depending on what your
local specialization policies are)?  I think so.  In fact, I think this is
the only way you can have a useful XML editor at all [I find it interesting
that the ADEPT*Editor product has had for many years a non-SGML-conforming
mechanism for creating specialized element types while editing, although
ADEPT does it through the use of PIs and creates documents that are really
only processible in that form by ADEPT tools. But clearly they recognized a
stronge requirement to allow specialization of documents by
authors--unfortunately, no architectural mechanism, certainly not a
standardized one, existed at the time they built that facility.  I wonder
how difficult it would be to make ADEPT into an architecture-aware editor
that provided the same specialization facilities it does now, but expressed
using the AFDR syntax rather than the proprietary ADEPT syntax?  Certainly
the work that Paul Grosso has done to demonstrate XML editing and
on-the-fly element declaration suggests it might be possible, even if it
requires something of a hack in the short term.]  

If you don't have an editor like this, then you are requiring the author to
know the architecture's rules, which as Peter points out, can be difficult,
even when you are the creator of the rules to begin with.  In other words,
"DTD-less authoring" is not attractive for most people because most people
create documents that need to have at least some minimal consistency with
other documents.

My personal feeling is that without architectures [in the general sense,
not necessarily using the AFDR mechanism, although I think the AFDR is a
very good mechanism] that neither SGML nor XML are really very
useful--meaning that architectures are required to use SGML and XML at
large scales and across wide domains.  Almost all the problems people have
with using SGML at large scales come not from technological limitations but
from limitations in the ability of document types alone to define document
classes and the inability of SGML processors to operate at the class level,
rather than the document level. 

Having said that, let me stress that SGML and XML are still the best thing
going for creating structured documents.  Obviously, we need to add the
architectural mechanism to SGML and XML, not discard them in favor of
something else.  I think publication of ISO/IEC 10744:1997 demonstrates the
desire to do this addition and, in fact, accomplished it (at least within
the constraints of 8879:1986--there's lots of room for improvement to this
mechanism as the syntax of SGML is improved through the SGML revision).

Cheers,

Eliot
--
<Address HyTime=bibloc>
W. Eliot Kimber, Senior Consulting SGML Engineer
Highland Consulting, a division of ISOGEN International Corp.
2200 N. Lamar St., Suite 230, Dallas, TX 95202.  214.953.0004
www.isogen.com
</Address>

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)