Weak DTDs

Fri Oct 17 00:59:04 BST 1997

I am in the throes of revising CML (Chemical Markup Language - an XML-based
application) and trying to work out what the value of conventional DTDs
are. The previous version has a traditional SGML-like DTD - lots of
parameter entities and other clever stuff. I am finding this too
restrictive for several reasons, mainly because:
	(a) XML-* is moving so rapidly (e.g. LINK, STYLE, etc.) This is a Good
Thing, but CML has to react to it.
	(b) RDF, DC, MathML etc will be involved in CML and I can't say exactly
how at present.
	(c) My ideas on CML itself keep changing as I gain experience of new
problems.

I'd like *constructive* views on the value of DTDs in XML. [I know that the
community has strongly held ones, so please avoid too much passion :-).
There was a very interesting discussion a few weeks back on the aesthetics
of DTDs - a good DTD is a thing of beauty.] I can see the following reasons
for DTDs.
	(a) the author has to conform to a pre-defined spectrum of ideas (e.g. a
tax-return). [This is not required for CML, and any conformance is outside
what a DTD can deliver - e.g. value verification.]
	(b) the document may get corrupted in transmission or elsewhere. I suspect
this is not a very important reason these days.
	(c) it *may* make it easier to develop authoring tools
	(d) it *may* give guidance to implementers of applications.
 	(e) it should (but doesn't always) act as an incentive to develop
human-readable documentation of the semantics.
	(f) it shows that the author has defined the language at some point in time.

I'd be grateful for other reasons for CML I expect that (c-e) have some
limited value. (f) may impress some people and horrify others.

In creating CML documents I find myself:
	(a) wanting to introduce foreign names (e.g. <DC:author>, or <MathML:EQN>)
These could reasonably come at many places in the document
	(b) forgetting my own 'rules', e.g. order of elements within a content
model. So I can't expect others to follow them :-)
	(c) adding new components to content models - for good reasons. There is
no reason why an <MOLECULE> cannot contain a <FIGURE>, but I didn't think
of that earlier. I don't want to have to think of all combinations and ask
'is that reasonable?'.

However the power of structured documents means that I can often use very
fuzzily constructed documents. Thus:
	'if a MOLECULE contains ATOMS and BONDS, the software can draw a picture'
	'if any parent contains a FIGURE, allow that to be displayed by the reader'.
	'if a VARiable has attribute BUILTIN=FOO, inform the software that it
could process this with special FOO-specific code'
and so on.

These are powerful conditions, but if we try to express them in DTDs,
validation will fail. What I'd like to have is a wildcard #ANY (this has
already been suggested) which can be used for content models something like
the (currently illegal) XML:

<!ELEMENT MOL (#ANY,ATOMS,BONDS)*>

This says that MOL can contain anything, but that ATOMS and BONDS have a
special role. The authoring tool might present a menu with the items ATOMS,
BONDS, Other. The software for MOL.java could contain routines to identify
children:
	for (int i = 0; i < this.getChildCount(); i++) {
            Node n = getNode(i);
            if (n instanceof ATOMS) {
                /* atom-specific stuff */;
                natom++;
            } else if (n instanceof BONDS) {
                /* bond-specific stuff */;
                nbond++;
            } 
        }
        if (natom > 0 && nbond > 0) {
            displayMol();
        }

Obviously this can't be written automatically, but the 'DTD' helps the author.

In some cases there will be stricter rules such as:

<!ELEMENT VAR (PCDATA)>
<!ATTLIST VAR 
    BUILTIN CDATA #IMPLIED 
    TYPE (INTEGER,FLOAT,STRING) STRING ...>

which clearly help both authoring tool authors and applications authors.

At present I would like to keep a simple DTD but most of the content models
will be 'ANY' and most of the attribute values will be CDATA. It would be
nice to have attribute values which could take a list of values *and* CDATA
:-) - like:
<!ATTLIST VAR TYPE (INTEGER,FLOAT,STRING,#ANY)>
which would inform the software that it should cater for three specific
values, but that the user can add FOO if they really want.

Any sympathisers out there :-)?

	P.

Peter Murray-Rust, Director Virtual School of Molecular Sciences, domestic
net connection
VSMS http://www.nottingham.ac.uk/vsms, Virtual Hyperglossary
http://www.venus.co.uk/vhg

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)