Why validate? (was Re: Parser compliance)

Robin Cover robin at isogen.com
Thu Nov 18 17:22:42 GMT 1999


While agreeing that SGML/XML DTD syntax is of limited value
in supporting "validation," I think David Megginson and
Tim Bray have somewhat short-changed the notion of 
declarative, constraint-based (ontologic/relational) semantics
and corresponding (generalized, middle-tier) machine
processing of constraint-based semantics within the "markup"
framework.  I'm speaking of how things could be, not how
they were (ISO 8879:1986) and are (XML 1.0) given
the expressiveness of DTDs.

It's true that SGML/XML DTDs provide almost *no* support
for expression and validation of semantic integrity constraints.
I've written on this because misunderstandings abound (see, for
example http://www.oasis-open.org/cover/xmlAndSemantics.html).

It's probably also true that some SGML practitioners placed
too much weight on DTD design, when other analysis and design
methodologies would have been more appropriate to the
enterprise problem, as Megginson says:

> Unfortunately, SGML consultancies who knew mainly just
> DTDS and FOSIs were substituting DTD design for data 
> analysis, domain modelling, system design, user interface
> design, and lots of other things for which DTDs are 
> woefully inadequate.

I would add to this "requirements engineering" and a number
of other disciplines followed in modern object-oriented
analysis and design frameworks.  The most critical aspect
in any analysis/design/modelling effort is getting the
semantics "right."  DTDs just don't help much in this
connection; if anything, they are a distraction and
frequently a detriment; they impose serious and
"unnatural" constraints upon a data modelling process
(the world is *not* hierarchical or serialized; attributes
*do* have complex values; relationships *do* admit of
semantic constraint properties; etc.)   Personally, I think
SGML's flat-out aversion to semantics was a mistake,
though it represents an understandable position in the
world of "paper print publishing."

On the other hand, I find the current XML Schema Definition
Language endeavor very well motivated with respect to the
goal of "defining datatypes" (an "extensible datatype system")
for use in the markup context.  I am particularly optimistic
about the goal of supporting "user-defined datatypes"
(generated datatypes built upon axiomatic ones, via 
construction rules).  A schema definition language having
access to these datatype declaration facilities, together
with a general schema-validation processor, should provide
the basis for some semantic processing -- which has always
been anathema to the "SGML" view of the world.

In this connection I am happy to see the words "meaning"
and "relationships" in the W3C draft spec:

  The purpose of an XML Schema: Structures schema
  is to define and describe a class of XML documents by 
  using these constructs to constrain and document the
  MEANING, usage and RELATIONSHIPS of their constituent
  parts: datatypes, elements and their content, 
  attributes and their values, entities and their contents
  and notations.

We should be able to say formally what we "mean" by an
<ISBN> (modulo the checksum, perhaps) and a <date>. At
that point, a schema processor will set off alarms if
we say "<date>smelly socks</date>".  Analogous to
what we can say now at the meta-level: "What do you
'mean' by an XML document?" "I 'mean' something
that 'matches the production [1] document'.  Machines
can understand "meaning" at this level, without
the prose documentation (useful to humans).

David also said:

> The DTD has two small but important roles in system
> implementation: as a partial set of structural validation
> rules and as a partial schema for guided authoring
> (it's nearly always supplemented in both cases).

I think DTDs (with all their weaknesses) do more than just
"two small" things.  XML Schemas, to the extent that they
do more than simply mirror DTD notions in instance syntax,
will do even more.   Some sample use cases are described in
the XML Schema requirements document (see
http://www.w3.org/TR/NOTE-xml-schema-req), for example
"Use schema to help [sensible] query formulation and 
[query] optimization" -- a function which DTDs can currently
support today.

Finally:

> As Eliot, Tim, and I have all mentioned (in different
> contexts), the primary contract *has* to be the 
> human readable documentation, even though the DTD is 
> useful for detecting a certain subset of problems.

I don't see the benefit of driving a wedge between "primary"
and its opposite: it depends upon perspective.  EBNFs,
XSDs, DTDs (or any grammar) may be viewed as human readable
documentation, and in some contexts, they have primacy
(contractually) over the informal prose.  Both are normally
necessary.

Still, as Len might say: "formal semantics is hard,
syntax is easy, now get (back) to work."

For anyone whose mind is not already made up on the
matter of generalized processing of (e.g., 'business logic') 
semantics in the markup context, see:

  http://www.cs.utexas.edu/users/mfkb/related.html
  http://www.oasis-open.org/cover/brml.html
  http://www.oasis-open.org/cover/xol.html
  http://www.oasis-open.org/cover/xml.html#xml-ontology

- Robin Cover


xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To unsubscribe, mailto:majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)





More information about the Xml-dev mailing list