Validation

Marcelo Cantos marcelo at mds.rmit.edu.au
Mon Mar 22 02:12:06 GMT 1999


Cheers,
Marcelo

On Thu, Mar 18, 1999 at 01:26:42PM -0600, Paul Prescod wrote:
> Chris Lilley wrote:
> > 
> > Unfortunately I came across EBNF long before I came accross DTD syntax,
> > so about half an hour after meeting DTDs I was, like, what do you mean
> > it can't express that this attribute is a url? Why can't it express that
> > this attribute is an ISO standard date?
> 
> I can guarantee you today that the XML schema effort will not allow you to
> express everything that EBNF will so if that's your standard it will fail.
> But even if we use EBNF as our standard: do you know of any programming
> languages expressed entirely in EBNF? Or even entirely in *any formalism*?
> 
> > Yes, validation is important - and I mean real validation, with no
> > critical-path human-readable comments in the DTD and multiple utilities
> > to check different aspects of validity (like separate scripts to ensure
> > that an attribute is a valid date or customer number).
> 
> It will never be the case that it will be possible to write schemas that
> are so tight that they remove the need for comments that describe
> additional constraints to other human beings. There will always be a need
> not only for multiple schema languages but also for the ultimately
> flexible schema language: prose text.

At the risk of sounding repetitious, an analogy may be of use here:

One mark of a good programming language is strong compile-time
checking (type safety, pre- and post-conditions and invariants are
typical measures of this).  Users of such languages typically
characterise them by exclaiming that programs, once compiled, usually
work the first time.  Of course no-one would argue that this is a bad
thing.  It would be silly, however, to take this as evidence that such
languages are bug free.  There will never be a bug free programming
language.  The simple reason is that no programming language can guess
the desired semantics if you get them wrong.  No language can stop you
from trying to implement sort and ending up with reverse sort!  All
they can do is prevent you from ever exhibiting undefined behaviour
(post-conditions can make it painfully difficult to do something so
silly, but I doubt they could do so at compile time).

Languages will never be capable of fully expressing what we want to
achieve in a declarative way.  Declarative languages such as SQL do
provide a very elegant and expressive mapping between intent and code.
But they typically address a very narrow and well-defined problem
domain.  No such silver bullet has been discovered for programming in
the large.

The purpose of this analogy is to illustrate what I believe to be the
same situation in the notion of a DTD.  A DTD defines the language
used to express a certain data domain.  This language provides some
constraints on what constitutes a legal piece of data.  However, just
as a programming language can never fully express the intent of the
user (i.e. it must always include procedural elements which implicitly
rather than explicitly embody the intent), so too can a schema
language never express the full set of constraints one might wish to
impose on a document.  It is easy to come up with trivial cases that
demonstrate this: imagine a document class in which the number of
paragraphs inside the nth section must be less than or equal to the
nth fibbonaci number; or another in which the content model of a
CONTENT element is defined in the PCDATA of the preceding MODEL
element; or how about one in which the maximum depth of the element
heirarchy is defined by the ascii value of the 100th character of the
file stored at a given URL!

Yes these are pathological examples, but the point of illustrating
with extremes is to obviate the fact that no matter how sophisticated
your schema language, it will never be able to handle all
contingencies.  There will always be someone, somewhere, for whom the
available schemas simply don't express the semantics they need.  If
you seriously want to address the widest possible set of schema
language requirements, EBNF doesn't come close; you would need Turing
completeness just as the starting point.

For those who would prefer to see a more concrete case-in-point, I
adduce the very example Chris Lilley provided, that of validating a
customer number.  What if a valid customer number is a composite of
the customer's priority level, her date of birth, and a sequence
number?  How would you express that the 3rd to 10th digits constitute
a date in the format yyyymmdd?  This would require a language with
built-in functions for type conversion, substring extraction and date
composition.  What if people born after 1990 couldn't have a priority
level higher than 25?  This would require branching constructs.  It is
quite common for things like customer codes, site codes, product
codes, etc. to have a composite structure, particularly in legacy
systems.  Often the parts are mutually dependent in non-trivial ways.
Sometimes validity can only be checked by consulting an external,
volatile source of information, such as a database (e.g. "Customer
code is invalid if the first two digits do not appear in the CODE
field of a record in the PRIORITY table...").

This is not, despite appearances, an exclamation of hopelessness.  But
rather to point out that completeness is not necessarily a desirable
goal.  If you wish to insist on developing a schema language that
handles all the validation requirements of any conceivable data
domain, you are aiming not only for an impossible goal, but for one
which, if it could be attained, would be so complex, so arbitrary and
so unwieldy that no-one would want to use it.  In real life, multiple
subsystems are brought into play when validating and transforming
data.  No single component can know everything about a piece of data,
and certainly not enough to definitively ascertain validity in all its
semantic richness.  In fact, the full set of constraints may not even
exist in one environment, but may be distributed across multiple
independent subsystems, or even across multiple hosts; for instance, a
timesheet workflow system may operate as follows:

  1. Employee fills in XML timesheet and submits to accounts server.

  2. The accounts server validates the project number field against
     projects the employee is permitted to book against, and passes
     the document on to the personnel server.

  3. The personnel server knows that the employee doesn't work on
     weekends and Tuesdays and checks for this.  It then dispatches
     the document to the repository.

  4. The repository performs basic DTD validation, which the other two
     servers probably did anyway.

The DTD (or XML Schemas) can provide a basic level of validation, but
there will always be more to do.  And this is not a problem if you
accept that there may be multiple stages to the process, probably
involving multiple languages and environments.

The complaint might be raised that the examples I have given mostly
involve table lookup and therefore belong more properly in the domain
of referential integrity mainenance, but this is not necessarily so.
For instance the accounts server may know that the project number and
employee number must begin with the same two digits (due to the
organisational structure) unless the project number begins with 99
(which represents admin codes).  Furthermore, in the case of a complex
customer code, validation may involve table lookup, but it is not with
a view to ensuring that the customer code refers to an existing
record, and hence is not a referential integrity constraint (the
constraint could even be revised to: "Customer code is invalid if the
first two digits do not appear in the CODE field ... _and_ the date of
birth is after 1990," in which case the record could be valid even
though the lookup did not find any matching records).

Quite apart from the problem of intractability, there is the equally
important issue of parsimony.  For many purposes, a fully expressive
language is more than one needs. Consequently, the user is forced the
to learn a complex environment to perform a simple task.  This is why
a language like CSS is in no danger of being superceded by XSL.  It
doesn't express everything XSL (or DSSSL) can, but it is simple.  An
average hack Web master can come to grips with CSS in a matter of
minutes, and can be using it to good effect within half an hour.  Not
to mention the fact that CSS is just plain easier to read (I hear much
debate about whether it is appropriate for humans to edit XML
directly, but I haven't heard anyone suggest that XSL should be
machine generated; I wonder about this from time to time).  For that
matter, XSL and DSSSL can't express every conceivable typography
requirement either.

Another concrete example comes to mind in the domain of configuration
files.  I have played around with moving our configuration file format
(which is a little ugly at present) to XML.  I was horrified at the
result and am now looking far more seriously at something like .INI
files.  It may not be intrinsically heirarchical (and hence is less
expressive), but it is much simpler, and much easier for a human to
read and manipulate.

Likewise, DTD's and XML Schemas will offer differing levels of
constraint specification, but neither of them (nor any future
language) can express every kind of validation rule that people will
want to express.  Life is simply too complex for that to be possible
(more specifically, real life is arbitrarily complex, and hence so are
the systems that try to model it).

> Luckily, eliminating all other schema languages is not a goal of the W3C
> schema language effort. 
> 
> > So what is critically needed is a real, namespace-aware, schema 
> > language that can be used to do real validation.
> 
> I hear a lot of users saying that. They don't seem to realize that there
> is no such thing as "real validation" there is only "the validation I need
> to do today." Ten years from now, we'll be griping that XMLSchemas don't
> do "real validation" for some other arbitrarily advanced definition of
> "real."

I heartily concur.  There is no silver bullet, so it is a waste of
time looking for one.  The focus should be on developing standards
that solve today's problems today, with an eye to leaving room for
future wisdom without being prescriptive.

Of course, none of the above discourse will eliminate the need for
discussion on what, exactly, is needed and how that need is to be
satisfied.  As one colleague astutely pointed out to me, I am really
transforming the issue from "real validation" to "sufficient
validation".  It would be a mistake, however, to conclude that this is
a trivial transformation in the statement of the problem.  It diverts
the emphasis of the search markedly away from completeness and towards
practicality and useability (of course, completeness remains
desirable, it merely ceases to be a central goal).


Cheers,
Marcelo

-- 
http://www.simdb.com/~marcelo/

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)




More information about the Xml-dev mailing list