<html>
<title>Using XSL for Structural Validation
</title>
<body>
<h1>Using XSL as a Validation Language
</h1>
<p><a href="mailto:ricko@gate.sinica.edu.tw">Rick Jelliffe</a><br />
Academia Sinica<br />
Taipei, Taiwan</p>
<p>1999-01-24</p>
<h2>Abstract</h2>
<p>XSL can be used as a validation language.
An XSL stylesheet can be used as a validation specification.
Because XSL uses a tree-pattern-matching approach,
it validates documents against fundamentally
different criteria than the content model.
This paper gives some examples.
</p>
<p>
XSL can be used on structured documents which
do not use markup declarations. And XSL
used in consort with XML markup declarations
seems a very nice and straight forward approach:
two small languages, each good at different
things.
</p>
<p>What is missing? The current XSL does not have some features
which would be desirable (how to report the current
line and entity, in particular) for a user-friendly
system. Regular expression pattern matching on
strings would be very useful.
(The main thing missing from this note is a
definite way to create the message "<i>This file is
valid</i>"; validity is shown by an empty list of
validity errors.)
</p>
<h2>Definitions</h2>
<p>A <em>validator</em> is software which examines
a structured document (e.g., an XML or HTML document,
a WebCGM document) and reports on the conformance
of that document's structures against some patterns.
</p>
<p>A <em>validation specification</em> are these
patterns expressed in some formal way, in particular
for use by a program.
In object-oriented software engineering terms (refer B. Meyer),
a validation specification give the pre- and post-conditions
we want to assert about a structured document's structures;
it is useful to make such assertions, because it
clarifies a programmers tasks and the capabilities
and nature of the data. It also can have a valuable role
in contractual conformance.
In markup terms (refer TEI),
a validation specification (such as a DTD) gives a theory about a
document's structure.
</p>
<p>A validator can be specified with a general purpose
language, or a specific <em>validation language</em>.
A validation language therefore embodies a theory about
which kinds of patterns are common, useful, important,
interesting, expected by users,
easy to implement, or which have patterns that can not be
validated readily by other validators or validation
languages.
Theories about which patterns are common, useful, etc.
are in turn judgements based on particular technologies
and usage domains.
</p>
<p>Just as with programming languages, the syntax
and operation of a validation languages are contraversial.
So a validation language also embodies a theory about
which syntactic and paradigmatic features are
common, useful, important,
interesting, expected by users,
easy to implement, or which are not available
in other validation languages.
</p>
<p>A <em>schema</em> is a collection of
rules about a document's structures. A schema definition
language is not a validation language, but may contain
a validation language. A schema definition
language may also allow any of the following:
<ul>
<li>information about data storage, encoding, transmission and
notation;
</li>
<li>human readable documentation;
</li>
<li>information to allow the automatic construction of
input front-ends;
</li>
<li>information about the meaning of elements, and various
linkages to other schemas.
</li>
</ul>
</p>
<p>An important distinction between a schema language and
a validation language is that a schema language will specify,
for example,
"<i>this element is a date</i>", while a validation language
will concentrate on more lexical/structural issues:
"<i>this element should conform to the regular expression</i>
<tt>/nnnn-nn-nn/</tt>".
</p>
<p>Examples of validation languages are:
<ul>
<li>W3C <em>XML markup declarations</em>;</li>
<li>ISO <em>SGML markup declarations</em>, which are a superset of XML markup
declarations;</li>
<li>ISO Architectural Forms, which allow a document to be
validated against multiple parallel content models, keyed not
only against element type names, but also against attribute values;</li>
<li>ISO Lexical Type Definitions, which allow element or attribute
values to be validated against a POSIX regular expression;</li>
<li>DDML (formerly XSchema), a subset of the XML markup declarations
expressed using XML instance syntax;</li>
</ul>
</p>
<h2>Limitations of Markup Declarations</h2>
<p>The XML markup declarations
(in particular, the content models)
have many desirable properties
as a validation language:
<ul>
<li>terse;
</li>
<li>declarative;
</li>
<li>simple, and modest in its aims;</li>
<li>fragment-friendly, since the interpretation of content models
does not depend on the document context;
</li>
<li>familiar, since their operation is familiar to people
exposed to BNF or formal grammers;
</li>
<li>standard, through the ISO heritage;</li>
<li>widely implemented;</li>
<li>understood--the nature and deficiencies of
content models have been well explored for more
than a decade on many projects.</li>
</ul>
</p>
<p>However, there are situations which the markup
declarations do not address, and some other system
would be useful:
<ul>
<li>the markup declarations are not available
as structured documents in their own right
(in the absense of nodes in DOM to do this);
</li>
<li>this in turn prevents hypertext linking,
structured annotations, and extending the
validation language to become a full schema
definition language;
</li>
<li>various kinds of partial validation,
where only targetted structures are checked;
</li>
<li>extended validation, where more than the
immediate context is checked--for example
to check that
<ul><li>
if a certain attribute is specified with
a particular value, some other attribute
has also been specified; or</li>
<li>that if a certain element type should not
be used if its parent's parent is some
other element type (e.g., to exclude
an RDF:RDF element from any subelement of
an RDF:RDF element).
</li>
</ul>
</li>
</ul>
</p>
<p>The XML markup declarations are part of XML.
In my view, there is scope for the development of
a validation language which complements XML
markup declarations rather than reinventing them.
(No disrespect, criticism or lack of enthusiasm for any
schema definition language or validation language
is intended by this comment.)
</p>
<h2>XSL Match Patterns</h2>
<p>Such a language already exists: XSL.
XSL match-patterns represent a very different view of
a document's structure than XML content models.
XSL match-patterns therefore can be used to complement
and enhance XML content models, as well as any
other content-model-based validation language.
</p>
<p>Doing this enables us to see validation as
merely another kind of document transformation.
In this case,
the input document is transformed into a document
which marks up structures in the original which are
not valid.
</p>
<p>(Note, a kind of validation can also be provided
by treating validation as a kind of formatting:
for example, a CSS stylesheet could be provided
which highlights in red any element which
is not valid. The CSS pattern-matching rules
may be complex enough to create a useful validator
based on this idea in some circumstances.)
</p>
<p>This use of a transformation language for validation
is hardly novel. Indeed, one reason why SGML
system constructed on top of transformation languages
(e.g. OmniMark, Perl) have a good rate of success
is that system developers can (and do) build extended
validation systems readily. Such validators help the
programmers discover structural
patterns: useful or pathological.
They can also allow looser
and simpler content models in the markup declarations,
resulting in better layering of validation.
</p>
<p>The advantage of using XSL as a validation language
are
<ul>
<li>terse--the match patterns are very terse, like
XML content models;
</li>
<li>declarative;
</li>
<li>simple, and modest in its aims;</li>
<li>fragment-friendly, since the interpretation of content models
does not depend on the document context;
</li>
<li>familiar, since their operation
will be familiar to people using XSL for
transformation or formatting purposes;
</li>
<li>widely implemented--James Clark and IBM already have
XSL tools available;</li>
<li>understood--the nature and deficiencies of
tree-based patterns have been well explored for more
than a decade on many projects in languages such as
OmniMark.</li>
</ul>
</p>
<h2>Template for the Validator</h2>
<p>Following is a stub which can be used to
construct a validator.
<p>
<pre>
<?xml version="1.0"?>
<!-- Template for XSL Validator -->
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/TR/WD-xsl"
xmlns="http://www.w3.org/TR/REC-html40"
result-ns=""
xmlns:rdf="http://w3.org/TR/1999/PR-rdf-syntax-19990105#"
><font color="red"><!-- add any other namespace declarations above --></font>
<!-- Root template - start processing here -->
<xsl:template match="/">
<HTML>
<HEAD>
<META http-equiv="Content-Type" content="text/html; charset=iso-8859-1"/>
<META http-equiv="Expires" content="0"/>
<TITLE>Results of Validation (using XSL)</TITLE>
</HEAD>
<BODY>
<H1>Results of Validation (using XSL)</H1>
<UL>
<xsl:apply-templates/>
</UL>
</BODY>
</HTML>
</xsl:template>
<xsl:macro name="element_warning_message" >
The invalid element is found at tree location <xsl:number level="multi" count="*" format="1." />
<xsl:if test='.[@ID]'>
The element's ID is <xsl:value-of select="@ID" />.
</xsl:if>
<xsl:if test='..[@ID]'> The element's parent's ID is <xsl:value-of select="../@ID" />.
</xsl:if>
</xsl:macro >
<xsl:macro name="attribute_warning_message" >
The element with the invalid attribute is found at tree location <xsl:number level="multi" count="*" format="1." />
<xsl:if test='.[@ID]'>
The element's ID is <xsl:value-of select="@ID" />.
</xsl:if>
<xsl:if test='..[@ID]'> The element's parent's ID is <xsl:value-of select="../@ID" />.
</xsl:if>
</xsl:macro >
<font color="red"> <!-- Good patterns. Put your instructions here. --></font>
<font color="red"> <!-- Bad patterns. Put your instructions here. --></font>
<!-- Do not change after here. This handles defaulting. -->
<xsl:template match="text()" priority=-1">
<!-- strip characters -->
</xsl:template>
</xsl:stylsheet>
</pre>
<p>Accept good patterns using the following template:
<pre>
<xsl:template match="<font color="red">pattern</font>" priority="2" >
<xsl:apply-templates/>
</xsl:template>
</pre>
<p>Validate against bad patterns using the following template:
<pre>
<xsl:template match="<font color="red">pattern</font>">
<LI>
<font color="red"><!--put message here--></font>
<xsl:invoke macro="<font color="red">node</font>_warning_message" />
</LI>
<xsl:apply-templates/>
</xsl:template>
</pre>
<p>You can use these in two ways.</p>
<p>The positive way is to make "good patterns"
which cover every context in which your element type (if that is what you
are validating) is allowed to appear. Then you put a simple case which
catches simple occurrances of the element as the "bad pattern".
</p>
<p>The negative way is to make "bad patterns" which find element
types in contexts you specifically want to deem invalid. The
"good pattern" can contain any excepts to this. You can use the
"good patterns" to create a stop list of specific cases which break
a more general rule about "bad patterns". Use the priority
attribute to show that the "good patterns" should be tested before the
"bad patterns".
<h2>Examples</h2>
<p>These examples were developed with the LotusXSL beta.
There may be slightly different syntaxes required for the
other XSL betas (i.e., James Clarks' and Microsoft's).
The examples each validate something which an
XML markup declation cannot directly specify.</p>
<h3>1: Unwanted Element</h3>
<p>This example imposes additional requirements compared
to the HTML DTD. It acts a little like an SGML global
exclusion, in that the content model of the markup declarations
may allow the blink element,
but this validation layer exposes the invalidity.
</p>
<pre>
<font color="red"><!-- Put this in the "bad patterns" section in the template --></font>
<xsl:template match="BLINK">
<LI>
Element "BLINK" has been used. This is against our house style.
<xsl:invoke macro="element_warning_message" />
</LI>
<xsl:apply-templates/>
</xsl:template>
</pre>
<p>If a BLINK is found, a warning is generated. The location in the tree
is given. The ID attribute of the element (if any exists) is given.
</p>
<h3>2: Element Context</h3>
<p>This example checks that an rdf:RDF element
never appears as a descendent of another rdf:RDF
element.
</p>
<pre>
<font color="red"><!-- Put this in the "bad patterns" section in the template --></font>
<xsl:template match="rdf:RDF[ancestor(rdf:RDF)]">
<LI>
The element "rdf:RDF" has been found inside another element "rdf:RDF".
<xsl:invoke macro="element_warning_message" />
</LI>
<xsl:apply-templates/>
</xsl:template>
</pre>
<h3>3: Attribute Context</h3>
<p>This example checks that an
"other-unit" attribute can only be
specified if the value of the "unit"
attribute is "other".
</p>
<pre>
<font color="red"><!-- Put this in the "Bad patterns" section of your template --></font>
<xsl:template match='fig[(@unit="other") and (@other-unit="")]' priority="2" >
<LI>
The element "fig" has attribute "unit" specified as "other".
But the attribute "other-unit" has a zero length.
<xsl:invoke macro="attribute_warning_message" />
</LI>
<xsl:apply-templates />
</xsl:template>
<xsl:template match='fig[(@unit="other") and (not(@other-unit))]'>
<LI>
The element "fig" has attribute "unit" specified as "other".
But the attribute "other-unit" has not been specified.
<xsl:invoke macro="attribute_warning_message" />
</LI>
<xsl:apply-templates/>
</xsl:template>
</pre>
<p>Checking attributes requires answering two questions.
First, has the attribute specified in the document?
Second, even if it is specified, does it have a zero-length
value?
</p>
<hr />
<p>Copyright (C) 1999 Rick Jelliffe.
Please feel free to publish this in any way you like,
but try to update it to the most recent version,
and keep my name on it.
</p>
</body>
</html>