Content models considered bad...errr sometimes (was: Re: SGML the next big thing?)

Sat Dec 4 09:09:36 GMT 1999

From: Liam R. E. Quin <liamquin at interlog.com>

>    This & thing is so far outside the way most other computer
languages
>    work that standard off-the-shelf parser generators roll on their
>    backs and wave their paws in the air and admit defeat.

I am interested if you think this also reveals anything about the
persistent claims that SGML is bad because is doesn't conform to
the expectations of computer science (as influenced by an early
generation of tools such as YACC).   I would tend towards the
view that uncritical acceptence of academic paradigms has held
SGML/XML development up.  In the case of XML (and SGML,
which is really a compiler compiler, though with a different
target to YACC and Lex) I think the view that a schema should
be viewed as a language definition is holding things back
(which is *not* to say that there is no benefit in being able
to implement a schema as a language, or that there is no
benefit in being able to reason about a schema using
formal language theory).

No-one says "Windows, Icons, Menus
and Popups are not easy to implement in YACC, so we should not
have them": in fact, in the 90s, the trend for specifying GUIs has
been solidly away from formal grammatical descriptions of the
total interface language, even if just for flexibility.

>(3) The & connector interacts with #PCDATA to form pernicious content
>    models (see below).  The XML WG went to great lengths to make sure
>    that no valid XML document suffers from this SGML bogosity.
Similar
>    lengths are needed for "&".

Paul Prescod had an excellent idea a while back for adding a #WS
particle
that explicitly modelled whitespace. That would get rid of most
problems,
but it I presume there would still be an ambiguity possible with
    (#PCDATA | #WS )

But outside all this there is the basic issue of whether content models
actually are good to be the only direct mechanism for implementing
data models in XML: if  the idea of namespaces is
to allow ad hoc inclusion of elements from different domains at
the user discretion, the idea that a schema should be a language
description becomes less and less convincing. How useful is
"," when we might want to interpose elements from any other
namespace anywhere, for example?

For example, here is your content model, followed by a Schematron
schema.  I would say that the Schematron schema captures much
more directly what the content model might be modeling: in fact, the
content
model establishes relationships but fails to provide what they mean.

> <!ELEMENT boy
>     (noise & (dirt,mud)+ & (mud,shoes,trouble)* & #PCDATA) +smell

<schema>
<pattern name="A Boy">
 <rule context="boy">
    <assert test="count(noise)=1">Boys need noise</assert>
    <assert test="dirt">Boys need dirt</assert>
    <assert test="mud">Boys need mud</assert>
    <assert test="count(mud)=count(dirt) + count(shoes)"
    >Some mud comes from dirt and some mud comes from shoes.</assert>
    <assert test="count(shoes)= count(trouble)"
    >A boy will have as much trouble as he has muddy shoes.</assert>
 </rule>
 <rule context="smell">
    <assert  test="ancestor::boy">Boys can smell</assert>
 </rule>
 <rule context="boy/trouble">
    <assert test="previousSibling::shoes">Muddy shoes lead to
trouble</assert>
    <assert test="count(mud)=count(dirt) + count(trouble)"
    >The mud that comes from dirt is independent of the
    mud that causes trouble</assert>
 </rule>
 <rule context="boy/shoes">
    <assert test="previousSibling::mud">A boy's shoes must be
muddy</assert>
 </rule>
 <rule context="boy/dirt">
     <assert test="followingSibling::mud">All dirt leads to mud</assert>
    <assert test="name(followingSibling::*[position()=1])='smell'
                | name(followingSibling::*[position()=1])='mud'"
    >Dirt must be followed by mud or smells</assert>
 </rule>
</pattern>
</schema>

Other rules could be added to capture the intricacies of the inclusion,
but
the question should be asked whether the content model captures the
intent of the schema developer more than the Schematron schema does:
to what extent does the elegence of regular expressions force decisions
to be made that are extraneous to modeling requirements, i.e. that
are merely artifacts of the notation/paradigm.

I think that a good number of the people who claimed dislike for DTDs
will
find that really their problem is with regular grammars. Of course, the
people
who need to convert from class-based data into XML will find XML
Schema's
provisions of inheritence or class mechanisms very useful, but that
still
won't help matters if the relationship between elements is important.

Rick Jelliffe

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To unsubscribe, mailto:majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)