Parser compliance

Fri Nov 19 15:21:25 GMT 1999

Re:

> From: Paul Prescod <paul at prescod.net>
> To: xml-dev at ic.ac.uk
> Subject: Re: Parser compliance
> 
[David Megginson] wrote:
> 
> That's probably a rhetorical question, but for those new to the field
> (i.e. who didn't come from SGML), SGML consultancies throughout the
> 1990's made an enormous portion of their money writing (and rewriting
> and rerewriting and rererewriting) massive and incomprehensible DTDs
> for government, military, and big industry, so naturally they (OK,
> "we") hyped the importance of DTDs as the cornerstone of any system.

[Paul Prescod]
There are two issues here. Tim raised one about how many mistakes DTDs
actually catch. You raise a (mostly unrelated, IMHO, one) about how much
of a project's energy should be spent on the DTD design. Even though you
are talking about different things, I think you're both wrong.

[RCC]
It appears that all three of you are "right" within narrow contexts, but
that the questions/contexts are very different.  Being right or wrong
isn't important; communication is.

[Paul Prescod]
I claim that when people start using XML for what its good for schemas
(nee DTDs) will come back to the fore. After all, schema design is the
only hard thing about XML. The system design aspects are no different
than NON-XML system design. But for schemas, you could erase the "XML
here" box and put "s-expressions here" (or maybe even "CORBA here") and
nothing else would change.

[RCC]
In one respect I hope your prophecy comes true, but a lot is at stake
in how much "schemas (nee DTDs)" improve upon DTDs.  DTDs
are extremely limited in their relevance/utility WRT data modelling
because they cannot model semantics.  I would say that the "hard
thing" about the system design is getting the semantics right.  DTDs
don't help, "XML schemas" might help -- we'll see.

[Paul Prescod]
Once the core standards mature, people will set about building vertical
schemas (they already have). They will spend an enormous portion of
money writing, and rewriting, and rewriting, massive and
incomprehensible DTDs for electronic commerce, manufacturing, three-D
graphics and so forth. It's precisely because getting these things right
is so hard AND so important that so much energy is spent on them.

[RCC]
If design efforts revolve around "[incomprehensible] DTDs" then the
new age is no better than the old.  DTDs cannot express/capture the
the truly important features with respect to which "getting these
things right" is a relevant and critically important notion.  I agree
that massive amounts of energy are required.  But what tools
and methodologies are appropriate?  I don't think DTDs should
be seen, heard, smelled, tasted, touched, or even thunk upon during
analysis/design.  UML diagrams -- maybe.  Isn't this the problem?
As David says [elsewhere], it's a conundrum because conceptual
design tends to bump into implementation design.  What we need is
a language of conversation that works for the domain experts
(we want semantic transparency when we get to the UI), the system
designers, and the implementation designers -- *and* which
represents executable/testable code for competence/correctness
checking.  The requirements engineering community does this...

[Paul Prescod]
The XML world won't spend less money, it will just spend it differently.
Instead of several large, collosal failures, it will have dozens and
dozens of tiny failures (there are probably at least a dozen failed
XML-based languages already).

[RCC]
It would be an interesting investigation: why have a dozen
(or two) XML-based languaged "failed" already?  How would
the designers explain the failure?  How would their competitors?

[David Megginson]
> In brief, then, SGML systems tend to be DTD-centric while XML systems
> tend to be component-centric.  There's nothing in SGML or XML that
> forces that distinction; it's just the way things fell out.  Tim's
> right -- DTD-based validation will tell you only a tiny portion of
> what's wrong with your document, though that portion can be helpful
> in some circumstances.

[RCC]
"DTD-centric" vs. "component-centric" makes sense to
me, but "the way things fell out" begs for explanation. 
<warning>Avert your eyes, if you twitch and spaz out 
when someone claims there's a fundamental different
between vanilla "documents" and "database 
[components]".</warning>  If XML systems are indeed
"component-centric", is it not because the information
being carried by XML markup is a different "kind" of
information than is carried in traditional (1986-vintage,
static, paper-print, "documents")?  The information
substrate (the datatype) in the latter is "character text".
The characters in "content" represent themselves
directly, aggregated into morphemes, words, phrases,
sentences -- which find quintessential expression in
the mere display/print of these characters.  The data in
the leaf nodes are all (credits Gary Simons) "from the
same domain -- the domain of character text."
DTDs ("sort of") work for this application domain
because *printable character text* is inherently ordered
(serialized) and can often be delimited naturally by
markers representing a hierarchy, etc. etc.

Conversely: when I look at the 200 or so XML (putative,
proposed) applications
(http://www.oasis-open.org/cover/siteIndex.html#toc-contentsApps),
I see information modelled in many exotic domains
which are NOT quintessentially 'character text'.  In these
spheres of application, the most important features cannot
be captured/modelled/expressed using DTD language
(relationships and other semantics).  Will XML Schemas
[XML Schema Definition Language] change things
a little -- or a lot?  I can't tell yet from the Schema
specs. The last I looked, there was still no provision
for something as fundamental as constraining the
target of a referent (ID-IDREF mechanism in
current XML-speak);  *that* would represent two
steps toward first base.

If "DTD-based validation will tell you only a
tiny portion of  what's wrong with your document",
the reasons WHY differ with respect to "README,
PRINTME" documents and "databases."  The
question of the hour is whether XML schemas
will be able (politically) to redress the DTD situation.
I do not share the skepticism of some about
generalized, extensible facilities for modelling/validating
primitive (relational, ontologic) semantics in the
markup context.  Whether the W3C members will be able to
transcend partisan "company interest" issues to make this
design happen: we can pray.  If the energy spent on bashing
and trying to outflank MS were spent on Good Design, I have
no doubt that the committees could succeed.

[Paul Prescod, WRT DM's 'DTD-centric']
That's a massive generalization. I'll bet that if we cleaned up all
HTML, 80% of all errors would be caught by a well-formedness parser
(mandated by XML), and 80% of the rest would be caught by a validator.
Most of the remaining errors would be fixed if we ran an automated link
checker against them.

[RCC]  Maybe.  What kinds of "errors" are we talking about?  If
I can re-make Tim's point, or perhaps a related one:  a limerick
or sonnet can be represented in HTML with a typical HTML browser
client (html, div, ul, li, p).  But "errors" of interest to the
literature teacher are completely outside the scope of HTML.  An
error-checker that can address ONLY markup syntax is feeble in
the extreme, though certainly not useless.  But it's nearly irrelevant
with respect to the domain knowledge/information we hold to be
of principal importance.  DTDs didn't *have to be* that way, but
they are.  A matter of "grand equivocation," as I have discussed
elsewhere.

[Paul Prescod]
I think that Tim should spend some time getting to know Lauren's
customers. I would guess that most people in the document publishing
world who use validating editors and parsers cut their error checking
code by 9/10th. More important, they elevate the error checking into a
syntax that can easily be read and shared.

Your average XML editor purchaser has only two non-negotiable
requirements: realtime DTD checking and realtime stylesheet application.
Tim seems to think that they are naive but I don't believe so. That
feature can and does save many companies millions of dollars.

[RCC]
This emphasizes my point about the difference between XML as
created in XMetaL -- and XML as used in XMI, XOL, BRML, CBL,
fixML, and so forth.  And so, again: for printable text
(characters!) structured as vanilla documents via XMetaL, DTDs
and validating parsers give substantial payback, even if the
level of QA on information quality is zero to marginal.

I think what we want/need are conceptual modelling languages
and software tools that allow one to machine-test the correctness
of the design (the requirements, the system architecture, the
software modules).  Most critically: to model the semantics of
the problem domain -- something SGML/XML DTDs cannot do.  For
all the respect I have for SGML/XML, and DTDs [I have given a chunk
of my life in support of SGML!], and for all the pardon that should
charitably be granted to markup language design [1984] badly
botched in places [over the protest of computer scientists; so F.
 Chahuneau], I still have to agree with Tim and David 
about the serious inherent limitations.  I therefore hope
XML Schemas (Version 1 and Version [N]) will succeed in
helping migrate a very popular and accessible markup
notation in the direction of modern OO principles, where
semantics can be modelled directly.

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To unsubscribe, mailto:majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)