XML Torture Test: Parsers Fail

Wed Apr 7 05:43:32 BST 1999

At 3:51 PM -0700 4/6/99, David Brownell wrote:
>Chris, these aren't errors ... unless there are references
>to those entities (&baseball; and &season;) in the document,
>which is not currently done.
>
>If IE5 is treating those as errors, it shouldn't.
>
>- Dave
>
>
>Chris Lovett wrote:
>>
>> The problem appears to be in braves.dtd.  You have the following:
>>
>>         <!ENTITY baseball SYSTEM "braves/baseball.dtd">
>>         <!ENTITY season SYSTEM "braves/season.dtd">
>>
>> and these DTD's exist - so you have general parsed entities pointing to DTD
>> information which is not right.
>>
>> Once these two lines are removed from braves.dtd everything loads fine in
>> IE5.
>>

That does seem to be the problem.  Once I fixed that, IE 5.0 could load the
document from my local hard drive, but it still failed to load it from the
Web site. I don't yet know why.

I think what this whole mess is showing, given the widely varying problems
with so many parsers, is that validation is not nearly as simple as it
seems, especially when the validators are asked to handle large files.  A
couple of decades ago a lot of bugs were exposed in various compilers for
various languages when the output of various program generators like lex
and yacc were thrown at them. While these compilers could handle anything a
human programmer was likely to write, they failed when faced with
automatically generated code.  The compilers made too many assumptions
about what code looked like that weren't part of the language specs.

I suspect we're seeing something like that here. These files and the DTDs
containing the entity references were all created by a program that pulled
data out of a database. Only the basic structure of the document was
designed by hand. Pouring a database into a custom designed XML vocabulary
is not unusual, but programmatically creating the entity references does
seem to be unusual. I worry about what's going to happen when we start
writing programs that not only generate the data and entity references but
also the vocabulary. We're likely to uncover even more bugs and underlying
assumptions about what XML files look like.  This one document uncovered
verifiable, repeatable problems in four separate independently developed
parsers.  What's interesting is that these were four completely different
problems.

We may be able to learn something from the more formal, verifiable approach
to compiler design that's taken hold over the last 20 years.  We need to
think about a more formal specification of XML, and perhaps provably
correct parsers.  At the very least there needs to be more connection
between the spec and validating parsers.  The BNF grammar is
straight-forward (though at least one parser doesn't seem to be relying on
it) but the validity constraints are a mess.  The various schema proposals
may present an opportunity to fix this. We should consider very carefully
whether a given schema grammar can be easily (preferably autamtically)
translated into a parser for schemas based on the grammar and documents
based on particular schemas.

+-----------------------+------------------------+-------------------+
| Elliotte Rusty Harold | elharo at metalab.unc.edu | Writer/Programmer |
+-----------------------+------------------------+-------------------+
|        XML: Extensible Markup Language (IDG Books 1998)            |
|   http://www.amazon.com/exec/obidos/ISBN=0764531999/cafeaulaitA/   |
+----------------------------------+---------------------------------+
|  Read Cafe au Lait for Java News:  http://sunsite.unc.edu/javafaq/ |
|  Read Cafe con Leche for XML News: http://sunsite.unc.edu/xml/     |
+----------------------------------+---------------------------------+

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)