XML Torture Test: Parsers Fail

Elliotte Rusty Harold elharo at metalab.unc.edu
Mon Apr 5 17:13:36 BST 1999


Without intending to do so, I have devised an XML document that exposes
many problems in almost all XML validating parsers and non-validating
parsers that resolve external entity references.  You will find this
torture test at

http://metalab.unc.edu/examples/players/index.xml

It has broken every parser I've thrown at it in one way or another
including the one in IE5  with the single exception of RXP.  However RXP
reports some warnings that do not appear to be errors, and missed some
problems involving the lack of encoding declarations in the text
declarations in an earlier version that xml4j 2.0.4 (but not 1.1.14) picked
up. These have now been fixed.

As best I can tell this document is both well-formed and valid. It's hard
to say for sure when many different parsers all fail to process it, mostly
after either giving up completely or generating incorrect error messages.
Until I'm more confident the document is correct, I'm simply defining a
broken parser as one that

1. describes a valid documbent as invalid  (Microsoft?, xml4j?)
2. describes an invalid document as valid (RXP)
3. describes an invalid document as invalid but gives the wrong reason.
(Microsoft?, xml4j?)

Once I've conclusively determined whether my document is valid, I should be
able to determine whether Microsoft, xml4j and xml4j fit into category 1 or
3 or both.

What's torturous about this example is that it defines over 1000 separate
external general  entity references in several dozen different DTDs.
Currently only one of those entities is actually used in the main document,
but I plan to expand it to use all 1000+ entities.  Thus it's likely to
become even more difficult to parse properly.  Leaving aside the question
of whether this is the proper design for this document, it's nonetheless
the case that parsers should be able to handle it.  Parser authors may wish
to investigate further. The assistance of anyone who can spot by eye
mistakes I made that the parsers may be incorrectly reporting is
appreciated.



+-----------------------+------------------------+-------------------+
| Elliotte Rusty Harold | elharo at metalab.unc.edu | Writer/Programmer |
+-----------------------+------------------------+-------------------+
|        XML: Extensible Markup Language (IDG Books 1998)            |
|   http://www.amazon.com/exec/obidos/ISBN=0764531999/cafeaulaitA/   |
+----------------------------------+---------------------------------+
|  Read Cafe au Lait for Java News:  http://sunsite.unc.edu/javafaq/ |
|  Read Cafe con Leche for XML News: http://sunsite.unc.edu/xml/     |
+----------------------------------+---------------------------------+



xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)




More information about the Xml-dev mailing list