Classification: XML Parser Features

David Megginson ak117 at
Fri Dec 12 17:19:13 GMT 1997

Tim Bray writes:

 > >a) Scanning
 > >     This type of parser simply skips the DOCTYPE declaration (using
 > >   regular expressions) and parses the markup in the document
 > >   instances.  
 > This is not a conformant XML processor per the spec.
 > There are certain things a processor is required to do with the internal
 > subset, including parse it and check it for syntax.

Quite right; to my knowledge, however, there exist no XML processors
that do so, except possibly for James's new one (I haven't tried it).
In particular, few handle UTF-8 correctly.  As I've mentioned in
private e-mail, even the 1997-12-08 spec is not currently well-formed,
since it uses ISO-8859-1 encoding without saying so in its encoding
declaration, so any conforming processor would have to reject it.

More generally, this requirement makes no provision for the desperate
Perl hacker who has played such a central role in XML discussions.
Creating a truly well-formed parser is very, very difficult, because
of the enormous number of constraints imposed both explicitly and
implicitly by the grammar (I could probably write a full SGML parser
with about the same level of effort, especially if I limited myself to
a single, simple SGML declaration).

For example, both Ælfred and Lark fail to report the two errors in the
following document:

<doc note="1<2">
<para>This is a ]]> paragraph.</para>

I could support complete well-formedness error reporting in Ælfred,
but its size would bloat to about 35-40K (entity-boundary checking, in
particular, would be messy), while I still want to get it down to
under 20K so that Java applet writers can use it.  I did have a
version that passed the first 101 of James Clark's 141 tests, but it
was already at about 30K, and I was aware of many other cases that he
wasn't testing for.

 > >b) DTD-driven
 > There are a whole range of behaviors.  Parsers may, not must, read 
 > external markup declarations and external parsed entities.  

Yes, you control that using the standalone declaration.  I am
recommending that parsers that do not handle the full DTD (internal
and external) be referred to as "scanning parsers", while parsers that
handle everything be referred to as "DTD-driven parsers".  If
necessary, we could always add another degree in the middle.

 > >Realm #2: Validation
 > >
 > >a) Non-validating
 > >     This type of parser assumes that its input document is both
 > >   well-formed and valid, and is not required to report any errors at
 > >   all.
 > No such animal is envisioned in the standard.  If it doesn't check for
 > WF problems, it's not an XML processor.

I am aware of the constraints in the spec, but I believe that this is
a serious strategic error.  Ælfred is a non-conforming XML processor,
as are Lark, MSXML, and all others that I have had a chance to try:
Ælfred will produce correct output for valid and well-formed XML
documents, but will not necessarily report errors for documents that
are not valid/well-formed.

If the XML spec does not make allowance for software tools like these,
then it will have little to distinguish it from full SGML except for a
bit of marketing hype.

 > I'll stop here.  I suggest you go back and re-work your
 > (potentially helpful) list based on a re-reading of the
 > specification. -Tim

Thank you very much for your comments.  I am grateful for the work
that you and the rest of the WG have done with the spec, and I hope
that you find my comments constructive rather than confrontational.

All the best,


David Megginson                 ak117 at
Microstar Software Ltd.         dmeggins at

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at
Archived as:
To (un)subscribe, mailto:majordomo at the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at

More information about the Xml-dev mailing list