Classification: XML Parser Features

Sat Dec 13 06:03:07 GMT 1997

David Megginson wrote:
> 
> Tim Bray writes:
> 
>  > >a) Scanning
>  > >     This type of parser simply skips the DOCTYPE declaration (using
>  > >   regular expressions) and parses the markup in the document
>  > >   instances.
>  >
>  > This is not a conformant XML processor per the spec.
>  >
>  > There are certain things a processor is required to do with the internal
>  > subset, including parse it and check it for syntax.
> 
> Quite right; to my knowledge, however, there exist no XML processors
> that do so, except possibly for James's new one (I haven't tried it).
> In particular, few handle UTF-8 correctly.  As I've mentioned in
> private e-mail, even the 1997-12-08 spec is not currently well-formed,
> since it uses ISO-8859-1 encoding without saying so in its encoding
> declaration, so any conforming processor would have to reject it.

The spec says that not specifying the right encoding is merely an error
(which means a processor is not required to detect it) rather than a
fatal error.  In general a processor can't detect whether the specified
encoding is correct or not (consider ISO-8859-1 v ISO-8859-2).

> More generally, this requirement makes no provision for the desperate
> Perl hacker who has played such a central role in XML discussions.

The desperate Perl hacker doesn't require his code to be blessed as a
conforming XML processor.  One reason for requiring conforming parsers
to detect and report errors is to avoid the situation we see now with
HTML where it has become extremely difficult to create a production
quality HTML processor because users have come to expect an HTML
processor to accept almost any random garbage they throw at it.
Personally I would have preferred to see XML allow conforming processors
to continue processing in the presence of errors, but I think the
decision to require that errors be detected and reported was the right
one.

> Creating a truly well-formed parser is very, very difficult, because
> of the enormous number of constraints imposed both explicitly and
> implicitly by the grammar (I could probably write a full SGML parser
> with about the same level of effort, especially if I limited myself to
> a single, simple SGML declaration).

I think that assessment is way off base.  My xmlwf processor aims to
catch all well-formedness errors.  There are a couple of cases I know
the current version doesn't catch and there are probably a few cases
I've missed, but I think it is pretty close.  I wouldn't say writing it
was very, very difficult.  However it's certainly not trivial, and does
require considerable attention to detail.  I think having a test suite
should help here. Getting good performance also requires effort.

There are a couple of things in this area I would like to see 1.1
change:

- for well-formedness almost any character should be allowed as a name
character; detailed checking of a character against the table of name
characters should be a validity check;

- whitespace in the prolog shouldn't be handled in the grammar, but
should instead be regularised (still compatible with ISO 8879 of course)
and handled at a lexical level.

A fully conforming SGML parser (even one limited to a single SGML
declaration) is substantially more difficult.  For example, in order to
enforce the RS/RE ignoring rules a parser has to determine whether an
element is an inclusion or not, which in turn requires it to do content
checking.

>  I did have a
> version that passed the first 101 of James Clark's 141 tests, but it
> was already at about 30K, and I was aware of many other cases that he
> wasn't testing for.

Additional test cases are welcome. (By the way, test 088.xml was
overtaken by events and is now well-formed.)

>  > >b) DTD-driven
>  >
>  > There are a whole range of behaviors.  Parsers may, not must, read
>  > external markup declarations and external parsed entities.
> 
> Yes, you control that using the standalone declaration.  I am
> recommending that parsers that do not handle the full DTD (internal
> and external) be referred to as "scanning parsers", while parsers that
> handle everything be referred to as "DTD-driven parsers".  If
> necessary, we could always add another degree in the middle.

The intent (at least as I understand it) was to enable the following two
classes of parser:

- standalone parsers which can handle only the internal subset (and
hence which are able to produce the correct parse only for documents
which specify or could specify standalone="yes")

- full parsers which can parse the complete DTD.

James

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)