Classification: XML Parser Features

Sat Dec 13 14:22:46 GMT 1997

At 17:08 12/12/97 -0800, Tim Bray wrote:
>At 12:17 PM 12/12/97 -0500, David Megginson wrote:
>>Creating a truly well-formed parser is very, very difficult, because
                                         ^^^^^^^^^^^^^^^^^^^^
I think I would rephrase this - like TimB - to read something like:
"Creating a WF parser is a *lot* of work with a large number of small
decisions where the author may not always get help from the spec."
The author has to make (small) decisions which may appear intuitive to her
but may be interpreted differently by others. These decisions may not
matter in the vast majority of cases. 

There is/was a measure for XML that a 'mythical computer science graduate
student' could hack up a parser in a couple of weeks. Armed with this
promise I set about writing a recursive descent parser (which still exists
in JUMBO and is the default). But I have stopped working on it because (a)
others have written much better ones and (b) it's a lot more work than it
looks. Not difficult, I suspect, (it *was* difficult with the early version
of PEs) but lots of unrelated niggles. 

As an example I started writing an editor for WF XML, including editing
elementTypes and attributes. I suddenly realised that I had to check for
Name validity - as highlighted by James Clark. This requires validating
characters against Appendix B of the spec. I applaud and support the WG's
concentration on Internationalization (i18n) but when confronted with
Appendix B at midnight, the heart sinks. The tendency is just to insert
'This document is not yet i18n-conformant' and get on with more exciting
things (like why the program crashes).

In writing JUMBO I have come across a large number of these little things
which I don't feel the spec resolves.  I am very happy to leave the
parser-related things to those people who do it better (than me). But
SeanM/DavidM correctly raise the question of what a parser emits. I am
still not sure what the distinction between a parser, a processor and an
application is - I keep asking and have failed to get a reply.  This is
dangerous because (a) 'processor' is used in the spec but 'parser' isn't
(b) it's quite clear from discussions on this list that:

	- some people think processor and parser are synonyms.

	----------------------              --------------------
        |Parser aka Processor| ---------->  |     Application   |
	----------------------              --------------------

	- some people think parser and processor are completely separate
	--------             ------------              --------------------
        |Parser|  ------->   | Processor| ---------->  |     Application   |
	--------             ------------              --------------------
	- some people think that a processor is a unit which contains a parser but
has additional integrated facilities.
	--------------------------------- 
	|           Processor           |              
	|           -----------         |              --------------------
        |           | Parser  |         | ---------->  |     Application   |
	|           -----------         |              --------------------
	--------------------------------- 

*** I suggest that the first time anyone uses the word 'parser' or
'processor' in this discussion they indicate what they think a processor
is. Unless we have some ideas of each other's ontologies we shall have
serious problems.

The problems with what a parser is, are tricky but nothing compared with
the semantic difficulties of passing the output of 'a processor' to 'an
application'. The spec gives no help with this, except to highlight some
areas of difficulty and - effectively - to say 'this is up to you'. I'd
like it to be partly 'up to XML-DEV', which is why this discussion is *so*
important. 

Please don't think that anyone raising problems here is simply unable to
understand the spec or hasn't read it properly. Those involved in writing
the spec have a combined weight of perhaps 500 years of working with SGML
and other document processing tools. Many of the readers of this list are
coming to these discussions with different backgrounds and do not pick up
the 'implied' or 'given' semantics in the spec. I'm one, and I think that
if someone genuinely can't *implement* the spec because of semantic
uncertainties, there is a problem. [I am also clear, and have said so all
along, that many problems will *only* come to light when people try to
implement them.]. However, it's also important to realise that the spec is
written with very great care, very great precision and many sentences need
to be read very carefully and repeatedly. [In this alone I doubt that many
MCSGS can effectively understand all the concepts in the spec in less than
two weeks. And most DPHs and DumbXMLBrowserHackers (like me) will miss a
lot of the subtlety, through cursory reading.]

>>of the enormous number of constraints imposed both explicitly and
>>implicitly by the grammar (I could probably write a full SGML parser
>>with about the same level of effort, especially if I limited myself to
>>a single, simple SGML declaration).

I think the problems are different. SGML is complex, but precise. A year or
two back someone estimated on comp.text.sgml that SGML defined something
like 2^16 variants. I think that XML is one such variant, and one of the
simplest.  Writing a full SGML parser is very hard, with the result that
very few complete standalone parsers were ever written. In one sense that
was very valuable because people like me would just run their document
through sgmls - if it crashed, the document was wrong.  [I have no idea
whether there are parsers which take a semantically different view of 8879
from sgmls. However, even sgmls did not implement all the hairy options in
SGML, and many of these are not covered in many textbooks].
The XML process is very different. The syntax is trivial to write a parser
for. But the freedom of WF documents presents difficult and unresolved
problems of semantics. Therefore the time writing an XML parser is not in
coding the BNF, but worrying about what to do with the code. In particular
the question of 'validity' is fuzzy and crops up repeatedly. Where features
are optional in an XML document (e.g. the DOCTYPE statement) does its
*presence* (not its content) imply anything about how the software should
behave. I don't find this easy, but it's a very different sort of
difficulty from the difficulty of coding a validating algorithm for content
in full SGML.

[Tim's areas of difficulty]
>1. handling multiple input encodings, and
>2. making it run real fast while you're doing #1.
>
>These don't really bother me that much as we are in the infancy of 
>learning what the right way is to build truly internationalized
>software; for example, I can parse the UTF16 Japanese version of the
>XML spec in a few seconds; then it takes the best part of a minute
>to load the .ttf for the Unicode font so you can look at anything;
>so we have a few problems in this area.

Because this is uncharted territory it's certain to throw up problems.

>
>Having said that, I am now in the middle of coding up validation for
>Lark, and there are a TREMENDOUS NUMBER of irritating little
                       ^^^^^^^^^^^^^^^^^
Yup, yup, yup.

Each of this is 'small'. Let's assume that 95% of people agree with your
interpretation for each one in precise implementation (e.g. implementation
of Name), and let's assume that you have 20 such problems. 0.95^20 is 0.35;
so 35% of people will think that Lark is totally conforming and does
exactly what they want. This is a possibly naughty way of addressing the
problem, but it can only (IMO) be resolved by identifying those niggling
problems and agreeing communally either the 'right' way, or adding a switch
to the operation. Simply making personal decisions by each parser writer is
a guarantee that parsers will behave differently.

This is why JUMBO can use multiple parsers. DavidD suggested that it was
because they had bugs. In a sense that's exactly right ('features' is
probably more accurate). [It's also because no one has - yet - got a
complete Java implementation of a 'parser'.]

The thing that really frustrates me is that we lost the communal will to
create an API for parsers. Why, why, why - can't we do this? 

I'm going to suggest a slightly revised approach. AElfred comes close to it.
I'll write another msg, rather than make this too long.

[...]
>
>Mind you, the validator is in a separate package and can be bypassed, so 
>Lark effectively need be no larger.  But still; I wonder if validation
>is intrinsically hard or we could have found a better 80/20 point? -Tim

You're going to find out whether it's hard when you try to implement it
:-). I have no idea whether it's *really hard*. I think I could do content
validation in a week on a desert island. I would probably use a completely
stupid approach.
However I have received a gift of a validator (not in Java, but many
thanks) and please keep them coming. We need more than one, precisely to
see whether we all agree :-)

	P.

Peter Murray-Rust, Director Virtual School of Molecular Sciences, domestic
net connection
VSMS http://www.nottingham.ac.uk/vsms, Virtual Hyperglossary
http://www.venus.co.uk/vhg

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)