Classification: XML Parser Features
tbray at textuality.com
Sat Dec 13 01:06:50 GMT 1997
At 12:17 PM 12/12/97 -0500, David Megginson wrote:
>Creating a truly well-formed parser is very, very difficult, because
>of the enormous number of constraints imposed both explicitly and
>implicitly by the grammar (I could probably write a full SGML parser
>with about the same level of effort, especially if I limited myself to
>a single, simple SGML declaration).
To start with, "full SGML parser" is directly contradictory to "a single
SGML declaration" - abstract syntax in fact being one of the things
that makes a full parser hard to write.
As to David's main point, that a WF parser is hard to write, I don't
agree; most of the work can be done in the low-level lexer, the number
of constraints that require ad-hoc code is pretty small. Two things
are in fact hard, it seems:
1. handling multiple input encodings, and
2. making it run real fast while you're doing #1.
These don't really bother me that much as we are in the infancy of
learning what the right way is to build truly internationalized
software; for example, I can parse the UTF16 Japanese version of the
XML spec in a few seconds; then it takes the best part of a minute
to load the .ttf for the Unicode font so you can look at anything;
so we have a few problems in this area.
Having said that, I am now in the middle of coding up validation for
Lark, and there are a TREMENDOUS NUMBER of irritating little
details about that. No rocket science at all, but the code is going
to be substantially larger than the rest of Lark and it's all real
code; more than half of Lark is compressed parser tables.
Mind you, the validator is in a separate package and can be bypassed, so
Lark effectively need be no larger. But still; I wonder if validation
is intrinsically hard or we could have found a better 80/20 point? -Tim
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)
More information about the Xml-dev