XML parser using lex & yacc

Wed Sep 1 18:33:18 BST 1999

This note is only for people really interested in low-level parser im-
plementation details; others, please ignore.

Richard Tobin wrote:
> 
> > I want to develop an XML parser in C or maybe C++ for an
> > undergraduate university project. My approach will be to prototype
> > the parser using flex and bison.

> Using your own lexer may be the best approach

Probably not for an undergraduate project.  Too error prone and time
consuming, I'd wager.  And in fact, if XML had to be approached this way
I'd say that it was fundamentally mis-designed.  It would be a very bad
mistake to design a modern markup language this way.

I don't know if the XML designers though about these issues, but it is
possible, with a few kludges, to parse it with Flex and Bison.

> Or you might be able to replace the lexer's input functions and change
> its character type to integer (if it isn't already); this would work
> for UTF-16 (the other required encoding) too.

Get the flex source at prep.ai.mit.edu (/pub/gnu/flex or whatever) and
patch the source with James Lauth's Unicode patches:  ftp://ftp.lauton
.com/pub/flex-2.5.4-unicode-patch.tar.gz.

Override the default Flex input routine with one that checks the file
format (all it has to do is parse the first few chars of the XML decl
as per the relevant appendix to the XML spec, then read the entire XML
decl for an encoding decl; you then rewind the file, store the file's
format and other important information in a lookup table, then use that
lookup table when reading in characters to determine what translation
to use for that file; convert everything to UCS-4, or perhaps UTF-16,
internally, so the above Flex patches will work; only need to do the
format check once for every file, since thereafter the lookup table
may be consulted).

> The most obvious problem with using yacc/lex type tools for XML is
> that keywords aren't always keywords.  For example, in some places
> in the DTD "SYSTEM" is a keyword and in others it would just be
> a name.

Just make sure that all your keywords can be both keywords and Name
sequences (you'll see what I mean when you read the spec).  Then write
your syntax rules so that wherever you need an Nmtoken sequence in the
parser it will accept a Name or a Nmtoken sequence (this can easily
be accomplished by having a rule, NameOrNmtoken : Name | Nmtoken, and
by then using NameOrNmtoken wherever you'd be inclined to use Nmtoken).

You'll see what I mean once you start writing the parser.

Would have been a lot easier if XML had introduced the notion of re-
served words.  Would also have been easier if the XML spec had aban-
doned the notion of whitespace as a grammatically significant token
inside of markup.  Inside markup it should essentially be ignored at
the parser's (as opposed to lexer's) level.  It's the way virtually
all modern languages are designed.  And I gather (Handbook, 65
[371:16]) that it's largely how SGML should work as well.

-- 

Richard Goerwitz
PGP key fingerprint:    C1 3E F4 23 7C 33 51 8D  3B 88 53 57 56 0D 38 A0
For more info (mail, phone, fax no.):  finger richard at goon.stg.brown.edu

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)