XML parser using lex & yacc

Richard Tobin richard at cogsci.ed.ac.uk
Wed Sep 1 17:45:01 BST 1999

> I want to develop an XML parser in C or maybe C++ for an
> undergraduate university project. My approach will be to prototype
> the parser using flex and bison. As I understand it, flex won't be
> able to handle all of the character encodings required in the the
> 1.0 spec.

Using your own lexer may be the best approach, but all the "syntax
characters" of XML are plain ASCII, so it might well be possible to
use [f]lex to tokenise it.  For UTF-8 it is straightforward: the lexer
doesn't have to even know that the multibyte-characters are not just
multiple characters - the next level up can translate them.

Or you might be able to replace the lexer's input functions and change
its character type to integer (if it isn't already); this would work
for UTF-16 (the other required encoding) too.

The most obvious problem with using yacc/lex type tools for XML is
that keywords aren't always keywords.  For example, in some places
in the DTD "SYSTEM" is a keyword and in others it would just be
a name.  You can have the parser switch the lexer between states
but it's not pretty.

-- Richard

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)

More information about the Xml-dev mailing list