REX: XML Shallow Parsing with Regular Expressions
Rob Cameron
cameron at cs.sfu.ca
Thu Dec 3 20:06:08 GMT 1998
Recently I've been having a great deal of fun building XML shallow
parsers using regular expressions. The result is REX 1.0 as
documented in the paper described below. The fun comes from the
several cute techniques (shallow parsing with a single regular
expression, literate regular expression programming, UTF-8 processing
using 8-bit extended ASCII regular expression packages) that
combine in a very nice way. In particular, REX parsers for XML
are generated from an XML representation of regular expressions
which is processed by tools written using REX! Needless to say,
initial hand-written parsers were need to bootstrap the process.
Fun aside, I think there is serious room for REX in the area of lightweight
XML tool implementation. I'd be interested in feedback from the
XML development community about possible applications of REX.
Robert D. Cameron, "REX: XML Shallow Parsing with Regular Expressions",
CMPT TR 1998-17, School of Computing Science, Simon Fraser University,
November 1998.
http://www.cs.sfu.ca/~cameron/REX.html
Abstract
The syntax of XML is simple enough that it is possible to parse an XML
document into a list of its markup and text items using a single regular
expression. Such a shallow parse of an XML document can be very useful
for the construction of a variety of lightweight XML processing tools.
However, complex regular expressions can be difficult to construct and
even more difficult to read. Using a form of literate programming for
regular expressions, this paper documents a set of XML shallow parsing
expressions that can be used a basis for simple, correct, efficient, robust
and language-independent XML shallow parsing. Complete shallow
parser implementations of less than 50 lines each in Perl, JavaScript and
Lex/Flex are also given.
Robert D. Cameron, Associate Professor cameron at cs.sfu.ca
School of Computing Science FAX: (604) 291-3045
Simon Fraser University
Burnaby, B.C., Canada V5A 1S6
Internet Electronic Library Project at SFU
http://elib.cs.sfu.ca/
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)
More information about the Xml-dev
mailing list