REX: XML Shallow Parsing with Regular Expressions

Rob Cameron cameron at cs.sfu.ca
Thu Dec 3 20:06:08 GMT 1998


Recently I've been having a great deal of fun building XML shallow
parsers using regular expressions.    The result is REX 1.0 as
documented in the paper described below.   The fun comes from the
several cute techniques (shallow parsing with a single regular
expression, literate regular expression programming, UTF-8 processing
using 8-bit extended ASCII regular expression packages) that 
combine in a very nice way.  In particular, REX parsers for XML
are generated from an XML representation of regular expressions
which is processed by tools written using REX!  Needless to say,
initial hand-written parsers were need to bootstrap the process.

Fun aside, I think there is serious room for REX in the area of lightweight
XML tool implementation.  I'd be interested in feedback from the
XML development community about possible applications of REX.

Robert D. Cameron, "REX: XML Shallow Parsing with Regular Expressions",
CMPT TR 1998-17, School of Computing Science, Simon Fraser University,
November 1998.
http://www.cs.sfu.ca/~cameron/REX.html

Abstract

The syntax of XML is simple enough that it is possible to parse an XML
document into a list of its markup and text items using a single regular
expression. Such a shallow parse of an XML document can be very useful
for the construction of a variety of lightweight XML processing tools.
However, complex regular expressions can be difficult to construct and
even more difficult to read. Using a form of literate programming for
regular expressions, this paper documents a set of XML shallow parsing
expressions that can be used a basis for simple, correct, efficient, robust
and language-independent XML shallow parsing. Complete shallow
parser implementations of less than 50 lines each in Perl, JavaScript and
Lex/Flex are also given. 

Robert D. Cameron, Associate Professor           cameron at cs.sfu.ca
School of Computing Science                      FAX: (604) 291-3045
Simon Fraser University
Burnaby, B.C., Canada  V5A 1S6

Internet Electronic Library Project at SFU       
http://elib.cs.sfu.ca/



xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)




More information about the Xml-dev mailing list