Classification: XML Parser Features
ak117 at freenet.carleton.ca
Fri Dec 12 15:10:10 GMT 1997
Sean Mc Grath writes:
> The real truth behind XML's simplicity and ease of implementaton is being badly
> let down by the haziness with with parsers are classified:-
> Well Formed
> Type Valid (In the DOM level 1 spec.)
> Tag Valid (ditto)
> DTD Aware (Aelfred)
I'd suggest that there at least three logically-separate realms of
here, all of which we've been overloading onto the same single set of
terminology. Here's what I suggest:
Realm #1: Functionality
This type of parser simply skips the DOCTYPE declaration (using
regular expressions) and parses the markup in the document
instances. It is not required to handle any but the built-in
entities, and as a result, does not include any external entities.
For the purposes of whitespace handling, it assumes that all
specified attributes are CDATA and that all elements have mixed
Optionally, a scanning parser may attempt to extract some
information from the DOCTYPE declaration, such as entity
declarations and attribute default values.
This type of parser reads the DTD (both internal and external
subsets) to obtain entity declarations, attribute declarations, and
element-type declarations. It handles any entities declared in the
DTD (internal or external), and provides default values when
attributes are not specified. For the purposes of whitespace
handling, it uses the declared type for each attribute, and
distinguishes between element types with element content and
elements with mixed content.
Realm #2: Validation
This type of parser assumes that its input document is both
well-formed and valid, and is not required to report any errors at
Optionally, a non-validating parser may report some lexical or
DTD-related errors, but it does not qualify as a well-formed or
validating parser unless it reports _all_ relevant errors.
This type of parser reports any lexical errors in an XML document
(including well-formedness constraints in the spec), but is not
required to report DTD-related errors (such as attribute-type
mismatches, elements out of context, etc.). A well-formed parser
must report an error for all 141 tests in James Clark's test suite.
Optionally, a well-formed parser may report some DTD-related
errors, but it does not qualify as a validating parser unless it
reports _all_ DTD-related errors.
A validating parser must report all of the errors reported by a
well-formed parser, together with all DTD-related errors ("validity
constraints" in the spec), such as elements in contexts not allowed
by the current content model, attempts to change #FIXED attributes,
failure to specify #REQUIRED attributes, unresolved IDREFS, and
Validating parsers must provide DTD-driven functionality.
Realm #3: Interface
An event parser returns a series of XML document events,
such as character data or the start or end of an element, usually
through call-backs to user-defined handlers. Events are returned
in the order that they occur in the XML source document.
A tree-based parser builds an in-memory tree of an entire
document, then provides some means for the user to navigate the
tree. The user is not constrained to navigating the tree in the
order that it was parser. Tree-based parsers are often built on
top of an event-based layer.
According to this classification, Ælfred is a DTD-driven,
non-validating, event-based XML parser.
There are other realms, including the type of information delivered by
a parser (simple ESIS-like production information, or full information
for an XML editor, such as comments, ignored whitespace, etc.), but I
think that we would be best standardise a few basic terms first.
All the best,
David Megginson ak117 at freenet.carleton.ca
Microstar Software Ltd. dmeggins at microstar.com
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)
More information about the Xml-dev