Classification: XML Parser Features

David Megginson ak117 at freenet.carleton.ca
Fri Dec 12 15:10:10 GMT 1997


Sean Mc Grath writes:

 > The real truth behind XML's simplicity and ease of implementaton is being badly
 > let down by the haziness with with parsers are classified:-
 > 
 > Well Formed
 > Valid
 > Type Valid (In the DOM level 1 spec.)
 > Tag Valid (ditto)
 > DTD Aware (Aelfred)

I'd suggest that there at least three logically-separate realms of
here, all of which we've been overloading onto the same single set of
terminology.  Here's what I suggest:

Realm #1: Functionality

a) Scanning
     This type of parser simply skips the DOCTYPE declaration (using
   regular expressions) and parses the markup in the document
   instances.  It is not required to handle any but the built-in
   entities, and as a result, does not include any external entities.
   For the purposes of whitespace handling, it assumes that all
   specified attributes are CDATA and that all elements have mixed
   content.
     Optionally, a scanning parser may attempt to extract some
   information from the DOCTYPE declaration, such as entity
   declarations and attribute default values.

b) DTD-driven
     This type of parser reads the DTD (both internal and external
   subsets) to obtain entity declarations, attribute declarations, and
   element-type declarations.  It handles any entities declared in the
   DTD (internal or external), and provides default values when
   attributes are not specified.  For the purposes of whitespace
   handling, it uses the declared type for each attribute, and
   distinguishes between element types with element content and
   elements with mixed content.


Realm #2: Validation

a) Non-validating
     This type of parser assumes that its input document is both
   well-formed and valid, and is not required to report any errors at
   all.
     Optionally, a non-validating parser may report some lexical or
   DTD-related errors, but it does not qualify as a well-formed or
   validating parser unless it reports _all_ relevant errors.

b) Well-formed
     This type of parser reports any lexical errors in an XML document
   (including well-formedness constraints in the spec), but is not
   required to report DTD-related errors (such as attribute-type
   mismatches, elements out of context, etc.).  A well-formed parser
   must report an error for all 141 tests in James Clark's test suite.
     Optionally, a well-formed parser may report some DTD-related
   errors, but it does not qualify as a validating parser unless it
   reports _all_ DTD-related errors.

c) Validating
     A validating parser must report all of the errors reported by a
   well-formed parser, together with all DTD-related errors ("validity
   constraints" in the spec), such as elements in contexts not allowed
   by the current content model, attempts to change #FIXED attributes,
   failure to specify #REQUIRED attributes, unresolved IDREFS, and
   attribute-type-mismatches.
     Validating parsers must provide DTD-driven functionality.


Realm #3: Interface

a) Event-based
     An event parser returns a series of XML document events,
   such as character data or the start or end of an element, usually
   through call-backs to user-defined handlers.  Events are returned
   in the order that they occur in the XML source document.

b) Tree-based
     A tree-based parser builds an in-memory tree of an entire
   document, then provides some means for the user to navigate the
   tree.  The user is not constrained to navigating the tree in the
   order that it was parser.  Tree-based parsers are often built on
   top of an event-based layer.


According to this classification, Ælfred is a DTD-driven,
non-validating, event-based XML parser.

There are other realms, including the type of information delivered by
a parser (simple ESIS-like production information, or full information
for an XML editor, such as comments, ignored whitespace, etc.), but I
think that we would be best standardise a few basic terms first.


All the best,


David

-- 
David Megginson                 ak117 at freenet.carleton.ca
Microstar Software Ltd.         dmeggins at microstar.com
      http://home.sprynet.com/sprynet/dmeggins/

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)




More information about the Xml-dev mailing list