Parser considerations (was: MS XML parser only works with IE...)

Peter Murray-Rust peter at ursus.demon.co.uk
Wed Nov 26 01:17:31 GMT 1997


At 14:13 25/11/97 -0500, [many people] wrote about MSXML

Some of the things we mustn't forget at this time are:
	- there is as yet no frozen XML 'recommendation' (I hope that's the
correct term). Under those circumstances it is unlikely that there are any
completing conforming parsers; the spec is still changing and so any parser
has addressed a moving target. 
	- for many people helping in the development of XML the question of 'best
parser' is not appropriate at this stage - and I suspect not for at least 3
months. The spec is quite large and is a lot of effort to implement (those
of us who have hacked parsers know). Many of us give up on points we don't
understand (for me it was parameter entities, and that caused others grief
as well :-). So until we see the next spec [is there a later public one
than Aug 7?] we can't be sure whether a parser 'gets PEs right' :-). I
sympathise with anyone who has failed to implement part of the current
spec, and I hope that people trying out parsers and other software will
take a constructive view of such 'failings'.
	- I believe that all parser writers at present would like their parsers
validated. Validation *of* a parser seems to me to include checks on
		- reporting errors in non-conforming XML documents
		- asserting that a conforming XML document is conforming
		- carrying out defined transformations on the original input
All of these require a set of test inputs, which I believe we badly need at
present. It is very likely that a parser writer at present will overlook
something in the spec.
	Checking the transformations is less easy as there is no defined output.
How, for example, do we check that parser A transforms all the entities
correctly? An important way is to make sure that the outputs of two
independent parsers agree. To this extent, whatever we think about
'steenking ESIS' [a quote from the source code of a well known XML parser],
it is at least checkable :-)

	- the really hard bit comes when the semantics of behaviour are unclear.
Does the statement <!DOCTYPE CML SYSTEM "cml.dtd"> require the parser to
*do* anything? Different authors will certainly have different ideas - some
see it as a request by the author that the document must be validated -
authors that if the reader wishes to validate it, then this is the doctype
that should be used.
There are many subtleties of this sort.

	I believe that the development of XML has been one of the outstanding
achievements of the WWW. It has been fast, rigorous, fair, open, and
required extraordinary commitment and patience from those involved. Often
the SIG has had 50 emails a day, and many have required a great deal of
careful reading.

	I have been very gratified by the level and amount of constructive
contributions to XML-DEV as this is an important area for ironing parts the
spec cannot reach. I remember the agonies of early C++ compilers where
every platform and vendor had messages 'this feature not supported' and so
on. I believe that all contributors on this list want to avoid this and
that 'any valid XML document can be parsed with any XML parser'. Since some
parsers may purport to be XML compliant but not be, it is critical that
this fact can be recognised, and a test suite of documents seems to be a
key instrument. I hope very much that authors of such parsers will be able
to find the energy to mend them :-)

	If - at some future time - I were looking for attractive features in an
XML parser and after discarding the non-compliant ones, I would want to
consider a wide range and I doubt that any one parser would 'win' in all
aspects. To this end I am trying to make JUMBO accept a range of parsers by
a simple commandline switch (or button). Thus:
	java jumbo.sgml.SGMLTree foo.xml parser=NXP (or Lark)
I can quite envisage where a user wants to use parser A to read in the
initial document (perhaps because it is large, or tree-structured) and
parser B to read the entities.  

I am delighted to hear about WORA-MSXML, and shall hope to look at it
shortly. I hope it's easy to bolt into JUMBO.

	I am slightly disappointed that Xapi-J seems to have become dormant,
because then work inside JUMBO would be minimal. At present most of the
parsers I have encountered are event-driven (e.g. doStartTag, doError...)
and not all build trees (JUMBO is happy to build trees from streams) . If,
indeed, this is the model most people use, then let's get a standard
terminology (Element, PI, ElementType, Attribute, etc.) It would make
things so much simpler. I also expect we could get a very very simple API
defined...

	P.


Peter Murray-Rust, Director Virtual School of Molecular Sciences, domestic
net connection
VSMS http://www.nottingham.ac.uk/vsms, Virtual Hyperglossary
http://www.venus.co.uk/vhg

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)




More information about the Xml-dev mailing list