Partial XML Processors (was Re: JavaScript parser update and Questions)

Fri Jan 16 11:32:59 GMT 1998

At 18:13 15/01/98 -0600, Jeremie Miller wrote:
>I've just updated my JavaScript parser < http://www.jeremie.com/xparse/ >,
>and have a few questions...
>
>First, the update.  Unlike normal software aging, I cut the code size by 50%
>(below 5k w/o my comments) and increased the speed and compatibility.  It
>should work with almost _any_ incarnation of JavaScript.  It now properly
>and according to spec for a well-formed parser understands elements,
>attributes, the prolog, comments, processing instructions, and CDATA
>sections.  What I am working on yet is entities and DOM compatibility(just
>have to print out the spec and read it).

Excellent.

>
>My question is this, being a fairly simple parser, how should I handle
>entities?  I'm confused by the spec as to how a well-formed parser should
>handle them.  Should I parse <!ENTITY definitions in an included DTD, or
>simply handle &amp; &lt; &gt; &quot; &apos; ?  If those are all I should
>handle, which ones where?  The spec does talk about these things, but I
>don't feel right about my interpretation of it.

You are not alone :-). There is a difficult decision here for parser
writers - do they implement everything in the spec or do they go for a
subset? If the latter they are not full XML implementations (and therefore
cannot use the label "XML parser"). If the former, they have a *lot* of
work to do in understanding the spec and getting it right. I have heralded
my own incompetence in understanding NOTATION on this list :-)

Every software writer therefore has to decide whether they are going to
write a fully conformant XML processor. I am not sure whether *anyone* has
yet done this other than James Clark (and those who adapt SGML systems to
process XML). [XML *is* SGML, of course, but you have to use a customised
SGML declaration for standard SGML tools to read XML.] Most of my work is
done with Lark and AElfred and I think they both may have some small bits
to fill in (please forgive if I'm wrong :-). 

For my own parser (Jumbo) I gave up about 6 months ago and do not process
entities (other than the hardcoded ones). That means that if I get a
document  which uses them, my parser fails and I switch to Larkfred. (In
fact I'll make one of them the default as soon as the dust settles...)

So you have the following choice:
	- encode the *whole* spec (and nothing but the spec - i.e. no tricky
non-compliant extensions) and give yourself the label "conforming XML tool".
	- encode the bits you feel are cost effective and label it "processes most
XML documents, but gives 'Sorry' messages for some".

>Other question:  Either I can't find it or I am reading right by it, but how
>do I handle whitespace in attribute values as a well-formed parser, just
>allow anything, including \n?

It depends on the type of the attribute value. see 3.3.3 (Attribute value
Normalization). If the attribute value is of type CDATA it stays asis, else
it gets normalised. How do you tell if it's not CDATA?
	- there has to be an ATTLIST for the element. This is in the external or
internal subsets. So you have to be able to process those.
	- these subsets can use Parameter Entities. So you have to be able to
process those.

The alternative is not to process any ATTLISTs. This has the slight
disadvantage that it can totally change the meaning of the document. e.g.
an attribute value can be an ENTITY which effectively means it is a pointer
to a chunk of information, whereas if it is assumed to be CDATA it's just a
string.

So the bottom line is that *if* the document author uses ENTITYs, and your
software doesn't then you will end up with something radically different
from what the author intended. This may or may not matter.

If you are the author of the document as well as the parser, then you can
make a bargain with yourself that you will never use ENTITYs so your
software doesn't need to. If you then want other people to use your
software you either have to add in entity processing OR give them a
statement that you cannot process the document. What you must not do (IMO)
is to ignore ENTITYs and assume the result is more or less OK :-) 

	P.

Peter Murray-Rust, Director Virtual School of Molecular Sciences, domestic
net connection
VSMS http://www.nottingham.ac.uk/vsms, Virtual Hyperglossary
http://www.venus.co.uk/vhg

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)