Partial XML Processors (was Re: JavaScript parser update and Questions)

Peter Murray-Rust peter at ursus.demon.co.uk
Sat Jan 17 11:15:36 GMT 1998


At 01:38 17/01/98 -0600, Jeremie Miller wrote:
>> [paragraphs removed]
>
>>
>>So you have the following choice:
>> - encode the *whole* spec (and nothing but the spec - i.e. no tricky
>>non-compliant extensions) and give yourself the label "conforming XML
>tool".
>> - encode the bits you feel are cost effective and label it "processes most
>>XML documents, but gives 'Sorry' messages for some".

[... picking up some of David Durand's concerns ...]

I appreciate the strength of David's arguments and personally will wish to
work with totally XML-compliant software. However it is a *lot* of work.
One design goal (4 in spec) is that it should be "easy to write programs
which process XML documents".  If that is interpreted that it is "easy to
write software that processes *all* XML documents, throwing errors wherever
one is required", then that goal is already lost. For example, James Clark
has come up with about 140 carefully incorrect XML documents for testing
parsers. DavidM has said that AElfred spots 80% of them, but that the other
20% would increase AElfred's size and decrease its speed. [And probably
involve the author in a lot more work.] I'm not making a moral judgment -
simply reporting facts in the PD.

Personally I think that XML is overly complex for goal 4 and have been
privileged to be able to say so on numerous occasions. However I accept the
consensus and will do what I can to support it.

However, I think there will be domains where the full functionality (or at
least the full syntax) of XML will not be used. In that case there will be
"simple tools" that process XML documents. Not *all* XML documents, but a
lot. It seems to me reasonable that these tools can tell the user if they
can't process a document. It's common for compilers to say "sorry, this
expression is just too complicated for me to deal with - you'll have to
break it up a bit". I can see a tool saying "sorry, I don't deal with
CDATA; please  try another parser". [The reason I have several parsers
running under JUMBO is that - at this stage - they all have things they
can't do...]

The WG has (I think rightly) said that there should not be conformance
levels  in XML. [For those not familiar with SGML, there are a large number
of different options, many of which are not supported by many parsers.] 
But I suspect there will be a number of tools which don't support the whole
spec - this is a neutral statement. And there will be a number of documents
that don't use the whole functionality of XML - this is also a neutral
statement.  We have frequently talked about the Desperate Perl Hacker
writing tools which are sufficient to process a class of XML documents, but
not all. 
I can see convergence between these activities.

>
>More questions/issues then:
>
>A well-formed XML document is not required to have a DTD, internal or
>external, correct? 

Correct. The inverse can be stated as "if a document does not have a DTD
subset , then it can only be well-formed".


> Is a well-formed parser not an XML parser that does not
>have access to or does not process a DTD, internal or external?  I guess I
>haven't found a clear definition of what a well-formed parser is yet.

I think we are all looking for enlightenment in this area. There are at
least the following categories:

A Document + DTD + request to validate document. Requires a validating parser.

B Document + full DTD but no request to validate. 

C Document + parts of a DTD (e.g. a few ELEMENTs and ATTLISTs, maybe an
external subset which covers some of the ELEMENTs in the document).

D Document with no internal or external subset. Can only be well-formed.

What the difference between A and B is is not clear to me.  IMO there are
several people/robots who can urge that a document be validated
(author/server/client/application/reader).  What is clear is that *all the
information in the DTDs must be processed and the document altered
accordingly*.

Note that Lark and AElfred both throw errors for 
<!DOCTYPE FOO SYSTEM "bar.dtd">
if bar.dtd cannot be found. This is reasonable (though frustrating) since
bar.dtd can alter the information in the document. 

NOTE BTW. If an entity is declared in both the internal and external
subsets then the one in the internal subset is processed first. [This
fooled me for some time because the IS occurs 'later' in the physical
document...]

C is similar to B, but validation is not possible. It is *essential* that
if ATTLISTs and ENTITYs (and NOTATION) exist, then the information in them
MUST be applied to the document. I think it is here that the differences of
opinion occur. If I get a document with a NOTATION, I may just say "sorry,
I can't grok NOTATION, so bomb out", but others see this as an unacceptable
position.

D seems to me entirely acceptable. If there is no DTD subset, then a parser
can be cleanly built which deals with exactly what is potentially carried
in well-formed/no_subset documents [you can see we need a terminology here
:-)]

>If this is true, then a well-formed parser doesn't even have to acknowledge
>that entities exist except for the built in ones, 

NO. *IFF* an ENTITY is declared (case C), the parser MUST process it.
Otherwise the content of the emitted information is incorrect. If a WF
document contains a reference to an entity (e.g. &foo;) then a 'correct'
document automatically falls into (C). A WF/no_subset parser can then only
report that an undeclared entity was discovered (and that even it it had
been declared, that parser couldn't manage it).


>and absolutely all whitespace is preserved, right?

Yes :-).  The *application* can throw this away, the parser can't. So JUMBO
will soon have the options "discard all PCDATA elements which contain only
whitespace", or "ignore all [these elements] when emitted by a parser." A
human has to press the button to make this happen :-).

NOTE that it is possible that a subset in a (C) document can contain enough
information to detect what the parser could do with whitespace. Whether it
should *act* on that information is unclear. For example, the single
declaration:

<!ELEMENT FOO (PLUGH|BAR)*>

says that FOO contains element content and therefore cannot contain PCDATA.
Any whitespace PCDATA  is therefore "ignorable". This information is not
sufficient to *validate* the document (there are no declarations for BAR
and PLUGH, for example). The declaration 

<!ELEMENT FOO ANY>

allows PCDATA, so doesn't help much.  Some people have argued for a content
model which includes something like #ANYNONPCDATA, but that is not legal XML.

	P.

Peter Murray-Rust, Director Virtual School of Molecular Sciences, domestic
net connection
VSMS http://www.nottingham.ac.uk/vsms, Virtual Hyperglossary
http://www.venus.co.uk/vhg

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)




More information about the Xml-dev mailing list