Re WF, V, and MSXML

Mon Jun 9 01:55:19 BST 1997

Peter Murray-Rust replying to me to him etc.
[Terry:]
| > Yes, I think there is a somewhat different information model in XML
| > than in SGML, and this parser (whether it's doing all the right things
| > or not) is useful for learning and thinking about the differences.
| 
| My problem is more basic - I don't think that there are (yet) 'right and
| wrong things'.  That is why I have been so keen on implementation, because it's
| only when we get to this stage that the problems of the WF/V boundary come out.

Right.  That's why the IETF assigns such importance to running code.

| > I, too, think that my "palmy" input document is invalid but WF.  Thus,
| > if MSXML is parsing to validate, it is (due to a bug or two) doing
| > error recovery (and should be fixed on this point not to do so).
| 
| I think this is more a question of terminology.  NXP (Norbert Mikula) is a
| 'validating parser', but the validation can be switched off.  This is a
| client-side decision.  So with NXP 'palmy' could be either invalid or WF
| according to the reader's wishes

Agreed, but from the viewpoint of the document preparer, it is both.  MSXML
needs the switch NXP has.  I think the behavior is unintentional, but
I would be alarmed at a processor/parser (they mean the same to me in
this context) that attempted to parse for validity, and if it found
an error, silently switched to WF-parse mode.

| > I can also see some gotchas for early adopters, such as that a WF document 
| > that makes reference to the wrong DTD is still WF.  And the WF-parser will 
|                               ^^^^^^^^^^^^^^^^^^^^^
| I'd agree with this, and I don't necessarily think it's wrong until the ERB
| tells us it is.  Its validity is presently decidable from axioms and we are
| waiting for the ERB to think about the problem.

Agreed, it's just something to watch out for and perhaps to guard against
(by not reusing entity names in different DTDs, etc.)

| > check the WFness of the element declarations (even in the right DTD) even 
| > if it isn't going to use them, at least in the internal subset.  Also, the 
| > internal subset is part of the XML document, and, as the spec is 
| > written, the parser must parse the subset and deliver it as part of 
| > the output (as MSXML does), even though the same is not true of an 
| > external subset.  (Right?)
| 
| I don't think so.  My formal reading of the spec is that no 'output' is
| defined.  [After all, processing of an XML document can be done by a human
| reader :-)].  I think the ERB has been careful to say nothing about output,
| implementation, APIs, etc.  My own view has been that the scope for
| confusion has been sufficient (as in the present case) that guidance is 
| important.  At present we do not know what documents are validatable, what
| the validity criterion can be computed to be, etc.

Point taken; but the spec is not entirely clean on this point.  If the
application requests the processor to process, the processor must
inform the application of certain things.  And it is hard to get
around

"*An XML processor which does not read the DTD must always pass all 
characters in a document that are not markup through to the application.* 
An XML processor which does read the DTD must always pass all characters 
in mixed co ntent that are not markup through to the application. It may 
also choose to pass white space ocurring in element content to the 
application; if it does so, it must signal to the application that ..."
		[2.8, truncated para, emphasis added]

| Note that NXP and Lark do not have 'outputs', they have APIs.  NXP allows
| the programmer to subclass at the Esis level, whilst lark provides a
| tree of Elements.  Neither passes any DTD information.  In Lark I suspect this
| is discarded - in NXP it is requires a bit of digging to extract.  NSXML comes
| closer to delivering the whole grove, I think.  (It subclasses PIs and DOCTYPE 
| from Element).

Right.  My problem as a document preparer is that I don't know what
an application may request the processor to do, so I must guard against
any kind of failure.

 ...

| > IOW, an SGML parser such as nsgmls combines both subsets
| > into a DTD and deals with information following as another unit,
| > the "document instance set" (if I have the terminology right, per
| > 8879 production 2), which is the part of an SGML document entity
| > *following* the prologue. 
| 
| nsgmls attempts to validate *every* document it receives.  XML parsers need
| not.  It's not clear whether an XML parser can insist on validating every 
| document.  [The spec says nothing about *parsers* - agina I have been asking
| for more concrete terminology than 'processor'].
|
| > But for an XML parser, the boundaries are shifted, because
| > it has to deal with an XML document that *includes* the prologue
| > (XMLlang production 23, where "element" corresponds to the SGML 
| > "document instance set", I think).  I don't know whether this is a good 
| > idea or not, just trying to understand it as an early adopter.
...
| I am actually unclear whether a WF-only parser (e.g. Lark) has to read the
| internal subset at all, other than skipping to the ']>' at the end.  If it 
| *does* read and parse it, what does it do with the information.  For example,

The soft spot here is the first line of 2.2, where "match" is not
defined except that later in that section it "implies" a few things,
which are not apparently meant to be a complete set.  What the
WF document matches is production 23, Prolog element Misc*.  As
the processor attempting to determine WFness must look inside element to 
determine WFness, presumably the same is true of prolog.

 ... unless I determine WFness by *parsing* with a *real parser* which
the processor is not meant to be ...

No, not per XMLlang alone.  FOO's only declared attribute has as its name
the unreserved string "XML-LINK" although it uses an undeclared attribute
name "HREF".  So it is WF but not valid.

As for whether you can have attlists without element decls, 
the 2nd sentence following production 47 (emended for entity>element)
reads "At user option, an XML processor may issue a warning
if attributes are declared for an [element] type not itself
declared, but this is not an error", so the document is still WF
but not valid per XMLlang alone.

Were the XMLlink spec to contain language such that the processor is 
supposed to go out and fetch the attribute declarations implied 
by the use of the FIXED attribute (implied by the XMLlink spec, that 
is), then the document shown is not only WF but perhaps even valid!
But it doesn't, and barely talks of validity and processing
by a *processor*.

That's my take, anyway.  Maybe the SGML ERB will want to revise
the language about validity in XMLlang, or create new concepts
of validity in XMLlink.  

Regards,

  Terry Allen    Electronic Publishing Consultant    tallen[at]sonic.net
                   http://www.sonic.net/~tallen/
    Davenport and DocBook:  http://www.ora.com/davenport/index.html
          T.A. at Passage Systems:  terry.allen[at]passage.com 

xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)