Re WF, V, and MSXML

Sun Jun 8 23:32:38 BST 1997

In message <199706081956.MAA31488 at bolt.sonic.net> Terry Allen writes:
[...]
> 
> Yes, I think there is a somewhat different information model in XML
> than in SGML, and this parser (whether it's doing all the right things
> or not) is useful for learning and thinking about the differences.

My problem is more basic - I don't think that there are (yet) 'right and
wrong things'.  That is why I have been so keen on implementation, because it's
only when we get to this stage that the problems of the WF/V boundary come out.

> I, too, think that my "palmy" input document is invalid but WF.  Thus,
> if MSXML is parsing to validate, it is (due to a bug or two) doing
> error recovery (and should be fixed on this point not to do so).

I think this is more a question of terminology.  NXP (Norbert Mikula) is a
'validating parser', but the validation can be switched off.  This is a
client-side decision.  So with NXP 'palmy' could be either invalid or WF
according to the reader's wishes
> 
> I can also see some gotchas for early adopters, such as that a WF document 
> that makes reference to the wrong DTD is still WF.  And the WF-parser will 
                              ^^^^^^^^^^^^^^^^^^^^^
I'd agree with this, and I don't necessarily think it's wrong until the ERB
tells us it is.  Its validity is presently decidable from axioms and we are
waiting for the ERB to think about the problem.

> check the WFness of the element declarations (even in the right DTD) even 
> if it isn't going to use them, at least in the internal subset.  Also, the 
> internal subset is part of the XML document, and, as the spec is 
> written, the parser must parse the subset and deliver it as part of 
> the output (as MSXML does), even though the same is not true of an 
> external subset.  (Right?)

I don't think so.  My formal reading of the spec is that no 'output' is
defined.  [After all, processing of an XML document can be done by a human
reader :-)].  I think the ERB has been careful to say nothing about output,
implementation, APIs, etc.  My own view has been that the scope for
confusion has been sufficient (as in the present case) that guidance is 
important.  At present we do not know what documents are validatable, what
the validity criterion can be computed to be, etc.

Note that NXP and Lark do not have 'outputs', they have APIs.  NXP allows
the programmer to subclass at the Esis level, whilst lark provides a
tree of Elements.  Neither passes any DTD information.  In Lark I suspect this
is discarded - in NXP it is requires a bit of digging to extract.  NSXML comes
closer to delivering the whole grove, I think.  (It subclasses PIs and DOCTYPE 
from Element).

> 
> Doesn't it seem as though the reasons for conveying the internal
> subset information to the application (such as those you mention) 
> are also reasons for extracting the same information from the external 
> subset and conveying it to the application, too?  whether the document 
> is dealt with as WF or not?

Again, the spec (and the ERB) are unclear about conveying this information
to the application at all.

> 
> IOW, an SGML parser such as nsgmls combines both subsets
> into a DTD and deals with information following as another unit,
> the "document instance set" (if I have the terminology right, per
> 8879 production 2), which is the part of an SGML document entity
> *following* the prologue. 

nsgmls attempts to validate *every* document it receives.  XML parsers need
not.  It's not clear whether an XML parser can insist on validating every 
document.  [The spec says nothing about *parsers* - agina I have been asking
for more concrete terminology than 'processor'].
> 
> But for an XML parser, the boundaries are shifted, because
> it has to deal with an XML document that *includes* the prologue
> (XMLlang production 23, where "element" corresponds to the SGML 
> "document instance set", I think).  I don't know whether this is a good 
> idea or not, just trying to understand it as an early adopter.
> 
> (I also notice now that per productions 23 and 27, white space
> after the end of the end-tag of the root element is also part
> of the document, which is okay by me; but this seems 
> not to be dealt with explicitly s.v. 2.8, "White Space Handling." 
> I read that section to mean that such white space must be passed
> to the application by a WF-parser [the language referring to
> "processors which ... read the DTD" or not should be changed,
> because, as we see, a WF parser must read at least the internal

I am actually unclear whether a WF-only parser (e.g. Lark) has to read the
internal subset at all, other than skipping to the ']>' at the end.  If it 
*does* read and parse it, what does it do with the information.  For example,
what is the implied structure of the document in:

<!DOCTYPE FOO [
<!ATTLIST FOO XML-LINK CDATA #FIXED "SIMPLE">
]>
<FOO HREF="bar"/>

Can we assume that FOO (which has no Element declaration) has an ATTLIST as
given, and that therefore it inherits the SHOW and ACTUATE attributes?
IOW *must* a parser decorate all matching elements with the ATTLISTS in the 
internal subset?

> subset part of the DTD], whereas a validating parser must not
> pass such white space to the application.)

My confusion on this issue is well publicised :-)

	P.

-- 
Peter Murray-Rust, domestic net connection
Virtual School of Molecular Sciences
http://www.vsms.nottingham.ac.uk/

xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)