Re WF, V, and MSXML
tallen at sonic.net
Sun Jun 8 22:17:06 BST 1997
Peter Murray-Rust wrote:
>Note that an internal subset may be present for other reasons than validation
(adding attribute values and types, as required for XML-LINK, for example).
Therefore I do not think the author's intentions can be deduced from the
presence of an internal subset. Presumably a pointer (SYSTEM) to an
external DTD is likely to refer to a DTD which can be used for validation, but
I'm not sure whether this is explicit.
Yes, I think there is a somewhat different information model in XML
than in SGML, and this parser (whether it's doing all the right things
or not) is useful for learning and thinking about the differences.
I, too, think that my "palmy" input document is invalid but WF. Thus,
if MSXML is parsing to validate, it is (due to a bug or two) doing
error recovery (and should be fixed on this point not to do so).
I can also see some gotchas for early adopters, such as that a WF document
that makes reference to the wrong DTD is still WF. And the WF-parser will
check the WFness of the element declarations (even in the right DTD) even
if it isn't going to use them, at least in the internal subset. Also, the
internal subset is part of the XML document, and, as the spec is
written, the parser must parse the subset and deliver it as part of
the output (as MSXML does), even though the same is not true of an
external subset. (Right?)
Doesn't it seem as though the reasons for conveying the internal
subset information to the application (such as those you mention)
are also reasons for extracting the same information from the external
subset and conveying it to the application, too? whether the document
is dealt with as WF or not?
IOW, an SGML parser such as nsgmls combines both subsets
into a DTD and deals with information following as another unit,
the "document instance set" (if I have the terminology right, per
8879 production 2), which is the part of an SGML document entity
*following* the prologue.
But for an XML parser, the boundaries are shifted, because
it has to deal with an XML document that *includes* the prologue
(XMLlang production 23, where "element" corresponds to the SGML
"document instance set", I think). I don't know whether this is a good
idea or not, just trying to understand it as an early adopter.
(I also notice now that per productions 23 and 27, white space
after the end of the end-tag of the root element is also part
of the document, which is okay by me; but this seems
not to be dealt with explicitly s.v. 2.8, "White Space Handling."
I read that section to mean that such white space must be passed
to the application by a WF-parser [the language referring to
"processors which ... read the DTD" or not should be changed,
because, as we see, a WF parser must read at least the internal
subset part of the DTD], whereas a validating parser must not
pass such white space to the application.)
Terry Allen Electronic Publishing Consultant tallen[at]sonic.net
Davenport and DocBook: http://www.ora.com/davenport/index.html
T.A. at Passage Systems: terry.allen[at]passage.com
xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)
More information about the Xml-dev