XML syntax (was Re: external subset syntax)
mecom-gmbh at mixx.de
Tue Dec 16 19:24:54 GMT 1997
perhaps it's time for a new role to complement the mcsgs, namely the npw - or
niggeling parser writer - not rebelling, just niggeling.
i admit to that fault.
my problem is, whenever i come to a point in the proposed recommendation at
which a parser is required to report an error and "must not continue normal
processing" even though the result which the stream would denote would be
sufficiently unambiguous if allowed, then i feel compelled to ask, "why does one
have to exclude this"?
which does not mean "in which production does the standard exclude or prescribe
it", but rather why does the standard exclude or prescribe it. what is the
useful purpose? particularly when excluding it makes the parser more complex and
the document encoding more exacting.
more than likely, when i've followed discussions of similar questions, the
design goal #3 gets hoisted like a commandment: "XML shall be compatible with
SGML". as a npw i tend to adhere more to #'s 1,4, 6, and 9: it should be easy to
generate, easy to program, and easy to read. SGML processors are already pretty
complex, so an argument to increase the complexity of XML in strictly order to
keep SGML processors simpler is difficult to accept on logical terms. (i know
i'm being naive here, and i'm ignoring the past, but i would wager that the
future is going to bear me out...)
the simplest thing would have been a document form which distinguished inline
definitions, external references (ie XLL built-in), content, and (maybe) a
declaration (autorecognition of encoding being the criteria on the latter). it
is true, that that is all there, but the standard requires at least twice as
many syntactic forms as are necessary. so despite having read mr murray-rust's
note on background to the list itself (re: XML-DEV (was Re: YAXPAPI)) which
gave me some sense of the effort which has gone into the proposed
recommendation, the distance between the simple form of the denoted data and the
complexity of the syntactic form often leads me to ask "why?"
one such example concerns the external subset, xml declaration, doctype
declaration, and text declaration. in particular, the productions
 XMLDecl ::= '<?xml' VersionInfo EncodingDecl? SDDecl? S? '?>'
 doctypedecl ::= '<!DOCTYPE' S Name (S ExternalID)? S? ('[' (markupdecl |
PEReference | S)* ']' S?)? '>'
 TextDecl ::= '<?xml' VersionInfo? EncodingDecl S? '?>'
 ExtPE ::= TextDecl? extSubset
i observe that, while one can well label the XMLDecl and TextDecl productions
differently, lexically speaking they are not disjoint, and practically speaking
there is no difference between their situation and that concerning the presence
of a doctype form at a location analogous to that of the textdecl. yet one is
"standard" and the other is "nonsense". not to a niggeling parser writer. from
the stream content, the permitted case (almost) appears (by analogy to the
remarks below) as one xml document within another. the other thing which is
disconcerting is that the standard goes to great length to, on one hand,
specify that the presence of an xml document may be introduced by a form with
the (not)PI keyword 'xml' (all lower case only) but on the other hand engenders
lexical ambiguity where it does not introduce a distinct keyword for the
distinctly different purpose and context of specifying the encoding of the
external dtd subset. why?
Per-Ake Ling wrote:
> > From jjc at jclark.com Mon Dec 15 11:59:21 1997
> > It is a requirement that the external subset *not* begin with a document
> > type declaration.
> If it were permitted, it would mean that there is a doctype declaration
> within a doctype declaration, which is clearly nonsense. It is a common
> misunderstanding that DTD means "document type declaration" instead of
> "document type definition".
(as an aside, i didn't - and still don't - see that as, in itself, a sufficient
explanation, since the case would comprise two instances of a "document type
declaration": one in the xml document and the other in the prolog of the
external portion of the "document type definition", which was referred to from
the first, but is not contained in the first, and which serves to constrain the
root element <em>if<em> so desired.)
another example is the MDC (']]>') exclusion in CharData which means that one
needs a state machine to scan character data. why?
another example is that of , in itself, where the npw believes his point (in
a previous posting) was misunderstood, and can only repeat the question
<em>why</em> is a PI-close specified to be '?>' and not '>', which would be
easier, or ('?>' | '>'), which would be robuster and observes (wrt to 'XML'
itself) that the standard, cf #6 with irony, engenders an encoding where of the
four obvious humanly legible encodings (that is, neglecting 'xMl' et.al.:
('<?XML' | '<?xml') ... ('?>' | '>')) only one is legitimized. why?
if the precision of an encoding depends so much on uniqueness, then why does one
start out with such a level of lexical complexity in the first place, only to
then exclude much of it as 'malformed'? all you need is <, >, ', & and / (if
you allow element recursion) - and even the distinction between < and > is more
for the eye than anything else.
Ingo Macherius wrote:
> > how about
> > <?XML version="1.0" ?>
> This is wrong, too. "xml" must be lower-case.
> > i've yet to understand why, but isn't that the way it needs to be?
> Why ? Productions  and  in section 2.8 !
>  XMLDecl ::= '<?xml' VersionInfo EncodingDecl? SDDecl? S? '?>'
>  VersionInfo ::= S 'version' Eq
> ('"VersionNum"'| "'VersionNum'")
> So the minimal correct PI is: <?xml version="1.0"?>
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)
More information about the Xml-dev