XML syntax (was Re: external subset syntax)

Chris Maden crism at ora.com
Tue Dec 16 20:26:02 GMT 1997


[James Anderson]
> my problem is, whenever i come to a point in the proposed
> recommendation at which a parser is required to report an error and
> "must not continue normal processing" even though the result which
> the stream would denote would be sufficiently unambiguous if
> allowed, then i feel compelled to ask, "why does one have to exclude
> this"?  which does not mean "in which production does the standard
> exclude or prescribe it", but rather why does the standard exclude
> or prescribe it.  what is the useful purpose? particularly when
> excluding it makes the parser more complex and the document encoding
> more exacting.

I am not particularly fond of this rule.  However, I can explain its
justification.  The WG made this decision at the request of both
Microsoft and Netscape.  In the HTML arena, both companies spend a
fair amount of their time reverse engineering the other's error-
recovery behavior, since Web page authors "validate" by seeing if it
looks OK in their browser of choice.  By requiring parsers to fail on
non-conformant documents, there is no chance that a user can think
erroneous data is acceptable in a conforming browser; if a browser
accepts the data, its opponent can level the charge that it is non-
conforming.

> more than likely, when i've followed discussions of similar
> questions, the design goal #3 gets hoisted like a commandment: "XML
> shall be compatible with SGML". as a npw i tend to adhere more to
> #'s 1,4, 6, and 9: it should be easy to generate, easy to program,
> and easy to read. SGML processors are already pretty complex, so an
> argument to increase the complexity of XML in strictly order to keep
> SGML processors simpler is difficult to accept on logical terms. (i
> know i'm being naive here, and i'm ignoring the past, but i would
> wager that the future is going to bear me out...)

Rule 3 is critical for two reasons: (a) technologically, it allows
easier application of existing SGML technology to the new problem
space, and (b) politically, it encourages XML's adoption in rigorously
standards-based arenas, like the Military-Industrial Complex.

> the simplest thing would have been a document form which
> distinguished inline definitions, external references (ie XLL
> built-in), content, and (maybe) a declaration (autorecognition of
> encoding being the criteria on the latter). it is true, that that is
> all there, but the standard requires at least twice as many
> syntactic forms as are necessary. so despite having read mr
> murray-rust's note on background to the list itself (re: XML-DEV
> (was Re: YAXPAPI)) which gave me some sense of the effort which has
> gone into the proposed recommendation, the distance between the
> simple form of the denoted data and the complexity of the syntactic
> form often leads me to ask "why?"

Many people have had discussions of the form "a markup language might
...", in which a clean, new theoretical language is designed.  These
discussions are useful and interesting, but completely outside of the
scope of XML, whose charter was to enable the transfer of SGML over
the Web.

If you want to design such a language, and are successful in
encouraging its adoption, many current SGMLheads would be very
grateful.  We use SGML because it is the best existing tool, not
because it is the best possible.

> (as an aside, i didn't - and still don't - see that as, in itself, a
> sufficient explanation, since the case would comprise two instances
> of a "document type declaration": one in the xml document and the
> other in the prolog of the external portion of the "document type
> definition", which was referred to from the first, but is not
> contained in the first, and which serves to constrain the root
> element <em>if<em> so desired.)

And indeed, some older SGML software produces documents like this.
This is a purely backwards-compatibility issue, from one point of
view; disambiguation rules could easily be developed, but then that
language would not be SGML.  See the XML charter.

> another example is the MDC (']]>') exclusion in CharData which means
> that one needs a state machine to scan character data. why?

This is because floating msc/mdc combos can get you later in a big
way.  See _The SGML FAQ Book_, and trust us on this.  I'd recommend
avoiding marked sections in the document instance altogether, but if
you don't, *ALWAYS* escape any occurrence of ']]>' in data.

> another example is that of [24], in itself, where the npw believes
> his point (in a previous posting) was misunderstood, and can only
> repeat the question <em>why</em> is a PI-close specified to be '?>'
> and not '>', which would be easier, or ('?>' | '>'), which would be
> robuster and observes (wrt to 'XML' itself) that the standard, cf #6
> with irony, engenders an encoding where of the four obvious humanly
> legible encodings (that is, neglecting 'xMl' et.al.: ('<?XML' |
> '<?xml') ... ('?>' | '>')) only one is legitimized. why?  if the
> precision of an encoding depends so much on uniqueness, then why
> does one start out with such a level of lexical complexity in the
> first place, only to then exclude much of it as 'malformed'? all you
> need is <, >, ', & and / (if you allow element recursion) - and even
> the distinction between < and > is more for the eye than anything
> else.

The pic *was* '>' in SGML.  It was explicitly changed to '?>' for two
reasons.  One, there is no standardized way of escaping characters in
a PI, so with pic='>' there's no way to put a greater-than in a
processing instruction.  '<?JScript if (1>2)>' is illegal.  Yes, you
can use application conventions, but are authors going to buy
'<?JScript if (1&gt;2)>'?  So, since '?>' is much less likely to occur
*within* PIs, it makes a safer delimiter.  Secondly, the symmetry is
appealing, especially for new authors.  Have you never seen <!-- --!>
used as a comment on Web pages?  The <? ... ?> syntax is more
intuitive.

Take the time to search the SGML WG archives
(<URL:http://lists.w3.org/Archives/Public/w3c-sgml-wg>), which go
through July of this year and are open to the public, and the XML SIG
archives (address unknown).  Searching them will lead to answers to
many of these questions.  See also the XML FAQ at
<URL:http://www.ucc.ie/xml/>.

-Chris
-- 
<!NOTATION SGML.Geek PUBLIC "-//Anonymous//NOTATION SGML Geek//EN">
<!ENTITY crism PUBLIC "-//O'Reilly//NONSGML Christopher R. Maden//EN"
"<URL>http://www.oreilly.com/people/staff/crism/ <TEL>+1.617.499.7487
<USMAIL>90 Sherman Street, Cambridge, MA 02140 USA" NDATA SGML.Geek>

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)




More information about the Xml-dev mailing list