First draft of proposed XML TC for Unicode 3.0 (unofficial)
Nik O
niko at cmsplatform.com
Fri Sep 10 21:05:18 BST 1999
<escape_clause>
If an overriding design goal of XML 1.0 is to ensure that all existing
well-formed documents will always be well-formed, forever and ever, then the
rest of this message is moot, and should be promptly sent to the trash-bin.
If, OTOH, it might be acceptable to break a miniscule number of documents in
return for a more dynamic and extensible handling of characters in XML,
please consider this message.
</escape_clause>
It is a given that changes from Unicode 2.0 to 3.0 will require changes to
XML 1.0, and thus all existing XML-compliant parsers will cease to be
compliant when the changes are made. These Unicode changes aren't
"corrections of printers errors" -- they are real changes in the XML spec,
and will require changes to XML parsers and apps, as well.
I guess my previous message was sufficiently obtuse, since my real intention
was to raise the issue of how these sorts of changes are to be managed.
I previously wrote:
>>
>> This change of classification may well break some existing XML parsers
>> and/or apps, no matter whether or not these characters remain legal in
XML
>> names.
>
>John Cowan replied:
>
> Only if those parsers are not compliant. Appendix B explicitly lays out
in the
> BNF what is and what is not legal in XML names, which is precisely
> why it needs revision now.
I should have said that "..the new BaseChars changes _will_ break _all_
existing XML parsers and/or apps..". Once this change is made to XML,
existing parsers won't be compliant since they've implemented BNF rule 85
from REC-xml-19980210, and thus won't recognize these new BaseChars (e.g.
#x01F6) as legal name characters.
XML 1.0 is frozen in the time of Unicode 2.0 since XML used a copy of,
rather than a reference to, the Unicode character encodings. What i was
suggesting (rather poorly, it seems) is that if XML were to simply refer to
Unicode there would never be the need to squeeze changes to XML into
"corrigenda".
I do understand that using BNF to describe XML required this sort of
copying, since rule 85 is the foundation of many other rules. But i
seriously doubt that most parsers actually use the BNF directly to build
their internal tables -- the BNF is merely the specification of data that
are translated into internal bitmaps or whatever.
I previously wrote:
>>
>> IMHO, "backward compatibility" does not justify a special rule for the
>> treatment of these characters! If symbols, in general, are not legal
name
>> characters, then these symbols should not receive special treatment, just
>> because there were erroneously classified in an earlier Unicode.
>
>John Cowan replied:
>
> Adding an extra rule isn't that hard, I<em>M</em>HO.
Very true, but isn't this the top of a slippery slope, whereby every change
to Unicode might require yet another special rule to maintain backward
compatibility?
It would be possible to define XML characters as being based directly upon
the current Unicode data tables, i.e., replace the whole BNF rule 85 table
with a rule that directly referenced Unicode: "BaseChar ::= [..what Unicode
says..]". I realise that this example isn't real BNF, but it is just as
valid a method of specifying characters. We could perhaps refer to
Unicode's BNF rules for the purpose of the XML grammar, but use Unicode
tables for actual XML implementations.
This way, XML would simply need to use "..a set of rules whereby you can
extract the XML lists from those in the [Unicode] standard automatically."
It's true that some characters might re-classified and thus cause some
documents that were well-formed to lose that status (again, i believe this
to be a miniscule subset of all XML documents). But the advantage would be
a simple and open identification of documents as "XML 1.0 + Unicode x.x
compliant".
If i'm using XML in a real-world environment, it doesn't matter if XML 1.0
has been changed to allow some new character if i haven't upgraded my
Unicode support, and vice versa. This would tighten the bond between XML
and Unicode, since the latter organization couldn't make their changes
oblivious to their impact upon XML (no insult intended to Unicode, Inc.).
Since XML is based upon Unicode, XML developers are also, by definition,
Unicode developers -- these two communities are already interdependent.
As mentioned in the annotated version of XML 1.0, there exists are a
contradiction between the abstract and section 1.1 of XML 1.0 regarding the
completeness of the XML spec. A specification's text typically takes
precedence over its abstract. Given this, we could presume that XML is
intended to be based upon Unicode and ISO 10646, and we could/should defer
to those standards for classifications of characters, assignments of values,
etc.
I'm just speculating about a future implementation of these interlocking
standards that would be extensible by relying upon commonly shared data
tables, rather than specified grammars -- a little more OO, a little less
BNF/ML.
Regards,
Nik O, Teton Data Systems, Jackson, Wyo.
======= Begin excerpts (from XML 1.0 Rec) =======
Abstract
The Extensible Markup Language (XML) is a subset of SGML that is completely
described in this document.
:
1.1 Origin and Goals
:
This specification, together with associated standards (Unicode and ISO/IEC
10646 for characters, Internet RFC 1766 for language identification tags,
ISO 639 for language name codes, and ISO 3166 for country name codes),
provides all the information necessary to understand XML Version 1.0 and
construct computer programs to process it."
======= End excerpt =======
======= Begin excerpts (from Tim Bray's Annotated XML 1.0) =======
XML Rules For Character Classification
Although the Working Group emphatically did argue over the inclusion and
exclusion of individual characters, we (well, mostly James Clark) were able
to work out a set of rules whereby you can extract the XML lists from those
in the standard automatically.
======= End excerpt =======
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)
More information about the Xml-dev
mailing list