First draft of proposed XML TC for Unicode 3.0 (unofficial)

Nik O niko at cmsplatform.com
Thu Sep 9 21:02:41 BST 1999


Even though the XML 1.0 Rec does specify the use of Unicode 2.0, i heartily
agree that we should be moving to support Unicode 3.0, rather than remaining
with the older version of Unicode.

John Cowan wrote:
>  :
>In addition, the following characters no longer pass the tests given
>in Appendix B for valid name or name-start characters, but should
>remain legal in XML names for backward compatibility, and therefore
>should be explicitly enumerated in the corrigendum:
>
>03D0;GREEK BETA SYMBOL
>03D1;GREEK THETA SYMBOL
>03D2;GREEK UPSILON WITH HOOK SYMBOL
>03D5;GREEK PHI SYMBOL
>03D6;GREEK PI SYMBOL
>03F0;GREEK KAPPA SYMBOL
>03F1;GREEK RHO SYMBOL
>03F2;GREEK LUNATE SIGMA SYMBOL
>  :

I disagree the these characters should remain legal in XML names.

  1)  Were the above changes based upon the recognition that Unicode 2.1
erroneously classified these symbols as letters?

  2)  If these characters continue to be considered legal name-start
characters, won't productions [4], [5], [84], and [85] now contradict the
text (following the legal characters table in Appendix B) regarding legal
name and name-start characters?

  3)  If question #2 is true, won't the text then need to be modified to
read: "Name start characters must have one of the categories Ll, Lu, Lo, Lt,
Nl [, except for these "special" ones..]"?

This change of classification may well break some existing XML parsers
and/or apps, no matter whether or not these characters remain legal in XML
names.

Consider that there are two ways that an XML parser might have implemented
production [85]: 1) use a simple table of character ranges, copied directly
from the XML 1.0 Rec; or 2) a truly Unicode-aware parser might have instead
used a table of categories derived from the Unicode data file, and
implemented the "Ll, Lu, Lo, Lt, Nl" rule, based upon that table.

If i were the developer of serious Unicode-aware software, i'd probably have
chosen the second approach, since it is _extensible_ (my parser changes in
sync with the Unicode changes); whereas the first is based upon a _static_
table (that changes only when the W3C decrees, if ever).

I do suppose we could argue that Unicode was expected to change more often
than XML, and that the first approach would therefore require less frequent
parser software updates.  Either way -- if Unicode changes than those things
built upon it (e.g. Java, XML) also have to change.

I argue that keeping simple "legal name character" rules is more important
than the rather slight possibility of breaking some existing XML documents.
At the risk of being labeled Anglo-centric, how many docs are likely to have
used these Greek, Arabic, Thai, Lao, or Tibetan symbols in XML names?  (I do
suppose that James Clark's choice of residence might have skewed the
frequency of Thai in XML, though ;-).

IMHO, "backward compatibility" does not justify a special rule for the
treatment of these characters!  If symbols, in general, are not legal name
characters, then these symbols should not receive special treatment, just
because there were erroneously classified in an earlier Unicode.  If these
characters indeed aren't letters, then they should be removed from
production [85].  This way the corrigendum need only correct [85], a
relatively simple change.

Also, won't the entry in "A.1 Normative References" also need to be changed
to reference the Unicode 3.0 spec, rather than the older version?

I, too, have no insight into the W3C process in this matter.  Presumably
there will one day be an XML 1.1, if only after the XML 1.0 errata reach a
critical mass., and/or the Namespaces issue is resolved...

Regards,
 Nik O, Teton Data Systems, Jackson, Wyo.

======= Begin excerpt (from XML 1.0 Rec) =======

[4]  NameChar ::=  Letter | Digit | '.' | '-' | '_' | ':' | CombiningChar |
Extender
[5]  Name ::=  (Letter | '_' | ':') (NameChar)*
  :

[84]  Letter ::=  BaseChar | Ideographic

[85]  BaseChar ::=  [#x0041-#x005A] | [#x0061-#x007A] | [#x00C0-#x00D6] |
[#x00D8-#x00F6] | [#x00F8-#x00FF] | [#x0100-#x0131] | [#x0134-#x013E] |
[#x0141-#x0148] | [#x014A-#x017E] | [#x0180-#x01C3] | [#x01CD-#x01F0] |
[#x01F4-#x01F5] | [#x01FA-#x0217] | [#x0250-#x02A8] | [#x02BB-#x02C1] |
#x0386 | [#x0388-#x038A] | #x038C | [#x038E-#x03A1] | [#x03A3-#x03CE] |
[#x03D0-#x03D6] | #x03DA | #x03DC | #x03DE | #x03E0 | [#x03E2-#x03F3]
  :

The character classes defined here can be derived from the Unicode character
database as follows:

* Name start characters must have one of the categories Ll, Lu, Lo, Lt, Nl.

 * Name characters other than Name-start characters must have one of the
categories Mc, Me, Mn, Lm, or Nd.
  :
======= End excerpt =======



xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)





More information about the Xml-dev mailing list