First draft of proposed XML TC for Unicode 3.0 (unofficial)

John Cowan cowan at locke.ccil.org
Tue Sep 7 23:07:03 BST 1999


This is version 0.1 of a proposed technical corrigendum to XML 1.0
to incorporate the new characters of Unicode 3.0 into the allowable
sets used in XML Names.  It presumes that XML should not
remain limited to an obsolete version of the Unicode and ISO 10646
standards.

The new scripts handled are:
Cherokee, Ethiopic, Khmer, Mongolian, Myanmar, Ogham, Runic, Syriac,
Thaana, Unified Canadian Aboriginal Syllabics, Yi.

These lists of new characters were constructed by using the current Unicode 3.0
data file from the Unicode Consortium and applying the rules given
in Appendix B to it.  This version of the proposal does not
yet incorporate information from the Unicode 3.0 properties list.

(Unicode 3.0 is technically still in beta, but the character list has
been frozen for months now.)

New BaseChars (BNF rule 85):

[#x01F6-#x01F9] /* new Latin letters */
| [#x0218-#x021F]
| [#x0222-#x0233]
| [#x02A9-#x02AD] /* new IPA Latin letters */
| #x03D7 /* new Greek letters */
| #x03DB
| #x03DD
| #x03DF
| #x03E1
| #x0400 /* new Cyrillic letters */
| #x040D
| #x0450
| #x045D
| [#x048C-#x048F]
| [#x04EC-#x04ED]
| [#x06B8-#x06B9] /* new Arabic letters */
| #x06BF
| #x06CF
| [#x06FA-#x06FC]
| #x0710 /* new Syriac script */
| [#x0712-#x072C]
| [#x0780-#x07A5] /* new Thaana script */
| #x0950 /* OM letters */
| #x0AD0
| [#x0D85-#x0D96] /* new Sinhala script */
| [#x0D9A-#x0DB1]
| [#x0DB3-#x0DBB]
| #x0DBD
| [#x0DC0-#x0DC6]
| #x0E2F / * new Thai characters */
| #x0EAF
| #x0F00 /* Tibetan OM */
| #x0F6A /* new Tibetan letters */
| [#x1000-#x1021] /* new Myanmar script */
| [#x1023-#x1027]
| [#x1029-#x102A]
| [#x1050-#x1055]
| #x1101 /* Hangul jamo that are no longer compatibility characters */
| #x1104
| #x1108
| #x110A
| #x110D
| [#x1113-#x113B]
| #x113D
| #x113F
| [#x1141-#x114B]
| #x114D
| #x114F
| [#x1151-#x1153]
| [#x1156-#x1158]
| #x1162
| #x1164
| #x1166
| #x1168
| [#x116A-#x116C]
| [#x116F-#x1171]
| #x1174
| [#x1176-#x119D]
| [#x119F-#x11A2]
| [#x11A9-#x11AA]
| [#x11AC-#x11AD]
| [#x11B0-#x11B6]
| #x11B9
| #x11BB
| [#x11C3-#x11EA]
| [#x11EC-#x11EF]
| [#x11F1-#x11F8]
| [#x1200-#x1206] /* new Ethiopic script */
| [#x1208-#x1246]
| #x1248
| [#x124A-#x124D]
| [#x1250-#x1256]
| #x1258
| [#x125A-#x125D]
| [#x1260-#x1286]
| #x1288
| [#x128A-#x128D]
| [#x1290-#x12AE]
| #x12B0
| [#x12B2-#x12B5]
| [#x12B8-#x12BE]
| #x12C0
| [#x12C2-#x12C5]
| [#x12C8-#x12CE]
| [#x12D0-#x12D6]
| [#x12D8-#x12EE]
| [#x12F0-#x130E]
| #x1310
| [#x1312-#x1315]
| [#x1318-#x131E]
| [#x1320-#x1346]
| [#x1348-#x135A]
| [#x13A0-#x13F4] /* new Cherokee script */
| [#x1401-#x166C] /* new Canadian Syllabics script */
| [#x166F-#x1676]
| [#x1681-#x169A] /* new Ogham script */
| [#x16A0-#x16EA] /* new Runic script */
| [#x1780-#x17B3] /* new Khmer script */
| [#x1820-#x1842] /* new Mongolian script */
| [#x1844-#x1877]
| [#x1880-#x18A8]
| #x3006 /* Ideographic closing mark */
| [#x31A0-#x31B7] /* new Bopomofo letters */
| [#xA000-#xA48C] /* new Yi script */

IMHO none of these are controversial except perhaps the Hangul jamo.
Formerly, some Hangul jamo had compatibility decompositions into
sequences of other Hangul jamo.  These decompositions have been
removed from the Unicode Standard (actually in 2.1), so the jamo
should now be allowed in XML names in accordance with the rules in Appendix B.

New Ideographics (BNF rule 86):

[#x3400-#x4DB5] /* CJK Ideograph Extension A */

New CombiningChars (BNF rule 87):

[#x0346-#x034E] /* new IPA combining characters */
| #x0362
| [#x0488-#x0489] /* new Cyrillic combining characters */
| [#x0653-#x0655] /* new Arabic combining characters */
| #x0711 /* combining characters for new Syriac script */
| [#x0730-#x074A]
| [#x07A6-#x07B0] /* combining characters for new Thaana script */
| [#x0D82-#x0D83] /* combining characters for new Sinhala script */
| #x0DCA
| [#x0DCF-#x0DD4]
| #x0DD6
| [#x0DD8-#x0DDF]
| [#x0DF2-#x0DF3]
| #x0F96 /* new Tibetan subjoined letters */
| [#x0FAE-#x0FB0]
| #x0FB8
| [#x0FBA-#x0FBC]
| #x0FC6 /* new Tibetan combining character */
| [#x102C-#x1032] /* combining characters for new Myanmar script */
| [#x1036-#x1039]
| [#x1056-#x1059]
| [#x17B4-#x17D3] /* combining characters for new Khmer script */
| #x18A9 /* combining character for new Mongolian script */
| [#x20E2-#x20E3] /* new general combining characters */

IMHO none of these are controversial except perhaps the #x20E2 and #x20E3,
which are primarily intended for use with symbol characters, and therefore
should perhaps be excluded as #x20DD-#x20E0 are.

New Digits (BNF rule 88):

[#x1040-#x1049] /* digits for new Myanmar script */
| [#x1369-#x1371] /* digits for new Ethiopic script */
| [#x17E0-#x17E9] /* digits for new Khmer script */
| [#x1810-#x1819] /* digits for new Mongolian script */

IMHO none of these will be controversial.

New Extenders (BNF rule 89):

#x02EE /* Modifier letter double apostrophe */
| #x1843 /* Modifier letter for new Mongolian script */

IMHO none of these will be controversial.

In addition, the following characters no longer pass the tests given
in Appendix B for valid name or name-start characters, but should
remain legal in XML names for backward compatibility, and therefore
should be explicitly enumerated in the corrigendum:

03D0;GREEK BETA SYMBOL
03D1;GREEK THETA SYMBOL
03D2;GREEK UPSILON WITH HOOK SYMBOL
03D5;GREEK PHI SYMBOL
03D6;GREEK PI SYMBOL
03F0;GREEK KAPPA SYMBOL
03F1;GREEK RHO SYMBOL
03F2;GREEK LUNATE SIGMA SYMBOL
0675;ARABIC LETTER HIGH HAMZA ALEF
0676;ARABIC LETTER HIGH HAMZA WAW
0677;ARABIC LETTER U WITH HAMZA ABOVE
0678;ARABIC LETTER HIGH HAMZA YEH
0E33;THAI CHARACTER SARA AM
0EB3;LAO VOWEL SIGN AM
0F77;TIBETAN VOWEL SIGN VOCALIC RR
0F79;TIBETAN VOWEL SIGN VOCALIC LL
1E9A;LATIN SMALL LETTER A WITH RIGHT HALF RING
212E;ESTIMATED SYMBOL

###

-- 
John Cowan                                   cowan at ccil.org
       I am a member of a civilization. --David Brin

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)





More information about the Xml-dev mailing list