Defn. of Extender (Pdn. 89) again
Paul W. Abrahams
abrahams at valinet.com
Thu Aug 26 02:47:33 BST 1999
Having searched the unicode.org website, I'm still puzzled as to what an
extender character is. The issue was raised once before, back in
January, by the following interchange:
------------------
Re: Extender characters, Production 89 of XML 1.0
John Cowan (cowan at locke.ccil.org)
Mon, 11 Jan 1999 14:07:50 -0500
Elliotte Rusty Harold wrote:
> In XML ["extender"]
> characters can be used anywhere a base character or ideographic
> character can be used.
This is not quite true, because extenders are not name-start characters
in either XML or Unicode.
> However I have been unable to find in the Unicode book or Web site any
> definition of what makes a character an extender. Can anyone clue me
in on
> why some Unicode characters have the extender property while others
don't?
> What's the logic behind this grouping of characters across languages?
Roughly (and unofficially) speaking, an extender is something that isn't
a letter or combining mark but often appears embedded in words.
For example, one may use L plus MIDDLE DOT as a compatibility equivalent
of L WITH MIDDLE DOT in writing Catalan, and we do not want a
Catalan name to break into two names at the MIDDLE DOT.
(The dot is used to distinguish two successive Ls, written with
a dot, from the unitary Catalan letter "ll", written without a dot.)
Extenders are enumerated (but not explained) in Section 5.14 of
the Unicode Standard.
-----------
The description of the Unicode 2.1 character database says nothing about
what an extender is. The extenders listed in that database are:
00B7;MIDDLE DOT;Po;0;ON;;;;;N;;;;;
02D0;MODIFIER LETTER TRIANGULAR COLON;Lm;0;ON;;;;;N;;;;;
02D1;MODIFIER LETTER HALF TRIANGULAR COLON;Lm;0;ON;;;;;N;;;;;
0387;GREEK ANO TELEIA;Po;0;ON;00B7;;;;N;;;;;
0640;ARABIC TATWEEL;Lm;0;R;;;;;N;;;;;
0E46;THAI CHARACTER MAIYAMOK;Lm;0;L;;;;;N;THAI MAI YAMOK;;;;
0EC6;LAO KO LA;Lm;0;L;;;;;N;;;;;
3005;IDEOGRAPHIC ITERATION MARK;Lm;0;L;;;;;N;;;;;
3031;VERTICAL KANA REPEAT MARK;Lm;0;L;;;;;N;;;;;
3032;VERTICAL KANA REPEAT WITH VOICED SOUND MARK;Lm;0;L;;;;;N;;;;;
3033;VERTICAL KANA REPEAT MARK UPPER HALF;Lm;0;L;;;;;N;;;;;
3034;VERTICAL KANA REPEAT WITH VOICED SOUND MARK UPPER
HALF;Lm;0;L;;;;;N;;;;;
3035;VERTICAL KANA REPEAT MARK LOWER HALF;Lm;0;L;;;;;N;;;;;
309D;HIRAGANA ITERATION MARK;Lm;0;L;;;;;N;;;;;
309E;HIRAGANA VOICED ITERATION MARK;Lm;0;L;309D 3099;;;;N;;;;;
30FC;KATAKANA-HIRAGANA PROLONGED SOUND MARK;Lm;0;L;;;;;N;;;;;
30FD;KATAKANA ITERATION MARK;Lm;0;L;;;;;N;;;;;
30FE;KATAKANA VOICED ITERATION MARK;Lm;0;L;30FD 3099;;;;N;;;;;
The extenders each fall into category Po (Punctuation, Other) or
category Lm (Letter, Modifier). However, many other characters fall
into these categories also. For example:
02B2;MODIFIER LETTER SMALL J;Lm;0;L;<super> 006A;;;;N;;;;;
02B3;MODIFIER LETTER SMALL R;Lm;0;L;<super> 0072;;;;N;;;;;
These all fall into category Lm. And the following, among many others,
fall into category Po:
0021;EXCLAMATION MARK;Po;0;ON;;;;;N;;;;;
0022;QUOTATION MARK;Po;0;ON;;;;;N;;;;;
0023;NUMBER SIGN;Po;0;ET;;;;;N;;;;;
So despite the statement in the XML spec that ``the character classes
defined here can be derived from the Unicode character database as
follows:'', there doesn't seem to be anything in that database that
would uniquely characterize the extenders. The statement "Character
#x00B7 is classified as an extender, because the property list so
identifies it" is puzzling since there's nothing in the property list
cited above that would identify it as being such; in fact, the property
list is identical to that of `0021;EXCLAMATION MARK'.
Can anyone elaborate on John Cowan's statement that "an extender is
something that isn't
a letter or combining mark but often appears embedded in words"?
And finally: I have the Unicode 2.0 book in front of me, and "extender"
appears neither in the General Index nor, as far as I can tell, in the
Table of
Contents.
Paul Abrahams
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)
More information about the Xml-dev
mailing list