Unicode, xml:lang, and variant glyphs

Tue Nov 3 20:40:43 GMT 1998

Rick Jelliffe wrote:

> Not so. The additions are use composed of standard radicals and
> combinations. There are various projects around (such as C.C.Hsieh in
> Taiwan) to figure out encodings to "spell" Han ideographs by component
> radicals. 

I'm glad to hear about this; I find the IRG archives utterly
impenetrable.

> I guess the point is that John thinks that if an XML system can produce
> characters which a recipient system cannot process, because it does not use
> ISO 10646, that is not something that CDATA sections should be used to
> address. I think his reasons are that he cannot see it in the spec. [...]
> I think a lot
> of people now think that any non-ISO10646 system is for losers anyway
> (except for whatever character set they use, probably).

Well, actually I would say the latter rationale has more effect on me
than the former, if I must choose either.  It just seemed to me that
using CDATA sections to constrain the behavior of editors was not
particularly user-friendly; if the user wants a character, let her
have it, using a character reference if possible.

In general, transcoding XML documents involves inserting NCRs as needed,
unless the target is UTF-8 or UTF-16.

> The primary purpose of xml:lang, as far as I am concerned, should be to
> convey the information lost by ISO 10646 unification: where the Japanese and
> Chinese glyphs

Actually, the problem isn't that clearcut.  As John Jenkins posted
to the Unicode list last year:

# FACT.  It is true that some Unihan characters are typically written 
# differently within the Japanese, Taiwanese, Korean, and Mainland Chinese 
# typographic traditions.  
# 
# FACT.  These differences of writing style are within the general range of 
# allowable differences within each typographic tradition.  
# 
# E.g., the official "Taiwanese" glyph for U+8349 ("grass") per ISO/IEC 
# 10646 uses four strokes for the "grass" radical, whereas the PRC, 
# Japanese, and Korean glyphs use three.  As it happens, Apple's LiSung 
# Light font for Big Five (which follows the "Taiwanese" typographic 
# tradition) uses three strokes.  
# 
# (This is easily confirmed by accessing 
# http://www.unicode.org/unihan/unihan.acgi$8349.)  
# 
# FACT.  Japanese users prefer to see Japanese text written with "Japanese" 
# glyphs.  
# 
# FACT.  It is also acceptable to Japanese users to see Chinese text 
# written with "Japanese" glyphs.  
# 
# E.g., I just borrowed from Lee Collins a standard Japanese dictionary 
# which quotes Chinese authors (e.g., Mencius) to show how a character is 
# used.  When doing so, they use "Japanese" glyphs, not Chinese ones. 
# 
# In particular, it is acceptable within Japanese typography for a small 
# stretch of Chinese quoted in a predominantly Japanese text to be written 
# with "Japanese" glyphs.  
# 
# FACT.  Han unification allows for the possibility that a Japanese user 
# might be required to use a Chinese font to display some Japanese text 
# (e.g., if it uses a rare kanji).  
# 
# FACT.  Ditto for JIS or an ISO 2022-based solution.  
# 
# FACT.  Unicode doesn't include all the characters in actual use in Japan 
# today, particularly for personal names.  
# 
# FACT.  Neither does JIS or an ISO 2022-based solution.  There are vendor 
# sets which include many of these characters, and Unicode is working with 
# the IRG and East Asian national bodies to add them.

> (or Polish and Russian)

How's that again?

Polish uses Latin, Russian uses Cyrillic!  What could possibly
count as a unification between these two??  *Nobody* thinks that
LATIN LETTER A and CYRILLIC LETTER A should be unified....

> for a unified character differ, then
> I think transcoding and unifying the characters into ISO 10646 can lose
> information unless the xml:lang attribute is set.

It doesn't lose information about meaning.  It may make characters
harder to read, but the distinction is one of typographic tradition,
not language, and can cross languages.

-- 
John Cowan	http://www.ccil.org/~cowan		cowan at ccil.org
	You tollerday donsk?  N.  You tolkatiff scowegian?  Nn.
	You spigotty anglease?  Nnn.  You phonio saxo?  Nnnn.
		Clear all so!  'Tis a Jute.... (Finnegans Wake 16.5)

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)