Unicode, xml:lang, and variant glyphs

Thu Nov 5 17:16:29 GMT 1998

Rick Jelliffe wrote:

> FACT: Many times that someone says two characters are variants and should be
> unified, someone else has used them not as variants. Hence the Unicode
> compatability area.

Unicode had to be round-trip compatible with many character sets formed
on different principles.  The KSC character sets, e.g. encode some
hanja (Chinese character) more than once if they have more than
one meaning, for the sake of making hanja-hangeul conversions easy.
Nobody denies that these are the same *characters*; even their glyphs
are bit for bit the same.

> Oops I meant Russian and Bylorussian (or Khazak or Ukrainian) where some of
> the national characters have a different form.

I don't know about this.  Are there really glyphic differences?
I know about the character-level differences, like Ukrainian using
GHE WITH STROKE except for a period from Stalin till a few years
ago, when they were forced to use GHE indiscriminately for GHE and
GHE WITH STROKE.

I also know about Polish accents, which are properly placed lower
over the character than similar-looking Western accents.  That
certainly is a glyph difference that fine Polish typography should
take into account, but getting it wrong does not interfere with
*meaning*: it is not a plaintext distinction.  (See below.)

A borderline case is 8859-2's use of S WITH CEDILLA and T WITH
CEDILLA to represent Romanian's S and T WITH COMMA BELOW.  This is
finally being undone, so that Turkish can keep S WITH CEDILLA and
Romanian will get a proper S WITH COMMA BELOW.  (Nobody actually
needs T WITH CEDILLA.)  My *National Geographic* world map uses
S WITH CEDILLA in Romanian place names, but you have to look closely
and compare with Turkish place names to be sure.

> Are you are saying that characters carry information, and never glyphs (or
> character + locale + markup)?

No, I am talking about the CJK case specifically.  A unified font
may look ugly, and certainly shouldn't be used for fine typography,
but a language indicator is neither necessary nor sufficient to
solve this problem.

This is not to say that in documents to be finely rendered, an
attribute called "cjkv-typographic-tradition" might not be
useful.

> if it is mathematics, then the font definitely
> carries information that the unified character does not.

Which is why there are a whole bunch of "letterlike symbols" for
math purposes.

> If you have a
> multi-language dictionary or a list of names which requires exactness, the
> font (or markup which selects the font) again is important.

Sure, font is important when it's important.  My claim is confined
to this: that for plain-text purposes, Han unification does not
obscure anything essential.

> "Harder to read" is no criterion at all. If it is harder to read, it is
> because it has lost information.

Au contraire.  The Unicode definition of a "plain text distinction"
is one which is necessary for mere legibility.

-- 
John Cowan	http://www.ccil.org/~cowan		cowan at ccil.org
	You tollerday donsk?  N.  You tolkatiff scowegian?  Nn.
	You spigotty anglease?  Nnn.  You phonio saxo?  Nnnn.
		Clear all so!  'Tis a Jute.... (Finnegans Wake 16.5)

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)