Unicode, xml:lang, and variant glyphs

Thu Nov 5 19:43:22 GMT 1998

Rick:

> > Are you are saying that characters carry information, and never
> > glyphs (or character + locale + markup)?

> From:  John Cowan

> No, I am talking about the CJK case specifically.  A unified font
> may look ugly, and certainly shouldn't be used for fine typography,
> but a language indicator is neither necessary nor sufficient to
> solve this problem.

But I am not thinking "What is sufficient?", I am thinking "Is something
being lost here?" and "how nice is the thing being lost?"

If a XML document arrives with an encoding in the XML header of SJIS it will
have been created on a Japanese editor: in the absense of any information to
the contrary, shouldn't it be displayed using Japanese fonts? And if a
document arrives in Big Five, shouldnt it be displayed in the absense of
anything else, using a (presumably traditional but this is not clear cut
now) Chinese font?

In XML terms, if there is no xml:lang in effect, and the sender wrote using
a different script variant to the receiver, mightn't heuristic defaulting of
xml:lang based on originating character set (and, for example, originating
country in the URL) be the desired behaviour for some?  And if it is desired
in that circumstance, wouldn't it be useful to preserve that information
when cutting-and-pasting documents or transcluding portions. (The XML
encoding PI presumably will not survive in the grove of every document, so I
am not sure it could reliably be available in the case of transcluded data.)

I am very loathe to say "everything that you need to know arrives marked-up
explicitly" in this particular case. For example, if a Japanese document
arrives in XML, and it was originally encoded in shift-JIS, then we should
have a suspicion that when there is a backslash character, a Yen glyph might
be intended. I know that it would be better to encode the document properly
first, but it seems that a policy of choosing a variant font based on the
sending encoding (or for that matter, the country in the URL) is just as
legitimate a default policy as just using the current-locale's variant font
at the receiver.

> My claim is confined
> to this: that for plain-text purposes, Han unification does not
> obscure anything essential.

As far as the plain-text distinction, were laypeople actually tested for
this, or is it the conjecture of scholars who already know all the variants
and their connections (no disrespect intended)? As is the case with fraktur
for English readers, if you have not been taught the characters you cannot
read them, and if you have been taught them you cannot be tested for whether
you can read them.

The plain-text criterion may be good for character-set people. But there is
no reason to assume that preserving minimal readability is a criterion good
enough for documents. I guess this is the PDF versus SGML debate writ small;
should fidelity to the originating publication be the policy or should
rendering be termined by the setup of the receiver. And maybe it is a
content-related thing too: the closer text is to literature or names, the
greater the chance that the sender intends a particular glyph variant for
the character they chose.

Rick Jelliffe

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)