Unicode, xml:lang, and variant glyphs

Thu Nov 5 21:12:34 GMT 1998

Rick Jelliffe wrote:

> If a XML document arrives with an encoding in the XML header of SJIS it will
> have been created on a Japanese editor: in the absense of any information to
> the contrary, shouldn't it be displayed using Japanese fonts?

Perhaps as a heuristic.  But I find it very hard to swallow that
the charset encoding of a document is part of its semantics.
Would you assume that, in the absence of other evidence, a
document in ASCII was in en-US?  And if so, what assumption
would you make about a 8859-1 document?

> And if a
> document arrives in Big Five, shouldnt it be displayed in the absense of
> anything else, using a (presumably traditional but this is not clear cut
> now) Chinese font?

Note:  Contrary to a common assumption, Unicode does *not* unify
simplified hanzi with their traditional counterparts.

> I am very loathe to say "everything that you need to know arrives marked-up
> explicitly" in this particular case. For example, if a Japanese document
> arrives in XML, and it was originally encoded in shift-JIS, then we should
> have a suspicion that when there is a backslash character, a Yen glyph might
> be intended.

If it is really encoded in SJIS, then an \x5C byte represents
a yen character, not a backslash, and had better be treated as such
by the application.  Of course, since the document character set is
always 10646, a &#x5C; character reference means a backslash, not a
yen symbol.  Ditto for KSC with a won symbol (U+20A9).

> As far as the plain-text distinction, were laypeople actually tested for
> this, or is it the conjecture of scholars who already know all the variants
> and their connections (no disrespect intended)?

I don't know, as I am not part of the Ideographic Rapporteur Group
and find their documents very hard to follow.

> The plain-text criterion may be good for character-set people. But there is
> no reason to assume that preserving minimal readability is a criterion good
> enough for documents.

No doubt it is not.  The point is that anything that is not a plain
text distinction should be encoded using our favorite markup mechanism:
XML.

> I guess this is the PDF versus SGML debate writ small;
> should fidelity to the originating publication be the policy or should
> rendering be termined by the setup of the receiver.

In the end the receiver always controls: a variant PDF renderer
could exist, although there's no reason for it to.  Fidelity to
the originating publication is a reasonable goal, but requires
reasonable cooperation.

> And maybe it is a
> content-related thing too: the closer text is to literature or names, the
> greater the chance that the sender intends a particular glyph variant for
> the character they chose.

Very true, which is why I am interested to hear about methods for
explicitly encoding variants.

-- 
John Cowan	http://www.ccil.org/~cowan		cowan at ccil.org
	You tollerday donsk?  N.  You tolkatiff scowegian?  Nn.
	You spigotty anglease?  Nnn.  You phonio saxo?  Nnnn.
		Clear all so!  'Tis a Jute.... (Finnegans Wake 16.5)

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)