Unicode, xml:lang, and variant glyphs
cowan at locke.ccil.org
Thu Nov 5 21:12:34 GMT 1998
Rick Jelliffe wrote:
> If a XML document arrives with an encoding in the XML header of SJIS it will
> have been created on a Japanese editor: in the absense of any information to
> the contrary, shouldn't it be displayed using Japanese fonts?
Perhaps as a heuristic. But I find it very hard to swallow that
the charset encoding of a document is part of its semantics.
Would you assume that, in the absence of other evidence, a
document in ASCII was in en-US? And if so, what assumption
would you make about a 8859-1 document?
> And if a
> document arrives in Big Five, shouldnt it be displayed in the absense of
> anything else, using a (presumably traditional but this is not clear cut
> now) Chinese font?
Note: Contrary to a common assumption, Unicode does *not* unify
simplified hanzi with their traditional counterparts.
> I am very loathe to say "everything that you need to know arrives marked-up
> explicitly" in this particular case. For example, if a Japanese document
> arrives in XML, and it was originally encoded in shift-JIS, then we should
> have a suspicion that when there is a backslash character, a Yen glyph might
> be intended.
If it is really encoded in SJIS, then an \x5C byte represents
a yen character, not a backslash, and had better be treated as such
by the application. Of course, since the document character set is
always 10646, a \ character reference means a backslash, not a
yen symbol. Ditto for KSC with a won symbol (U+20A9).
> As far as the plain-text distinction, were laypeople actually tested for
> this, or is it the conjecture of scholars who already know all the variants
> and their connections (no disrespect intended)?
I don't know, as I am not part of the Ideographic Rapporteur Group
and find their documents very hard to follow.
> The plain-text criterion may be good for character-set people. But there is
> no reason to assume that preserving minimal readability is a criterion good
> enough for documents.
No doubt it is not. The point is that anything that is not a plain
text distinction should be encoded using our favorite markup mechanism:
> I guess this is the PDF versus SGML debate writ small;
> should fidelity to the originating publication be the policy or should
> rendering be termined by the setup of the receiver.
In the end the receiver always controls: a variant PDF renderer
could exist, although there's no reason for it to. Fidelity to
the originating publication is a reasonable goal, but requires
> And maybe it is a
> content-related thing too: the closer text is to literature or names, the
> greater the chance that the sender intends a particular glyph variant for
> the character they chose.
Very true, which is why I am interested to hear about methods for
explicitly encoding variants.
John Cowan http://www.ccil.org/~cowan cowan at ccil.org
You tollerday donsk? N. You tolkatiff scowegian? Nn.
You spigotty anglease? Nnn. You phonio saxo? Nnnn.
Clear all so! 'Tis a Jute.... (Finnegans Wake 16.5)
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)
More information about the Xml-dev