Non-Unicode Character Sets
roddey at us.ibm.com
roddey at us.ibm.com
Mon Jan 31 22:56:41 GMT 2000
>I am told that conversion of some character sets through Unicode is
>lossy and cannot be round-tripped. But it occurs ot me that as long as
>one has the private use area, "unknown" characters can always be
>preserved. If a particular mapping loses information, isn't that more a
>weakness in the mapping then in Unicode itself? Are there some
>standardized national character sets with so many non-Unicode characters
>that they cannot fit into the PUA? Even with planes 15 and 16?
Don't know the answer to that, but just as a related aside...
In some cases the problem isn't round tripping, its 'half-tripping', due to
wierd design of the encoding. For instance, we have had some problems with
some Japanese and Korean encodings because of ambiguity between the
backslash and Yen sign. When you transcode that code point to Unicode, you
have to know the context of the text being transcoded in order to know
which translation is the correct one. If you transcode it to Yen, then if
you turn around and pass that text to say a 'file open' Unicode API in a
system that is inherently Unicode enabled, then it breaks because the
Unicode Yen sign probably isn't a legal path separator on that platform. If
you transcode it to backslash, and the text was a monetary value, then it
will be incorrect in its Unicode incarnation as well.
If you round trip it, its ok probably because both Unicode points can get
translated back to the single, ambiguous point, but then the software is
processed by an API that knows its dealing with this situation and can use
its context sensitivity to do the right thing (i.e. the file open knows
what that ambiguous code point means in that situation.)
Its all due to a psycho encoding design I guess, which could be mostly
dealt with when the code dealing with it was specific to that locale and
was dealing with it in the original encoding. But, once you move to a
Unicode world, and you have to make a choice between the two Unicode code
points to transcode to, it gets wierd and I don't see how it could really
be made to work consistently, since no one is going to write entire
software systems that carry around context information with the text
wherever it goes.
If some of you folks who deal with these encodings think I'm just confused,
please say so. But this is the best we can figure out with these types of
encodings.
----------------------------------------
Dean Roddey
Software Weenie
IBM Center for Java Technology - Silicon Valley
roddey at us.ibm.com
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ or CD-ROM/ISBN 981-02-3594-1
Please note: New list subscriptions and unsubscriptions
are now ***CLOSED*** in preparation for list transfer to OASIS.
More information about the Xml-dev
mailing list