Unicode confusion

roddey at us.ibm.com roddey at us.ibm.com
Mon Jan 10 19:44:56 GMT 2000

>> If anything, it should go the other way. Unicode should be the core
>> API, and there should be helper API to allow the use of local code
>> page chars where necessary. Everything should be set up to optimize
>> use of the Unicode API, with local code page use paying the price,
>> since Unicode is the more desireable format.
>No one's disagreeing with the use of Unicode; we're talking about
>which character encoding we'll use to represent it.  You can represent
>Unicode in variable-width 8-bit or 16-bit encodings or in fixed-width
>32-bit encodings.
>Note that Java uses UTF-16, which isn't quite fixed-width, though no
>one really notices.

Our parser already adopts to whether the native wchar_t is 16 or 32 bits,
though it still uses surrogates and stores 16 bit data points in the 32 bit
values when its 32 bits. However, it could also pretty reasonably also
adopt to not using surrogates if the local wchar_t is 32 bits. I guess it
comes down to whatever the local system's wide character APIs expect. If it
expects 32 bit values without surrogates, then it would be kind of
necessary to give them that. If it expects 16 bit code points with
surrogates, irregardless of the fact that the wchar_t is 32 bits perhaps,
then it would best to give them that.

Going this far would require some support in parsers that might not be
common, but I think that we could do that reasonably in the Xerces/XML4C
stuff without too much pulling out of hair or added complexity. The
internalization of text into the local format is pretty constrained. The
big iss though is that you are kind of dependent upon what transcoding
package you use. For those incodings that we handle intrinsically, we could
do this well enough. But we allow each platform to use its own transcoding
mechanism if they choose to, and they probably are going to support one
scheme or the other. Hopefully they would support the local scheme, but you
could also choose to use some portable package such as ICU which is going
to do one thing.

So, perhaps the question is: Are there any systems out there which use 32
bit wchar_t *and* expect that surrogates will not be used?

Dean Roddey
Software Weenie
IBM Center for Java Technology - Silicon Valley
roddey at us.ibm.com

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To unsubscribe, mailto:majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)

More information about the Xml-dev mailing list