UTF-8 vs UTF-16...?
Kragen Sitaker
kragen at pobox.com
Wed Nov 17 18:09:58 GMT 1999
According to the latest Unicode book (is it version 2.0? Or 3.0?)
UTF-8 does not allow you to encode more than the first 17 planes of ISO
10646. If I remember correctly, the formats are (omitting leading
output zero bits):
one byte:
0xxxxxxx -> xxxxxxx
two bytes:
110yyyyy 10xxxxxx -> yyy yyxxxxxx
three bytes:
1110zzzz 10yyyyyy 10xxxxxx -> zzzzyyyy yyxxxxxx
four bytes:
11110uuu 10uuzzzz 10yyyyyy 10xxxxxx -> wwwww zzzzyyyy yyxxxxxx
where wwwww is uuuu+1. (These characters are encoded with surrogate
pairs in UTF-16.) I may be mistaken about this one; my book is at home.
No five-byte or longer sequences are listed. No valid sequences
starting with more than four ones are listed. Presumably these two
omissions correspond, and an extended UTF-8 with these additions would
allow you to handle larger character sets.
It may be that other standards actually specify such an extended UTF-8.
So "bigger character range" is probably not a valid reason for wanting
to use UTF-8 -- quite aside from the question of whether you really
need more than the million or so characters UTF-16 can encode --
because UTF-8 decoders implemented according to Unicode's spec will
choke if you try to encode bigger characters in it.
--
<kragen at pobox.com> Kragen Sitaker <http://www.pobox.com/~kragen/>
The Internet stock bubble didn't burst on 1999-11-08. Hurrah!
<URL:http://www.pobox.com/~kragen/bubble.html>
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To unsubscribe, mailto:majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)
More information about the Xml-dev
mailing list