UTF-8 vs UTF-16...?

Wed Nov 17 18:09:58 GMT 1999

According to the latest Unicode book (is it version 2.0?  Or 3.0?)
UTF-8 does not allow you to encode more than the first 17 planes of ISO
10646.  If I remember correctly, the formats are (omitting leading
output zero bits):

one byte:
0xxxxxxx -> xxxxxxx
two bytes:
110yyyyy 10xxxxxx -> yyy yyxxxxxx
three bytes:
1110zzzz 10yyyyyy 10xxxxxx -> zzzzyyyy yyxxxxxx
four bytes:
11110uuu 10uuzzzz 10yyyyyy 10xxxxxx -> wwwww zzzzyyyy yyxxxxxx
where wwwww is uuuu+1.  (These characters are encoded with surrogate
pairs in UTF-16.)  I may be mistaken about this one; my book is at home.

No five-byte or longer sequences are listed.  No valid sequences
starting with more than four ones are listed.  Presumably these two
omissions correspond, and an extended UTF-8 with these additions would
allow you to handle larger character sets.

It may be that other standards actually specify such an extended UTF-8.

So "bigger character range" is probably not a valid reason for wanting
to use UTF-8 -- quite aside from the question of whether you really
need more than the million or so characters UTF-16 can encode --
because UTF-8 decoders implemented according to Unicode's spec will
choke if you try to encode bigger characters in it.

-- 
<kragen at pobox.com>       Kragen Sitaker     <http://www.pobox.com/~kragen/>
The Internet stock bubble didn't burst on 1999-11-08.  Hurrah!
<URL:http://www.pobox.com/~kragen/bubble.html>

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To unsubscribe, mailto:majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)