UTF-8 vs UTF-16...? (Was: Feeling good about SML)

Wed Nov 17 15:53:18 GMT 1999

At 17 Nov 1999 14:29 GMT, Steve Schafer wrote:
 > On 17 Nov 1999 13:24:27 +0100, you wrote:
 > 
 > >Not sure if I understand the UTF-16 bit above, but I'm reading this:
 > >        <URL:http://www.unicode.org/unicode/faq/#UTF-16 and UCS-4>
 > >to UTF-16 being able to represent the full UCS-4, which is what you
 > >say UTF-8 can do, if I interpret you correctly...?
 > 
 > Section C.3 of the Unicode 2.0 spec, paragraph 4:
 > 
 > "UTF-16 does not support the representation of all the UCS-4 code
 > space but is limited to the BMP and the next 16 planes...."

True, but that's more code values than anybody expects to ever
standardise (although that's the opinion of the same people that
thought that they'd never need more than the BMP).

All of the currently defined Unicode and ISO/IEC 10646 characters
(both people define the same characters) are in the BMP.  It won't be
long until characters are defined in Plane 1 and Plane 2 (with
possible spill-over into Plane 3), plus planes 15 and 16 are reserved
for private use.

Currently the only thing defined for the characters beyond Plane 16 of
Group 00 (i.e. beyond the characters addressable with UTF-16) are more
areas available for private use.

The fuss over UTF-8 or UTF-16 is over the number of bytes used to
represent the characters in the BMP, i.e. the currently defined
characters.  UTF-16 uses two bytes per character, and UTF-8 uses one
byte per character for the ASCII characters, two bytes per character
for not that many more characters, and three bytes per character for
most of the characters in the BMP.  Both UTF-8 and UTF-16 use four
bytes per character to represent the characters in planes 1 to 16.

(There's also UTF-32, which is four bytes per character for all the
characters that you can represent with UTF-16.)

UTF-8 is efficient if you use a lot of ASCII, e.g. if you're an
English speaker and all you use is ASCII, but it's more bytes per
character than UTF-16 for a whole lot of other scripts (plus it's more
bytes per character than an lot of current script-specific encodings).

So the issue isn't how many characters the different encodings can
represent, but how efficiently (or how uniformly) they represent the
currently defined characters.

Regards,

Tony Graham
======================================================================
Tony Graham                            mailto:tgraham at mulberrytech.com
Mulberry Technologies, Inc.                http://www.mulberrytech.com
17 West Jefferson Street                    Direct Phone: 301/315-9632
Suite 207                                          Phone: 301/315-9631
Rockville, MD  20850                                 Fax: 301/315-8285
----------------------------------------------------------------------
  Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To unsubscribe, mailto:majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)