UTF-8 or ? for SML (was: Re: Feeler for SML (Simple Markup Language))

Tony Graham tgraham at mulberrytech.com
Sat Nov 13 19:13:22 GMT 1999

At 13 Nov 1999 15:46 -0000, Richard Anderson wrote:
 > But UTF-8 can support "foreign" characters so I dont see the argument for
 > having UTF-16 too.  Also, generally speaking UTF-8 encoding results in
 > smaller output for most cases.

Different people have different ideas of what constitutes "foreign".

For the majority of the characters in the Unicode Standard, UTF-8 uses
three bytes per character.  However, for the US-ASCII characters, it
uses only one byte per character.

For all characters in the Unicode Standard, UTF-16 uses two bytes per

Whether a given file is less bytes as UTF-8 or UTF-16 is largely a
function of the proportion of unaccented Latin characters in the file.

Moreover, most legacy encodings for a single script use one byte per
character, although Chinese, Japanese, and Korean encodings use two or
more bytes per character.  UTF-8, therefore, isn't as efficient as the
legacy encodings of most scripts.  (Its advantage is that it can
represent more scripts than any legacy encoding.)


Tony Graham
Tony Graham                            mailto:tgraham at mulberrytech.com
Mulberry Technologies, Inc.                http://www.mulberrytech.com
17 West Jefferson Street                    Direct Phone: 301/315-9632
Suite 207                                          Phone: 301/315-9631
Rockville, MD  20850                                 Fax: 301/315-8285
  Mulberry Technologies: A Consultancy Specializing in SGML and XML

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To unsubscribe, mailto:majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)

More information about the Xml-dev mailing list