Feeler for SML (Simple Markup Language)

Tony Graham tgraham at mulberrytech.com
Fri Nov 12 16:10:05 GMT 1999

At 11 Nov 1999 17:32 -0500, Clark C. Evans wrote:
 > > o UTF-8 encoding only
 > I'm kinda ingnorant... would it still be
 > possible to handle oriental character sets
 > with UTF-8 ? 

You can still represent the characters from your oriental character
sets using UTF-8, but it takes three bytes per character to do so
(instead of two bytes with UTF-16 and most legacy encodings).

UTF-8 is a win for English text, since the ASCII characters are
represented with one byte.  For most scripts, however, UTF-8 takes up
more bytes per character than UTF-16.  It is well known the three
bytes per character range includes the CJK ideographs, but it also
includes Hangul, the South and Southeast Asian Scripts, and others too
numerous to mention here.

Whether UTF-8 or UTF-16 is better depends both on what scripts you
mainly use and on what your tools support (since neither UTF-8 support
nor UTF-16 support is universal among general-purpose programming
languages or editors or...).


Tony Graham
Tony Graham                            mailto:tgraham at mulberrytech.com
Mulberry Technologies, Inc.                http://www.mulberrytech.com
17 West Jefferson Street                    Direct Phone: 301/315-9632
Suite 207                                          Phone: 301/315-9631
Rockville, MD  20850                                 Fax: 301/315-8285
  Mulberry Technologies: A Consultancy Specializing in SGML and XML

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To unsubscribe, mailto:majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)

More information about the Xml-dev mailing list