Unicode confusion

Tue Jan 4 20:02:47 GMT 2000

> No one's disagreeing with the use of Unicode; we're talking about
> which character encoding we'll use to represent it.  You can represent
> Unicode in variable-width 8-bit or 16-bit encodings or in fixed-width
> 32-bit encodings.

My reading of the Unicode 2.x standard is that the above isn't strictly
correct.  It is correct if you change "Unicode" to "the ISO 10646 Universal
Character Set" though.

> Note that Java uses UTF-16, which isn't quite fixed-width, though no
> one really notices.

It seems to me that Java uses Unicode, which maintains the semantics that 16
bits equals one character.  Surrogates are characters in Unicode, whereas
those code points are not legal UCS characters, but only artifacts of the
UTF-16 encoding.

Unicode looks like UTF-16, but the semantics are slightly different.  So a
file using UTF-16 encoding containing a single "astral plane" character of
the UCS would be interpreted by Unicode as a file containing *two* surrogate
characters.  (I think it's a strange tack to take, but it seems fairly clear
to me that this was their position as of Unicode 2.x.  I haven't looked at
3.0 yet, so things may have changed since then.)

The XML character set is the UCS, not Unicode.

Cheers,
-Peter-    housel at acm.org

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To unsubscribe, mailto:majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)