Unicode surrogate block in XML?

Sat Sep 18 06:39:27 BST 1999

At 17 Sep 1999 22:16 -0400, Paul W. Abrahams wrote:
 > Tony Graham (tgraham at mulberrytech.com)
 > Fri, 17 Sep 1999 01:15:51 -0400 (EST)
 > 
 > >> In any XML document, you can make numeric references to any Unicode
 > 
 > character in the range #x10000 to #x10FFFF (as well as to any other
 > legal character number).  These references are independent of the
 > encoding used in the XML document. <<
 > 
 > Is it really correct to refer to #x10FFFF, say, as a Unicode
 > character, since Unicode characters are limited to 16 bits?  I'd think
 > it's necessary here to refer to that as a UCS-4 character.

The Unicode Standard started out with the design principle that all
characters have a uniform width of 16 bits.  The expectation was that
the 65,000 or so characters that you can address with 16 bits would
far exceed the requirements.  However, reality intruded, and the
practicalities (and possibly the political realities) of defining a
universal character set has meant that there are more characters to be
defined than can fit in a 16-bit address space.

Unicode 2.0, published in 1996, defines the Surrogate block and a
mechanism for using two code values from the surrogate block to
address over one million extra characters.

The Unicode Standard, Version 2.0, supports surrogates, but doesn't
quite know what to do about them.  Section 3.7 of the Unicode
Standard, Version 2.0, defines surrogates, and they are mentioned
again in section C.3, but you're left with the impression that they
and UTF-16 are really an ISO/IEC 10646 thing.  UTF-16 was initially
defined in Amendment 1 of ISO/IEC 10646-1:1993, so it wasn't far off
the mark.

Planes 15 and 16 are reserved for private use, so there's been a
legitimate use for surrogates, or, more broadly, for using characters
outside Plane 0, since 1996.

Since 1996, however, there have been numerous proposals for scripts to
be included in the Unicode Standard and ISO/IEC 10646, and many of
these are slated for definition in Plane 1, i.e. they'll need more
than 16 bits to address the characters.  As far as I know, none have
been assigned code values yet, but it won't be too long after the
release of the Unicode Standard, Version 3.0, and ISO/IEC
10646-1:2000.  Furthermore, Plane 2 is reserved as the CJK Unified
Ideographs Supplementary Plane, and it already has 41,000 characters
lined up for inclusion.

 > >> The sequence of #xD800 #xDC00 is the two Surrogate code values that
 > 
 > address #x10000.  That four-byte sequence may occur in a UTF-16
 > encoded file to represent #x10000.  In contrast, "&#xD800;&#xDC00;" in
 > 
 > an XML document is two illegal character references in a row. <<
 > 
 > I've been trying to fathom the distinction between Unicode and UTF-16,
 > if there is one, and how these in turn relate to the UCS-2 encoding of

There isn't one anymore.  The Unicode Standard used to say that it
corresponded to UCS-2, but now it has embraced UTF-16 (and given us
UTF-16BE and UTF-16LE for big-endian and little-endian representations
without the BOM, respectively).

The Unicode Consortium now also defines UTF-32, which is a 32-bit
representation of the characters that you can address with UTF-16.
There is no difference between the UTF-32 representation of a
character and the UCS-4 representation of a character over the range
of characters that you can address with UTF-32.  The only difference
is that when you say that your document is UTF-32, you're saying that
it comes with the Unicode character semantics and conformance
requirements rather than the different requirements of UCS-4.

UTF-8 has also come into the fold since 1996.  In the Unicode
Standard, Version 2.0, UTF-8 was relegated to section A.2, but now
it's an accepted alternative for UTF-16.

 > ISO 10646.  There's also the question of whether an XML document can
 > be stored directly in Unicode, or whether instead it must be stored in
 > either UTF-8 or UTF-16,  as Section 2.2 seems to imply when it says
 > ``all XML processors must accept the UTF-8 and UTF-16 encodings of
 > 10646''.   The latter appears to be the case; but if it isn't, then
 > how would an XML  document be stored directly in Unicode?   I've

UTF-8 and UTF-16 can encode the characters of the Unicode Standard.

The Unicode Standard used to miss an aspect compared to how some people,
e.g. some ISO standards, define a character set.

Roughly speaking, the base aspect is the character repertoire, which is
a collection of abstract characters.

The next aspect is a mapping of the character repertoire onto a set of
numbers.

The third aspect is mapping the character numbers onto some
representation as bits or bytes.

The Unicode Standard used to conflate the second and third aspects
since the character numbers are identical to the value of the 16-bit
quantities that you can use to represent the characters.  Hence it
seems like a Unicode character is its 16-bit character number.  This
simplification falls down when you have character numbers that you
can't express with 16-bits and you allow other bit representations for
the characters.

You'll find that the Unicode Consortium now speaks about UTF-8,
UTF-16, UTF-32, and UTF-EBCDIC.  The favourite is probably still
UTF-16, but even UTF-16 isn't one 16-bit quantity to one character.

Also, the Unicode character encoding model
(http://www.unicode.org/unicode/reports/tr17/) now has five levels.

 > pondered both Appendix C of the Unicode Standard and the relevant part
 > of the FAQ on the Unicode website, and I'm still unclear about all of
 > this.  (By the way, the FAQ erroneously refers to UTF as the Unicode
 > Transformation Format rather than the UCS transformation format.)

There are two definitions for UTF.  ISO/IEC 10646 always defines it as
"UCS transformation format", and the Unicode Consortium mostly defines
it as "Unicode transformation format" (see section C.3 of the Unicode
Standard, Version 2.0, for an exception).  They mean the same thing.

 > In any event, thanks, Tony, for your very enlightening response to my
 > original query.

I hope this remains enlightening, and not overwhelming.

Regards,

Tony Graham
======================================================================
Tony Graham                            mailto:tgraham at mulberrytech.com
Mulberry Technologies, Inc.                http://www.mulberrytech.com
17 West Jefferson Street                    Direct Phone: 301/315-9632
Suite 207                                          Phone: 301/315-9631
Rockville, MD  20850                                 Fax: 301/315-8285
----------------------------------------------------------------------
  Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)