Unicode surrogate block in XML?

Tony Graham tgraham at mulberrytech.com
Fri Sep 17 07:13:57 BST 1999


At 16 Sep 1999 18:12 -0400, Paul W. Abrahams wrote:
 > The XML 1.0 spec explicitly excludes the Unicode surrogate characters
 > from XML documents (production 2).  It now seems, from information
 > I've picked up on the Unicode web site, that surrogate characters are
 > likely to play a more important role in the future, since the
 > available 16-bit characters are almost all used up.  (Unicode 2.0 has
 > 18,134 spares but Unicode 3.0 has only 7827 spares.  The trend is
 > clear.)
 > 
 > Is any thought being given in W3C to allowing surrogate characters in
 > XML documents?

The code values from the Surrogate block (soon to be the High
Surrogates, High Private Use Surrogates, and Low Surrogates) are not
allowed in XML documents, but the characters that you reference with
the two parts of a Surrogate Pair are definitely allowed.

The characters that you can address with a Surrogate Pair are in the
range #x10000 to #x10FFFF.  In Unicode terminology, this is the
Unicode Scalar Value of the Surrogate Pair.

Production 2 from the XML Recommendation shows that these are legal
characters:

[2]  Char ::=  #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD]
               | [#x10000-#x10FFFF] 

In a UTF-16 encoded document, you can use the code values from the
Surrogate block to refer to these characters. It would be an error if,
for example, you used an unpaired Surrogate code value, but any UTF-16
application is going to complain about or ignore an unpaired
surrogate.

In a UTF-8 encoded document, you can refer to the characters in the
range #x10000 to #x10FFFF using a four-byte sequence that has no
relationship to the code values in the Surrogate block.

In UCS-4 (or the new UTF-32) you can directly represent characters in
the range #x10000 to #x10FFFF.

In any XML document, you can make numeric references to any Unicode
character in the range #x10000 to #x10FFFF (as well as to any other
legal character number).  These references are independent of the
encoding used in the XML document.

#x10000 is the first code value outside the Basic Multilingual Plane
(the ISO/IEC 10646 term for the characters in the range #x0 to
#xFFFF).  "𐀀" is the hexadecimal numeric reference for this
code value.

The sequence of #xD800 #xDC00 is the two Surrogate code values that
address #x10000.  That four-byte sequence may occur in a UTF-16
encoded file to represent #x10000.  In contrast, "��" in
an XML document is two illegal character references in a row.

Regards,


Tony Graham
======================================================================
Tony Graham                            mailto:tgraham at mulberrytech.com
Mulberry Technologies, Inc.                http://www.mulberrytech.com
17 West Jefferson Street                    Direct Phone: 301/315-9632
Suite 207                                          Phone: 301/315-9631
Rockville, MD  20850                                 Fax: 301/315-8285
----------------------------------------------------------------------
  Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================


xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)





More information about the Xml-dev mailing list