Unicode surrogate block in XML?

Tim Bray tbray at textuality.com
Fri Sep 17 02:35:28 BST 1999


At 06:12 PM 9/16/99 -0400, Paul W. Abrahams wrote:
>The XML 1.0 spec explicitly excludes the Unicode surrogate characters
>from XML documents (production 2).  It now seems, from information
>I've picked up on the Unicode web site, that surrogate characters are
>likely to play a more important role in the future, since the
>available 16-bit characters are almost all used up.  (Unicode 2.0 has
>18,134 spares but Unicode 3.0 has only 7827 spares.  The trend is
>clear.)

No. Production [2] says

[2] Char ::=  #x9 | #xA | #xD | [#x20-#xD7FF]
              | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

This follows the unicode model in allowing 17 planes of 64k characters
each, i.e. about a million characters.  For this to work in UTF-16, you
need surrogate pairs.  What XML rules out is *characters* whose numeric 
value is that of one-half of a surrogate pair.  There will never be any
such characters precisely because those values are reserved for use in 
surrogate pairs.  That's why XML rules them out. -Tim

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)





More information about the Xml-dev mailing list