Unicode surrogate block in XML?
Tim Bray
tbray at textuality.com
Fri Sep 17 02:35:28 BST 1999
At 06:12 PM 9/16/99 -0400, Paul W. Abrahams wrote:
>The XML 1.0 spec explicitly excludes the Unicode surrogate characters
>from XML documents (production 2). It now seems, from information
>I've picked up on the Unicode web site, that surrogate characters are
>likely to play a more important role in the future, since the
>available 16-bit characters are almost all used up. (Unicode 2.0 has
>18,134 spares but Unicode 3.0 has only 7827 spares. The trend is
>clear.)
No. Production [2] says
[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF]
| [#xE000-#xFFFD] | [#x10000-#x10FFFF]
This follows the unicode model in allowing 17 planes of 64k characters
each, i.e. about a million characters. For this to work in UTF-16, you
need surrogate pairs. What XML rules out is *characters* whose numeric
value is that of one-half of a surrogate pair. There will never be any
such characters precisely because those values are reserved for use in
surrogate pairs. That's why XML rules them out. -Tim
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)
More information about the Xml-dev
mailing list