Unicode surrogate block in XML?

Paul W. Abrahams abrahams at valinet.com
Sat Sep 18 04:10:55 BST 1999


Tony Graham (tgraham at mulberrytech.com)
Fri, 17 Sep 1999 01:15:51 -0400 (EST)

>> In any XML document, you can make numeric references to any Unicode

character in the range #x10000 to #x10FFFF (as well as to any other
legal character number).  These references are independent of the
encoding used in the XML document. <<

Is it really correct to refer to #x10FFFF, say, as a Unicode
character, since Unicode characters are limited to 16 bits?  I'd think
it's necessary here to refer to that as a UCS-4 character.

>> The sequence of #xD800 #xDC00 is the two Surrogate code values that

address #x10000.  That four-byte sequence may occur in a UTF-16
encoded file to represent #x10000.  In contrast, "&#xD800;&#xDC00;" in

an XML document is two illegal character references in a row. <<

I've been trying to fathom the distinction between Unicode and UTF-16,
if there is one, and how these in turn relate to the UCS-2 encoding of
ISO 10646.  There's also the question of whether an XML document can
be stored directly in Unicode, or whether instead it must be stored in
either UTF-8 or UTF-16,  as Section 2.2 seems to imply when it says
``all XML processors must accept the UTF-8 and UTF-16 encodings of
10646''.   The latter appears to be the case; but if it isn't, then
how would an XML  document be stored directly in Unicode?   I've
pondered both Appendix C of the Unicode Standard and the relevant part
of the FAQ on the Unicode website, and I'm still unclear about all of
this.  (By the way, the FAQ erroneously refers to UTF as the Unicode
Transformation Format rather than the UCS transformation format.)

In any event, thanks, Tony, for your very enlightening response to my
original query.

Paul Abrahams



xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)





More information about the Xml-dev mailing list