Char & Java implementation
richard at cogsci.ed.ac.uk
Wed Mar 4 10:53:00 GMT 1998
>  Char ::= #x9 | #xA | #xD | [#x20-#xD7FF]
> | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
> Am I right in thinking that, since the indicated characters are longer than
> 16 bits, they can't be represented in Java with the char data type, and int
> must be used instead?
The answer to this explains the otherwise mysterious missing range
D800 to DFFF. These 2 * 2^10 missing characters can be used in pairs
to represent the first 2^20 characters above FFFF. The character
10000 + x is represented by the pair D800 + (x >> 10), DC00 + (x & 3FF).
Since none of the characters above FFFF are name characters, they are
irrelevant to the syntax of XML, and you don't need to convert the
pairs of "surrogates" into the characters they represent - you can
just pass them through to the application.
So you can treat the range of legal characters as being 9,A,D,20-FFFD.
There are a few things you have to take account of:
- the surrogates must appear in pairs in the input, one in the range
D800-DBFF followed by one in the range DC00-DFFF
- if a character entity refers to a character in the range 10000-10FFFF
it should be converted to a pair of surrogates before it is passed to
- a character entity must not expand to a character in the surrogate
I think, but I'm not certain, that this encoding only applies to UTF-16
and not UCS-2 (which would mean that the surrogate characters are an
error if encountered in a UCS-2 stream). Can anyone confirm/deny this?
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)
More information about the Xml-dev