Char & Java implementation

Wed Mar 4 10:53:00 GMT 1998

> [2]  Char ::= #x9 | #xA | #xD | [#x20-#xD7FF]
>               | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
>                                   ^^^^^^^^^^^^^^^^^^
> 
> Am I right in thinking that, since the indicated characters are longer than
> 16 bits, they can't be represented in Java with the char data type, and int
> must be used instead?

The answer to this explains the otherwise mysterious missing range
D800 to DFFF.  These 2 * 2^10 missing characters can be used in pairs
to represent the first 2^20 characters above FFFF.  The character
10000 + x is represented by the pair D800 + (x >> 10), DC00 + (x & 3FF).

Since none of the characters above FFFF are name characters, they are
irrelevant to the syntax of XML, and you don't need to convert the
pairs of "surrogates" into the characters they represent - you can
just pass them through to the application.

So you can treat the range of legal characters as being 9,A,D,20-FFFD.

There are a few things you have to take account of:

- the surrogates must appear in pairs in the input, one in the range
  D800-DBFF followed by one in the range DC00-DFFF

- if a character entity refers to a character in the range 10000-10FFFF
  it should be converted to a pair of surrogates before it is passed to
  the application

- a character entity must not expand to a character in the surrogate
  range D800-DFFF.

I think, but I'm not certain, that this encoding only applies to UTF-16
and not UCS-2 (which would mean that the surrogate characters are an
error if encountered in a UCS-2 stream).  Can anyone confirm/deny this?

-- Richard

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)