Character Encoding Detection

Fri May 8 22:06:31 BST 1998

	Thanks Chris!  That led me down the path of enlightenment.

On Fri, 8 May 1998, Chris Maden wrote:

> UCS-2 is identical to UTF-16, and so it is subject (presumably) to the
> same rule.

	Ahh, I hadn't read the details of UTF-16 encoding.  So, as I
understand it now, UTF-16 is only applicable to UCS-4 (not UCS-2 ??)
documents, because it transforms a UCS-4 document into a UCS-2 document
that uses a reserved code range (rows D8-DB and DC-DF of the BMP) to
represent the extra characters (where only UCS-4 chars < 0011 0000 can
be mapped). A normal UCS-2 document would not use these reserved codes
(??), so in that respect a UTF-16 encoded UCS-4 document is not identical
to a "native" UCS-2 document.  What this means is that you can only tell a
UTF-16 document from a UCS-2 document by the encoding declaration, or by
encountering a character in the reserved UCS-2 range for UTF-16 encoding.
And both UCS-2 and UTF-16 docs must start with a byte order mark.  Is this
byte order mark "#xFEFF" specific to XML documents, or is it part of UCS?

> As a side note, I was unsure until just now whether they were
> equivalent, but I finally found ISO 10646-1 clause 8:
> 
>    Plane 00 of Group 00 shall be the Basic Multilingual Plane (BMP).
>    The BMP can be used as a two-octet coded character set in which
>    case it shall be called UCS-2.

	Without reading the whole document (uhg, I see it coming) I didn't
think this statement had anything to do with UTF, I thought it was just
explaining how UCS-2 is a subset of UCS-4.

Though I think the following does state what you say:

"UCS Transformation Format 16 (UTF-16)" at
http://www.stonehand.com/unicode/standard/wg2n1035.html

> The following method transforms the coded representation of over a
> million graphic characters of UCS-4 into a form that is compatible with
> the two-octet BMP form of UCS-2 (section 14.1). This permits the
> coexistence of those UCS-4 characters within coded character data that
> is in accordance with UCS-2. 
> 
> In UTF-16 each graphic character from the UCS-2 repertoire retains its
> UCS-2 coded representation. In addition, the coded representation of any
> character from a single contiguous block of 16 Planes in Group 00
> (1,048,576 code positions) is transformed to pairs of two-octet
> sequences, where each sequence corresponds to a cell in a single
> contiguous block of 8 Rows in the BMP (2,048 code positions). These
> codes are reserved for the use of this transformation method, and shall
> not be allocated for any other purpose. 

	As for the goal of it being relatively easy for the desperate Perl
hacker, or myself, the desperate Java hacker, to code an XML parser...it's
not that bad.  It helps if you come into it with prior knowledge of stuff 
like character encodings.  I think all I would have changed in the spec is
to not allow PE's _within_ markup declarations in the external DTD, I
still have to sort that out if I ever want to make my parser validating.
I might also have been tempted to not allow recursive PE's.  They could
have left it out for 1.0 and added it for 1.1, but now you have to be
backward compatible.  I think if you find yourself needing that kind of
thing, SGML might be the language for you.

---
Chris Hubick
mailto:chris at hubick.com
http://www.hubick.com/

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)