Character Encoding Detection
maillist at chris.hubick.com
Fri May 8 07:41:02 BST 1998
I am new to Character Encodings, and am trying to implement them
for my XML parser.
As I understand it, UCS has two flavors, UCS-2 and UCS-4, either of which
can optionally have a UCS transformation applied to them. It is my
understanding that you could author an XML document in either of these,
without applying a transformation.
The UTF-16 spec at:
"In UTF-16, any UCS character from the BMP shall be represented by
its UCS-2 coded representation."
Now in UCS-2:
'<' is 00 3C
'?' is 00 3f
So the start of a UCS-2 or UTF-16 encoded XML document would be 00 3C 00
In the section on autodetection of character encodings the XML spec
states "00 3C 00 3F: UTF-16, big-endian, no Byte Order Mark (and thus,
strictly speaking, in error)"
My question is, why is this an error rather than a perfectly
acceptable untransformed UCS-2 document?
<?xml version="1.0" encoding="ISO-10646-UCS-2"?>
mailto:chris at hubick.com
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)
More information about the Xml-dev