Character Encoding Detection

Rick Jelliffe ricko at allette.com.au
Sat May 9 02:23:50 BST 1998


From: Chris Hubick <maillist at chris.hubick.com>

>Now in UCS-2:
> '<' is 00 3C
> '?' is 00 3f
>
>So the start of a UCS-2 or UTF-16 encoded XML document would be 00 3C 00
>3F

Not always. Because when storing a double-octet (i.e. two 8-bit bytes)
different
CPUs store them in different byte order: big-endian and little-endian. So if
you
read a big-endian file as a sequence of bytes you will get 003c003f,
but if you read a little-ending file as a sequence of bytes you will get
3c003f00.

This is why a binary file from an Intel machine will typically need to be
byte-swapped for use on a Motorola 68K machine. I believe the Motorola
PowerPC CPU has an instruction to switch between little-endian and
big-endian
reading modes. When you are storing 4 octet data there are 4 possibile
arrangements, but in fact only 3 seem to have been used: PDP-11 use one,
Motorola used another, Intel used another.

To cope with this,  network protocols and binary file formats will usually
specify a particular byte order for storing or transfering 16 or 32 bit
data.

As an alternative, Unicode (UTF-16) reserves two characters near the high
end
of the codespace. From these characters, it is possible to determine whether
the file is saved as big-endian or little endian. Unicode software will then
resequence
incoming bytes correctly, to construct the 16-but values.This is called the
Byte Order Mark (BOM).

The XML character encoding strategy is to encourage you, wherever possible,
to make sure that all the information needed to parse your document can be
marked-up in the document. Operating systems typically just provide a "text"
type (.txt, Mac "TEXT") under the assumption, which was OK in the West when
everyone had standalone computers, that all "text" on a computer would use
the
same character set or encoding. So, because operating systems and protocols
fail in this key regard, XML provides its heuristic for determining text:
this heuristic
is reliable, in the sense that if you use it, your data is clearly
labelled--it takes the
guess work out of character detection.

The heuristic is basically this:

1) If the operating system or transport protocol tells you the character
set, and if
it is reliable, use that.
2) If there is a BOM, then your data is Unicode.
3) Otherwise, use the XML declarations (i.e., the <?XML charset="xxx"?>) to
determine the encoding (this is a little bit complex, but straightforward)
4) Otherwise, it is UTF-8 (7-bit ASCII conforms to UTF-8 already).

The draft RFC for the MIME type text/xml will be made public soon. It gives
some more policy on these issues.

>In the section on autodetection of character encodings the XML spec
>states "00 3C 00 3F: UTF-16, big-endian, no Byte Order Mark (and thus,
>strictly speaking, in error)"
>
> My question is, why is this an error rather than a perfectly
>acceptable untransformed UCS-2 document?
>
><?xml version="1.0" encoding="ISO-10646-UCS-2"?>


The reason that the encoding declaration and BOM are not mandatory was
to allow legacy documents and software systems to be used. But there seems
to be a general concensus with the XML SIG/WG people that all new XML
documents should have either a BOM or an explicit XML declaration
(with the encoding attribute).

As part of the XML-development process, some of us asked the question:
"What are the intrinsic properties of text?", with the idea that if
something was
intrinsic ("prime metadata") then XML should provide a standard way of
representing it. The things we came up with were notation (i.e. XML,
including
version and treatment of white-space xml:space), encoding, language
(i.e., xml:lang).  All these things are fundamental to any parsing of text
files
in a world wide web. (I also think that written-script is also an intrinsic,
but
the ISO standard for script codes was not finished and  the RFC 1766
language format was flexible enough for most needs, it was thought.)

Software developers should rise to this challenge. When you write out an XML
file, make sure you write out the appropriate XML header: Don't treat it as
an option but as a necessity. And if you are writing UTF-16/Unicode tools,
always write out the BOM as the first character. With XML we have the chance
to not get out of the character set and encoding maze: not by being forced
to use a single encoding, but by disconnecting encoding from document
character set (i.e. ISO 10646) and by clearly labelling which encoding is
being used.

Rick Jelliffe


xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)




More information about the Xml-dev mailing list