Character Encoding and the XML PR (was Re: PR.xml)
David Megginson
ak117 at freenet.carleton.ca
Fri Jan 16 16:41:07 GMT 1998
Peter Murray-Rust writes:
> Thanks. I am also aware of it now :-). Can I make the assumption that:
>
> - ISO-8859-1 and UTF-8 look identical to not-very-experienced humans.
They look identical to most English speakers, but differ in their
treatment of accented characters (> 0x7f), so French and German
speakers probably notice.
> - in principle I should be able to sort this by adding something like
>
> <?xml version="1.0" encoding="ISO-8859-1"?>
> to the top of the document
Correct. The other alternative is to configure your web server to
send the encoding ISO-8859-1 in the HTTP header for this document if
the text/xml MIME type is approved, but the problem will reappear if
you download the file and the parse it on your own system.
> - in practice this fails because by the time it gets to the encoding
> declaration it has already assumed the encoding is UTF-8 and has crashed :-)
It should not fail with AElfred -- I just downloaded the PR and added
your XML declaration to the top, and AElfred reported no errors.
In fact, the XML declaration is guaranteed to use only ASCII
characters, which are the same in UTF-8 and ISO-8859-*. AElfred is
very careful not to try to read too far until the document until it
has discovered whether there is an explicit encoding declaration.
> I am not quite clear why we need this problem. Do different tools emit
> different encodings? If so, what should I work with?. Can I convert this
> document?
ISO-8859-1, which is used for most web pages, contains characters only
for Western European languages. UTF-8 can encode any Unicode
characters up to 0xff (and a little higher with surrogates), so it can
handle Kanji, Han Chinese, Arabic, etc. The PR rightly specifies that
any entity that begins with neither an encoding declaration nor a
byte-order mark (for UCS-2) should be assumed to be encoded in UTF-8.
Conversion should be fairly simple -- take a look at the AElfred
source to see how the different encodings are constructed. Just for
the record, AElfred accepts the following encodings, and to my
knowledge, supports them completely and correctly to the extent
allowed by Java's 16-bit characters and by surrogates:
- UTF-8
- ISO-10646-UCS-2 (both byte orders)
- ISO-10646-UCS-4 (four byte orders)
- UTF-16
- ISO-8859-1
All the best,
David
--
David Megginson ak117 at freenet.carleton.ca
Microstar Software Ltd. dmeggins at microstar.com
http://home.sprynet.com/sprynet/dmeggins/
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)
More information about the Xml-dev
mailing list