Encoding detection again ...
msabin at cromwellmedia.co.uk
Tue Mar 2 12:07:28 GMT 1999
I've been browsing throught the archives for an
answer to this question, but I haven't been able
to find anything that seems to give a completely
unambiguous answer ...
Appendix F of the spec say that given a document
starting with the 4 octet sequence,
00 3C 00 3F
I'm to infer BOM-less big-endian UTF-16, and
given a document starting with,
3C 00 3F 00
I'm to infer BOM-less little-endian UTF-16.
What I what to know is: why could these
sequences not equally represent (respectively)
big-endian UCS-2 or little-endian UCS-2? In
other words, surely these octet sequences are
ambiguous, and hence the encoding should be
resolved definitively with either,
<?xml version="1.0" encoding="UTF-16"?>
<?xml version="1.0" encoding="ISO-10646-UCS-2"?>
or an appropriate MIME header, ie.,
Content-type: text/xml; charset="utf-16"
Content-type: text/xml; charset="ISO-10646-UCS-2"
Just so there's no confusion ... I'm assuming:
1. Unicode == UTF-16
2. UCS-2 != UTF-16 (because UCS-2 lacks UTF-16's
support for characters outside the BMP).
Miles Sabin Cromwell Media
Internet Systems Architect 5/6 Glenthorne Mews
+44 (0)181 410 2230 London, W6 0LJ
msabin at cromwellmedia.co.uk England
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)
More information about the Xml-dev