Encoding detection again ...
Miles Sabin
msabin at cromwellmedia.co.uk
Wed Mar 3 12:12:30 GMT 1999
David Brownell wrote,
> Put it this way: if you assume UTF-16, you're
> safe either way because UTF-16 is a superset.
Err ... is that true?
Maybe I'm being a bit obsessive about my
interpretation of the various standards docs, but
as far as I can see UCS-2 isn't a subset of
UTF-16. The BMP S-zone codes (D800-DFFF) are
undefined but reserved in UCS-2, and so should
not occur in a purportedly UCS-2 stream. I would
expect a processor which encountered such codes to
either,
1. Spit out an error and give up.
or,
2. Quietly ignore them and continue processing
with the next 2 octets.
Obviously these codes are defined and legal
in UTF-16, so an incorrect assumption of UTF-16
when the stream was in fact broken UCS-2 would
produce unpredictably incorrect behaviour (ie.
the processor might continue processing a broken
doc in an indeterminate way).
In any case, on a less finickety note, I'd quite
like to be able to compute string lengths UCS-2
style where that's appropriate, because 2*byte-
length is a bit simpler than the UTF-16
equivalent ;-)
Anyway, here's a slightly updated version of a
proposal I mailed to Tim Bray yesterday ...
In the absence of an appropriate MIME header
the octet sequences,
1. FE FF
2. FF FE
3. 00 3C 00 3F
4. 3C 00 3F 00
may be inferred to be,
1. big-endian indeterminately encoded 2 octet
characters.
2. little-endian indeterminately encoded 2 octet
characters.
3. BOM-less big-endian indeterminately encoded 2
octet characters.
4. BOM-less little-endian indeterminately encoded
2 octet characters.
If either of the following PIs are found,
<?xml version="1.0" ?>
<?xml version="1.0" encoding="UTF-16"?>
or, in cases (1) and (2), if *no* PI is found,
then encoding is resolved to UTF-16. Otherwise
if,
<?xml version="1.0" encoding="ISO-10646-UCS-2"?>
is found then encoding is resolved to UCS-2.
This very complicated and isn't a zillion miles away
from the current handling of UTF-8 vs. ISO 8859-x
vs. US-ASCII.
Cheers,
Miles
--
Miles Sabin Cromwell Media
Internet Systems Architect 5/6 Glenthorne Mews
+44 (0)181 410 2230 London, W6 0LJ
msabin at cromwellmedia.co.uk England
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)
More information about the Xml-dev
mailing list