Encoding detection again ...

Miles Sabin msabin at cromwellmedia.co.uk
Wed Mar 3 12:12:30 GMT 1999


David Brownell wrote,
> Put it this way:  if you assume UTF-16, you're
> safe either way because UTF-16 is a superset.

Err ... is that true?

Maybe I'm being a bit obsessive about my 
interpretation of the various standards docs, but 
as far as I can see UCS-2 isn't a subset of
UTF-16. The BMP S-zone codes (D800-DFFF) are 
undefined but reserved in UCS-2, and so should 
not occur in a purportedly UCS-2 stream. I would 
expect a processor which encountered such codes to
either,

1. Spit out an error and give up.

or,

2. Quietly ignore them and continue processing 
   with the next 2 octets.

Obviously these codes are defined and legal
in UTF-16, so an incorrect assumption of UTF-16
when the stream was in fact broken UCS-2 would
produce unpredictably incorrect behaviour (ie.
the processor might continue processing a broken
doc in an indeterminate way).

In any case, on a less finickety note, I'd quite
like to be able to compute string lengths UCS-2
style where that's appropriate, because 2*byte-
length is a bit simpler than the UTF-16
equivalent ;-)

Anyway, here's a slightly updated version of a 
proposal I mailed to Tim Bray yesterday ...

In the absence of an appropriate MIME header
the octet sequences,

1. FE FF 
2. FF FE
3. 00 3C 00 3F
4. 3C 00 3F 00

may be inferred to be,

1. big-endian indeterminately encoded 2 octet
   characters.

2. little-endian indeterminately encoded 2 octet
   characters.

3. BOM-less big-endian indeterminately encoded 2 
   octet characters.

4. BOM-less little-endian indeterminately encoded 
   2 octet characters.

If either of the following PIs are found,

  <?xml version="1.0" ?>
  <?xml version="1.0" encoding="UTF-16"?>

or, in cases (1) and (2), if *no* PI is found,
then encoding is resolved to UTF-16. Otherwise 
if,

  <?xml version="1.0" encoding="ISO-10646-UCS-2"?>

is found then encoding is resolved to UCS-2.

This very complicated and isn't a zillion miles away 
from the current handling of UTF-8 vs. ISO 8859-x 
vs. US-ASCII.

Cheers,


Miles

-- 
Miles Sabin                          Cromwell Media
Internet Systems Architect           5/6 Glenthorne Mews
+44 (0)181 410 2230                  London, W6 0LJ
msabin at cromwellmedia.co.uk           England


xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)




More information about the Xml-dev mailing list