UTF-8
Richard Emberson
emberson at faslab.com
Sat Oct 17 00:50:29 BST 1998
Does the UTF-8 encoding require that the minimum byte count
be used when a character is encoded.
Recall that the form of a UTF-8 encoding is:
0xxxxxxx
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
So one could, for example, claim that:
00111111
and
11000000 10111111
represent the same character, #x3F, or
11110001 10111111 10111111 10111111
and
11111000 10000001 10111111 10111111 10111111
represent #x7FFFF (note: x10000 < x7FFFF < x10FFFF as so is legal).
The reason I ask is whether an XML parser has to worry about
5 and 6 byte UTF-8 encodings or can it *allways* assume that the
values represented by such encoding are not legal unicode characters.
Thanks.
Richard Emberson
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)
More information about the Xml-dev
mailing list