SAX/C++: UTF-8 v UTF-16
James Clark
jjc at jclark.com
Fri Dec 3 04:49:44 GMT 1999
David Megginson wrote:
> 4. Hold my nose and use UTF-8 rather than UTF-16, for compatibility
> with most existing C++ code.
I would say there was at least as much C++ code using UTF-16 as using
UTF-8. On Windows at least, UTF-16 is much more common. The DOM mandates
UTF-16, so if SAX mandated UTF-8 there would be an unfortunate mismatch.
This is a tough one, because there's a lot more diversity in the C++
world. My preference would be not to mandate either UTF-8 or UTF-16
exclusively. There are lots of apps using UTF-8 and there are lots of
apps using UTF-16; if you exclude either, then a lot of apps will take a
mojor performance/convenience hit. Expat allows a choice at compile-time
between UTF-8 and UTF-16, and there are big projects using both (eg Perl
uses UTF-8 and Mozilla uses UTF-16).
There are a couple of possible solutions:
1. A lo-tech solution. Provide a SAXChar typedef, and define everything
in terms of SAXChar. SAXChar gets typedefed to either char or unsigned
short depending on whether SAX_UNICODE is defined or not. It's up to
implementations to decide whether to support both or just one, and up to
clients to decide whether to work with both or to require one.
A variation on this is to allow both UTF-8 and UTF-16 variants to exist
in a single library. To do this, you can do something along the lines
of
class AttributeList16 {
public:
virtual const unsigned short *getName(int pos) = 0;
};
class AttributeList8 {
public:
virtual const char *getName(int pos) = 0;
};
#ifdef SAX_UNICODE
typedef AttributeList16 AttributeList;
#else
typedef AttributeList8 AttributeList;
#endif
2. A hi-tech solution. Do what the Standard C++ library does and make
the interface a template in the character type. This is the cleanest
solution, but lots of C++ projects eschew templates on portability
grounds.
If you feel that one needs to be mandated, I would pick UTF-16.
James
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To unsubscribe, mailto:majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)
More information about the Xml-dev
mailing list