expat and encodings

Clark Cooper coopercl at sch.ge.com
Wed Dec 16 14:02:36 GMT 1998

Expat always returns to your handlers XML_Char strings, which will be
either UTF-8, UTF-16 encoded as wchar_ts, or UTF-16 encoded as unsigned shorts,
depending on the definition of XML_UNICODE and XML_UNICODE_WCHAR_T when
you compile the library and your program. It provides no option to change
the encoding of strings you receive. This is a wise design choice, since
only the application should know what to do with characters that don't
map from Unicode to the encoding you want to receive. (Even if the document
is in ISO-8859-1, it can contain character references (‾) or references
to external entities that are in a different encoding.)

You may force the encoding recognized by providing a non-null encoding
name string to XML_ParserCreate. Normally, however, you should pass it a
NULL pointer so that it will recognize and use the XML encoding declaration.

If you were using perl and the XML::Parser perl module built on top of expat,
I could recommend one of the Unicode modules at CPAN (Comprehensive Perl
Archive Network) to help you map from UTF-8 to whatever. Even if you aren't
using perl, you can download one of these to see how to build your own
C function to do encoding mapping.

Clark Cooper    Logic Technology Inc.		cccooper at ltionline.com
(518) 385-8380  650 Franklin St., Suite 304	coopercl at sch.ge.com
		Schenectady,  NY 12305

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)

More information about the Xml-dev mailing list