arbitrary characters in XML document?

David Brownell david-b at pacbell.net
Thu Sep 2 22:40:15 BST 1999


Cliff Draper wrote:
> 
> Hi,  I have a question about dealing with multiple character sets.
> 
> I have an application where I want to store data in XML and retrieve
> it later.  Now a good chunk of the data I want to store is coming
> straight from the user and I have little control over exactly which
> character set the user is using.  One of my users apparently tried
> using 0x98 + 0x03 as an accented 'e'; I have no idea which character
> set he used (and I don't care),

You should.  Arbitrary binary garbage isn't necessarily going to
be legal -- as happened in this case -- and even if it chances to
be legal, it's likely to come out as something that wasn't intended.

Coming out as an error diagnostic is a useful outcome ... hidden
mangling of data is as likely, and causes severe problems later on.
A diagnostic lets you fix the problems early, before they get bad.


>	 but I still want to be able to store
> it and parse it later.  When I parse it with expat with an
> encoding="UTF-8", it complains that it's not well-formed.

Probably because it isn't.


> Any ideas?

Don't permit aritrary binary data into your text.  Ensure you know
what character encoding was used, and make sure that you either 
transform that encoding to the one you're using, or switch to using
that encoding.

- Dave



> thanks,
> -Cliff Draper
>  cliffwd at forte.com
> 
> xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
> Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
> To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
> (un)subscribe xml-dev
> To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
> subscribe xml-dev-digest
> List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)





More information about the Xml-dev mailing list