How best to represent unrepresentable characters in NAME tokens?
jjc at jclark.com
Tue Nov 4 12:31:17 GMT 1997
Andrew Greene wrote:
> If you have a Unicode-friendly XML environment, then users can create
> elements whose GIs or attribute names contain "interesting"
> characters. (Yes? A NAME token can contain "BaseChars", which includes
> characters beyond ASCII and even beyond Latin-1.)
> So, if a user requests that the document instance be saved as an ASCII
> file, what is the best way for a Unicode-aware and standards-compliant
> application to represent these characters?
I would use numeric character references wherever XML allows them; if
there are non-ASCII characters in places where numeric character
references aren't allowed I would use UTF-8 and give a warning to the
user. The ASCII characters will still be there as ASCII, and the
non-ASCII characters won't get lost, although they will look a bit funny
in an 8-bit editor. An interesting case is when there are non-ASCII
characters in places where numeric character references are not
recognized but do not cause an error (eg PIs, comments); one could have
an application convention that recognizes numeric character references
in these cases.
> 2. Rename all the offending elements and attributes, and use PIs to
> ensure that when they're read back in we can patch things up.
> So, for example, the file could contain:
> <?GoodCitizen MangledGI Strae1="Straße"?>
> <Strae1>foo bar</Strae1>
> Advantages: It's fully compliant.
If I was going to do this sort of thing, I think I would use a variation
on URL % encoding. I would have a convention that underscore (say)
followed by 4 hex digits represented the Unicode character with that hex
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)
More information about the Xml-dev