How best to represent unrepresentable characters in NAME tokens?

Mon Nov 3 20:27:01 GMT 1997

If you have a Unicode-friendly XML environment, then users can create
elements whose GIs or attribute names contain "interesting"
characters. (Yes? A NAME token can contain "BaseChars", which includes
characters beyond ASCII and even beyond Latin-1.)

So, if a user requests that the document instance be saved as an ASCII
file, what is the best way for a Unicode-aware and standards-compliant
application to represent these characters? It's not legal to say

   <Stra&sz;e>

and the user may already have an element type called "Strasse" so it
would be inappropriate to "reduce" it. [I chose this example because
it is easy to describe in email; the problem is much more difficult
if, instead of German, the user has used Cyrillic or Hebrew NAMEs.]

I've thought of three solutions:

1. It's an error. Tell the user "Sorry, your file could not be saved
   in that character encoding because the element name 'StraBe' could
   not be represented.

   Advantages: It's fully compliant and no data can get lost.

   Disadvantages: No data can get out, either. Perhaps the user has
   an 8-bit app to massage the data in a particular way, and she
   doesn't want to rename all her elements.

2. Rename all the offending elements and attributes, and use PIs to
   ensure that when they're read back in we can patch things up.
   So, for example, the file could contain:

   <?GoodCitizen MangledGI Strae1="Stra&#x00DF;e"?>
   <Strae1>foo bar</Strae1>

   Advantages: It's fully compliant.

   Disadvantages: It assumes that all other processing applications
   will be nice and won't lose my processing instructions, and it
   makes the file hard to read. It's also non-portable; unless we
   as a community decide on a "semi-standard" PI to use, no one else 
   will know how to interpret this convention. (On the other hand, 
   this is exactly why I'm bringing the issue up here. Maybe we can 
   all agree on a semi-standard and I'll feel less uneasy about
   doing something like this....)

3. Violate the standard and use character entities to represent the 
   ineffable, for example:

   <Stra&#0xDF;e>foo bar</Stra&#0xDF;e>

   Advantages: It's compact and unambiguous (even if it's illegal :-).

   Disadvantages: It violates both XML and 8879 in a new and perverse
   way. The user's file will not be usable by any other piece of 
   standards-compliant software. That's worse than refusing to write
   the file at all (number 1).

My questions to the assembled multitudes are:

* Is there a need for a "semi-standard" solution to this problem, or am
  I the only one struggling with it?

* Is there interest in adopting some variation of number 2 so that we're
  better able to exchange such data?

* I can't help but think that number 3 would be the most elegant solution
  if it were only legal. Yet I'm also sure that the XML committee had a 
  good reason for disallowing it. I'd be interested in hearing what their
  reason was, so that I may become enlightened. :-)

Thanks for your thoughts,
  Andrew Greene

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)