How best to represent unrepresentable characters in NAME tokens?
Andrew Greene
agreene at bitstream.com
Mon Nov 3 20:27:01 GMT 1997
If you have a Unicode-friendly XML environment, then users can create
elements whose GIs or attribute names contain "interesting"
characters. (Yes? A NAME token can contain "BaseChars", which includes
characters beyond ASCII and even beyond Latin-1.)
So, if a user requests that the document instance be saved as an ASCII
file, what is the best way for a Unicode-aware and standards-compliant
application to represent these characters? It's not legal to say
<Stra&sz;e>
and the user may already have an element type called "Strasse" so it
would be inappropriate to "reduce" it. [I chose this example because
it is easy to describe in email; the problem is much more difficult
if, instead of German, the user has used Cyrillic or Hebrew NAMEs.]
I've thought of three solutions:
1. It's an error. Tell the user "Sorry, your file could not be saved
in that character encoding because the element name 'StraBe' could
not be represented.
Advantages: It's fully compliant and no data can get lost.
Disadvantages: No data can get out, either. Perhaps the user has
an 8-bit app to massage the data in a particular way, and she
doesn't want to rename all her elements.
2. Rename all the offending elements and attributes, and use PIs to
ensure that when they're read back in we can patch things up.
So, for example, the file could contain:
<?GoodCitizen MangledGI Strae1="Straße"?>
<Strae1>foo bar</Strae1>
Advantages: It's fully compliant.
Disadvantages: It assumes that all other processing applications
will be nice and won't lose my processing instructions, and it
makes the file hard to read. It's also non-portable; unless we
as a community decide on a "semi-standard" PI to use, no one else
will know how to interpret this convention. (On the other hand,
this is exactly why I'm bringing the issue up here. Maybe we can
all agree on a semi-standard and I'll feel less uneasy about
doing something like this....)
3. Violate the standard and use character entities to represent the
ineffable, for example:
<Stra�xDF;e>foo bar</Stra�xDF;e>
Advantages: It's compact and unambiguous (even if it's illegal :-).
Disadvantages: It violates both XML and 8879 in a new and perverse
way. The user's file will not be usable by any other piece of
standards-compliant software. That's worse than refusing to write
the file at all (number 1).
My questions to the assembled multitudes are:
* Is there a need for a "semi-standard" solution to this problem, or am
I the only one struggling with it?
* Is there interest in adopting some variation of number 2 so that we're
better able to exchange such data?
* I can't help but think that number 3 would be the most elegant solution
if it were only legal. Yet I'm also sure that the XML committee had a
good reason for disallowing it. I'd be interested in hearing what their
reason was, so that I may become enlightened. :-)
Thanks for your thoughts,
Andrew Greene
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)
More information about the Xml-dev
mailing list