How best to represent unrepresentable characters in NAME tokens?

Tue Nov 4 17:37:09 GMT 1997

At 2:52 PM -0500 11/3/97, Andrew Greene wrote:
>If you have a Unicode-friendly XML environment, then users can create
>elements whose GIs or attribute names contain "interesting"
>characters. (Yes? A NAME token can contain "BaseChars", which includes
>characters beyond ASCII and even beyond Latin-1.)

Sure can...

I'l give my solution at the end, but first, a few comments on the suggestions.

>So, if a user requests that the document instance be saved as an ASCII
>file, what is the best way for a Unicode-aware and standards-compliant
>application to represent these characters?

<snip>
>I've thought of three solutions:
>
>1. It's an error. Tell the user "Sorry, your file could not be saved
>   in that character encoding because the element name 'StraBe' could
>   not be represented.
>
>   Advantages: It's fully compliant and no data can get lost.
>
>   Disadvantages: No data can get out, either. Perhaps the user has
>   an 8-bit app to massage the data in a particular way, and she
>   doesn't want to rename all her elements.

This works, but isn't needed.

>2. Rename all the offending elements and attributes, and use PIs to
>   ensure that when they're read back in we can patch things up.
>   So, for example, the file could contain:
>
>   <?GoodCitizen MangledGI Strae1="Stra&#x00DF;e"?>
>   <Strae1>foo bar</Strae1>
>
>   Advantages: It's fully compliant.
>
>   Disadvantages: It assumes that all other processing applications
>   will be nice and won't lose my processing instructions, and it
>   makes the file hard to read. It's also non-portable; unless we
>   as a community decide on a "semi-standard" PI to use, no one else
>   will know how to interpret this convention. (On the other hand,
>   this is exactly why I'm bringing the issue up here. Maybe we can
>   all agree on a semi-standard and I'll feel less uneasy about
>   doing something like this....)

This is actively evil, in that it obfuscates the markup, and makes it
impossible to validate against the original DTD. Validating against a DTD
at all requires a DTD translation tool to change element and attribute
names there as well. The use of PIs to affect the meaning of markup (as
opposed to enable additional application processing that can't be expressed
in markup) is generally a bad idea. In fact, most SGML experts concur that
PIs are best used in _exceptional_ cases. The reason for this is that
applications are allowed (and usually do) ignore any PIs that they are not
specialized for.

>
>3. Violate the standard and use character entities to represent the
>   ineffable, for example:
>
>   <Stra&#0xDF;e>foo bar</Stra&#0xDF;e>
>
>   Advantages: It's compact and unambiguous (even if it's illegal :-).
>
>   Disadvantages: It violates both XML and 8879 in a new and perverse
>   way. The user's file will not be usable by any other piece of
>   standards-compliant software. That's worse than refusing to write
>   the file at all (number 1).

Yes, this is not good.

>* Is there a need for a "semi-standard" solution to this problem, or am
>  I the only one struggling with it?

Yes, but it's already built into XML.

>* Is there interest in adopting some variation of number 2 so that we're
>  better able to exchange such data?

Not from me...

>* I can't help but think that number 3 would be the most elegant solution
>  if it were only legal. Yet I'm also sure that the XML committee had a
>  good reason for disallowing it. I'd be interested in hearing what their
>  reason was, so that I may become enlightened. :-)

Part of it is simply compatibility -- this cannot be done in SGML. The
argument about SGML compatibility is no worth rehashing here, the archive
of the working group discussions include many messages on it.

So now that I've objected to all three solutions, you may think I'm a
negative kind of guy... But I do have a suggestion.

Support for UTF-8 is required for XML processors, so that an "8-bit" tool
can always be fed something that it can understand, even though some
strings may look funny in some editors. Since XML parsers do _not_ perform
any kind of character format normalization (e.g. of diacritical marks) each
element name will be a constant string, even if that string is not readable.

[[ Note for anyone who may be puzzled: UTF-8 is a clever little encoding
trick that uses variable length character codes to represent the larger
space of Unicode (and 10646) codes in 8-bit chunks. Codes < 128 represent
USASCII, and codes above are concatenated together to represent large
values. The details (and sample code in C) can be found at
http://www.unicode.org/ So aplain ASCII file in UTF-8 looks the same, but
other characters show up as strings with leading chars >= 128. One detail
is that Latin-1 etc., are _not_ valid UTF-8 because they use the eighth-bit
high codes for single characters.]]

The core of your problime is the very good, and very real point: writers of
XML processors need to remember that the Unicode basis of XML is
fundamental -- so conversion to another character set may fail because the
characters in a document may simply not exist in the target code. Of
course, for many documents, the markup will allow transcoding to Latin-1
(and other local processing codes), but this does depend on the document.
Text can be modified to use numeric character references but this is
probably too horrible, especially for the asian ideographic scripts.

So, you can keep your 8-bit tools, but you may need UTF-8 display code to
make them maximally usable.

  -- David

_________________________________________
David Durand              dgd at cs.bu.edu  \  david at dynamicDiagrams.com
Boston University Computer Science        \  Sr. Analyst
http://www.cs.bu.edu/students/grads/dgd/   \  Dynamic Diagrams
--------------------------------------------\  http://www.dynamicDiagrams.com/
MAPA: mapping for the WWW                    \__________________________

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)