How best to represent unrepresentable characters in NAMEtoken s?

Tue Nov 4 19:20:46 GMT 1997

I'm left unclear by this response.  Suppose that I have a Java program with
an object class called "$Price" and I want to serialize this into XML.
Something such as the following is not legal XML:

<$Price>15.95<$Price>

What can I do?  One thing I could do is to avoid such names when writing
Java.  But suppose that isn't an option. I could do the following:

<OBJECT realtype="$Price">15.95</OBJECT>

But, as you say, this "obfuscates the markup, and makes it impossible to
validate against the original DTD" (in the sense that the declaration for
the OBJECT element type would be almost meaningless).

What is the recommended solution?

--Andrew Layman
   AndrewL at microsoft.com

> -----Original Message-----
> From:	dgd at cs.bu.edu [SMTP:dgd at cs.bu.edu]
> Sent:	Tuesday, November 04, 1997 8:44 AM
> To:	xml-dev at ic.ac.uk
> Subject:	Re: How best to represent unrepresentable characters in
> NAMEtokens?
> 
> At 2:52 PM -0500 11/3/97, Andrew Greene wrote:
> >If you have a Unicode-friendly XML environment, then users can create
> >elements whose GIs or attribute names contain "interesting"
> >characters. (Yes? A NAME token can contain "BaseChars", which includes
> >characters beyond ASCII and even beyond Latin-1.)
> 
> Sure can...
> 
> I'l give my solution at the end, but first, a few comments on the
> suggestions.
> 
> >So, if a user requests that the document instance be saved as an ASCII
> >file, what is the best way for a Unicode-aware and standards-compliant
> >application to represent these characters?
> 
> <snip>
> >I've thought of three solutions:
> >
> >1. It's an error. Tell the user "Sorry, your file could not be saved
> >   in that character encoding because the element name 'StraBe' could
> >   not be represented.
> >
> >   Advantages: It's fully compliant and no data can get lost.
> >
> >   Disadvantages: No data can get out, either. Perhaps the user has
> >   an 8-bit app to massage the data in a particular way, and she
> >   doesn't want to rename all her elements.
> 
> This works, but isn't needed.
> 
> >2. Rename all the offending elements and attributes, and use PIs to
> >   ensure that when they're read back in we can patch things up.
> >   So, for example, the file could contain:
> >
> >   <?GoodCitizen MangledGI Strae1="Stra&#x00DF;e"?>
> >   <Strae1>foo bar</Strae1>
> >
> >   Advantages: It's fully compliant.
> >
> >   Disadvantages: It assumes that all other processing applications
> >   will be nice and won't lose my processing instructions, and it
> >   makes the file hard to read. It's also non-portable; unless we
> >   as a community decide on a "semi-standard" PI to use, no one else
> >   will know how to interpret this convention. (On the other hand,
> >   this is exactly why I'm bringing the issue up here. Maybe we can
> >   all agree on a semi-standard and I'll feel less uneasy about
> >   doing something like this....)
> 
> This is actively evil, in that it obfuscates the markup, and makes it
> impossible to validate against the original DTD. Validating against a DTD
> at all requires a DTD translation tool to change element and attribute
> names there as well. The use of PIs to affect the meaning of markup (as
> opposed to enable additional application processing that can't be
> expressed
> in markup) is generally a bad idea. In fact, most SGML experts concur that
> PIs are best used in _exceptional_ cases. The reason for this is that
> applications are allowed (and usually do) ignore any PIs that they are not
> specialized for.
> 
> >
> >3. Violate the standard and use character entities to represent the
> >   ineffable, for example:
> >
> >   <Stra&#0xDF;e>foo bar</Stra&#0xDF;e>
> >
> >   Advantages: It's compact and unambiguous (even if it's illegal :-).
> >
> >   Disadvantages: It violates both XML and 8879 in a new and perverse
> >   way. The user's file will not be usable by any other piece of
> >   standards-compliant software. That's worse than refusing to write
> >   the file at all (number 1).
> 
> Yes, this is not good.
> 
> >* Is there a need for a "semi-standard" solution to this problem, or am
> >  I the only one struggling with it?
> 
> Yes, but it's already built into XML.
> 
> >* Is there interest in adopting some variation of number 2 so that we're
> >  better able to exchange such data?
> 
> Not from me...
> 
> >* I can't help but think that number 3 would be the most elegant solution
> >  if it were only legal. Yet I'm also sure that the XML committee had a
> >  good reason for disallowing it. I'd be interested in hearing what their
> >  reason was, so that I may become enlightened. :-)
> 
> Part of it is simply compatibility -- this cannot be done in SGML. The
> argument about SGML compatibility is no worth rehashing here, the archive
> of the working group discussions include many messages on it.
> 
> So now that I've objected to all three solutions, you may think I'm a
> negative kind of guy... But I do have a suggestion.
> 
> Support for UTF-8 is required for XML processors, so that an "8-bit" tool
> can always be fed something that it can understand, even though some
> strings may look funny in some editors. Since XML parsers do _not_ perform
> any kind of character format normalization (e.g. of diacritical marks)
> each
> element name will be a constant string, even if that string is not
> readable.
> 
> [[ Note for anyone who may be puzzled: UTF-8 is a clever little encoding
> trick that uses variable length character codes to represent the larger
> space of Unicode (and 10646) codes in 8-bit chunks. Codes < 128 represent
> USASCII, and codes above are concatenated together to represent large
> values. The details (and sample code in C) can be found at
> http://www.unicode.org/ So aplain ASCII file in UTF-8 looks the same, but
> other characters show up as strings with leading chars >= 128. One detail
> is that Latin-1 etc., are _not_ valid UTF-8 because they use the
> eighth-bit
> high codes for single characters.]]
> 
> The core of your problime is the very good, and very real point: writers
> of
> XML processors need to remember that the Unicode basis of XML is
> fundamental -- so conversion to another character set may fail because the
> characters in a document may simply not exist in the target code. Of
> course, for many documents, the markup will allow transcoding to Latin-1
> (and other local processing codes), but this does depend on the document.
> Text can be modified to use numeric character references but this is
> probably too horrible, especially for the asian ideographic scripts.
> 
> So, you can keep your 8-bit tools, but you may need UTF-8 display code to
> make them maximally usable.
> 
>   -- David
> 
> _________________________________________
> David Durand              dgd at cs.bu.edu  \  david at dynamicDiagrams.com
> Boston University Computer Science        \  Sr. Analyst
> http://www.cs.bu.edu/students/grads/dgd/   \  Dynamic Diagrams
> --------------------------------------------\
> http://www.dynamicDiagrams.com/
> MAPA: mapping for the WWW                    \__________________________
> 
> 
> 
> xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
> Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
> To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
> (un)subscribe xml-dev
> To subscribe to the digests, mailto:majordomo at ic.ac.uk the following
> message;
> subscribe xml-dev-digest
> List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)