XML and special Characters : unicode v3.0 ?

Mon Mar 1 03:14:12 GMT 1999

 From: Baden Hughes <bmhughes at ozemail.com.au>

>I know that XML 1.0 allows you to use 'special' characters as included
in
>the Unicode 2.0 specification. With the upcoming release of Unicode 3.0
how
>will we be able to refer to characters in 3.0 which were not in 2.0 ?
The
>same way (meaning the actual version of Unicode spec is irrelevant as
long
>as the method used is included in XML) or some new way ?
>
>For instance, the Sinhala character set was not in Unicode 2.0 but will
be
>in 3.0. How do I get one of those characters in an XML document ? Or is
that
>inconsequential to the document per se as it is simply a reference and
its
>really up to the application to render it correctly ?

The document character set of XML is ISO 10646, as used by the Unicode
Consortium's character set Unicode. I think most people's strong
expectation is that XML will track ISO 10646, just as Unicode tracks it.
In fact, I think it is essential that XML automatically tracks ISO
10646: people will always try to do strange and interesting things with
characters and codes, and XML should try to allow as much freedom for
them to do this as possible.

Developers should be very wary of putting type-checking into their
systems which will cause future legitimate ISO 10646 to fail. For
example, when a new character is invented, like the Euro, the only
difficulty it should cause is if the font is not upgraded or if the
sort/type system doesnt allow new character registration.

We certainly need to abandon the expectation the number of characters is
fixed or knowable, which is how some might interpret material from
Unicode Consortium: a character set standard tries to put in what is
generally useful against some criteria--if your criteria do not match,
then you easily legitimately decide that your character is not found in
the set: is Apple's "apple" character a real character? are variant
kanji characters real characters? are roman, fraktur, italic and uncial
"a" characters different? Is English "W" a different character (i.e.,
"UU") from German "W" (i.e. "VV"), when using historical material? In my
book I use a dinosaur glyph as a word have liked to have put it in the
index too: why is it not a character? Such questions can never be
resolved, but a character set must make a decision based on some
selection criteria; and those criteria will not be appropriate in every
situation.

The nice thing about markup is it lets us simulate the existance of a
character missing from a character set: however, we have no markup
conventions yet to do this systematically. There are no standard methods
for saying "when you find 'a' in this context, collate it differently"
for example (apart from, perhaps, language-tagged elements).

Rick Jelliffe

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)