Private-Use Characters (RE: Character Range: surrogate blocks)

Rick Jelliffe ricko at allette.com.au
Sun Oct 18 10:31:13 BST 1998



> From: owner-xml-dev at ic.ac.uk [mailto:owner-xml-dev at ic.ac.uk]On Behalf Of
> Richard Emberson

> To extend the available characters in Unicode one
> can use to 16 bit characters with surrogate blocks.

If you want to extend the available characters with your own, use the
"Private-Use" or "user-defined" character block. The surrogates are
codepoints reserved for messing up software later as more registered
national characters sets are added; they are not for private use;
implementors of current systems can ignore them at least for the next year,
as far as I know.

FIRST check that your character could not be represented by using an
existing ISO 10646 character with some appropriate attribute on the element.
In particular, if it is a regional variant of a character, try to use the
xml:lang attribute. Note that a "language" includes far more than just
simple regional language: I could have xml:lang='en-US-legal' to indicate US
legalese; or it could be xml:lang='x-physics'  to indicate that it is using
the language of physics, but this language has not been recognised by IANA:
in this case, your stylesheet can say "Oh, this is an X, but an X to be
rendered as physicists will want it rendered."

NEXT note that if you need mathematical characters, check out MML
	http://www.w3.org/TR/REC-MathML/chapter6.html
first.

FINALLY there are two contradictory needs for a user-defined character:
searching (collation) and display. Which fits you?--

If your primary need is DISPLAY, then it is better to use an entity
reference for the character. The corresponding entity contains an element
with a hypertext reference to the glyph of the character: e.g.
	<!ENTITY my-alpha "<http:img src='url'/>">
If your system is smart, you could use content-negotiation to get the best
form: GIF or whatever. (And it lets you tie into some Web fonts system, as
that becomes available.)  If you also need a little bit of collatability,
you could add an attribute to indicate collation sequence posisition.

If your primary need is for simple SEARCHING (collation) rather than
presentation, then use the Private-Use area. (In the Private-Use characters,
avoid using E200-E600; MML uses them.) You should always enter any of the
Private-Use area characters using a numeric character reference (or, if you
use these characters more than once, or want to provide a modicom of
documentation, define an entity for them and use an entity reference)-- this
will prevent possible transcoding errors later, and also makes the text more
readable in editors which do not allow private-use characters to be added.
(Western readers may be surprised that allowing user-defined characters is
not uncommon in CJK publishing software, since the standard sets only go so
far, even though it is almost unheard of in the West.)


Rick Jelliffe

<kisses xml:lang='x-love'>XXX</kisses>



xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)




More information about the Xml-dev mailing list