How to process Japanese Code with XMLDSO(MS-XML)
MURATA Makoto
murata at apsdc.ksp.fujixerox.co.jp
Mon Aug 10 03:48:59 BST 1998
Rick Jelliffe wrote:
> MURATA Makoto wrote:
>
> > More than one conversion procedures certainly exist. The more I
> > think about
> > this issue, the more pessimistic I become.
>
> On the other hand, there has just been little practical requirement for
> everyone to
> synchronize until now. It is still the earliest days of Unicode and XML
This is correct. We still have hope.
> deployment, so cheer up! perhaps a strong request stating XML's needs to
>
> JIS and Microsoft (and ISO) can force a resolution.
> If everyone remains stubborn, then the only thing to do is for IANA to
> register
> three different character sets. And perhaps XML will need another
> pre-defined
> attribute to indicate which character set variant is in use in an
> element, to
> handle cut-and-paste. What a cock-up. In the meantime, I guess the
> appropriate
> strategy is "damage control": as many Japanese implementors as possible
> should
> adopt a single mapping. Can you recommend one?
What I have in mind is as follows: First, we should clarify the definition
of the charset "shift_JIS" registered at IANA. I believe the least common
denominator, which is JIS X0201 + JIS X0208:1997 should be adopted as the
coded character set of SHIFT_JIS. NEC extensions and IBM extensions should
be eliminated. 0x5C is backslash rather than yen sign, and 0x7E is tilde
rather than overline. Second, we should revise the Japanese profile for
XML and encourage the use of character entities to represent conversion-
error-prone characters rather than directly use them. (See the end of this
message.) Then, it will become easier for users to make conversion-error-free
documents. User-friendly XML processors should warn users when documents
in EUC-jp, iso-2022-jp, or shift_jis contain such characters.
> deployed. It is better to converge on a single mapping, even if that
> mapping
> is not satisfactory to everyone (i.e. JIS).
Actually, I am not optimistic about this, because there are many conversion
policies. For example, Microsoft maps 0x5C (sjis) to 0x005C (unicode), but
the glyph for yen sign is used for this code point. Microsoft converts NEC
extensions and IBM extensions to Unicode characters. On the other hand, Java
ignores NEC extensions and IBM extensions. (What happens if J++ is used? I
do not know.) Apple appears to use more than one conversion table.
Rick Jelliffe wrote:
>
> I have made mapping tables for entity references to thousands of
> characters and
> glyphs.
You are talking about SPREAD entities. I recently tried to find ERCS documents,
but I could find only a few. In my understanding, names of SPREAD entities contain
hexadicimal numbers. But XML already have hexadecimal character entities.
I would rather want to use natural language markup such as &enkigou; (enkigou
should be in kanji).
Here is a list of conversion-error-prone characters.
< YEN SIGN
> BACKSLASH
< OVERLINE
> TILDE
< OVERLINE
> FULLWIDTH MACRON
< EM DASH
> HORIZONTAL BAR
< BACKSLASH
> FULLWIDTH BACKSLASH
< WAVE DASH
> FULLWIDTH TILDE
< DOUBLE VERTICAL LINE
> PARALLEL TO
< MINUS SIGN
> FULLWIDTH HYPHEN-MINUS
< YEN SIGN
> FULLWIDTH YEN SIGN
< CENT SIGN
> FULLWIDTH CENT SIGN
< POUND SIGN
> FULLWIDTH POUND SIGN
< NOT SIGN
> FULLWIDTH NOT SIGN
< TILDE
> FULLWIDTH TILDE
< BROKEN BAR
> FULLWIDTH BROKEN BAR
Makoto
Fuji Xerox Information Systems
Tel: +81-44-812-7230 Fax: +81-44-812-7231
E-mail: murata at apsdc.ksp.fujixerox.co.jp
xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)
More information about the Xml-dev
mailing list