How to process Japanese Code with XMLDSO(MS-XML)

Mon Aug 10 03:48:59 BST 1998

Rick Jelliffe wrote:
> MURATA Makoto wrote:
> 
> >  More than one conversion procedures certainly exist.  The more I
> > think about
> > this issue, the more pessimistic I become.
> 
> On the other hand, there has just been little practical requirement for
> everyone to
> synchronize until now.  It is still the earliest days of Unicode and XML

This is correct.  We still have hope.

> deployment, so cheer up! perhaps a strong request stating XML's needs to
> 
> JIS and Microsoft (and ISO) can force a resolution.
> If everyone remains stubborn, then the only thing to do is for IANA to
> register
> three different character sets. And perhaps XML will need another
> pre-defined
> attribute to indicate which character set variant is in use in an
> element, to
> handle cut-and-paste.  What a cock-up.  In the meantime, I guess the
> appropriate
> strategy is "damage control": as many Japanese implementors as possible
> should
> adopt a single mapping.  Can you recommend one? 

What I have in mind is as follows:  First, we should clarify the definition 
of the charset "shift_JIS" registered at IANA.  I believe the least common 
denominator, which is JIS X0201 + JIS X0208:1997 should be adopted as the 
coded character set of SHIFT_JIS.  NEC extensions and IBM extensions should 
be eliminated.  0x5C is backslash rather than yen sign, and 0x7E is tilde 
rather than overline.  Second, we should revise the Japanese profile for 
XML and encourage the use of character entities to represent conversion-
error-prone characters rather than directly use them.  (See the end of this 
message.)  Then, it will become easier for users to make conversion-error-free 
documents.  User-friendly XML processors should warn users when documents 
in EUC-jp, iso-2022-jp, or shift_jis contain such characters.

> deployed.  It is better to converge on a single mapping, even if that
> mapping
> is not satisfactory to everyone (i.e. JIS).

Actually, I am not optimistic about this, because there are many conversion 
policies.  For example, Microsoft maps 0x5C (sjis)  to 0x005C (unicode), but 
the glyph for yen sign is used for this code point.  Microsoft converts NEC 
extensions and IBM extensions to Unicode characters.  On the other hand, Java 
ignores NEC extensions and IBM extensions.  (What happens if J++ is used?  I 
do not know.)  Apple appears to use more than one conversion table.  

Rick Jelliffe wrote:
> 
> I have made mapping tables for entity references to thousands of
> characters and
> glyphs. 

You are talking about SPREAD entities.  I recently tried to find ERCS documents, 
but I could find only a few.  In my understanding, names of SPREAD entities contain 
hexadicimal numbers.  But XML already have hexadecimal character entities.  
I would rather want to use natural language markup such as  &enkigou; (enkigou 
should be in kanji).

Here is a list of conversion-error-prone characters.

< YEN SIGN
> BACKSLASH

< OVERLINE
> TILDE

< OVERLINE
> FULLWIDTH MACRON

< EM DASH
> HORIZONTAL BAR

< BACKSLASH
> FULLWIDTH BACKSLASH

< WAVE DASH
> FULLWIDTH TILDE

< DOUBLE VERTICAL LINE
> PARALLEL TO

< MINUS SIGN
> FULLWIDTH HYPHEN-MINUS

< YEN SIGN
> FULLWIDTH YEN SIGN

< CENT SIGN
> FULLWIDTH CENT SIGN

< POUND SIGN
> FULLWIDTH POUND SIGN

< NOT SIGN
> FULLWIDTH NOT SIGN

< TILDE
> FULLWIDTH TILDE

< BROKEN BAR
> FULLWIDTH BROKEN BAR

Makoto

Fuji Xerox Information Systems

Tel: +81-44-812-7230   Fax: +81-44-812-7231
E-mail: murata at apsdc.ksp.fujixerox.co.jp

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)