encoding problem fixed

Fri Jul 30 19:11:59 BST 1999

----- Original Message -----
From: John Cowan <cowan at locke.ccil.org>
To: XML Dev <xml-dev at ic.ac.uk>
Sent: Friday, July 30, 1999 7:59 AM
Subject: Re: encoding problem fixed

> James Tauber wrote:
>
> > In other words, rather than creating an InputSource using a FileReader,
I
> > used James Clark's "fileInputSource" method in XT to make a URL out of a
> > file and create the InputSource from the URL string.
>
> Yes, indeed.  You should never use a Reader of any sort when processing
                           ^^^^^ wrong !!!
> XML (unless you have a non-standard Reader class that understands the
> XML declaration).  Always use an InputSource so that the parser can
> install its own bytes-to-chars converter based on the declaration.

Actually, that's not correct either.  My general advice is to pass a
URI to the parser -- which is required to do the correct thing! -- and
in those rare cases that can't be done:

    * If the data is externally typed according to character set,
      you MUST use some Reader ... e.g. given a MIME type of
      "application/xml;charset=Big5", then use a reader set
      up to use the "Big5" encoding (a Chinese encoding).  There
      isn't much choice of classes; InputStreamReader, or a custom
      reader that understands that encoding.

    * If the data is NOT externally typed, then you MUST rely on
      the XML parser's autodetection ... pass an InputStream.

Remember, with external typing (e.g. MIME objects) the MIME type is
authoritative.  And XML/text declarations are optional; for the top
level document, the "encoding=..." is also optional.  Autodetection
will not work in all cases ... which is why the notion of "always
use an InputStream" is incorrect.

Those using Sun's parser will notice a "Resolver" class that has a
method accepting a MIME type, which is interpreted according to the
relevant RFC, and another method (also static) taking a "File" which
ignores the JVM's normal understanding of file encodings to do a
better thing in that case also. (It autodetects -- better than any
system default in that case!)

> > The culprit is FileReader. It is the one doing the strange "read UTF-8
as
> > Windows code page".
>
> Actually, it's doing what it's expected to: reading the native charset,
> CP-1252.  (Unix JVMs use 8859-1 instead.)

Those are actually system-specific defaults ... many localized versions
of those environments work differently.  For example UNIX JVMs may well
use the "EUC-JP" coding in Japan, or MS-Windows the "Shift_JIS".

>     It has no way of knowing that
> *you* think the document charset is UTF-8.

The InputStreamReader class can be told such stuff, and you can create
one from a FileInputStream.

Another fix, for JDK 1.2 conformant JVMs, is to construct the URI for
the relevant file and construct the InputSource like this:

    new InputSource (new File (path).toURL ().toString ())

In fact, my own basic guidance is never to pass any sort of I/O stream
(InputStream -or- Reader!) to the parser; let the parser work from the
URI, if at all possible.  It's normally quite possible, and it's a lot
less likely to handle the encodings wrong than application code!!

- Dave

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)