IE5.0 does not conform to RFC2376

Sat Apr 10 23:58:59 BST 1999

MURATA Makoto wrote:
> Chris Lilley wrote:
> > An alternative method for achieving the same result is to use a filter
> > (this can be done in Apache and in Jigsaw) which automatically emits the
> > correct charset parameter based on reading the encoding declaration in
> > the XML instance. This can easily cache its results, and need not
> > result in processing overhead on each request.
> 
> I strongly agree.   This is the best approach.  I sincerely hope that such
> an attempt will happen at W3C.

I have spoken to the Jigsaw team about this, explained the urgency, and
hope to see an implementation in a forthcoming Jigsaw release. They said
it was about an hours work or so.

> > > At *IETF*, the default of the charset parameter for text/HTML *is* 8859-1.
> >
> > Yes, which is different to the default for text/* - this demonstrates
> > that it is possible to give a more specific rule for a particular
> > registration.
> 
> Actually, in the case of HTTP MIME, the default of the charset parameter of
> text/* is always ISO-8859-1. 

Yes, we both agree there. And I said that this shows that the default
for a particular registration can be different from the default for 

>  In the case of real MIME, the default of
> the charset parameter of text/* is always US-ASCII. 

I don't think we need to get into "real MIME" versus "HTTP MIME" here; I
raised the issue very early on on the IMC list and quickly got concensus
that the MIME registration applies to all uses of MIME. By drawing this
distinction, are you saying that RFC 2376 does not apply to HTTP and
only applies to email?

> text/xml is an exception, since the default is always US-ASCII.  This was
> recommended by ISEG.

Well, if a US-based group recommends US-ASCII that should not really be
a surprise ;-) However, while US-ASCII is compatible with UTF-8 it is
not the same; and it is not compatible with UTF-16. So, it is a very odd
choice for a default.

As I said, I regarded a better default to have been the same default as
specified in the XML Recommendation. While priority rules can always be
defined to figure out which of two conflicting labels (or label
defualts) has precedence, the whole issue is solved if the defaults are
the same. Unfortunately, RFC2376 did not do this.

> > > It is going to be very difficult or
> > > impossible, since HTTP and MIME people will disagree.
> >
> > I think you mean, HTTP and Mail(SMTP/IMAP/POP). MIME is used by both
> > email and HTTP.
> 
> HTTP MIME is not quite the same as real MIME.  There are many differences
> between the two.

Since HTTP is at a different (lower) position in the IETF standards
track to MIME, MIME cannot make any reference to HTTP but can only speak
of email. This is odd, but there we are. So, HTTP has to refer to MIME,
noththe other way round; the use of MIME in HTTP uis defined in the HTTP
specs. This is unfortunate, but does not make it "unreal".

> text/xml has to be consistent with HTTP and MIME.  Autodetection
> or the use of META tags as the default of the charset parameter has been
> extensively discussed by HTTP people and MIME people.  They strongly dissent.

In another thread, it was convincingly shown that the term
"autodetection" to refer to the encoding declaration in the XML
Recommendation was incorrect terminology. It is actually using a
designating sequence.

> > But, if it is not present,
> > then the XML Rec says exactly what should happen;
> 
> Appendix F is non-normative.  

Yes, but I was not referring to Appendix F. I was referring to section
4.3.3 which is normative:
  http://www.w3.org/TR/REC-xml#charencoding

   Parsed entities which are stored in an encoding other than UTF-8 
   or UTF-16 must begin with a text declaration containing an encoding 
   declaration [...]

> RFC2376 supercedes it, as intended by the XML WG. 

Supercedes Appendix F, or superceeds the whole of the XML
Recommendation? I assume you mean the former. So, all parsed entities
which are not in UTF-8 or UTF-16 must still genin with an encoding
declaration; it is an error for them not to do so, and it is an error
for an entity including an encoding declaration to be presented to the
XML processor in an encoding other than that named in the declaration.
All of which follows from the normative section 4.3.3 which is still, as
far as I am aware, the current XML 1.0 Recommendation.

>  XML 1.0 cleary says:
In Appendix F, which as you point out is non-normative.

> By the way, now that RFC 2376 is publisehd, XML 1.0 will be revised.

I can't just now conform to a potential future revision of a
Recommendation. 

> >carefull wording which
> > this RFC nullifies. Problems arise if an XML file is saved from the Web
> > to a local filesystem, perhaps for further editing; the MIME charset
> > information is lost. It could perhaps be stored in some way - but, there
> > is already a standard way - the XML encoding declaration.
> 
> Since it is a standard way, RFC 2376 recommends recipient programs to
> rewrite encoding declarations.

OK. It would be better if no rewrites were ever necessary, however. That
would have been possible, with suitable wording in RFC 2376.

If the MIME charset parameter was *always* derived from the encoding
declaration, as I have suggested, then 

a) it would never disagree
b) it would always be correct, when saved to local file, without
rewriting

> > And if the charset parameter is present, then it should say the same
> > thing as the encoding declaration.
> 
> This disallows code conversion by proxy servers. 

No, it does not, any more than your proposal disallows saving to local
file.
Your proposal requires rewriting the encoding declaration when saving to
file (but not by a proxy); my proposal requires rewriting the encoding
declaration when passing through a transcoding proxy (but not when
saving locally).

Since I observe that saving to file is a very common operation; since I
observe that there is existing XML client code deployed, and since a
transcoding proxy is rewriting all the bytes in the file anyway,
rewriting the encoding declaration is not a significant burden for the
proxy.

> One could argue
> that proxy servers should rewrite encoding declarations. 

Yes, I am doing so.

> However,
> documents should not be rewritten for security reasons. 

Your security argument is self defeating here:

a) it imples, don't use transcoding proxies because they trewrite
documents
b) it implies, saving to local file (which you want to require rewriting
the encoding) should be banned for security reasons; I don't see how
that could be enforced
c) a cryptographic hash or digital signature will be broken by any
transcoding proxy

So, if security is important and availability of a resource in multiple
encodings is important, it follows that the conversions should be done
one time on the server, the results signed and cached, and that as part
of this process the encoding declaration should be correctly rewritten
on the server before the document is signed.  

> Moreover,
> if we require different code conversion for different subtypes of text,
> there is not much hope for interoperability, 

Uh, if something is capable of readinga MIME parameter to find out the
charset, it is equally capable of reading the MIME subtype

> especially because fallback to text/plain is required.

Fallback to text/plain is overrated and rarely useful, as others have
noted. You seem to be sacrificing a lot of other things, just to
accomodate it.

> > The best way to ensure this is to
> > treat the XML encoding declaration as the prmary metadata resource and
> > to programatically derive the charset parameter from this;  greater
> 
> If it is done when the document is stored in the WWW server, that is
> superb.

Yes, that seems the best way.

I notice that the Apache 1.3 distribution has a mod_mime_magic which can
be used, perhaps, to do this sort of thing. However, it doesn't seem to
do cacheing, and involves server CPU load on a per-hit basis. It also
seems fragile, sinc eit relies of fixed byte positions in the file.
http://www.apache.org/docs/mod/mod_mime_magic.html

There must be a better solution, which only computes the charset once,
and only recomputes if the document has changed.

> > However, I will point out that it is the consensus of the XML 1.0
> > Recommendation that I am respecting - and that the RFC does not, by
> > altering the meaning of the default encoding. It could have been
> > harmionised with the XML REC; it was not.
> 
> RFC 2376 IS the consensus (it was not unanimous, though).  It is based
> on really extensive discussion at the XML SIG and XML WG.  My mail
> folder named text/xml has 687 e-mails ;-(   Larry Masinter (the HTTP WG
> chair) and Martin Duerst (the I18N IG chair) was heavily involved.  On
> the other hand, appendix in XML 1.0 is merely informative and was meant
> to be replaced by the XML media type RFC.
> 
> Cheers,
> 
> Makoto
> 
> Fuji Xerox Information Systems
> 
> Tel: +81-44-812-7230   Fax: +81-44-812-7231
> E-mail: murata at apsdc.ksp.fujixerox.co.jp
> 
> xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
> Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
> To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
> (un)subscribe xml-dev
> To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
> subscribe xml-dev-digest
> List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)