IE5.0 does not conform to RFC2376

Sun Apr 4 15:28:47 BST 1999

David Brownell wrote:
> Chris Lilley wrote:
> >
> > What this RFC appears to do is remove author control over correctly
> > labelling the encoding, and ensure that most if not all XML documents
> > get incorrectly labelled as US-ASCII.
> 
> Not at all.  The best default MIME content type for all web
> servers is "application/xml". 

Why? Do you consider anything not written in US-ASCII as a text
document? I think the Unicode Consortium would disagree with you there.

You don't actually show that application/xml is better, because you say:

> Without a "charset=Big5" or
> similar declaration, then the XML processor's autodetection
> kicks in ... minimally handling UTF-8 and UTF-16, and quite
> commonly handling a variety of additional encodings.

If it has poor code to autodetect, it has poor code for both text/xml
and application/xml. But it need not autodetect, in fact, autodetection
is a bad thing. I was not suggesting autodetection, quite the converse.

Rather, in the absence of an explicit MIME charset parameter, it should
use the encoding declaration. If there is none, then the document is in
UTF-8 or UTF-16 and the XML spec tells you how to determine which. [1].

If the processor is unable to deal with a particular encoding (8859-15,
for example) then that is still the case whether the information was
conveyed in a charset parameter on the MIME type (text/xml) or in the
encoding declaration in the entity (application/xml). So, in what way is
application/xml any better?

So, the only difference between text/xml and application/xml in this
regard is that the former *requires* the client to ignore the encoding
declaration in the entity and forces an interpretation of US-ASCII in
all cases.

Now, the default for text/* over HTTP is ISO-8859-1 and the default for
XML in the absence of an encoding declaration is UTF-8 or UTF-16. 

My position is that the most preferable option when registering text/xml
would have been to use the rules in the XML spec (UTF-8 or UTF-16, unles
there is an encoding declaration). 

> For example, Sun's XML processor handles about 140 encodings
> at last count ... and _does_ conform to RFC 2376.

You mean, when receiving a message body labelled as text/xml (via email
or via HTTP) it ignores the encoding declaration, assumes US-ASCII,
signals a fatal error because of invalid byte sequences in the file and
then halts? Great ;-(

> > So, this RFC removes at a stroke the possibility of authors correctly
> > labelling the encoding of their XML documents and takes us back to that
> > dark time (the present) when the majority of, say, Japanese Web content
> > was mis-labelled. And it seems to have done this simply to save a very
> > small part of coding effort for people writing transcoders.
> 
> Again, no it doesn't.  The idea is to get the web server to
> attach the correct MIME content type, which is NOT "text/xml"
> in many/most cases. 

So, your position is that since text/xml is unusable, best use
application/xml instead? Surely it would have been better not to make
text/xml unusable? Or if that was thought unreasonable, then why
register text/xml at all?

> Authors must rely on the administrator
> not breaking their content, and this is part of it.

Authors would love to rely on this, but have learned not to.

The vast majority of content authors have *no control whatsoever* on
server configuration. This isn't 1993; assuming that the person who
wrote the content is also the person who administers the server is
totally unwarranted. 

For 99.9995% of the folks, they sign up with an ISP; they get around
5Megs of web space and they are allowed to upload documents there. They
share that server with thousands of other users. The server is not
chosen by them, and is configured with all the default settings and the
ISP will not change them no matter how many reasoned emails are sent by
users. So, users cannot choose the MIME type that is used and certainly
do not have the control to allow different documents to be served up
with different MIME parameters depending on the encoding of their
various documents.

Which is my concern; control is removed from the users (who get to
author the documents, and are in a position to do the right thing) and
put in the hands of ISP administrators (who are installing new web
servers at a rate of several a day, and do not want any special cases or
anthing that is not right out of the box).

Merely saying "so, ignore text/xml and use application/xml" does not
help matters; its a workaround, not a solution.

[1] http://www.w3.org/TR/REC-xml#charencoding

--
Chris

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)