Plug and Play XML

Sat Dec 20 14:21:58 GMT 1997

> From: Peter Murray-Rust <peter at ursus.demon.co.uk>

> What will differentiate a text/xml document from an application/xml one?

When is each appropriate?  I think the idea is to use text/xml in the 
normal case, and application/xml as a fallback. 

I think I first suggested it, but it certainly was not my preferred 
option: I would prefer everything to be application/xml, because I do
not like the idea of dumb HTTP/MIME systems fiddling and transcoding data,
which they may do for text/xml. Application/xml is a binary transmission;
no bits are molested en route.

The trouble with text/xml is that XML positively encourages the use
of all ISO 10646 characters, for example all the symbol and publishing
characters. If the data is "transcoded" enroute from a large character
set encoding (e.g. Unicode or an East Asian one) to a small encoding
(e.g. 8859-n) then a dumb transcoder will not translate a non-encoding-
repertoire character into its numeric character reference, but probably
swallow it, or put out something strange. 

In practise this means that all characters above 127 should be encoded
using numeric character references rather than directly by XML 
document generators. Smart intermediate XML systems should also attempt
to replace characters in data and attributes with numeric character
references.  When you are devising your own PI notations, and comment
conventions you should also duplicate numeric character references.

The unpleasant implication in all this is for native language markup.
If your XML data will be sent to users who use other scripts, do not
use characters in XML names that are not available in their regional
character sets. Numeric character references do not apply, currently,
to names. (I hope this will eventually be changed in SGML and XML,
but I think the facts and the effected users will eventually speak 
for themselves in due time.)

This is why you should be conservative in your choice of name characters.
The < 127 characters are OK. The 128-255 range of characters in 8859-1
and ISO 10646 are probably pretty safe too. This problem even effects
within nations, if the nation has a few different repertoires in common
use: in particular in Japan Unix systems using EUC have available several
thousand more kanji than older PC (i.e. shift-JIS) and macintosh systems:
it is probably prudent for Japanese users to only use those characters
available in shift-JIS for naming. 

None of these considerations were new for the XML discussion: what was
new was that XML works with a particular operating model that says that
documents must cope with HTTP/MIME systems but also must provide
enough information to create the MIME headers in the first place. 

The restriction that numeric character references cannot be used 
in markup, just in data and attribute values, comes from the old
character model of SGML. In this model, it made no sense to 
allow numeric character references in names, and indeed would be
considered bad, because it created markup that could not be read
in a simple editor.

XML is probably one of the most thoroughly internationized software
systems around: in particular, this internationalization has been
in place and under discussion from the very beginning, and not
"tacked on". Internationalization (I18n) is one area of XML that
must cause difficulties for parser writers to get right. But the
benefit is that once they have it right, it makes life much simpler
and richer for users.  Which is not to say that XML i18n is perfect, 
but it is certainly near state-of-the-art, given the need to fit
in with HTTP/MIME and operating systems. I certainly hope that XML
will not remain "state-of-the-art" for long, and that advances
in various technologies--in particular, for operating system
vendors to agree on a charset/encoding labelling schema that 
they all implement in their OS (or the adoption of MIME as a
file format, e.g. .MIM)-- will overtake it.

Rick Jelliffe

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)