SDATA or UNICODE

Thu Jan 29 05:13:12 GMT 1998

> From: Paul Prescod <papresco at technologist.com>

> On Wed, 28 Jan 1998, Gavin McKenzie wrote:
> > 
> > XML provides a way for specifying the encoding of an entity with the
> > ?XML pi encoding declaration.  Why wouldn't this be sufficient.  If the
> > euro or florin symbol is available in some non-Unicode character
> > encoding scheme, isn't it sufficient to encode the text which requires
> > the symbol in the appropriate scheme and use the encoding declaration?
> 
> No, for the reason Tim points out. On the other hand, you might be on the 
> right track. A processing instruction would serve as a hack to tell the 
> application where to insert the euro. <?EURO>

XML has, underlying its decisions, the SGML model which separates the
encoding of data (i.e. "storage management") from their logical representation
as streams of characters in a single character set (i.e. "entity management").

This is a very flexible model, since it allows any system of encoding that
anyone can dream up to be used without having to alter XML/SGML: an entity
can be sourced from files, multipart MIME, data base, random number generators,
standard input, anything.  To allow multiple encodings within an XML
file, delimited using PIs or elements or internal entities would violate
this model, and I would strongly recommend against it. If your customers
require multiple encodings, then they have to source each one from a separate
external entity. These entities can be bundled up or interleaved in any
fashion you like, but this is a *PRE* XML storage management issue, not
an XML issue. 

I think there is a great desire that XML will be a Trojan horse to force
the development of wide-character applications, and Universal Character 
Set-using ones (UCS = ISO 10646 ~= Unicode) in particular. 
I, for one, hope that by disconnection encoding and character "repertoire", 
XML will marginalise the character encoding issue to the extent that 
it will become easier to use Unicode than to use a regional encoding, 
in the long run.

> I think you should implement a language that allows this and is preprocessed 
> into XML. If I were you I would use marked sections and not attributes to 
> describe the boundaries. Marked sections are really easy to scan for.

But once you have changed encodings, do you scan for the end of the
marked section using the old or the new encoding? These kinds of ISO 2022
mode changing are what we are trying to get rid of from XML (and from
SGML).

So you can have multiple encodings before the parser, but not being presented
to the parser. The other choice is multiple encodings after the parser: e.g.
embedded the SJIS encoded in a latin-1-safe way. This is the same as Dave's 
comment about transliteration using notation. You can have a document like

<?XML version="1.0" encoding="8859-1"?>
<!DOCTYPE x SYSTEM "x.dtd"
[
	<!NOTATION sjis-Qencoded SYSTEM "SjisQ.pl">
	<!ELEMENT SJIS-SECTION ( #PCDATA ) >
	<!ATTLIST SJIS-SECTION
		I-need-decoding NOTATION ( sjis-Qencoded ) > 
]>
<x>
...

<SJIS-SECTION><![CDATA[
smdkfjhhjwfnnweofijslkdm
]]></SJIS-SECTION>
...
</x>

(You cannot do the same thing using internal entities in XML, since you 
cannot put a notatation on an internal entity declaration.)

Rick Jelliffe

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)