character encoding questions

Nicolas Paris nico at ais.Berger-Levrault.fr
Wed Jul 16 14:12:03 BST 1997


Christina Portillo  wrote:
> 
> My questions are:
> 1. How are the software vendors (browser, parser, authoring) planning on
> supporting documents which utilize the UNICODE character set?
> 

The Double Byte Edition of the Balise SGML/XML transformation tool uses 
internally characters coded with 16 bits, which allows to transform
transparently any Unicode documents (see http://www.balise.com/).

Balise is able to parse, read and write most usual encoding schemes:
UCS-2, UTF-8, ISO-8859-[1-9], Shift-JIS, EUC-JP, EUC-KR, CN-GB, and Big5.

The Balise xml scanner switches to the adequate decoder when the
XML PI changes the encoding. For instance, 
	<?XML version='1.0' encoding='ISO-8859-1' ?>
specifies that the flow should be interpreted according to the ISO latin1 
encoding scheme. When reading or writing character files, Balise can specified
the used encoding scheme and by this mechanism is able to transform from one
encoding scheme to another (as long they are compatible).

The internal double byte coding of the characters allows the user to see 
directly one flat Unicode character set. This is particularly important for
operation like searches and sortes.

The Single Byte Edition of Balise is able to support ISO-8859-1 and UTF-8
(in its ASCII subset).

> 2. a) Can all the characters referenced in ISO LAT,1 positions 0-256, be
> referenced in the document without benefit of escape codes? 
> 

Only the UCS-2 and UTF-8 encoding schemes are absolutely required by the XML
spec. Tools need to support the encoding scheme ISO-8859-1, for processing
 characters in the range 160-255 of ISO-8859-1. If the ISO-8859-1 is not
available then you must code your characters with character references.  

> 2. b) What about positions 0-125? 

Characters of ISO Latin 1 between 32 and 127 (ASCII part) are OK because
they are mapped in the same place in most encoding scheme including UTF-8.

> 
> 2. c) Must the characters above 126 be escaped?
> 

No, if the appropriate encoding scheme (here ISO-8859-1) is used.
If not, you should use character references like &#233; or express the
desired character in current encoding scheme (UCS-2 or UTF-8). 

For ISO-Latin1, the mapping of every character is the same as in Unicode.
This is not true for other ISO-8859 encoding scheme and for ISO TECH, ISO PUB,
...
This means that tools using 8 bit internal representation are obliged to code
them internally in an escaped way, which may be inefficient or inadequate
for some coding and some processing.


> 3. At what point in the ISO10646 character set must escaping be
> instituted in order to reference a character within the set?
> 

Character references (like &#233;) is a convenience to cover any Unicode
character, even if they are not compatible with the encoding scheme of the
document. Tools like Balise can be used to transform documents between any
character formats: special characters can be coded directly by Unicode 
character code (if compatible with the encoding scheme), XML character
references or SGML SDATA entity references. You can use Balise at different
steps of your process to adapt your data with the capabilities and limitations
 of other tools.
When tools are not coding characters internally in 16 bits, they are obliged
to code these escaped characters into an escaped form.

  
--------------------------------------------------------------------------------
Nicolas Paris                                         AIS Software
tel. : (33+1) 40 64 43 00                             17 rue Remy Dumoncel
fax. : (33+1) 40 64 43 10                             75014 Paris
email: nico at AIS.Berger-Levrault.fr                    FRANCE
web:   http://www.balise.com/

xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)




More information about the Xml-dev mailing list