Character encoding questions

Thu Jun 26 06:52:52 BST 1997

> I was struck by the following sentence in the Microsoft XML White Paper:
> 
>   XML supports a range of encodings...subject only to the restriction
>   that an entire document must share the same encoding.
>   
> My immediate reaction was that that wasn't correct, although the
> definition of "document" above isn't obvious to me (for example, are
> external entities part of a document?).  However, when checking into the
> XML April specification, I got in over my head.  I am hoping that someone
> here will help me out of my hole.
> 
> If my XML document is a simple Unicode text file then I begin it like
> the following
> 
>   a Byte Order Mark
>   <?XML version="1.0" encoding="ISO-10646-UCS-2"?>
>   ...
> 
> with the Byte Order Mark being required even though an EncodingDecl is
> used?  (I would have said "yes" until I got to Appendix E "Autodetection
> of Character Sets," which worries about detecting UCS-2 when there
> is no Byte Order Mark.)  Is the EncodingDecl necessary if the file
> starts with a Byte Order Mark?
> 
> Where can I have an EncodingPI?  Section 4.3.3 talks about their being
> "at the beginning of a system entity, before any other character data or
> markup" but doesn't define "system entity" (perhaps one that has an
> ExternalID that contains "SYSTEM"?).  If my document references an
> external entity, then I believe that the external entity must start
> with an EncodingPI (see Appendix E "Autodetection of Character Sets")
> if it isn't in UTF-8 or start with a Byte Order Mark.
>
In classical SGML this info is contained in the system declaration where
one or more character sets can be declared and the control characters
used to switch between them, using the ISO 2022 and related standard
systems. These are read in before the dtd.

However, if I understand the XML proposals correctly, they do not envisage
a system declaration. The best info on system declarations are a white 
paper from omnimark and an article in TAG by Wayne Wohler. On character
sets you might have a look at my article in CHUM a couple of years ago.
I have a preprint in ps available by ftp if you want to see it. It does
not have the character set tables which ISO claims the copyright for.

With the implementation of unicode/ucs we don't need all those things with
control characters which are too succeptible to corruption. All the
characters you need (or almost all in my case) are in the new character set.

The other option in classic SGML is to use a subdoc, but as far as I can 
remember it can contain its own dtd, but I don't think it can have a
system declaration. My docs are at the office.

>
Harry Gaylord
former chair TEI committee on character sets
member ISO SC2 and NNI shadow committee

> 

xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa at ic.ac.uk)