Unix/Java design issues (Was: Re: Is CDATA "structure"?)

Hunter, David dhunter at Mobility.com
Wed Jul 21 20:47:46 BST 1999


From: Arthur Rother [mailto:arthur.rother at ovidius.com]

> For sure the LF vs CRLF and CR in theory (the spec) and for viewing in
> Notepad is all correctly debated or noted, but pragmatically, 
> does this
> really provide a problem? The encoding for XML is UTF-8. So 
> in allmost any
> text viewer/editor, in normal(?) circumstances it will show strange in
> these applications, since they do not understand UTF-8 (in windows). 
> The API on XML, for example DOM, is also UTF-8, which most 
> applications may
> treat as 7-bit ASCII, but for encoding generic applications 
> this should be
> treated as UTF-8. Windows is not UTF-8 aware, so it has to be 
> converted to
> Unicode anyway.
<snip/>

Actually, the XML spec states in section 2.2 that "A character is an atomic
unit of text as specified by ISO/IEC 10646" - in other words, Unicode.
Since there are different ways of storing Unicode characters, XML processors
are allowed to accept Unicode in any of these formats, and it even goes on
to state that "all XML processors <em>must</em> accept the UTF-8 and UTF-16
encodings of 10646" (emphasis added), since [I believe] UTF-8 and UTF-16 are
the most common ways to store Unicode characters.  <aside>XML processors are
also allowed to <em>accept</em> data in any other encoding they want as
well, as long as the data is converted to Unicode.  At least, I believe
that's how Microsoft reads the spec, because I had to study this crap at
great length and talk to Microsoft many times for my multi-lingual
application.  :-) </aside>

Windows NT is perfectly Unicode aware, and I routinely view XML documents in
Notepad on my NT box.  All of the characters are fine, with the only problem
being the LF-CRLF-CR problem that started this thread in the first place.  I
am 87% sure that Windows 95 uses the windows-1250 or windows-1252 character
set internally, although it may also have some level of Unicode awareness.
(I'm not sure about that.)  And I haven't the faintest idea what character
set Windows 98 uses natively, although I'd like to hope that it's Unicode.

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo at ic.ac.uk the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)





More information about the Xml-dev mailing list