Feeler for SML (Simple Markup Language)

David Brownell david-b at pacbell.net
Mon Nov 15 22:58:17 GMT 1999


Tim Bray wrote:
> 
> At 01:08 PM 11/15/99 -0800, David Brownell wrote:
> >> The UTF-*'s are logically equivalent to most users, in that they share
> >> the property that almost no real-world data objects are encoded in either.
> >
> >Quite true, from what I know, if you don't consider all the documents
> >encoded in ASCII (which is a subset of UTF-8).  Many of them aren't
> >tagged as to encoding; assert they're UTF-8 not ASCII, and disproof is
> >often going to be impossible!
> 
> I used to think so too, but actually, if you look closely, the proportion
> of "ascii" that's actually pure US-ASCII is not that high. 

Well, ASCII is ASCII -- if it's not pure, it's not ASCII (and
hence it's not usable as UTF-8 either).  ASCII uses only seven
bits; always has (modulo parity), and I can't see that changing.

But while that's key to what I was saying (if it _really_ is
ASCII, it's also UTF-8, and there's lots of real ASCII), I
suspect that was likely not what you were getting at there.


>	 The prevalence
> of é's and õ's and so on these days is in my experience really growing,
> which means that documents which are ideally ISO-8859-1 but in fact
> some Microsoft codepage is really immense.  -T.

Those characters are actually in ISO-8859-1, but I understand that
Microsoft does cause real problems by its use of many characters
that are reserved in 8859-1 ... look at the number of web pages
with strange characters where you should have “ or ”
(but hmm, not all browsers accept those entities anyway).

Assert that one of those documents is ASCII, and disproof is trivial:
some character has the eighth bit set.  (When was the last time you
saw a document using it for parity?  A LONG time ago, for me!)  Since
it's not ASCII, you clearly can't read it as UTF-8.

- Dave

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To unsubscribe, mailto:majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)





More information about the Xml-dev mailing list