About sml and internationalization

Niels Möller nisse at lysator.liu.se
Mon Nov 29 16:06:37 GMT 1999


Sean McGrath <digitome at iol.ie> writes:

> I am thinking about the issue to with allowing/disallowing
> sets of Unicode characters in element type names as per XML
> 1.0.
> 
> If SML has very few special tokens
> e.g. "<", "&" and whitespace, what would happen
> if any character outside this teeny weeny set is
> allowed in an element type name.

I would say this is the way to go. And I have seen it done before,
both with eight-bit charsets like latin1 andwith unicode.

It gives people the ability to shoot themselves in the foot by using
strange characters (my favourite is using non-breakable space in
variable names in emacs lisp). But I still think it is the way to go:
The parser and language can define a small set of characters as
special, and just pass on whatever is between those special characters
to the application.

If you think about it this way, most of the charset considerations can
be removed from the parser. Treat the input as a sequence of
non-negative integers (which may be 7, 8 or 36 bits wide, depending on
the application; if you think in C++, the parser could be a template
parameterized on the character type). If an application needs to
handle several charsets, it can use something like a content-type:
text/sml; charset = iso-8859-2 header to convert the input into
unicode before feeding it into the parser.

One could define the special characters more abstractly, and leave it
to the application to tell the parser how an "<" is represented today,
but I think that's overabstracting things. Using plain ascii values
(possibly embedded into an ascii superset like unicode or latin-2)
should be good enough.

This line of thinking also means that "whitespace", as far as the
parser is concerned, should be limited to a few ascii characters. SPC
and NL ought to be enough. To keep with tradition, perhaps TAB an CR
as well. Having the parser recognize all unicode whitespace characters
as adds some complexity. (There are 5 spacing control characters in
traditional ASCII, and ordinary space, non-breakable space (in latin-x
and unicode), and an additinal 18 in the rest of unicode. I.e 25 in
all).

/Niels

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev at ic.ac.uk
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To unsubscribe, mailto:majordomo at ic.ac.uk the following message;
unsubscribe xml-dev
To subscribe to the digests, mailto:majordomo at ic.ac.uk the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa at ic.ac.uk)





More information about the Xml-dev mailing list