languages with full unicode support
Joachim Durchholz
jo at durchholz.org
Sat Jul 1 03:46:50 EDT 2006
Chris Uppal schrieb:
> Joachim Durchholz wrote:
>
>>> This is implementation-defined in C. A compiler is allowed to accept
>>> variable names with alphabetic Unicode characters outside of ASCII.
>> Hmm... that could would be nonportable, so C support for Unicode is
>> half-baked at best.
>
> Since the interpretation of characters which are yet to be added to
> Unicode is undefined (will they be digits, "letters", operators, symbol,
> punctuation.... ?), there doesn't seem to be any sane way that a language could
> allow an unrestricted choice of Unicode in identifiers.
I don't think this is a problem in practice. E.g. if a language uses the
usual definition for identifiers (first letter, then letters/digits),
you end up with a language that changes its definition on the whims of
the Unicode consortium, but that's less of a problem than one might
think at first.
I'd expect two kinds of changes in character categorization: additions
and corrections. (Any other?)
Additions are relatively unproblematic. Existing code will remain valid
and retain its semantics. The new characters will be available for new
programs.
There's a slight technological complication: the compiler needs to be
able to look up the newest definition. In other words, for a compiler to
run, it needs to be able to access http://unicode.org, or the language
infrastructure needs a way to carry around various revisions of the
Unicode tables and select the newest one.
Corrections are technically more problematic, but then we can rely on
the common sense of the programmers. If the Unicode consortium
miscategorized a character as a letter, the programmers that use that
character set will probably know it well enough to avoid its use. It
will probably not even occur to them that that character could be a
letter ;-)
Actually I'm not sure that Unicode is important for long-lived code.
Code tends to not survive very long unless it's written in English, in
which case anything outside of strings is in 7-bit ASCII. So the
majority of code won't ever be affected by Unicode problems - Unicode is
more a way of lowering entry barriers.
Regards,
Jo
More information about the Python-list
mailing list