Python Unicode handling wins again -- mostly
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Sat Nov 30 19:22:28 EST 2013
On Sun, 01 Dec 2013 11:37:30 +1300, Gregory Ewing wrote:
> Which makes it even sillier to have an 'ffi' character in this day and
> age, when you can simply space the characters so that they overlap.
It's in Unicode to support legacy character sets that included it[1].
There are a bunch of similar cases:
* LATIN CAPITAL LETTER A WITH RING ABOVE versus ANGSTROM SIGN
* KELVIN SIGN versus LATIN CAPITAL LETTER A
* DEGREE CELSIUS and DEGREE FAHRENHEIT
* the whole set of full-width and half-width forms
On the other hand, there are cases which to a naive reader might look
like needless duplication but actually aren't. For example, there are a
bunch of visually indistinguishable characters[2] in European languages,
like AΑА and BΒВ. The reason for this becomes more obvious[3] when you
lowercase them:
py> 'AΑА BΒВ'.lower()
'aαа bβв'
Sorting and case-conversion rules would become insanely complicated, and
context-sensitive, if Unicode only included a single code point per thing-
that-looks-the-same.
The rules for deciding what is and what isn't a distinct character can be
quite complex, and often politically charged. There's a lot of opposition
to Unicode in East Asian countries because it unifies Han ideograms that
look and behave the same in Chinese, Japanese and Korean. The reason they
do this is for the same reason that Unicode doesn't distinguish between
(say) English A, German A and French A. One reason some East Asians want
it to is for the same reason you or I might wish to flag a section of
text as English and another section of text as German, and have them
displayed in slightly different typefaces and spell-checked with a
different dictionary. The Unicode Consortium's answer to that is, this is
beyond the remit of the character set, and is best handled by markup or
higher-level formatting.
(Another reason for opposing Han unification is, let's be frank, pure
nationalism.)
[1] As far as I can tell, the only character supported by legacy
character sets which is not included in Unicode is the Apple logo from
Mac charsets.
[2] The actual glyphs depends on the typeface used.
[3] Again, modulo the typeface you're using to view them.
--
Steven
More information about the Python-list
mailing list