python 2.7 and unicode (one more time)

Chris Angelico rosuav at gmail.com
Fri Nov 21 02:39:21 CET 2014


On Fri, Nov 21, 2014 at 12:31 PM,  <random832 at fastmail.us> wrote:
> On Thu, Nov 20, 2014, at 20:10, Chris Angelico wrote:
>> 2) Languages which use a different alphabet (eg Cyrillic - Russian,
>> Bulgarian). You could possibly cram them into an eight-bit encoding
>> without tipping ASCII out, but I'm not sure. In Unicode, these
>> languages are all easily supported by the BMP, as they don't use a
>> huge number of characters each.
>
> There are numerous eight-bit encodings that support latin and one other
> alphabet. Remember, ASCII is a seven-bit encoding, and an eight-bit
> encoding is basically two seven-bit encodings.

I'm aware of this; Greek, for instance, fits quite happily into
ISO-8859-7, which is eight-bit.

> The most difficult (of those still possible at all) language to encode
> in eight bits is actually Vietnamese, which uses the Latin alphabet, due
> to the sheer number of accented letters used. Windows' encoding of it
> (along with some other lesser used encodings, all for Vietnamese) is the
> only 8-bit encoding to use combining accents, in a way unfortunately
> incompatible with unicode normalization if naively translated, whereas
> VISCII sacrifices a handful of C0 control characters in addition to
> fully packing the high half with letters.

This is what I was suspicious of. The very notion of "combining
accents" already breaks the notion that "a byte is a character is a
glyph", which most eight-bit encodings try to pretend. In any case,
the BMP still easily copes with them all.

(Hmm. I wonder how you'd typeset the old "Self-Pronouncing Alphabet"
for English? It's basically English text with a few markings added to
letters - not standard diacriticals that already exist in Unicode, but
dots. Probably possible, one way or another... but I haven't seen SPA
text since the 90s, and that was in stuff published back in the 80s or
so.)

ChrisA



More information about the Python-list mailing list