choosing a default text-encoding in Python programs (was: To unicode or not to unicode)

Sun Feb 22 19:46:35 EST 2009

Denis Kasak <denis.kasak at gmail.com> writes:
>
> > > Python "assumes" ASCII and if the decodes/encoded text doesn't
> > > fit that encoding it refuses to guess.
> >
> > Which is reasonable given that Python is programming language where it's
> > better to have more conservative assumption about encodings so errors
> > can be more quickly diagnosed.  A newsreader however is a different
> > beast, where it's better to make a less conservative assumption that's
> > more likely to display messages correctly to the user.  Assuming ISO
> > 8859-1 in the absense of any specified encoding allows the message to be
> > correctly displayed if the character set is either ISO 8859-1 or ASCII.
> > Doing things the "pythonic" way and assuming ASCII only allows such
> > messages to be displayed if ASCII is used.
>
> Reading this paragraph, I've began thinking that we've misunderstood
> each other. I agree that assuming ISO 8859-1 in the absence of
> specification is a better guess than most (since it's more likely to
> display the message correctly).

So, yeah--back on the subject of programming in Python and supporting
charactersets beyond ASCII:

If you have to make an assumption, I'd really think that it'd be
better to use whatever the host OS's default is, if the host OS has
such a thing--using an assumption of ISO 8859-1 works only in select
regions on unix systems, and may fail even in those select regions on
Windows, Mac OS, and other systems; without the OS considerations,
just the regional constraints are likely to make an ISO-8859-1
assumption result in /incorrect/ results anywhere eastward of central
Europe. Is a user in Russia (or China, or Japan) *really* most likely
to be using ISO 8859-1?

As a point of reference, here's what's in the man-pages that I have
installed (note the /complete/ and conspicuous lack of references to
even some notable eastern languages or character-sets, such as Chinese
and Japanese, in the /entire/ ISO-8859 spectrum):

   "ISO 8859 Alphabets
       The full set of ISO 8859 alphabets includes:

       ISO 8859-1    West European languages (Latin-1)
       ISO 8859-2    Central and East European languages (Latin-2)
       ISO 8859-3    Southeast European and miscellaneous languages (Latin-3)
       ISO 8859-4    Scandinavian/Baltic languages (Latin-4)
       ISO 8859-5    Latin/Cyrillic
       ISO 8859-6    Latin/Arabic
       ISO 8859-7    Latin/Greek
       ISO 8859-8    Latin/Hebrew
       ISO 8859-9    Latin-1 modification for Turkish (Latin-5)
       ISO 8859-10   Lappish/Nordic/Eskimo languages (Latin-6)
       ISO 8859-11   Latin/Thai
       ISO 8859-13   Baltic Rim languages (Latin-7)
       ISO 8859-14   Celtic (Latin-8)
       ISO 8859-15   West European languages (Latin-9)
       ISO 8859-16   Romanian (Latin-10)"

       "ISO 8859-1 supports the following languages: Afrikaans, Basque,
       Catalan, Danish, Dutch, English, Faeroese, Finnish, French,
       Galician, German, Icelandic, Irish, Italian, Norwegian,
       Portuguese, Scottish, Spanish, and Swedish."

       "ISO   8859-2  supports  the  following  languages:  Albanian,  Bosnian,
       Croatian, Czech, English, Finnish, German,  Hungarian,  Irish,  Polish,
       Slovak, Slovenian and Sorbian."

       "ISO 8859-7 encodes the characters used in modern monotonic
       Greek."

       "ISO 8859-9, also known as the "Latin Alphabet No. 5", encodes
       the characters used in Turkish."

       "ISO 8859-15 supports the following languages: Albanian, Basque, Breton,
       Catalan,  Danish,  Dutch,  English, Estonian, Faroese, Finnish, French,
       Frisian,  Galician,  German,  Greenlandic,  Icelandic,  Irish   Gaelic,
       Italian,  Latin,  Luxemburgish,  Norwegian, Portuguese, Rhaeto-Romanic,
       Scottish Gaelic, Spanish, and Swedish."

       "ISO  8859-16  supports  the  following  languages:  Albanian,  Bosnian,
       Croatian, English, Finnish, German, Hungarian, Irish, Polish, Romanian,
       Slovenian and Serbian."

-- 
Don't be afraid to ask (Lf.((Lx.xx) (Lr.f(rr)))).