choosing a default text-encoding in Python programs (was: To unicode or not to unicode)
Joshua Judson Rosen
rozzin at geekspace.com
Sun Feb 22 19:46:35 EST 2009
Denis Kasak <denis.kasak at gmail.com> writes:
>
> > > Python "assumes" ASCII and if the decodes/encoded text doesn't
> > > fit that encoding it refuses to guess.
> >
> > Which is reasonable given that Python is programming language where it's
> > better to have more conservative assumption about encodings so errors
> > can be more quickly diagnosed. A newsreader however is a different
> > beast, where it's better to make a less conservative assumption that's
> > more likely to display messages correctly to the user. Assuming ISO
> > 8859-1 in the absense of any specified encoding allows the message to be
> > correctly displayed if the character set is either ISO 8859-1 or ASCII.
> > Doing things the "pythonic" way and assuming ASCII only allows such
> > messages to be displayed if ASCII is used.
>
> Reading this paragraph, I've began thinking that we've misunderstood
> each other. I agree that assuming ISO 8859-1 in the absence of
> specification is a better guess than most (since it's more likely to
> display the message correctly).
So, yeah--back on the subject of programming in Python and supporting
charactersets beyond ASCII:
If you have to make an assumption, I'd really think that it'd be
better to use whatever the host OS's default is, if the host OS has
such a thing--using an assumption of ISO 8859-1 works only in select
regions on unix systems, and may fail even in those select regions on
Windows, Mac OS, and other systems; without the OS considerations,
just the regional constraints are likely to make an ISO-8859-1
assumption result in /incorrect/ results anywhere eastward of central
Europe. Is a user in Russia (or China, or Japan) *really* most likely
to be using ISO 8859-1?
As a point of reference, here's what's in the man-pages that I have
installed (note the /complete/ and conspicuous lack of references to
even some notable eastern languages or character-sets, such as Chinese
and Japanese, in the /entire/ ISO-8859 spectrum):
"ISO 8859 Alphabets
The full set of ISO 8859 alphabets includes:
ISO 8859-1 West European languages (Latin-1)
ISO 8859-2 Central and East European languages (Latin-2)
ISO 8859-3 Southeast European and miscellaneous languages (Latin-3)
ISO 8859-4 Scandinavian/Baltic languages (Latin-4)
ISO 8859-5 Latin/Cyrillic
ISO 8859-6 Latin/Arabic
ISO 8859-7 Latin/Greek
ISO 8859-8 Latin/Hebrew
ISO 8859-9 Latin-1 modification for Turkish (Latin-5)
ISO 8859-10 Lappish/Nordic/Eskimo languages (Latin-6)
ISO 8859-11 Latin/Thai
ISO 8859-13 Baltic Rim languages (Latin-7)
ISO 8859-14 Celtic (Latin-8)
ISO 8859-15 West European languages (Latin-9)
ISO 8859-16 Romanian (Latin-10)"
"ISO 8859-1 supports the following languages: Afrikaans, Basque,
Catalan, Danish, Dutch, English, Faeroese, Finnish, French,
Galician, German, Icelandic, Irish, Italian, Norwegian,
Portuguese, Scottish, Spanish, and Swedish."
"ISO 8859-2 supports the following languages: Albanian, Bosnian,
Croatian, Czech, English, Finnish, German, Hungarian, Irish, Polish,
Slovak, Slovenian and Sorbian."
"ISO 8859-7 encodes the characters used in modern monotonic
Greek."
"ISO 8859-9, also known as the "Latin Alphabet No. 5", encodes
the characters used in Turkish."
"ISO 8859-15 supports the following languages: Albanian, Basque, Breton,
Catalan, Danish, Dutch, English, Estonian, Faroese, Finnish, French,
Frisian, Galician, German, Greenlandic, Icelandic, Irish Gaelic,
Italian, Latin, Luxemburgish, Norwegian, Portuguese, Rhaeto-Romanic,
Scottish Gaelic, Spanish, and Swedish."
"ISO 8859-16 supports the following languages: Albanian, Bosnian,
Croatian, English, Finnish, German, Hungarian, Irish, Polish, Romanian,
Slovenian and Serbian."
--
Don't be afraid to ask (Lf.((Lx.xx) (Lr.f(rr)))).
More information about the Python-list
mailing list