[Python-Dev] Add a new "locale" codec?

Victor Stinner victor.stinner at haypocalc.com
Wed Feb 8 17:40:03 CET 2012


>> The current locale is process-wide: if a thread changes the locale,
>> all threads are affected. Some functions have to use the current
>> locale encoding, and not the locale encoding read at startup. Examples
>> with C functions: strerror(), strftime(), tzname, etc.
>
> Could a core part of Python breaking because of a sequence like:
>
> 1) Encode unicode to bytes using locale codec.
> 2) Silly third-party library code changes the locale codec.
> 3) Attempt to decode bytes back to unicode using the locale codec
> (which is now a different underlying codec).

When you decode data from the OS, you have to use the current locale
encoding. If you use a variable to store the encoding and the locale
is changed, you have to update your variable or you get mojibake.

Example with Python 2:

lisa$ python2.7
Python 2.7.2+ (default, Oct  4 2011, 20:06:09)
>>> import locale
>>> encoding=locale.getpreferredencoding(False)
>>> encoding
'ANSI_X3.4-1968'
>>> encoding, os.strerror(23).decode(encoding)
u'Too many open files in system'
>>> locale.setlocale(locale.LC_ALL, '') # set the locale
'fr_FR.UTF-8'
>>> os.strerror(23).decode(encoding)
Traceback (most recent call last):
  ...
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
37: ordinal not in range(128)
>>> encoding=locale.getpreferredencoding(False)
>>> encoding
'UTF-8'
>>> os.strerror(23).decode(encoding)
u'Trop de fichiers ouverts dans le syst\xe8me'

You have to update manually encoding because setlocale() changed
LC_MESSAGES locale category (message language) but also LC_CTYPE
locale category (encoding).

Using the "locale" encoding, you always get the current locale encoding.

In some cases, you must use sys.getfilesystemencoding() (e.g. write
into the console or encode/decode filenames), in other cases, you must
use the current locale encoding (e.g. sterror() or strftime()). Python
3 does most of the work for me, so you don't have to care of the
locale encoding (you just manipulate Unicode, it decodes bytes or
encode back to bytes for you). But in some cases, you have to decode
or encode manually using the right encoding. In this case, the
"locale" codec can help you.

The documentation will have to explain exactly what this new codec is,
because as expected, it is confusing :-)

Victor


More information about the Python-Dev mailing list