Question about encoding, I need a clue ...

Nobody nobody at
Sat Aug 6 01:59:41 EDT 2011

On Fri, 05 Aug 2011 14:07:54 -0400, Geoff Wright wrote:

> I guess what it boils down to is that I would like to get a better handle
> on what is going on so that I will know how best to work through future
> encoding issues.  Thanks in advance for any advice.
> Here are the specifics of my problem.
> On my Mac:

>>>> sys.getdefaultencoding()
> 'ascii'

sys.getdefaultencoding() is a red herring. It's almost always 'ascii',
and isn't affected by the locale (and cannot be changed outside of the file).

>>>> import calendar
>>>> calendar.month_name[8]
> 'ao\xc3\xbbt'

This is the "repr()" of 'août' in UTF-8.

>>>> print calendar.month_name[8]
> août
>>>> print unicode(calendar.month_name[8],"latin1")
> août

This is what you get if you decode the UTF-8 representation of 'août'
using ISO-8859-1 (aka ISO-Latin-1).

> On the linux server:
>>>> calendar.month_name[8]
> 'ao\xfbt'

This is the "repr()" of 'août' in ISO-8859-1.

Conclusion: the Mac's "fr_CA" locale uses UTF-8, the Linux system uses
ISO-8859-1 (there may or may not be a distinct "fr_CA.utf8" locale which
uses UTF-8). The difference between the two /isn't/ responsible for your
problem; your problem is almost certainly due to a mismatch between the
encoding used by the terminal and the locale's encoding.

If you get a "?" on the Linux system, it's likely that the terminal (or
emulator) is configured to use something other than ISO-8859-1 (e.g. UTF-8
or ASCII). For a GUI-based emulator (xterm, etc), you need to consult the
documentation for the specific program. For the Linux console, refer to
the setfont(8) manual page.

In this situation, there probably isn't much point in converting to and
from Unicode. You can't perform the encoding step (Unicode -> whatever)
without knowing the terminal's encoding. It *should* be the same as the
locale's encoding, in which case converting to and from Unicode is an
identity transformation (i.e. you get out exactly what you put in). If it
isn't the same as the locale's encoding, well ... good luck trying to
figure out what it is.

More information about the Python-list mailing list