Question about encoding, I need a clue ...

Steven D'Aprano steve+comp.lang.python at pearwood.info
Fri Aug 5 23:12:02 EDT 2011


Geoff Wright wrote:

> Hi,
> 
> I use Mac OSX for development but deploy on a Linux server.  (Platform
> details provided below).
> 
> When the locale is set to FR_CA, I am not able to display a u circumflex
> consistently across the two machines even though the default encoding is
> set to "ascii" on both machines.

As somebody else already pointed out, û (u circumflex) is not an ASCII
character, so why would you expect to be able to use it with the ASCII
encoding?

Essential reading:

http://www.joelonsoftware.com/articles/Unicode.html

Drop everything and go read that!

Using Python 2.x, so-called strings are byte strings, which complicates
matters greatly. The month name you get:

'ao\xc3\xbbt'

is a string of five bytes with hex values:

x61 x6f xc3 xbb x74

Depending on how your terminal is set up, that MAY be interpreted as the
characters a o û t but you could end up with anything:

>>> print s
ao羶t

(In theory, even the a, o and t could change, but I haven't found any
terminal settings *that* wacky.)


> Specifically, calendar.month_name[8] 
> returns a ? (question mark) on the Linux server whereas it displays
> properly on the Mac OSX system.

That could mean either:

(1) the terminal on the Linux server is set to a different default encoding
from that on the Mac; or

(2) the two terminals have the same encoding, but the font used on the Linux
server doesn't include the right glyph to display û.

Of the two, I expect (1) is more likely.

The solution is to avoid relying on lucky accidents of the terminal
encoding, and deal with this the right way. The right way is nearly always
to use UTF-8 everywhere you can, not Latin 1. Make sure your terminal is
set to use UTF-8 as well (I believe this is the default for Mac OS's
terminal app, but I have no idea about the many different Linux terminals).
Then:

>>> bytes = 'ao\xc3\xbbt'  # From calendar.month_name[8] 
>>> s = bytes.decode('utf-8')  # Like unicode(bytes, 'utf-8')
>>> s
u'ao\xfbt'
>>> print s
août


Provided your Linux server terminal also is set to use UTF-8, this should
just work.




-- 
Steven




More information about the Python-list mailing list