[Pythonmac-SIG] Accented characters and Python

Bob Ippolito bob at redivi.com
Sat Oct 16 21:50:15 CEST 2004


On Oct 16, 2004, at 2:40 PM, João Leão wrote:

> Hi everyone
>
> I'm having some trouble with accented characters (or non ascii 
> characters).
> Everything works fine when I just want to print or save to a file. For 
> example:
>
> >>> s = 'á'  ------> accented a

It's generally a bad idea to ever keep non-ascii text in strings, use 
unicode whenever possible.  With an OS X terminal, you know that the 
strings are in utf-8, so s = 'á'.decode('utf8') is a better way to do 
it.  Normally you could just do u'á', however Python defaults to ascii 
default encoding so it won't know what to do with that.  In a Python 
script, you can set the encoding as you see fit and do u'á' with 
correct results.

#!/usr/bin/python
# -*- coding: utf-8 -*-

(see http://www.python.org/peps/pep-0263.html for more information 
about Python source encodings)

> >>> print s
> á                  ------> output is fine

When you're using unicode strings, this won't work by default since the 
default encoding is ascii, so you should wrap stdout with something 
unicode-aware

import sys, codecs
sys.stdout = codecs.getwriter('utf-8')(sys.stdout)

then printing unicode will work just fine.  The alternative is to set 
the defaultencoding to utf-8 in a sitecustomize.py (it can not 
typically be changed at runtime).. but I recommend against that because 
it makes it more difficult to redistribute correct code (what works on 
your machine may not work elsewhere b/c of the different default 
encoding).

> >>> s
> '\x87\x8e'
>
> Even when the string is read from a function like os.listdir(), the 
> output with 'print' will work preserving the acented characters.

Convert it to unicode first..  If you pass a unicode path to os.listdir 
it will return unicode paths.

s = os.listdir(u'.')

If you have a regular string for whatever reason, it's probably in the 
filesystem encoding (utf-8 on OS X), so do:
s = os.listdir(path.decode(sys.getfilesystemencoding()))

> The problem comes when I have to pass this string to some function not 
> so smart as 'print'.
> For the sake of example, here is a small snippet of code that draws 
> some text in a CoreGraphics's BitmapContext:
>
> This will workfine:
> _
> s = '<some_accented_characters>'
> c.showText (s, len(s))
> _
>
> But this won't work (the output will show other strange accented 
> characters):
> _
> s = os.listdir('.')[1]   #suppose s is again a string of accented 
> characters
> c.showText (s, len(s))
>
> This function is obviously using the wrong encodement for the output 
> (the right one would be iso-8859-1, I suppose) and I can' figure a way 
> to tell it which to use.
> I found some articles about encodements but I don't really know what 
> is the right thing to do. Then I tried several things like converting 
> the string to unicode and then decode it to latin-1, but that was not 
> the answer.

Nope, Windows uses latin-1, Mac OS X certainly does not.  It uses 
utf-8, which is very similar to latin-1 for some characters, but can 
represent arbitrary unicode characters.

-bob


More information about the Pythonmac-SIG mailing list