[Pythonmac-SIG] Accented characters and Python
Bob Ippolito
bob at redivi.com
Sat Oct 16 21:50:15 CEST 2004
On Oct 16, 2004, at 2:40 PM, João Leão wrote:
> Hi everyone
>
> I'm having some trouble with accented characters (or non ascii
> characters).
> Everything works fine when I just want to print or save to a file. For
> example:
>
> >>> s = 'á' ------> accented a
It's generally a bad idea to ever keep non-ascii text in strings, use
unicode whenever possible. With an OS X terminal, you know that the
strings are in utf-8, so s = 'á'.decode('utf8') is a better way to do
it. Normally you could just do u'á', however Python defaults to ascii
default encoding so it won't know what to do with that. In a Python
script, you can set the encoding as you see fit and do u'á' with
correct results.
#!/usr/bin/python
# -*- coding: utf-8 -*-
(see http://www.python.org/peps/pep-0263.html for more information
about Python source encodings)
> >>> print s
> á ------> output is fine
When you're using unicode strings, this won't work by default since the
default encoding is ascii, so you should wrap stdout with something
unicode-aware
import sys, codecs
sys.stdout = codecs.getwriter('utf-8')(sys.stdout)
then printing unicode will work just fine. The alternative is to set
the defaultencoding to utf-8 in a sitecustomize.py (it can not
typically be changed at runtime).. but I recommend against that because
it makes it more difficult to redistribute correct code (what works on
your machine may not work elsewhere b/c of the different default
encoding).
> >>> s
> '\x87\x8e'
>
> Even when the string is read from a function like os.listdir(), the
> output with 'print' will work preserving the acented characters.
Convert it to unicode first.. If you pass a unicode path to os.listdir
it will return unicode paths.
s = os.listdir(u'.')
If you have a regular string for whatever reason, it's probably in the
filesystem encoding (utf-8 on OS X), so do:
s = os.listdir(path.decode(sys.getfilesystemencoding()))
> The problem comes when I have to pass this string to some function not
> so smart as 'print'.
> For the sake of example, here is a small snippet of code that draws
> some text in a CoreGraphics's BitmapContext:
>
> This will workfine:
> _
> s = '<some_accented_characters>'
> c.showText (s, len(s))
> _
>
> But this won't work (the output will show other strange accented
> characters):
> _
> s = os.listdir('.')[1] #suppose s is again a string of accented
> characters
> c.showText (s, len(s))
>
> This function is obviously using the wrong encodement for the output
> (the right one would be iso-8859-1, I suppose) and I can' figure a way
> to tell it which to use.
> I found some articles about encodements but I don't really know what
> is the right thing to do. Then I tried several things like converting
> the string to unicode and then decode it to latin-1, but that was not
> the answer.
Nope, Windows uses latin-1, Mac OS X certainly does not. It uses
utf-8, which is very similar to latin-1 for some characters, but can
represent arbitrary unicode characters.
-bob
More information about the Pythonmac-SIG
mailing list