[Tutor] Python and unicode

Fri Mar 10 11:54:06 CET 2006

On Fri, 10 Mar 2006 08:55:35 +0100
Ferry Dave Jäckel <dave.jaeckel at arcor.de> wrote:

> Hello list,
> 
> I try hard to understand python and unicode support, but don't get it 
> really.
> 
> What I thought about this until yesterday :)
> If I write my script in unicode encoding and put the magic # -*- coding: 
> utf-8 -*- at its start, I can just use unicode everywhere without problems.
> Reading strings in different encodings, I have to decode them, specifying 
> there source encoding, and writing them in different encode i have to 
> encode them, giving the target encoding.
> 
> But I have problems with printing my strings with print >> sys.stderr, 
> mystring. I get "ASCII codec encoding errors". I'm on linux with python2.4
> 
> My programming problem where I'm stumbling about this:
> I have an xml-file from OO.org writer (encoded in utf-8), and I parse this 
> with sax, getting some values from it. This data should go into a mysql db 
> (as utf-8, too). I think this works quite well, but debug printing gives 
> this errors.
> 
> What is the right way to handle unicode and maybe different encodings in 
> python?
> What encoding should be put into the header of the file, and when to use the 
> strings encode and decode methods? Are there modules (as maybe sax) which 
> require special treatment because of lack of full unicode support?
> In general I'd like to keep all strings as unicode in utf-8, and just 
> convert strings from/to other encodings upon input/output.
> 

Hi Dave,

you should be aware that utf-8 is *not* unicode, but just another encoding.
Look here for more details:

  http://www.joelonsoftware.com/articles/Unicode.html

I am not sure what happens in your program, but generally when converting a unicode
string into a byte string python assumes to use the ascii codec if no other codec
is explicitely specified, which seems to be what occurs.
In this case calling encode('utf-8') on the unicode string before processing it may help.

If you are using strings in different encodings it is probably the best to convert them
all into unicode objects after reading with decode() (however you need to know which codec
to use) and to use only unicode internally. However you are right, some modules may
not be unicode-proof (i don't know about sax though). You will have to encode these
strings again when calling these module's functions.

If your program is supposed to run on different systems it may help to know the encoding
the system uses if you want to read some files there; you can look at the top of
the IOBinding module in idlelib to see how to guess the system encoding.

I hope this helps

Michael