[Tutor] Python and unicode

Fri Mar 10 12:13:51 CET 2006

Ferry Dave Jäckel wrote:
> Hello list,
> 
> I try hard to understand python and unicode support, but don't get it 
> really.
> 
> What I thought about this until yesterday :)
> If I write my script in unicode encoding and put the magic # -*- coding: 
> utf-8 -*- at its start, I can just use unicode everywhere without problems.
> Reading strings in different encodings, I have to decode them, specifying 
> there source encoding, and writing them in different encode i have to 
> encode them, giving the target encoding.

Yes, this is all good practice. The coding declaration at the start of 
your program may not be necessary - this declares the encoding of your 
actual program file. You only need it if, for example, you have utf-8 
encoded string constants in the file. It doesn't affect the operation of 
the program.

But your strategy of keeping all strings as unicode, decoding and 
encoding as they enter and leave the program, is sound.
> 
> But I have problems with printing my strings with print >> sys.stderr, 
> mystring. I get "ASCII codec encoding errors". I'm on linux with python2.4

stderr and stdout have an encoding too. You can check it by printing 
sys.stderr.encoding and sys.stdout.encoding. The strings that you print 
should be converted to the correct encoding just as strings you send to 
mysql are.

You can make this automatic by replacing sys.stdout with a codec 
wrapper, e.g.
   sys.stdout = codecs.getwriter('latin-1')(sys.stdout)

Hmm, on my Windows machine sys.stderr.encoding is None. 
sys.stdout.encoding is Cp437. That implies that stderr can't accept 
encoded characters at all, so you probably need a lenient ascii encoder, 
for example:
   sys.stdout = codecs.getwriter('ascii')(sys.stdout, 'backslashreplace')

> What is the right way to handle unicode and maybe different encodings in 
> python?
> What encoding should be put into the header of the file, and when to use the 
> strings encode and decode methods? Are there modules (as maybe sax) which 
> require special treatment because of lack of full unicode support?
> In general I'd like to keep all strings as unicode in utf-8, and just 
> convert strings from/to other encodings upon input/output.

I think you are on the right track, though you are keeping all strings 
as unicode, not utf-8. There are some modules that have weak unicode 
support but I don't know specifically which ones. I would expect the XML 
support to be unicode-aware as utf-8 is common in XML.

Kent