[Tutor] Python and unicode
Kent Johnson
kent37 at tds.net
Fri Mar 10 12:13:51 CET 2006
Ferry Dave Jäckel wrote:
> Hello list,
>
> I try hard to understand python and unicode support, but don't get it
> really.
>
> What I thought about this until yesterday :)
> If I write my script in unicode encoding and put the magic # -*- coding:
> utf-8 -*- at its start, I can just use unicode everywhere without problems.
> Reading strings in different encodings, I have to decode them, specifying
> there source encoding, and writing them in different encode i have to
> encode them, giving the target encoding.
Yes, this is all good practice. The coding declaration at the start of
your program may not be necessary - this declares the encoding of your
actual program file. You only need it if, for example, you have utf-8
encoded string constants in the file. It doesn't affect the operation of
the program.
But your strategy of keeping all strings as unicode, decoding and
encoding as they enter and leave the program, is sound.
>
> But I have problems with printing my strings with print >> sys.stderr,
> mystring. I get "ASCII codec encoding errors". I'm on linux with python2.4
stderr and stdout have an encoding too. You can check it by printing
sys.stderr.encoding and sys.stdout.encoding. The strings that you print
should be converted to the correct encoding just as strings you send to
mysql are.
You can make this automatic by replacing sys.stdout with a codec
wrapper, e.g.
sys.stdout = codecs.getwriter('latin-1')(sys.stdout)
Hmm, on my Windows machine sys.stderr.encoding is None.
sys.stdout.encoding is Cp437. That implies that stderr can't accept
encoded characters at all, so you probably need a lenient ascii encoder,
for example:
sys.stdout = codecs.getwriter('ascii')(sys.stdout, 'backslashreplace')
> What is the right way to handle unicode and maybe different encodings in
> python?
> What encoding should be put into the header of the file, and when to use the
> strings encode and decode methods? Are there modules (as maybe sax) which
> require special treatment because of lack of full unicode support?
> In general I'd like to keep all strings as unicode in utf-8, and just
> convert strings from/to other encodings upon input/output.
I think you are on the right track, though you are keeping all strings
as unicode, not utf-8. There are some modules that have weak unicode
support but I don't know specifically which ones. I would expect the XML
support to be unicode-aware as utf-8 is common in XML.
Kent
More information about the Tutor
mailing list