davea at ieee.org
Thu Oct 1 12:50:21 CEST 2009
>> save in utf-8 the coding declaration also has to be utf-8
> ok, I understand, but what's the problem? Unfortunately seems to be
> the Python interactive
> mode doesn't have unicode support. It recognize the latin-1 encoding
> So I have 2 options, how to write doctest:
> 1. Replace native charaters with their encoded representation like
> u"\u017dabovit\xe1 zmie\u0161an\xe1 ka\u0161a" instead of u"Žabovitá
> zmiešaná kaša"
> 2. Use latin-1 encoding, where the file is saved in utf-8
> The first is bad because doctest is a great documenttion tool and it
> is propably the main reason I use python. And something like
> u"\u017dabovit\xe1 zmie\u0161an\xe1 ka\u0161a" is not a best
> documentation style. But the tests work.
> The second is bad, because the declaration is incorrect and if I use
> it in Django model declaration for example I got bad data in the
> So what is the solution? Back to Java? :-)
Wait -- don't give up yet. Since I'm one of the ones who (partially)
steered you wrong, let me try to help.
Key variable here is how your text editor behaves. Since I've never
taken my (programming) text editor out of ASCII mode before this week,
it took some experimenting (and more importantly a message from Piet on
this thread) to make sense of things. I think I now know how to make my
own editor (Komodo IDE) behave in this environment, and you probably can
do as well or better. In fact, judging from your messages, you probably
are doing much better on the editor front.
When I tried this morning to re-open that test file from yesterday, many
of the characters were all messed up. I was okay as long as the project
was still open, but not today. The editor itself apparently looks to
that encoding declaration when it's deciding how to interpret the bytes
So I did the following, using Komodo IDE. I created a new file in the
project. Before saving it, I used
Edit->CurrentFileSettings->Properties->Encoding to set it to UTF-8.
*NOW* I pasted the stuff from your email message. And added the
#-*- coding: utf-8 -*-
as the second line of the file. Notice it's *NOT* latin-1.
At this point I save and run the file, and it seems to work fine.
My guess is that I could set these as default settings in Komodo, if I
were doing UTF-8 very often, and it would become painless. I know I
have certain stuff in my python template, and could add that encoding
line as well.
Anyway, that gets us to the step of running the doctest. The trick here
seems to be that we need to define the docstring as a Unicode docstring
to have it interpreted correctly. Try adding the u in front of the
triple quote as follows:
>>> downcode(u"Žabovitá zmiešaná kaša")
u'Zabovita zmiesana kasa'
for key, value in _MAP.iteritems():
name = name.replace(key, value)
Now, if the doctest passes, we seem to be in good shape.
There's another problem, that hopefully somebody else can help with.
That's if doctest needs to report an error. When I deliberately changed
the "expect" string I get an error like the following.
UnicodeEncodeError: 'ascii' codec can't encode character u'\u017d' in
50: ordinal not in range(128)
I get a similar error if running the -v option on doctest. (Note that
I do *NOT* get the error when running inside Komodo. And what I've read
implies that the same would be true if running inside IDLE.) The
problem is similar to the one you'd have doing a simple:
I think these are avoided if sys.stdout.encoding (and maybe
sys.stderr.encoding) are set to utf-8. On my system they're set to
None, which says to use "the system default encoding." On my system
that would be ASCII, so I get the error. But perhaps yours is already
I found links:
which indicate you may want to try:
set LC_CTYPE=en_GB.utf-8 python
at the command prompt before running python. This could be system specific; it didn't work for me on XP.
The workaround that works for me (so far) is:
if __name__ == "__main__":
import sys, codecs
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
print u"Žabovitá zmiešaná kaša"
The codecs line tells python that stdout should use utf-8. That doesn't make the characters look good on my console, but at least it avoids the errors. I'm guessing that on my system I should use latin1 here instead of utf8. But I don't want to confuse things.
More information about the Python-list