Need debugging knowhow for my creeping Unicodephobia

Wed Feb 10 18:03:32 EST 2010

On Wed, 2010-02-10 at 12:17 -0800, Anthony Tolle wrote:
> On Feb 10, 2:09 pm, kj <no.em... at please.post> wrote:
> > Some people have mathphobia.  I'm developing a wicked case of
> > Unicodephobia.
> > [snip]
> 
> Some general advice (Looks like I am reiterating what MRAB said -- I
> type slower :):
> 
> 1. If possible, use unicode strings for everything.  That is, don't
> use both str and unicode within the same project.
> 
> 2. If that isn't possible, convert strings to unicode as early as
> possible, work with them that way, then convert them back as late as
> possible.
> 
> 3. Know what type of string you are working with!  If a function
> returns or accepts a string value, verify whether the expected type is
> unicode or str.
> 
> 4. Consider switching to Python 3.x, since there is only one string
> type (unicode).

Some further nasty gotchas:

5. Be wary of the encoding of sys.stdout (and stderr/stdin), e.g. when
issuing a "print" statement:  they can change on Unix depending on
whether the python process is directly connected to a tty or not.

(a) If they're directly connected to a tty, their encoding is taken from
the locale, UTF-8 on my machine:
[david at brick ~]$ python -c 'print u"\u03b1\u03b2\u03b3"'
αβγ
(prints alpha, beta, gamma to terminal, though these characters might
not survive being sent in this email)

(b) If they're not (e.g. cronjob, daemon, within a shell pipeline, etc)
their encoding is the default encoding, which is typically ascii;
rerunning the same command, but piping into "cat":
[david at brick ~]$ python -c 'print u"\u03b1\u03b2\u03b3"'| cat
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-2: ordinal not in range(128)

(c) These problems can lurk in sources and only manifest themselves
during _deployment_ of code.  You can set PYTHONIOENCODING=ascii in the
environment to force (a) to behave like (b), so that your code will fail
whilst you're _developing_ it, rather than on your servers at midnight:
[david at brick ~]$ PYTHONIOENCODING=ascii python -c 'print u"\u03b1\u03b2
\u03b3"'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-2: ordinal not in range(128)

(Given the above, it could be argued perhaps that one should never
"print" unicode instances, and instead should write the data to
file-like objects, specifying an encoding.  Not sure).

6. If you're using pygtk (specifically the "pango" module, typically
implicitly imported), be warned that it abuses the C API to set the
default encoding inside python, which probably breaks any unicode
instances in memory at the time, and is likely to cause weird side
effects:
[david at brick ~]$ python
Python 2.6.2 (r262:71600, Jan 25 2010, 13:22:47) 
[GCC 4.4.2 20100121 (Red Hat 4.4.2-28)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> import pango
>>> sys.getdefaultencoding()
'utf-8'
(the above is on Fedora 12, though I'd expect to see the same weirdness
on any linux distro running gnome 2)

Python 3 will probably make this all much easier; you'll still have to
care about encodings when dealing with files/sockets/etc, but it should
be much more clear what's going on.  I hope.

Hope this is helpful
Dave