[Tutor] unicode issue?

Danny Yoo dyoo at hkn.eecs.berkeley.edu
Wed Mar 22 00:37:34 CET 2006



On Tue, 21 Mar 2006, Matt Dempsey wrote:

> I'm having a new problem with my House vote script. It's returning the
> following error:
>
> Traceback (most recent call last):
>   File "C:/Python24/evenmorevotes", line 20, in -toplevel-
>     f.write
> (nm+'*'+pt+'*'+vt+'*'+md['vote-result'][0]+'*'+md['vote-desc'][0]+'*'+'\n')
> UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in
> position 172: ordinal not in range(128)


Hi Matt,

Just wondering: how familiar are you with Unicode?  What's going on is
that one of the strings in the string concatenation above contains a
Unicode string.


It's like an infection: anything that touches Unicode turns Unicode.
*grin*

######
>>> 'hello' + u'world'
u'helloworld'
######


This has repercussions: when we're writing these strings back to files,
because we have a Unicode string, we must now be more explicit about how
Unicode is written, since files are really full of bytes, not unicode
characters.  That is, we need to specify an "encoding".


'utf-8' is a popular encoding that turns Unicode reliably into a bunch of
bytes:

######
>>> u'\u201c'.encode('utf8')
'\xe2\x80\x9c'
######

and this can be written to a file.  Recovering Unicode from bytes can be
done by going the other way, by "decoding":

######
>>> '\xe2\x80\x9c'.decode("utf8")
u'\u201c'
######



The codecs.open() function in the Standard Library is useful for handling
this encode/decode thing so that all we need to do is concentrate on
Unicode:

    http://www.python.org/doc/lib/module-codecs.html#l2h-991

For example:

######
>>> import codecs
>>>
>>> f = codecs.open("foo.txt", "wb", "utf8")
>>> f.write(u'\u201c')
>>> f.close()
>>>
>>> open('foo.txt', 'rb').read()
'\xe2\x80\x9c'
>>>
>>> codecs.open("foo.txt", "rb", "utf-8").read()
u'\u201c'
######


We can see that if we read and write to a codec-opened file, it'll
transparently do the encoding/decoding step for us as we write() and
read() the file.


You may also find Joel Spolsky's post on "The Absolute Minimum Every
Software Developer Absolutely, Positively Must Know About Unicode And
Character Sets (No Excuses!) useful in clarifying the basic concepts of
Unicode:

    http://www.joelonsoftware.com/articles/Unicode.html


I hope this helps!



More information about the Tutor mailing list