[Tutor] Encoding

spir denis.spir at gmail.com
Thu Mar 4 17:07:01 CET 2010

On Thu, 4 Mar 2010 15:13:44 +0100
Giorgio <anothernetfellow at gmail.com> wrote:

> Thankyou.
> You have clarificated many things in those emails. Due to high numbers of
> messages i won't quote everything.
> So, as i can clearly understand reading last spir's post, python gets
> strings encoded by my editor and to convert them to unicode i need to
> specify HOW they're encoded. This makes clear this example:
> c = "giorgio è giorgio"
> d = c.decode("utf8")
> I create an utf8 string, and to convert it into unicode i need to tell
> python that the string IS utf8.
> Just don't understand why in my Windows XP computer in Python IDLE doesn't
> work:
> >>> ================================ RESTART
> ================================
> >>>
> >>> c = "giorgio è giorgio"
> >>> c
> 'giorgio \xe8 giorgio'
> >>> d = c.decode("utf8")
> Traceback (most recent call last):
>   File "<pyshell#10>", line 1, in <module>
>     d = c.decode("utf8")
>   File "C:\Python26\lib\encodings\utf_8.py", line 16, in decode
>     return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode bytes in position 8-10:
> invalid data
> >>>

How do you know your win XP terminal is configured to deal with text using utf8? Why do you think it should? Don't know much about windows, but I've read they have their own character sets (and format?). So, probably, if you haven't personalized it, it won't. (Conversely, I guess Macs use utf8 as default. Someone confirms?)
In other words, c is not a piece of text in utf8.

> In IDLE options i've set encoding to UTF8 of course. I also have some linux
> servers where i can try the IDLE but Putty doesn't seem to support UTF8.
> But, let's continue:
> In that example i've specified UTF8 in the decode method. If i hadn't set it
> python would have taken the one i specified in the second line of the file,
> right?
> As last point, i can't understand why this works:
> >>> a = u"giorgio è giorgio"
> >>> a
> u'giorgio \xe8 giorgio'

This trial uses the default format of your system. It does the same as
   a = "giorgio è giorgio".encode(default_format)
It's a shorcut for ustring *literals* (constants), directly expressed by the programmer. In source code, it would use the format specified on top of the file.

> And this one doesn't:
> >>> a = "giorgio è giorgio"
> >>> b = unicode(a)
> Traceback (most recent call last):
>   File "<pyshell#14>", line 1, in <module>
>     b = unicode(a)
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 8:
> ordinal not in range(128)

This trial uses ascii because you give no format (yes, it can be seen as a flaw). It does the same as
   a = "giorgio è giorgio".encode("ascii")

> >>>
> The second doesn't work because i have not told python how the string was
> encoded. But in the first too i haven't specified the encoding O_O.
> Thankyou again for your help.
> Giorgio


la vita e estrany


More information about the Tutor mailing list