[Tutor] Encoding

Thu Mar 4 08:01:30 CET 2010

On Wed, 3 Mar 2010 20:44:51 +0100
Giorgio <anothernetfellow at gmail.com> wrote:

> Please let me post the third update O_o. You can forgot other 2, i'll put
> them into this email.
> 
> ---
> >>> s = "ciao è ciao"
> >>> print s
> ciao è ciao
> >>> s.encode('utf-8')
> 
> Traceback (most recent call last):
>   File "<pyshell#2>", line 1, in <module>
>     s.encode('utf-8')
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe8 in position 5:
> ordinal not in range(128)
> ---
> 
> I am getting more and more confused.

What you enter on the terminal prompt is text, encoded in a format (ascii, latin*, utf*,...) that probably depends on your system locale. As this format is always a sequence of bytes, python stores it as a plain str:
>>> s = "ciao è ciao"
>>> s,type(s)
('ciao \xc3\xa8 ciao', <type 'str'>)
My system is parametered in utf8. c3-a8 is the repr of 'é' in utf8. It needs 2 bytes because of the rules of utf8 itself. Right?

To get a python unicode string, it must be decoded from its format, for me utf8:
>>> u = s.decode("utf8")
>>> u,type(u)
(u'ciao \xe8 ciao', <type 'unicode'>)
e8 is the unicode code for 'è' (decimal 232). You can check that in tables. It needs here one byte only because 232<255.

[comparison with php]

> Ok, now, the point is: you (and the manual) said that this line:
> 
> s = u"giorgio è giorgio"
> 
> will convert the string as unicode.

Yes and no: it will convert it *into* a <unicode> string, in the sense of a python representation for universal text. When seeing u"..." , python will automagically *decode* the part in "...", taking as source format the one you indicate in a pseudo-comment on top of you code file, eg:
# coding: utf8
Else I guess the default is the system's locale format? Or ascii? Someone knows?
So, in my case u"giorgio è giorgio" is equivalent to "giorgio è giorgio".decode("utf8"):
>>> u1 = u"giorgio è giorgio"
>>> u2 = "giorgio è giorgio".decode("utf8")
>>> u1,u2
(u'giorgio \xe8 giorgio', u'giorgio \xe8 giorgio')
>>> u1 == u2
True

> But also said that the part between ""
> will be encoded with my editor BEFORE getting encoded in unicode by python.

will be encoded with my editor BEFORE getting encoded in unicode by python
-->
will be encoded *by* my editor BEFORE getting *decoded* *into* unicode by python

> So please pay attention to this example:
> 
> My editor is working in UTF8. I create this:
> 
> c = "giorgio è giorgio" // This will be an UTF8 string because of the file's
> encoding
Right.
> d = unicode(c) // This will be an unicode string
> e = c.encode() // How will be encoded this string? If PY is working like PHP
> this will be an utf8 string.

Have you tried it?
>>> c = "giorgio è giorgio" 
>>> d = unicode(c)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 8: ordinal not in range(128)

Now, tell us why! (the answer is below *)

> Can you help me?
> 
> Thankyou VERY much
> 
> Giorgio

Denis

(*)
You don't tell which format the source string is encoded in. By default, python uses ascii (I know, it's stupid) which max code is 127. So, 'é' is not accepted. Now, if I give a format, all works fine:
>>> d = unicode(c,"utf8")
>>> d
u'giorgio \xe8 giorgio'

Note: unicode(c,format) is an alias for c.decode(format):
>>> c.decode("utf8")
u'giorgio \xe8 giorgio'
________________________________

la vita e estrany

spir.wikidot.com