[Tutor] Decode and Encode

Wed Jan 28 11:26:43 CET 2015

On Wed, Jan 28, 2015 at 10:35 AM, Sunil Tech <sunil.techspk at gmail.com> wrote:
> Hi All,
>
> When i copied a text from web and pasted in the python-terminal, it
> automatically coverted into unicode(i suppose)
>
> can anyone tell me how it does?
> Eg:
>>>> p = "你好"
>>>> p
> '\xe4\xbd\xa0\xe5\xa5\xbd'
>>>> o = 'ªîV'
>>>> o
> '\xc2\xaa\xc3\xaeV'
>>>>

No, it didn’t.  You created a bytestring, that contains some bytes.
Python does NOT think of `p` as a unicode string of 2 characters, it’s
a bytestring of 6 bytes.  You cannot use that byte string to reliably
get only the first character, for example — `p[0]` will get you
garbage ('\xe4' which will render as a question mark on an UTF-8
terminal).

In order to get a real unicode string, you must do one of the following:

(a) prepend it with u''.  This works only if your locale is set
correctly and Python knows you use UTF-8.   For example:

>>> p = u"你好"
>>> p
u'\u4f60\u597d'

(b) Use decode on the bytestring, which is safer and does not depend
on a properly configured system.

>>> p = "你好".decode('utf-8')
>>> p
u'\u4f60\u597d'

However, this does not apply in Python 3.  Python 3 defaults to
Unicode strings, so you can do:

>>> p = "你好"

and have proper Unicode handling, assuming your system locale is set
correctly.  If it isn’t,

>>> p = b"你好".decode('utf-8')

would do it.

-- 
Chris Warrick <https://chriswarrick.com/>
PGP: 5EAAEA16