[Tutor] Decode and Encode
Chris Warrick
kwpolska at gmail.com
Wed Jan 28 11:26:43 CET 2015
On Wed, Jan 28, 2015 at 10:35 AM, Sunil Tech <sunil.techspk at gmail.com> wrote:
> Hi All,
>
> When i copied a text from web and pasted in the python-terminal, it
> automatically coverted into unicode(i suppose)
>
> can anyone tell me how it does?
> Eg:
>>>> p = "你好"
>>>> p
> '\xe4\xbd\xa0\xe5\xa5\xbd'
>>>> o = 'ªîV'
>>>> o
> '\xc2\xaa\xc3\xaeV'
>>>>
No, it didn’t. You created a bytestring, that contains some bytes.
Python does NOT think of `p` as a unicode string of 2 characters, it’s
a bytestring of 6 bytes. You cannot use that byte string to reliably
get only the first character, for example — `p[0]` will get you
garbage ('\xe4' which will render as a question mark on an UTF-8
terminal).
In order to get a real unicode string, you must do one of the following:
(a) prepend it with u''. This works only if your locale is set
correctly and Python knows you use UTF-8. For example:
>>> p = u"你好"
>>> p
u'\u4f60\u597d'
(b) Use decode on the bytestring, which is safer and does not depend
on a properly configured system.
>>> p = "你好".decode('utf-8')
>>> p
u'\u4f60\u597d'
However, this does not apply in Python 3. Python 3 defaults to
Unicode strings, so you can do:
>>> p = "你好"
and have proper Unicode handling, assuming your system locale is set
correctly. If it isn’t,
>>> p = b"你好".decode('utf-8')
would do it.
--
Chris Warrick <https://chriswarrick.com/>
PGP: 5EAAEA16
More information about the Tutor
mailing list