newbie with a encoding question, please help
Chris Rebert
clp2 at rebertia.com
Thu Apr 1 07:22:01 EDT 2010
2010/4/1 Mister Yu <eryan.yu at gmail.com>:
> hi experts,
>
> i m new to python, i m writing crawlers to extract data from some
> chinese websites, and i run into a encoding problem.
>
> i have a unicode object, which looks like this u'\xd6\xd0\xce\xc4'
> which is encoded in "gb2312",
No! Instances of type 'unicode' (i.e. strings with a leading 'u')
***aren't encoded at all***.
> but i have no idea of how to convert it
> back to utf-8
To convert u'\xd6\xd0\xce\xc4' to UTF-8, do u'\xd6\xd0\xce\xc4'.encode('utf-8')
> to re-create this one is easy:
>
> this will work
> ============================
>>>> su = u"中文".encode('gb2312')
>>>> su
> u
>>>> print su.decode('gb2312')
> 中文 -> (same as the original string)
>
> ============================
> but this doesn't,why
> ===========================
>>>> su = u'\xd6\xd0\xce\xc4'
>>>> su
> u'\xd6\xd0\xce\xc4'
>>>> print su.decode('gb2312')
You can't decode a unicode string, it's already been decoded!
One decodes a bytestring to get a unicode string.
One **encodes** a unicode string to get a bytestring.
So the last line of your example should be:
print su.encode('gb2312')
Only call .encode() on things of type 'unicode'.
Only call .decode() on things of type 'str'.
[When using Python 2.x that is. Python 3.x renames the types in question.]
Cheers,
Chris
--
http://blog.rebertia.com
More information about the Python-list
mailing list