newbie with a encoding question, please help

Thu Apr 1 07:22:01 EDT 2010

2010/4/1 Mister Yu <eryan.yu at gmail.com>:
> hi experts,
>
> i m new to python, i m writing crawlers to extract data from some
> chinese websites, and i run into a encoding problem.
>
> i have a unicode object, which looks like this u'\xd6\xd0\xce\xc4'
> which is encoded in "gb2312",

No! Instances of type 'unicode' (i.e. strings with a leading 'u')
***aren't encoded at all***.

> but i have no idea of how to convert it
> back to utf-8

To convert u'\xd6\xd0\xce\xc4' to UTF-8, do u'\xd6\xd0\xce\xc4'.encode('utf-8')

> to re-create this one is easy:
>
> this will work
> ============================
>>>> su = u"中文".encode('gb2312')
>>>> su
> u
>>>> print su.decode('gb2312')
> 中文    -> (same as the original string)
>
> ============================
> but this doesn't,why
> ===========================
>>>> su = u'\xd6\xd0\xce\xc4'
>>>> su
> u'\xd6\xd0\xce\xc4'
>>>> print su.decode('gb2312')
You can't decode a unicode string, it's already been decoded!

One decodes a bytestring to get a unicode string.
One **encodes** a unicode string to get a bytestring.

So the last line of your example should be:
print su.encode('gb2312')

Only call .encode() on things of type 'unicode'.
Only call .decode() on things of type 'str'.
[When using Python 2.x that is. Python 3.x renames the types in question.]

Cheers,
Chris
--
http://blog.rebertia.com