newbie with a encoding question, please help

Mister Yu eryan.yu at gmail.com
Thu Apr 1 07:38:41 EDT 2010


On Apr 1, 7:22 pm, Chris Rebert <c... at rebertia.com> wrote:
> 2010/4/1 Mister Yu <eryan... at gmail.com>:
>
> > hi experts,
>
> > i m new to python, i m writing crawlers to extract data from some
> > chinese websites, and i run into a encoding problem.
>
> > i have a unicode object, which looks like this u'\xd6\xd0\xce\xc4'
> > which is encoded in "gb2312",
>
> No! Instances of type 'unicode' (i.e. strings with a leading 'u')
> ***aren't encoded at all***.
>
> > but i have no idea of how to convert it
> > back to utf-8
>
> To convert u'\xd6\xd0\xce\xc4' to UTF-8, do u'\xd6\xd0\xce\xc4'.encode('utf-8')
>
>
>
> > to re-create this one is easy:
>
> > this will work
> > ============================
> >>>> su = u"中文".encode('gb2312')
> >>>> su
> > u
> >>>> print su.decode('gb2312')
> > 中文    -> (same as the original string)
>
> > ============================
> > but this doesn't,why
> > ===========================
> >>>> su = u'\xd6\xd0\xce\xc4'
> >>>> su
> > u'\xd6\xd0\xce\xc4'
> >>>> print su.decode('gb2312')
>
> You can't decode a unicode string, it's already been decoded!
>
> One decodes a bytestring to get a unicode string.
> One **encodes** a unicode string to get a bytestring.
>
> So the last line of your example should be:
> print su.encode('gb2312')
>
> Only call .encode() on things of type 'unicode'.
> Only call .decode() on things of type 'str'.
> [When using Python 2.x that is. Python 3.x renames the types in question.]
>
> Cheers,
> Chris
> --http://blog.rebertia.com

hi, thanks for the tips.

but i m still not very sure how to convert a unicode object  **
u'\xd6\xd0\xce\xc4 ** back to "中文" the string it supposed to be?

thanks.

sorry i m really new to python.



More information about the Python-list mailing list