newbie with a encoding question, please help

Thu Apr 1 08:26:23 EDT 2010

On Apr 1, 8:13 pm, Chris Rebert <c... at rebertia.com> wrote:
> On Thu, Apr 1, 2010 at 4:38 AM, Mister Yu <eryan... at gmail.com> wrote:
> > On Apr 1, 7:22 pm, Chris Rebert <c... at rebertia.com> wrote:
> >> 2010/4/1 Mister Yu <eryan... at gmail.com>:
> >> > hi experts,
>
> >> > i m new to python, i m writing crawlers to extract data from some
> >> > chinese websites, and i run into a encoding problem.
>
> >> > i have a unicode object, which looks like this u'\xd6\xd0\xce\xc4'
> >> > which is encoded in "gb2312",
> <snip>
> > hi, thanks for the tips.
>
> > but i m still not very sure how to convert a unicode object  **
> > u'\xd6\xd0\xce\xc4 ** back to "中文" the string it supposed to be?
>
> Ah, my apologies! I overlooked something (sorry, it's early in the
> morning where I am).
> What you have there is ***really*** screwy. It's the 2 Chinese
> characters, encoded in gb2312, and then somehow cast *directly* into a
> 'unicode' string (which ought never to be done).
>
> In answer to your original question (after some experimentation):
> gb2312_bytes = ''.join([chr(ord(c)) for c in u'\xd6\xd0\xce\xc4'])
> unicode_string = gb2312_bytes.decode('gb2312')
> utf8_bytes = unicode_string.encode('utf-8') #as you wanted
>
> If possible, I'd look at the code that's giving you that funky
> "string" in the first place and see if it can be fixed to give you
> either a proper bytestring or proper unicode string rather than the
> bastardized mess you're currently having to deal with.
>
> Apologies again and Cheers,
> Chris
> --http://blog.rebertia.com

Hi Chris,

thanks for the great tips! it works like a charm.

i m using the Scrapy project(http://doc.scrapy.org/intro/
tutorial.html) to write my crawler, when it extract data with xpath,
it puts the chinese characters directly into the unicode object.

thanks again chris, and have a good april fool day.

Cheers,
Yu