newbie with a encoding question, please help

Thu Apr 1 09:31:08 EDT 2010

Mister Yu, 01.04.2010 14:26:
> On Apr 1, 8:13 pm, Chris Rebert wrote:
>> gb2312_bytes = ''.join([chr(ord(c)) for c in u'\xd6\xd0\xce\xc4'])
>> unicode_string = gb2312_bytes.decode('gb2312')
>> utf8_bytes = unicode_string.encode('utf-8') #as you wanted

Simplifying this hack a bit:

     gb2312_bytes = u'\xd6\xd0\xce\xc4'.encode('ISO-8859-1')
     unicode_string = gb2312_bytes.decode('gb2312')
     utf8_bytes = unicode_string.encode('utf-8')

Although I have to wonder why you want a UTF-8 encoded byte string as 
output instead of Unicode.

>> If possible, I'd look at the code that's giving you that funky
>> "string" in the first place and see if it can be fixed to give you
>> either a proper bytestring or proper unicode string rather than the
>> bastardized mess you're currently having to deal with.
>
> thanks for the great tips! it works like a charm.

I hope you're aware that it's a big ugly hack, though. You should really 
try to fix your input instead.

> i m using the Scrapy project(http://doc.scrapy.org/intro/
> tutorial.html) to write my crawler, when it extract data with xpath,
> it puts the chinese characters directly into the unicode object.

My guess is that the HTML page you are parsing is broken and doesn't 
specify its encoding. In that case, all that scrapy can do is guess, and it 
seems to have guessed incorrectly.

You should check if there is a way to tell scrapy about the expected page 
encoding, so that it can return correctly decoded unicode strings directly, 
instead of resorting to dirty hacks that may or may not work depending on 
the page you are parsing.

Stefan