newbie with a encoding question, please help
Stefan Behnel
stefan_ml at behnel.de
Thu Apr 1 09:31:08 EDT 2010
Mister Yu, 01.04.2010 14:26:
> On Apr 1, 8:13 pm, Chris Rebert wrote:
>> gb2312_bytes = ''.join([chr(ord(c)) for c in u'\xd6\xd0\xce\xc4'])
>> unicode_string = gb2312_bytes.decode('gb2312')
>> utf8_bytes = unicode_string.encode('utf-8') #as you wanted
Simplifying this hack a bit:
gb2312_bytes = u'\xd6\xd0\xce\xc4'.encode('ISO-8859-1')
unicode_string = gb2312_bytes.decode('gb2312')
utf8_bytes = unicode_string.encode('utf-8')
Although I have to wonder why you want a UTF-8 encoded byte string as
output instead of Unicode.
>> If possible, I'd look at the code that's giving you that funky
>> "string" in the first place and see if it can be fixed to give you
>> either a proper bytestring or proper unicode string rather than the
>> bastardized mess you're currently having to deal with.
>
> thanks for the great tips! it works like a charm.
I hope you're aware that it's a big ugly hack, though. You should really
try to fix your input instead.
> i m using the Scrapy project(http://doc.scrapy.org/intro/
> tutorial.html) to write my crawler, when it extract data with xpath,
> it puts the chinese characters directly into the unicode object.
My guess is that the HTML page you are parsing is broken and doesn't
specify its encoding. In that case, all that scrapy can do is guess, and it
seems to have guessed incorrectly.
You should check if there is a way to tell scrapy about the expected page
encoding, so that it can return correctly decoded unicode strings directly,
instead of resorting to dirty hacks that may or may not work depending on
the page you are parsing.
Stefan
More information about the Python-list
mailing list