newbie with a encoding question, please help
eryan.yu at gmail.com
Thu Apr 1 16:53:31 CEST 2010
On Apr 1, 9:31 pm, Stefan Behnel <stefan... at behnel.de> wrote:
> Mister Yu, 01.04.2010 14:26:
> > On Apr 1, 8:13 pm, Chris Rebert wrote:
> >> gb2312_bytes = ''.join([chr(ord(c)) for c in u'\xd6\xd0\xce\xc4'])
> >> unicode_string = gb2312_bytes.decode('gb2312')
> >> utf8_bytes = unicode_string.encode('utf-8') #as you wanted
> Simplifying this hack a bit:
> gb2312_bytes = u'\xd6\xd0\xce\xc4'.encode('ISO-8859-1')
> unicode_string = gb2312_bytes.decode('gb2312')
> utf8_bytes = unicode_string.encode('utf-8')
> Although I have to wonder why you want a UTF-8 encoded byte string as
> output instead of Unicode.
> >> If possible, I'd look at the code that's giving you that funky
> >> "string" in the first place and see if it can be fixed to give you
> >> either a proper bytestring or proper unicode string rather than the
> >> bastardized mess you're currently having to deal with.
> > thanks for the great tips! it works like a charm.
> I hope you're aware that it's a big ugly hack, though. You should really
> try to fix your input instead.
> > i m using the Scrapy project(http://doc.scrapy.org/intro/
> > tutorial.html) to write my crawler, when it extract data with xpath,
> > it puts the chinese characters directly into the unicode object.
> My guess is that the HTML page you are parsing is broken and doesn't
> specify its encoding. In that case, all that scrapy can do is guess, and it
> seems to have guessed incorrectly.
> You should check if there is a way to tell scrapy about the expected page
> encoding, so that it can return correctly decoded unicode strings directly,
> instead of resorting to dirty hacks that may or may not work depending on
> the page you are parsing.
i don't think the page is broken or somehow, you can take a look at
the page http://www.7176.com/Sections/Genre/Comedy , it's kinda like
a chinese IMDB rip off
from what i can see from the source code of the page header, it
contains the coding info:
<HTML><head><meta http-equiv="Content-Type" content="text/html;
charset=gb2312" /><meta http-equiv="Content-Language" content="zh-CN" /
><meta content="all" name="robots" /><meta name="author"
content="admin(at)7176.com" /><meta name="Copyright" content="www.
7176.com" /> <meta content="类别为 剧情 的电影列表 第1页" name="keywords" /><TITLE>
类别为 剧情 的电影列表 第1页</TITLE><LINK href="http://www.7176.com/images/
maybe i should take a look at the source code of Scrapy, but i m just
not more than a week's newbie of python. not sure if i can understand
earlier Chris's walk around is looking pretty well until it meets some
string like this:
>>> su = u'一二三四 12345 一二三四'
u'\u4e00\u4e8c\u4e09\u56db 12345 \u4e00\u4e8c\u4e09\u56db'
>>> gb2312_bytes = ''.join([chr(ord(c)) for c in u'\u4e00\u4e8c\u4e09\u56db 12345 \u4e00\u4e8c\u4e09\u56db'])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: chr() arg not in range(256)
the digis doesn't get encoded so it messes up the code.
once again, thanks everybody's help!!!!
More information about the Python-list