newbie with a encoding question, please help

Mister Yu eryan.yu at gmail.com
Thu Apr 1 10:53:31 EDT 2010


On Apr 1, 9:31 pm, Stefan Behnel <stefan... at behnel.de> wrote:
> Mister Yu, 01.04.2010 14:26:
>
> > On Apr 1, 8:13 pm, Chris Rebert wrote:
> >> gb2312_bytes = ''.join([chr(ord(c)) for c in u'\xd6\xd0\xce\xc4'])
> >> unicode_string = gb2312_bytes.decode('gb2312')
> >> utf8_bytes = unicode_string.encode('utf-8') #as you wanted
>
> Simplifying this hack a bit:
>
>      gb2312_bytes = u'\xd6\xd0\xce\xc4'.encode('ISO-8859-1')
>      unicode_string = gb2312_bytes.decode('gb2312')
>      utf8_bytes = unicode_string.encode('utf-8')
>
> Although I have to wonder why you want a UTF-8 encoded byte string as
> output instead of Unicode.
>
> >> If possible, I'd look at the code that's giving you that funky
> >> "string" in the first place and see if it can be fixed to give you
> >> either a proper bytestring or proper unicode string rather than the
> >> bastardized mess you're currently having to deal with.
>
> > thanks for the great tips! it works like a charm.
>
> I hope you're aware that it's a big ugly hack, though. You should really
> try to fix your input instead.
>
> > i m using the Scrapy project(http://doc.scrapy.org/intro/
> > tutorial.html) to write my crawler, when it extract data with xpath,
> > it puts the chinese characters directly into the unicode object.
>
> My guess is that the HTML page you are parsing is broken and doesn't
> specify its encoding. In that case, all that scrapy can do is guess, and it
> seems to have guessed incorrectly.
>
> You should check if there is a way to tell scrapy about the expected page
> encoding, so that it can return correctly decoded unicode strings directly,
> instead of resorting to dirty hacks that may or may not work depending on
> the page you are parsing.
>
> Stefan

Hi Stefan,

i don't think the page is broken or somehow, you can take a look at
the page http://www.7176.com/Sections/Genre/Comedy  , it's kinda like
a chinese IMDB rip off

from what i can see from the source code of the page header, it
contains the coding info:
<HTML><head><meta http-equiv="Content-Type" content="text/html;
charset=gb2312" /><meta http-equiv="Content-Language" content="zh-CN" /
><meta content="all" name="robots" /><meta name="author"
content="admin(at)7176.com" /><meta name="Copyright" content="www.
7176.com" /> <meta content="类别为 剧情 的电影列表 第1页" name="keywords" /><TITLE>
类别为 剧情 的电影列表 第1页</TITLE><LINK href="http://www.7176.com/images/
pro.css" rel=stylesheet></HEAD>

maybe i should take a look at the source code of Scrapy, but i m just
not more than a week's newbie of python. not sure if i can understand
the source.

earlier Chris's walk around is looking pretty well until it meets some
string like this:
>>> su = u'一二三四 12345 一二三四'
>>> su
u'\u4e00\u4e8c\u4e09\u56db 12345 \u4e00\u4e8c\u4e09\u56db'
>>> gb2312_bytes = ''.join([chr(ord(c)) for c in u'\u4e00\u4e8c\u4e09\u56db 12345 \u4e00\u4e8c\u4e09\u56db'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: chr() arg not in range(256)

the digis doesn't get encoded so it messes up the code.

any ideas?

once again, thanks everybody's help!!!!




More information about the Python-list mailing list