[lxml-dev] About encoding question !

Hey guys, I recently use lxml to do my HTML parsing, it's really great, and indeed the fastest one compare to other libraries. But since I begin to parse some other pages using gb2312 coding, I've a problem. The output is in here: http://david-paste.cn/paste/50/ Please help me with this, thanks you guys. Regards, David -- ---------------------------------------------- Attitude determines everything ! ----------------------------------------------

David Shieh <mykingheaven@gmail.com> (DS) wrote:
DS> Hey guys, DS> I recently use lxml to do my HTML parsing, it's really great, and DS> indeed the fastest one compare to other libraries.
DS> But since I begin to parse some other pages using gb2312 coding, I've a DS> problem. The output is in here: http://david-paste.cn/paste/50/
From the first error message it seems that you have a byte string as input, not a Unicode string (this also seems to be implied by your message ('pages using gb2312 coding'). If you feed these to the xml
Firstly, HTML is not XML. XHTML is, however. So if your input is not XHTML, you should use a HTML parser instead of the XML parser. parser they should contain an encoding declaration, like: <?xml version="1.0" encoding="gb2312"?> Otherwise the parser thinks it is utf-8, as the error message indicates. contents.encode('utf-8') doesn't make sense when contents contains a byte string. This would only make sense when it contains a Unicode string. Neither does contents.encode('gb2312'). contents.decode('utf-8') is wrong if contents does not contain a utf-8 encoded byte string. However, contents.decode('gb2312') would make sense if contents contains a gb2312 encoded byte string. This will deliver a Unicode string that you can pass to etree.fromstring. So etree.fromstring(contents.decode('gb2312')) could be an alternative for specifying gb2312 in the file itself. -- Piet van Oostrum <piet@vanoostrum.org> WWW: http://pietvanoostrum.com/ PGP key: [8DAE142BE17999C4]
participants (2)
-
David Shieh
-
Piet van Oostrum