[lxml-dev] html.xpath returns not decoded unicode string

Hi guys, I try to parse html encoded in 'iso-8859-2' and with xpath want to get a specific content. The content I usually get with xpath is python unicode, but in this case it does not contain unicode code points but characters encoded in 'iso-8859-2' just like it was never decoded and put in unicode object as it is. Let's take for example this url: ' http://www.pkm.jaworzno.pl/rozklady/rozklad.php?kat=302_20100628&nr=14&kier=1', and do something in command line:
from lxml import html import urllib2 root = html.parse(urllib2.urlopen(' http://www.pkm.jaworzno.pl/rozklady/rozklad.php?kat=302_20100628&nr=14&kier=1' )) root.docinfo.encoding 'iso-8859-2' header = root.xpath('/html/body/center/center[1]/table/tr/td/table')[3].text_content().strip() header u'Soboty, niedziele i \xb6wi\xeata' uc = u'Soboty, niedziele i święta' uc u'Soboty, niedziele i \u015bwi\u0119ta' uc == header False
I expect header and uc variables to be equal but they're not, while uc is the actual unicode representation of my string. I use this code in a script and run it on Windows with english locale and the script has # -*- coding: utf-8 -*- directive. Interesting thing is that the script passes the compassion uc==header on http://www.pkm.jaworzno.pl/rozklady/rozklad.php?kat=302_20100628&nr=*13*&kier=1 but does not pass on http://www.pkm.jaworzno.pl/rozklady/rozklad.php?kat=302_20100628&nr=*14*&kier=1. Needless to say, the content I try to get (Soboty, niedziele i święta) on both pages is binary the same, as well as declared encoding and they both render correctly in a web browser. Can anybody help me with this? OS: Windows XP (english) 32 bit Python: 2.6.5 lxml.etree: (2, 2, 0, 0) libxml used: (2, 7, 2) libxml compiled: (2, 7, 2) libxslt used: (1, 1, 24) libxslt compiled: (1, 1, 24) Regards Piotr

Piotr Owcarz, 30.06.2010 16:44:
I try to parse html encoded in 'iso-8859-2' and with xpath want to get a specific content. The content I usually get with xpath is python unicode, but in this case it does not contain unicode code points but characters encoded in 'iso-8859-2' just like it was never decoded and put in unicode object as it is.
Note that the problem at hand is unrelated to XPath. Only the parser has an impact on the encodings.
Let's take for example this url: ' http://www.pkm.jaworzno.pl/rozklady/rozklad.php?kat=302_20100628&nr=14&kier=1', and do something in command line:
from lxml import html import urllib2 root = html.parse(urllib2.urlopen(' http://www.pkm.jaworzno.pl/rozklady/rozklad.php?kat=302_20100628&nr=14&kier=1' )) root.docinfo.encoding 'iso-8859-2' header = root.xpath('/html/body/center/center[1]/table/tr/td/table')[3].text_content().strip() header u'Soboty, niedziele i \xb6wi\xeata' uc = u'Soboty, niedziele i święta' uc u'Soboty, niedziele i \u015bwi\u0119ta' uc == header False
Seems to work for me: In [1]: from lxml import html In [2]: root = html.parse('http://www.pkm.jaworzno.pl/rozklady/rozklad.php?kat=302_20100628&nr=14&kier=1') In [3]: root.docinfo.encoding Out[3]: 'iso-8859-2' In [4]: root.xpath('/html/body/center/center[1]/table/tr/td/table')[3].text_content().strip() Out[4]: u'Soboty, niedziele i \u015bwi\u0119ta' Even when I use urllib2, I get In [14]: root = html.parse(urllib2.urlopen('http://www.pkm.jaworzno.pl/rozklady/rozklad.php?kat=302_20100628&nr=14&kier=1')) In [15]: header = root.xpath('/html/body/center/center[1]/table/tr/td/table')[3].text_content().strip() In [16]: header Out[16]: u'Soboty, niedziele i \u015bwi\u0119ta'
I use this code in a script and run it on Windows with english locale and the script has # -*- coding: utf-8 -*- directive.
That doesn't matter.
Interesting thing is that the script passes the compassion uc==header on http://www.pkm.jaworzno.pl/rozklady/rozklad.php?kat=302_20100628&nr=*13*&kier=1 but does not pass on http://www.pkm.jaworzno.pl/rozklady/rozklad.php?kat=302_20100628&nr=*14*&kier=1.
Same result for both on my side.
Needless to say, the content I try to get (Soboty, niedziele i święta) on both pages is binary the same, as well as declared encoding and they both render correctly in a web browser.
Note that "renders correctly in a web browser" is not a good indicator for a page being valid HTML. Browsers are extremely advanced when dealing with broken HTML. But once a page is broken, there is no such thing as "correct" behaviour.
Can anybody help me with this?
OS: Windows XP (english) 32 bit Python: 2.6.5 lxml.etree: (2, 2, 0, 0) libxml used: (2, 7, 2) libxml compiled: (2, 7, 2) libxslt used: (1, 1, 24) libxslt compiled: (1, 1, 24)
I'm using lxml 2.3alpha1 and libxml2 2.7.6. The libxml2 version may make a difference here. Try the lxml 2.2.4 binaries for Windows, I think they use a newer lib version. Stefan
participants (2)
-
Piotr Owcarz
-
Stefan Behnel