[lxml-dev] html.xpath returns not decoded unicode string

Hi guys, I try to parse html encoded in 'iso-8859-2' and with xpath want to get a specific content. The content I usually get with xpath is python unicode, but in this case it does not contain unicode code points but characters encoded in 'iso-8859-2' just like it was never decoded and put in unicode object as it is. Let's take for example this url: ' http://www.pkm.jaworzno.pl/rozklady/rozklad.php?kat=302_20100628&nr=14&kier=1', and do something in command line:
I expect header and uc variables to be equal but they're not, while uc is the actual unicode representation of my string. I use this code in a script and run it on Windows with english locale and the script has # -*- coding: utf-8 -*- directive. Interesting thing is that the script passes the compassion uc==header on http://www.pkm.jaworzno.pl/rozklady/rozklad.php?kat=302_20100628&nr=*13*&kier=1 but does not pass on http://www.pkm.jaworzno.pl/rozklady/rozklad.php?kat=302_20100628&nr=*14*&kier=1. Needless to say, the content I try to get (Soboty, niedziele i święta) on both pages is binary the same, as well as declared encoding and they both render correctly in a web browser. Can anybody help me with this? OS: Windows XP (english) 32 bit Python: 2.6.5 lxml.etree: (2, 2, 0, 0) libxml used: (2, 7, 2) libxml compiled: (2, 7, 2) libxslt used: (1, 1, 24) libxslt compiled: (1, 1, 24) Regards Piotr

Piotr Owcarz, 30.06.2010 16:44:
Note that the problem at hand is unrelated to XPath. Only the parser has an impact on the encodings.
Seems to work for me: In [1]: from lxml import html In [2]: root = html.parse('http://www.pkm.jaworzno.pl/rozklady/rozklad.php?kat=302_20100628&nr=14&kier=1') In [3]: root.docinfo.encoding Out[3]: 'iso-8859-2' In [4]: root.xpath('/html/body/center/center[1]/table/tr/td/table')[3].text_content().strip() Out[4]: u'Soboty, niedziele i \u015bwi\u0119ta' Even when I use urllib2, I get In [14]: root = html.parse(urllib2.urlopen('http://www.pkm.jaworzno.pl/rozklady/rozklad.php?kat=302_20100628&nr=14&kier=1')) In [15]: header = root.xpath('/html/body/center/center[1]/table/tr/td/table')[3].text_content().strip() In [16]: header Out[16]: u'Soboty, niedziele i \u015bwi\u0119ta'
I use this code in a script and run it on Windows with english locale and the script has # -*- coding: utf-8 -*- directive.
That doesn't matter.
Same result for both on my side.
Note that "renders correctly in a web browser" is not a good indicator for a page being valid HTML. Browsers are extremely advanced when dealing with broken HTML. But once a page is broken, there is no such thing as "correct" behaviour.
I'm using lxml 2.3alpha1 and libxml2 2.7.6. The libxml2 version may make a difference here. Try the lxml 2.2.4 binaries for Windows, I think they use a newer lib version. Stefan

Piotr Owcarz, 30.06.2010 16:44:
Note that the problem at hand is unrelated to XPath. Only the parser has an impact on the encodings.
Seems to work for me: In [1]: from lxml import html In [2]: root = html.parse('http://www.pkm.jaworzno.pl/rozklady/rozklad.php?kat=302_20100628&nr=14&kier=1') In [3]: root.docinfo.encoding Out[3]: 'iso-8859-2' In [4]: root.xpath('/html/body/center/center[1]/table/tr/td/table')[3].text_content().strip() Out[4]: u'Soboty, niedziele i \u015bwi\u0119ta' Even when I use urllib2, I get In [14]: root = html.parse(urllib2.urlopen('http://www.pkm.jaworzno.pl/rozklady/rozklad.php?kat=302_20100628&nr=14&kier=1')) In [15]: header = root.xpath('/html/body/center/center[1]/table/tr/td/table')[3].text_content().strip() In [16]: header Out[16]: u'Soboty, niedziele i \u015bwi\u0119ta'
I use this code in a script and run it on Windows with english locale and the script has # -*- coding: utf-8 -*- directive.
That doesn't matter.
Same result for both on my side.
Note that "renders correctly in a web browser" is not a good indicator for a page being valid HTML. Browsers are extremely advanced when dealing with broken HTML. But once a page is broken, there is no such thing as "correct" behaviour.
I'm using lxml 2.3alpha1 and libxml2 2.7.6. The libxml2 version may make a difference here. Try the lxml 2.2.4 binaries for Windows, I think they use a newer lib version. Stefan
participants (2)
-
Piotr Owcarz
-
Stefan Behnel