Mailman 3 [lxml-dev] html.xpath returns not decoded unicode string - lxml - The Python XML Toolkit

June 30, 2010

      Hi guys,

I try to parse html encoded in 'iso-8859-2' and with xpath want to get a
specific content. The content I usually get with xpath is python unicode,
but in this case it does not contain unicode code points but characters
encoded in 'iso-8859-2' just like it was never decoded and put in unicode
object  as it is.
Let's take for example this url: '
http://www.pkm.jaworzno.pl/rozklady/rozklad.php?kat=302_20100628&nr=14&kier=1',
and do something in command line:
...
...
...
from lxml import html
import urllib2
root = html.parse(urllib2.urlopen('
http://www.pkm.jaworzno.pl/rozklady/rozklad.php?kat=302_20100628&nr=14&kier=1'
))
root.docinfo.encoding
'iso-8859-2'
header =
root.xpath('/html/body/center/center[1]/table/tr/td/table')[3].text_content().strip()
header
u'Soboty, niedziele i \xb6wi\xeata'
uc = u'Soboty, niedziele i święta'
uc
u'Soboty, niedziele i \u015bwi\u0119ta'
uc == header
False
I expect header and uc variables to be equal but they're not, while uc is
the actual unicode representation of my string.
I use this code in a script and run it on Windows with english locale and
the script has # -*- coding: utf-8 -*- directive.
Interesting thing is that the script passes the compassion uc==header on
http://www.pkm.jaworzno.pl/rozklady/rozklad.php?kat=302_20100628&nr=*13*&kier=1
but does not pass on
http://www.pkm.jaworzno.pl/rozklady/rozklad.php?kat=302_20100628&nr=*14*&kier=1.
Needless to say, the content I try to get (Soboty, niedziele i święta) on
both pages is binary the same, as well as declared encoding and they both
render correctly in a web browser.
Can anybody help me with this?

OS: Windows XP (english) 32 bit
Python: 2.6.5
lxml.etree:        (2, 2, 0, 0)
libxml used:       (2, 7, 2)
libxml compiled:   (2, 7, 2)
libxslt used:      (1, 1, 24)
libxslt compiled:  (1, 1, 24)

Regards
Piotr

[lxml-dev] html.xpath returns not decoded unicode string

Piotr Owcarz

Stefan Behnel

tags

participants (2)