Hi guys,
I try to parse html encoded in 'iso-8859-2' and with
xpath want to get a specific content. The content I usually get with
xpath is python unicode, but in this case it does not contain unicode
code points but characters encoded in 'iso-8859-2' just like it was
never decoded and put in unicode object as it is.
Let's take for example this url: 'http://www.pkm.jaworzno.pl/rozklady/rozklad.php?kat=302_20100628&nr=14&kier=1',
and do something in command line:
>>> from lxml import html
>>> import urllib2
>>>
root = html.parse(urllib2.urlopen('http://www.pkm.jaworzno.pl/rozklady/rozklad.php?kat=302_20100628&nr=14&kier=1'))
>>> root.docinfo.encoding
'iso-8859-2'
>>>
header = root.xpath('/html/body/center/center[1]/table/tr/td/table')[3].text_content().strip()