Hi guys, I try to parse html encoded in 'iso-8859-2' and with xpath want to get a specific content. The content I usually get with xpath is python unicode, but in this case it does not contain unicode code points but characters encoded in 'iso-8859-2' just like it was never decoded and put in unicode object as it is. Let's take for example this url: ' http://www.pkm.jaworzno.pl/rozklady/rozklad.php?kat=302_20100628&nr=14&kier=1', and do something in command line:
from lxml import html import urllib2 root = html.parse(urllib2.urlopen(' http://www.pkm.jaworzno.pl/rozklady/rozklad.php?kat=302_20100628&nr=14&kier=1' )) root.docinfo.encoding 'iso-8859-2' header = root.xpath('/html/body/center/center[1]/table/tr/td/table')[3].text_content().strip() header u'Soboty, niedziele i \xb6wi\xeata' uc = u'Soboty, niedziele i święta' uc u'Soboty, niedziele i \u015bwi\u0119ta' uc == header False
I expect header and uc variables to be equal but they're not, while uc is the actual unicode representation of my string. I use this code in a script and run it on Windows with english locale and the script has # -*- coding: utf-8 -*- directive. Interesting thing is that the script passes the compassion uc==header on http://www.pkm.jaworzno.pl/rozklady/rozklad.php?kat=302_20100628&nr=*13*&kier=1 but does not pass on http://www.pkm.jaworzno.pl/rozklady/rozklad.php?kat=302_20100628&nr=*14*&kier=1. Needless to say, the content I try to get (Soboty, niedziele i święta) on both pages is binary the same, as well as declared encoding and they both render correctly in a web browser. Can anybody help me with this? OS: Windows XP (english) 32 bit Python: 2.6.5 lxml.etree: (2, 2, 0, 0) libxml used: (2, 7, 2) libxml compiled: (2, 7, 2) libxslt used: (1, 1, 24) libxslt compiled: (1, 1, 24) Regards Piotr