[lxml-dev] Encoding bug in lxml.etree.HTML
Hello! I've probably discovered a bug in lxml.etree.HTML: >>> from lxml import etree >>> a = u'<html><body><p>\u044b</p></body></html>' >>> b = etree.HTML(a) >>> b[0][0].text u'\xd1\x8b' Expected: u'\u044b' It seems that etree.HTML function works with non-ascii symbols incorrectly. I can reproduce it on Windows. This bug is relatively new: it happens with lxml with statically linked libxml2 version 2.6.28 and libxslt2 version 1.1.19 (current version of lxml-1.2.1 from Cheese Shop and the newer releases). Older versions of lxml (lxml-1.2.1 with libxml2 version 2.6.26 and libxslt2 version 1.1.17, which are no longer available from Cheese Shop, or older releases such as lxml-1.2) work fine. -- Best regards, Alexander mailto:alexander.kozlovsky@gmail.com
Hi, Alexander Kozlovsky wrote:
I've probably discovered a bug in lxml.etree.HTML:
>>> from lxml import etree >>> a = u'<html><body><p>\u044b</p></body></html>' >>> b = etree.HTML(a) >>> b[0][0].text u'\xd1\x8b'
Expected: u'\u044b'
It seems that etree.HTML function works with non-ascii symbols incorrectly. I can reproduce it on Windows.
Thanks for the extensive report. This is actually a bug that has been fixed two days ago, so there isn't a release yet containing the fix. It will go away in lxml 1.3.3. Stefan
participants (2)
-
Alexander Kozlovsky -
Stefan Behnel