-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Stefan Behnel wrote:
Sidnei da Silva wrote:
I get one test failure with lxml 1.3.2, doesn't look too bad. Maybe it has something to do with the libxml2 version?
====================================================================== FAIL: test_module_HTML_unicode (lxml.tests.test_htmlparser.HtmlParserTestCaseBas e) ---------------------------------------------------------------------- Traceback (most recent call last): File "c:\Python24\lib\unittest.py", line 260, in run testMethod() File "C:\src\lxml-build\lxml-1.3.2\src\lxml\tests\test_htmlparser.py", line 33 , in test_module_HTML_unicode unicode(self.uhtml_str.encode('UTF8'), 'UTF8')) File "c:\Python24\lib\unittest.py", line 333, in failUnlessEqual raise self.failureException, \ AssertionError: u'<html><head><title>test \xc3\x83\xc2\xa1\xef\xa3\x92</title></ head><body><h1>page \xc3\x83\xc2\xa1\xef\xa3\x92 title</h1></body></html>' != u' <html><head><title>test \xc3\xa1\uf8d2</title></head><body><h1>page \xc3\xa1\uf8 d2 title</h1></body></html>'
Hmmm, didn't I take that test out? :)
Erik Swanson reported the same problem on OS-X. I guess that makes parsing HTML from a unicode string pretty much a Unix-only thing, though maybe it's actually rather a UCS4-only thing. No idea how to fix that (or what actually goes wrong here).
It seems like the problem only arises on UCS-2 systems. Could anyone with a UCS-2 Linux system check if this is also fails there? UCS-2 can be detected with "sys.maxunicode" being 65535 (I think). UCS-4 systems say 1114111 here. I heard rumours that Redhat systems have UCS-2 builds. Ubuntu definitely doesn't.
The test case itself is pretty simple:
import lxml.etree as et html = et.HTML(u'<html><body>\xc3\xa1\uf8d2</body></html>') print repr(et.tounicode(html)) u'<html><body>\xc3\xa1\uf8d2</body></html>'
To see that the actual problem is the parser, not the serialiser, you can do:
print repr(et.tostring(html, 'utf-8')) '<html><body>\xc3\x83\xc2\xa1\xef\xa3\x92</body></html>'
Hoping for feedback and ideas,
Stefan
I have lxml installed in both UCS4 and UCS2 versions of python2.4 on my Ubuntu laptop:: $ cat et_test.py import sys print sys.version print sys.maxunicode import lxml.etree as et html = et.HTML(u'<html><body>\xc3\xa1\uf8d2</body></html>') print repr(et.tounicode(html)) $ /path/to/ucs4/bin/python et_test.py 2.4.3 (#2, Oct 6 2006, 07:52:30) [GCC 4.0.3 (Ubuntu 4.0.3-1ubuntu5)] 1114111 u'<html><body>\xc3\xa1\uf8d2</body></html>' [/home/tseaver] $ /path/to/ucs2/bin/python et_test.py 2.4.4 (#1, Apr 19 2007, 16:14:47) [GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] 65535 u'<html><body>\xc3\xa1\uf8d2</body></html>' Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver@palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGllxz+gerLs4ltQ4RAjZ/AJ9Pvf4WBX1cZywNmaePspGyFiD/TQCfTGIO mPMPYd0dfCk/uCVyRJpmAu4= =Y4mN -----END PGP SIGNATURE-----