[lxml-dev] Test Failures in lxml 1.3.2
I get one test failure with lxml 1.3.2, doesn't look too bad. Maybe it has something to do with the libxml2 version? ====================================================================== FAIL: test_module_HTML_unicode (lxml.tests.test_htmlparser.HtmlParserTestCaseBas e) ---------------------------------------------------------------------- Traceback (most recent call last): File "c:\Python24\lib\unittest.py", line 260, in run testMethod() File "C:\src\lxml-build\lxml-1.3.2\src\lxml\tests\test_htmlparser.py", line 33 , in test_module_HTML_unicode unicode(self.uhtml_str.encode('UTF8'), 'UTF8')) File "c:\Python24\lib\unittest.py", line 333, in failUnlessEqual raise self.failureException, \ AssertionError: u'<html><head><title>test \xc3\x83\xc2\xa1\xef\xa3\x92</title></ head><body><h1>page \xc3\x83\xc2\xa1\xef\xa3\x92 title</h1></body></html>' != u' <html><head><title>test \xc3\xa1\uf8d2</title></head><body><h1>page \xc3\xa1\uf8 d2 title</h1></body></html>' ---------------------------------------------------------------------- -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214
Versions used, FWIW: TESTED VERSION: Python: (2, 5, 0, 'final', 0) lxml.etree: (1, 3, 2, 0) libxml used: (2, 6, 28) libxml compiled: (2, 6, 28) libxslt used: (1, 1, 19) libxslt compiled: (1, 1, 19) -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214
Sidnei da Silva wrote:
I get one test failure with lxml 1.3.2, doesn't look too bad. Maybe it has something to do with the libxml2 version?
====================================================================== FAIL: test_module_HTML_unicode (lxml.tests.test_htmlparser.HtmlParserTestCaseBas e) ---------------------------------------------------------------------- Traceback (most recent call last): File "c:\Python24\lib\unittest.py", line 260, in run testMethod() File "C:\src\lxml-build\lxml-1.3.2\src\lxml\tests\test_htmlparser.py", line 33 , in test_module_HTML_unicode unicode(self.uhtml_str.encode('UTF8'), 'UTF8')) File "c:\Python24\lib\unittest.py", line 333, in failUnlessEqual raise self.failureException, \ AssertionError: u'<html><head><title>test \xc3\x83\xc2\xa1\xef\xa3\x92</title></ head><body><h1>page \xc3\x83\xc2\xa1\xef\xa3\x92 title</h1></body></html>' != u' <html><head><title>test \xc3\xa1\uf8d2</title></head><body><h1>page \xc3\xa1\uf8 d2 title</h1></body></html>'
Hmmm, didn't I take that test out? :) Erik Swanson reported the same problem on OS-X. I guess that makes parsing HTML from a unicode string pretty much a Unix-only thing, though maybe it's actually rather a UCS4-only thing. No idea how to fix that (or what actually goes wrong here). It seems like the problem only arises on UCS-2 systems. Could anyone with a UCS-2 Linux system check if this is also fails there? UCS-2 can be detected with "sys.maxunicode" being 65535 (I think). UCS-4 systems say 1114111 here. I heard rumours that Redhat systems have UCS-2 builds. Ubuntu definitely doesn't. The test case itself is pretty simple:
import lxml.etree as et html = et.HTML(u'<html><body>\xc3\xa1\uf8d2</body></html>') print repr(et.tounicode(html)) u'<html><body>\xc3\xa1\uf8d2</body></html>'
To see that the actual problem is the parser, not the serialiser, you can do:
print repr(et.tostring(html, 'utf-8')) '<html><body>\xc3\x83\xc2\xa1\xef\xa3\x92</body></html>'
Hoping for feedback and ideas, Stefan
Hi,
It seems like the problem only arises on UCS-2 systems. Could anyone with a UCS-2 Linux system check if this is also fails there? UCS-2 can be detected with "sys.maxunicode" being 65535 (I think). UCS-4 systems say 1114111 here. I
Runs without failures on Solaris with sys.maxunicode==65535: 0 lb54320@adevp02 .../lxml-1.3 $ python2.4 -c "import sys; print sys.maxunicode" 65535 0 lb54320@adevp02 .../lxml-1.3 $ make test PYTHON=python2.4 python2.4 setup.py build_ext -i Building lxml version 1.3.3-44945 /apps/prod/lib/python2.4/distutils/dist.py:236: UserWarning: Unknown distribution option: 'zip_safe' warnings.warn(msg) running build_ext python2.4 test.py -p -v TESTED VERSION: Python: (2, 4, 4, 'final', 0) lxml.etree: (1, 3, 3, 44945) libxml used: (2, 6, 27) libxml compiled: (2, 6, 27) libxslt used: (1, 1, 20) libxslt compiled: (1, 1, 20) 607/607 (100.0%): Doctest: xpathxslt.txt ---------------------------------------------------------------------- Ran 607 tests in 1.310s OK PYTHONPATH=src python2.4 selftest.py 126 tests ok. PYTHONPATH=src python2.4 selftest2.py 88 tests ok. 0 lb54320@adevp02 .../lxml-1.3 $ Note that I ran from 1.3 branch, not 1.3.2 release (found not 1.3.2 tag in the repository), so maybe the offending test has been disabled already (?) Holger -- GMX FreeMail: 1 GB Postfach, 5 E-Mail-Adressen, 10 Free SMS. Alle Infos und kostenlose Anmeldung: http://www.gmx.net/de/go/freemail
jholg@gmx.de wrote:
It seems like the problem only arises on UCS-2 systems. Could anyone with a UCS-2 Linux system check if this is also fails there? UCS-2 can be detected with "sys.maxunicode" being 65535 (I think). UCS-4 systems say 1114111 here. I
Runs without failures on Solaris with sys.maxunicode==65535:
Thanks for testing. Just to be sure, Sun Solaris machines are big endian, right? Intel is little endian, so Solaris actually uses a different byte encoding here. So I think we can restrict the problem to UCS-2 little endian. Any non-Windows, non-Mac-OS-X testers for that one? Or maybe any Mac-OS PPC testers?
Note that I ran from 1.3 branch, not 1.3.2 release (found not 1.3.2 tag in the repository)
Thanks for reminding me. It's there now.
so maybe the offending test has been disabled already (?)
No, it's still in there. Stefan
Hi Stefan,
Runs without failures on Solaris with sys.maxunicode==65535:
Thanks for testing. Just to be sure, Sun Solaris machines are big endian, right? Intel is little endian, so Solaris actually uses a different byte encoding here.
So I think we can restrict the problem to UCS-2 little endian. Any non-Windows, non-Mac-OS-X testers for that one?
Right, and I've been a bit sloppy: SUN Sparc is big endian architecture, whereas Intel is little endian, so I should've rather said "Runs without failures on *Sparc* Solaris". Holger -- GMX FreeMail: 1 GB Postfach, 5 E-Mail-Adressen, 10 Free SMS. Alle Infos und kostenlose Anmeldung: http://www.gmx.net/de/go/freemail
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Stefan Behnel wrote:
Sidnei da Silva wrote:
I get one test failure with lxml 1.3.2, doesn't look too bad. Maybe it has something to do with the libxml2 version?
====================================================================== FAIL: test_module_HTML_unicode (lxml.tests.test_htmlparser.HtmlParserTestCaseBas e) ---------------------------------------------------------------------- Traceback (most recent call last): File "c:\Python24\lib\unittest.py", line 260, in run testMethod() File "C:\src\lxml-build\lxml-1.3.2\src\lxml\tests\test_htmlparser.py", line 33 , in test_module_HTML_unicode unicode(self.uhtml_str.encode('UTF8'), 'UTF8')) File "c:\Python24\lib\unittest.py", line 333, in failUnlessEqual raise self.failureException, \ AssertionError: u'<html><head><title>test \xc3\x83\xc2\xa1\xef\xa3\x92</title></ head><body><h1>page \xc3\x83\xc2\xa1\xef\xa3\x92 title</h1></body></html>' != u' <html><head><title>test \xc3\xa1\uf8d2</title></head><body><h1>page \xc3\xa1\uf8 d2 title</h1></body></html>'
Hmmm, didn't I take that test out? :)
Erik Swanson reported the same problem on OS-X. I guess that makes parsing HTML from a unicode string pretty much a Unix-only thing, though maybe it's actually rather a UCS4-only thing. No idea how to fix that (or what actually goes wrong here).
It seems like the problem only arises on UCS-2 systems. Could anyone with a UCS-2 Linux system check if this is also fails there? UCS-2 can be detected with "sys.maxunicode" being 65535 (I think). UCS-4 systems say 1114111 here. I heard rumours that Redhat systems have UCS-2 builds. Ubuntu definitely doesn't.
The test case itself is pretty simple:
import lxml.etree as et html = et.HTML(u'<html><body>\xc3\xa1\uf8d2</body></html>') print repr(et.tounicode(html)) u'<html><body>\xc3\xa1\uf8d2</body></html>'
To see that the actual problem is the parser, not the serialiser, you can do:
print repr(et.tostring(html, 'utf-8')) '<html><body>\xc3\x83\xc2\xa1\xef\xa3\x92</body></html>'
Hoping for feedback and ideas,
Stefan
I have lxml installed in both UCS4 and UCS2 versions of python2.4 on my Ubuntu laptop:: $ cat et_test.py import sys print sys.version print sys.maxunicode import lxml.etree as et html = et.HTML(u'<html><body>\xc3\xa1\uf8d2</body></html>') print repr(et.tounicode(html)) $ /path/to/ucs4/bin/python et_test.py 2.4.3 (#2, Oct 6 2006, 07:52:30) [GCC 4.0.3 (Ubuntu 4.0.3-1ubuntu5)] 1114111 u'<html><body>\xc3\xa1\uf8d2</body></html>' [/home/tseaver] $ /path/to/ucs2/bin/python et_test.py 2.4.4 (#1, Apr 19 2007, 16:14:47) [GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] 65535 u'<html><body>\xc3\xa1\uf8d2</body></html>' Tres. - -- =================================================================== Tres Seaver +1 540-429-0999 tseaver@palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGllxz+gerLs4ltQ4RAjZ/AJ9Pvf4WBX1cZywNmaePspGyFiD/TQCfTGIO mPMPYd0dfCk/uCVyRJpmAu4= =Y4mN -----END PGP SIGNATURE-----
Hi Tres, thanks for testing. Tres Seaver wrote:
Stefan Behnel wrote:
It seems like the problem only arises on UCS-2 systems. Could anyone with a UCS-2 Linux system check if this is also fails there? UCS-2 can be detected with "sys.maxunicode" being 65535 (I think). UCS-4 systems say 1114111 here. I heard rumours that Redhat systems have UCS-2 builds. Ubuntu definitely doesn't.
The test case itself is pretty simple:
import lxml.etree as et html = et.HTML(u'<html><body>\xc3\xa1\uf8d2</body></html>') print repr(et.tounicode(html)) u'<html><body>\xc3\xa1\uf8d2</body></html>'
To see that the actual problem is the parser, not the serialiser, you can do:
print repr(et.tostring(html, 'utf-8')) '<html><body>\xc3\x83\xc2\xa1\xef\xa3\x92</body></html>'
I have lxml installed in both UCS4 and UCS2 versions of python2.4 on my Ubuntu laptop::
$ cat et_test.py import sys print sys.version print sys.maxunicode import lxml.etree as et html = et.HTML(u'<html><body>\xc3\xa1\uf8d2</body></html>') print repr(et.tounicode(html))
$ /path/to/ucs4/bin/python et_test.py 2.4.3 (#2, Oct 6 2006, 07:52:30) [GCC 4.0.3 (Ubuntu 4.0.3-1ubuntu5)] 1114111 u'<html><body>\xc3\xa1\uf8d2</body></html>' [/home/tseaver]
$ /path/to/ucs2/bin/python et_test.py 2.4.4 (#1, Apr 19 2007, 16:14:47) [GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] 65535 u'<html><body>\xc3\xa1\uf8d2</body></html>'
Hmmm, that leaves me hoping that my test case actually touched the problem. Could we get feedback from someone with a non-working setup here? So far, we have the following cases: - it fails on MacOS-X (Intel) with a UCS-2 little endian Python - it fails on Windows with a UCS-2 little endian Python - it works on Linux/Intel with UCS-2 little endian - it works on Linux/Intel with UCS-4 little endian - it works on Solaris/Sparc with UCS-2 big endian I can't really see a pattern there... Stefan
Stefan Behnel wrote:
- it fails on MacOS-X (Intel) with a UCS-2 little endian Python - it fails on Windows with a UCS-2 little endian Python - it works on Linux/Intel with UCS-2 little endian - it works on Linux/Intel with UCS-4 little endian - it works on Solaris/Sparc with UCS-2 big endian
I can't really see a pattern there...
I've heard from a few people who tested (either failing or succeeding) that they have fairly recent libxml2 versions. Also, libxml2 works for me from 2.6.20 through 2.6.29. But what about the iconv version? Is there any difference on the systems that were tested so far? "iconv --version" says 2.5 for me. I assume it's about the same for Tres (who's on Ubuntu also). What about the others? Stefan
Stefan Behnel wrote:
Stefan Behnel wrote:
- it fails on MacOS-X (Intel) with a UCS-2 little endian Python - it fails on Windows with a UCS-2 little endian Python - it works on Linux/Intel with UCS-2 little endian - it works on Linux/Intel with UCS-4 little endian - it works on Solaris/Sparc with UCS-2 big endian
I can't really see a pattern there...
I've heard from a few people who tested (either failing or succeeding) that they have fairly recent libxml2 versions. Also, libxml2 works for me from 2.6.20 through 2.6.29. But what about the iconv version? Is there any difference on the systems that were tested so far? "iconv --version" says 2.5 for me. I assume it's about the same for Tres (who's on Ubuntu also). What about the others?
Sure, that must be it. Erik said he had iconv 1.9 on MacOS-X, the Windows binaries of libxml2 come with a pre-built iconv 1.9.2, but most Linux systems have a more recent version installed (at least the Debian based ones). I sent a mail to Igor Zlatković to update the official libxml2 builds with a newer iconv version. Sidnei, in case it's not too much hassle for you, could you install a more recent iconv version yourself to try this out? Stefan
Hi,
2.6.20 through 2.6.29. But what about the iconv version? Is there any difference on the systems that were tested so far? "iconv --version" says 2.5 for me. I assume it's about the same for Tres (who's on Ubuntu also). What about the others?
libxml2 built without iconv here (Sparc Solaris). Holger -- Der GMX SmartSurfer hilft bis zu 70% Ihrer Onlinekosten zu sparen! Ideal für Modem und ISDN: http://www.gmx.net/de/go/smartsurfer
jholg@gmx.de wrote:
2.6.20 through 2.6.29. But what about the iconv version? Is there any difference on the systems that were tested so far? "iconv --version" says 2.5 for me. I assume it's about the same for Tres (who's on Ubuntu also). What about the others?
libxml2 built without iconv here (Sparc Solaris).
I first thought your comment wasn't relevant as Sparc uses a different encoding already, but then I looked back into the code of libxml2 and found that iconv is not used for detecting the encoding, only for later decoding if libxml2 itself doesn't support the encoding. So iconv isn't the real problem here, it's rather libxml2 that fails to detect the encoding on some platforms. What we use here is the function xmlDetectCharEncoding() in encoding.c, which (AFAICT) checks for a BOM. Maybe these platforms do not have a that in their unicode strings... Here is a patch that will print out the internal representation of a unicode string when importing etree. Could someone with a Windows or MacOS machine please try this and send me the results? Stefan
On 7/13/07, Stefan Behnel <stefan_ml@behnel.de> wrote:
Here is a patch that will print out the internal representation of a unicode string when importing etree.
Could someone with a Windows or MacOS machine please try this and send me the results?
There you have it: C:\src\lxml-build\lxml-1.3.2>python test.py -vv '<\x00t\x00e\x00s\x00t\x00/\x00>\x00' -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214
Sidnei da Silva wrote:
On 7/13/07, Stefan Behnel <stefan_ml@behnel.de> wrote:
Here is a patch that will print out the internal representation of a unicode string when importing etree.
Could someone with a Windows or MacOS machine please try this and send me the results?
There you have it:
C:\src\lxml-build\lxml-1.3.2>python test.py -vv '<\x00t\x00e\x00s\x00t\x00/\x00>\x00'
Thanks, the attached patch should work around it. Stefan
On 7/13/07, Stefan Behnel <stefan_ml@behnel.de> wrote:
Thanks, the attached patch should work around it.
Nope, still fails the same way. ====================================================================== FAIL: test_module_HTML_unicode (lxml.tests.test_htmlparser.HtmlParserTestCaseBas e) ---------------------------------------------------------------------- Traceback (most recent call last): File "c:\Python24\lib\unittest.py", line 260, in run testMethod() File "C:\src\lxml-build\lxml-1.3.2\src\lxml\tests\test_htmlparser.py", line 33 , in test_module_HTML_unicode unicode(self.uhtml_str.encode('UTF8'), 'UTF8')) File "c:\Python24\lib\unittest.py", line 333, in failUnlessEqual raise self.failureException, \ AssertionError: u'<html><head><title>test \xc3\x83\xc2\xa1\xef\xa3\x92</title></ head><body><h1>page \xc3\x83\xc2\xa1\xef\xa3\x92 title</h1></body></html>' != u' <html><head><title>test \xc3\xa1\uf8d2</title></head><body><h1>page \xc3\xa1\uf8 d2 title</h1></body></html>' -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214
Sidnei da Silva wrote:
On 7/13/07, Stefan Behnel <stefan_ml@behnel.de> wrote:
Thanks, the attached patch should work around it.
Nope, still fails the same way.
Ok, I read in the libxml2 sources a bit more and found that I was actually using iconv alias names for the UTF16 encodings, "UTF16LE" instead of "UTF-16LE". It looks like libxml2 only understands the latter natively, so I switched to using it instead. Could you test the attached patch? Thanks, Stefan
On 7/15/07, Stefan Behnel <stefan_ml@behnel.de> wrote:
Could you test the attached patch?
Tested, didn't make a difference apparently. ====================================================================== FAIL: test_module_HTML_unicode (lxml.tests.test_htmlparser.HtmlParserTestCaseBas e) ---------------------------------------------------------------------- Traceback (most recent call last): File "c:\Python24\lib\unittest.py", line 260, in run testMethod() File "C:\src\lxml-build\lxml-1.3.2\src\lxml\tests\test_htmlparser.py", line 33 , in test_module_HTML_unicode unicode(self.uhtml_str.encode('UTF8'), 'UTF8')) File "c:\Python24\lib\unittest.py", line 333, in failUnlessEqual raise self.failureException, \ AssertionError: u'<html><head><title>test \xc3\x83\xc2\xa1\xef\xa3\x92</title></ head><body><h1>page \xc3\x83\xc2\xa1\xef\xa3\x92 title</h1></body></html>' != u' <html><head><title>test \xc3\xa1\uf8d2</title></head><body><h1>page \xc3\xa1\uf8 d2 title</h1></body></html>' ---------------------------------------------------------------------- -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214
On 7/13/07, Stefan Behnel <stefan_ml@behnel.de> wrote:
Sure, that must be it. Erik said he had iconv 1.9 on MacOS-X, the Windows binaries of libxml2 come with a pre-built iconv 1.9.2, but most Linux systems have a more recent version installed (at least the Debian based ones).
I sent a mail to Igor Zlatković to update the official libxml2 builds with a newer iconv version.
Sidnei, in case it's not too much hassle for you, could you install a more recent iconv version yourself to try this out?
As soon as Igor makes a new iconv build, sure. -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214
participants (4)
-
jholg@gmx.de -
Sidnei da Silva -
Stefan Behnel -
Tres Seaver