Hi Tres, thanks for testing. Tres Seaver wrote:
Stefan Behnel wrote:
It seems like the problem only arises on UCS-2 systems. Could anyone with a UCS-2 Linux system check if this is also fails there? UCS-2 can be detected with "sys.maxunicode" being 65535 (I think). UCS-4 systems say 1114111 here. I heard rumours that Redhat systems have UCS-2 builds. Ubuntu definitely doesn't.
The test case itself is pretty simple:
import lxml.etree as et html = et.HTML(u'<html><body>\xc3\xa1\uf8d2</body></html>') print repr(et.tounicode(html)) u'<html><body>\xc3\xa1\uf8d2</body></html>'
To see that the actual problem is the parser, not the serialiser, you can do:
print repr(et.tostring(html, 'utf-8')) '<html><body>\xc3\x83\xc2\xa1\xef\xa3\x92</body></html>'
I have lxml installed in both UCS4 and UCS2 versions of python2.4 on my Ubuntu laptop::
$ cat et_test.py import sys print sys.version print sys.maxunicode import lxml.etree as et html = et.HTML(u'<html><body>\xc3\xa1\uf8d2</body></html>') print repr(et.tounicode(html))
$ /path/to/ucs4/bin/python et_test.py 2.4.3 (#2, Oct 6 2006, 07:52:30) [GCC 4.0.3 (Ubuntu 4.0.3-1ubuntu5)] 1114111 u'<html><body>\xc3\xa1\uf8d2</body></html>' [/home/tseaver]
$ /path/to/ucs2/bin/python et_test.py 2.4.4 (#1, Apr 19 2007, 16:14:47) [GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] 65535 u'<html><body>\xc3\xa1\uf8d2</body></html>'
Hmmm, that leaves me hoping that my test case actually touched the problem. Could we get feedback from someone with a non-working setup here? So far, we have the following cases: - it fails on MacOS-X (Intel) with a UCS-2 little endian Python - it fails on Windows with a UCS-2 little endian Python - it works on Linux/Intel with UCS-2 little endian - it works on Linux/Intel with UCS-4 little endian - it works on Solaris/Sparc with UCS-2 big endian I can't really see a pattern there... Stefan