[lxml-dev] lxml eggs and unicode strings

Hi there,
I just found out that there is a hidden incompatibility in the compiled versions of lxml eggs we provide, at least in linux. Our provided versions are compiled with a Python that has 4 bytes unicode support (probably the default on ubuntu on which I built the 2.4 extension).
If you try to install such an egg on a machine where unicode support is compiled with 2 bytes only, it'll fail with errors such as:
ImportError: /usr/local/lib/python2.4/site-packages/lxml-1.0.2-py2.4-linux-i686.egg/lxml/etree.so: undefined symbol: PyUnicodeUCS4_FromEncodedObject
I wonder whether there's anything within the egg distribution mechanism that lets us distinguish between such platforms. If not, I wonder what to do instead -- the simplest would be to add a FAQ entry and tell people to recompile from the sources.
By the way, does Pyrex generate different C code depending on whether 4 or 2 byte unicode is used? If so, then that would mean an installation of pyrex as well for these people...
Regards,
Martijn

On Jul 31, 2006, at 11:41 AM, Martijn Faassen wrote:
Hi there,
I just found out that there is a hidden incompatibility in the compiled versions of lxml eggs we provide, at least in linux. Our provided versions are compiled with a Python that has 4 bytes unicode support (probably the default on ubuntu on which I built the 2.4 extension).
Noticed that last week, too. Sorry I forgot to mention it over there.
If you try to install such an egg on a machine where unicode support is compiled with 2 bytes only, it'll fail with errors such as:
ImportError: /usr/local/lib/python2.4/site-packages/lxml-1.0.2-py2.4-linux- i686.egg/lxml/etree.so: undefined symbol: PyUnicodeUCS4_FromEncodedObject
I wonder whether there's anything within the egg distribution mechanism that lets us distinguish between such platforms. If not, I wonder what to do instead -- the simplest would be to add a FAQ entry and tell people to recompile from the sources.
As far as I know, this is typical of the Ubuntu distribution, and I'm 100% sure this egg was laid from Ubuntu. If the egg system could make a difference between distributions, it would be ok, imho.
Charset problems are a plague.
By the way, does Pyrex generate different C code depending on whether 4 or 2 byte unicode is used? If so, then that would mean an installation of pyrex as well for these people...
I tried to compile from source on Mandriva, and it failed. I had no time to investigate (low priority for the task I was working on), it could very well have been something very trivial.
Yours,
--------- Georges Racinet Nuxeo SAS gracinet@nuxeo.com http://nuxeo.com Tel: +33 (0) 1 40 33 71 73

Georges Racinet wrote:
On Jul 31, 2006, at 11:41 AM, Martijn Faassen wrote:
Hi there,
I just found out that there is a hidden incompatibility in the compiled versions of lxml eggs we provide, at least in linux. Our provided versions are compiled with a Python that has 4 bytes unicode support (probably the default on ubuntu on which I built the 2.4 extension).
Noticed that last week, too. Sorry I forgot to mention it over there.
What platform were you on when you noticed this? Mandriva (as you mention below)?
[snip]
As far as I know, this is typical of the Ubuntu distribution, and I'm 100% sure this egg was laid from Ubuntu. If the egg system could make a difference between distributions, it would be ok, imho.
I think Red Hat has been compiling Python with 4 bytes characters for ages too, so while this was Ubuntu (I did it), I'm also pretty sure it's also the case on Fedora.
Charset problems are a plague.
This is not your common charset problems. Mostly one can avoid the plague by just using unicode, but that's what we're doing here..
By the way, does Pyrex generate different C code depending on whether 4 or 2 byte unicode is used? If so, then that would mean an installation of pyrex as well for these people...
I tried to compile from source on Mandriva, and it failed. I had no time to investigate (low priority for the task I was working on), it could very well have been something very trivial.
Interesting; let us know if you find out more.
It's important to have the lxml C sources compile on all platforms, as otherwise people will be forced to use Pyrex, possibly even the forked version of Pyrex Stephan is maintaining.
Regards,
Martijn

Hi Martijn,
Martijn Faassen wrote:
I just found out that there is a hidden incompatibility in the compiled versions of lxml eggs we provide, at least in linux. Our provided versions are compiled with a Python that has 4 bytes unicode support (probably the default on ubuntu on which I built the 2.4 extension).
AFAIK, UCS4 is the default on most (though maybe not all) Python desktop/server installations under Linux, including SuSE, Redhat and (apparently) Debian/Ubuntu. Distributors tend to care more about broad support for all possible use cases than about memory requirements.
If you try to install such an egg on a machine where unicode support is compiled with 2 bytes only, it'll fail with errors such as:
ImportError: /usr/local/lib/python2.4/site-packages/lxml-1.0.2-py2.4-linux-i686.egg/lxml/etree.so: undefined symbol: PyUnicodeUCS4_FromEncodedObject
Sure. These cannot be compatible in current CPython (and that's highly unlikely to change).
I wonder whether there's anything within the egg distribution mechanism that lets us distinguish between such platforms. If not, I wonder what to do instead -- the simplest would be to add a FAQ entry and tell people to recompile from the sources.
I wouldn't know any way egg naming could help here. Google yields some discussions about this topic on the distutils list, but it seems they have not made their way into either distutils or setuptools.
http://mail.python.org/pipermail/distutils-sig/2005-October/005222.html
Anyway, if you have to recompile your Python version to get UCS2 strings, there's no reason not to require the same for the C extensions.
Given the fact that all major distributions seem to use UCS4, a FAQ entry should be enough.
By the way, does Pyrex generate different C code depending on whether 4 or 2 byte unicode is used? If so, then that would mean an installation of pyrex as well for these people...
No, the distinction between different unicode encodings is handled completely inside the Python interpreter. The C code is not affected and Pyrex does not rely on it.
To support parsing from unicode, lxml even has generic run-time support code to detect the internal unicode encoding, which should work for any encoding supported by libxml2/libiconv.
Stefan

Stefan Behnel wrote: [snip]
Anyway, if you have to recompile your Python version to get UCS2 strings, there's no reason not to require the same for the C extensions.
Ah, so current CPython sources builds with 4 byte unicode by default? If this is for sure, then we're fairly safe. If not, then I wonder what to do - you'd like lxml to work with hand-compiled Pythons..
Given the fact that all major distributions seem to use UCS4, a FAQ entry should be enough.
It definitely is encouraging.
By the way, does Pyrex generate different C code depending on whether 4 or 2 byte unicode is used? If so, then that would mean an installation of pyrex as well for these people...
No, the distinction between different unicode encodings is handled completely inside the Python interpreter. The C code is not affected and Pyrex does not rely on it.
Good, that's what I was hoping for. That at least means people should be able to recompile without installing Pyrex first.
To support parsing from unicode, lxml even has generic run-time support code to detect the internal unicode encoding, which should work for any encoding supported by libxml2/libiconv.
Cool!
Regards,
Martijn

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Martijn Faassen wrote:
Ah, so current CPython sources builds with 4 byte unicode by default? If this is for sure, then we're fairly safe. If not, then I wonder what to do - you'd like lxml to work with hand-compiled Pythons..
Nope. The distros all pass the '--enable-unicode=ucs4' to configure. The default value for that option is 'yes', which maps to 'ucs2' unless you also have a usc4-enabled TCL.
Tres. - -- =================================================================== Tres Seaver +1 202-558-7113 tseaver@palladion.com Palladion Software "Excellence by Design" http://palladion.com

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Tres Seaver wrote:
Martijn Faassen wrote:
Ah, so current CPython sources builds with 4 byte unicode by default? If this is for sure, then we're fairly safe. If not, then I wonder what to do - you'd like lxml to work with hand-compiled Pythons..
Nope. The distros all pass the '--enable-unicode=ucs4' to configure. The default value for that option is 'yes', which maps to 'ucs2' unless you also have a usc4-enabled TCL.
Tres.
=================================================================== Tres Seaver +1 202-558-7113 tseaver@palladion.com Palladion Software "Excellence by Design" http://palladion.com
Perhaps we could use the following test inside 'setup.py', and modify the name of the binary egg to include the 'ucs2' vs. 'ucs4' flag?::
ucs_flag = sys.maxunicode > 65536 and 'ucs4' or 'ucs2'
Tres. - -- =================================================================== Tres Seaver +1 202-558-7113 tseaver@palladion.com Palladion Software "Excellence by Design" http://palladion.com

Tres Seaver wrote:
Tres Seaver wrote:
Martijn Faassen wrote:
Ah, so current CPython sources builds with 4 byte unicode by default? If this is for sure, then we're fairly safe. If not, then I wonder what to do - you'd like lxml to work with hand-compiled Pythons..
Nope. The distros all pass the '--enable-unicode=ucs4' to configure. The default value for that option is 'yes', which maps to 'ucs2' unless you also have a usc4-enabled TCL.
Right, that's what I witness, too.
Perhaps we could use the following test inside 'setup.py', and modify the name of the binary egg to include the 'ucs2' vs. 'ucs4' flag?::
ucs_flag = sys.maxunicode > 65536 and 'ucs4' or 'ucs2'
While that's nice to have, it doesn't really help us as a) we'd still have to build and ship both eggs (while the current UCS4 eggs seem to fit most users) and b) easy_install doesn't currently handle these extensions, so it would most likely just stop finding the eggs on cheeseshop if we added additional sections to the egg name.
I still think it's enough to add a FAQ entry (which I already did) and otherwise ignore the problem for now. That way, the major distros are supported out-of-the-box. And for those who happen to use a UCS2 system, it's really not a big deal to build lxml from sources on a fairly recent and well installed Linux system.
Stefan

Stefan Behnel wrote:
Tres Seaver wrote:
[snip]
Perhaps we could use the following test inside 'setup.py', and modify the name of the binary egg to include the 'ucs2' vs. 'ucs4' flag?::
ucs_flag = sys.maxunicode > 65536 and 'ucs4' or 'ucs2'
While that's nice to have, it doesn't really help us as a) we'd still have to build and ship both eggs (while the current UCS4 eggs seem to fit most users)
There'd be a significant amount of people who just build Python by hand though, and they can't use our eggs...
[snip]
I still think it's enough to add a FAQ entry (which I already did) and otherwise ignore the problem for now. That way, the major distros are supported out-of-the-box. And for those who happen to use a UCS2 system, it's really not a big deal to build lxml from sources on a fairly recent and well installed Linux system.
I agree that's all we can do on the lxml side.
Apart from that, we can also talk to the distutils/setuptools people and raise this issue again. It's a fundamental problem with binary eggs that use unicode as long as Python ships with this configuration option. I'll send off a mail on this to the distutils SIG.
Regards,
Martijn

Martijn Faassen wrote:
Apart from that, we can also talk to the distutils/setuptools people and raise this issue again. It's a fundamental problem with binary eggs that use unicode as long as Python ships with this configuration option. I'll send off a mail on this to the distutils SIG.
Good idea. Thanks for taking care of it.
It may no longer fit into the 2.5 time frame, but it's still a problem that needs to be solved some time...
Stefan
participants (4)
-
Georges Racinet
-
Martijn Faassen
-
Stefan Behnel
-
Tres Seaver