[lxml-dev] PyUnicodeUCS4_FromEncodedObject - problem

Hello I have this 'undefined symbol: PyUnicodeUCS4_FromEncodedObject' when I install lxml using easy_install. I saw that this problem was discussed last month on this list. I scanned the mails addressing this issue, however, I could not find a solution. How can I test whether my python installation (Python 2.3.5) is compiled with 2 bit unicode? Regards, Benno

Hi Benno, Luthiger Stoll Benno wrote:
We do not provide eggs for Python installations that use 16 bit unicode (UCS2). The solution is therefore to compile lxml yourself. I assume you're on Linux, so that's not too much of an effort. http://codespeak.net/lxml/build.html
How can I test whether my python installation (Python 2.3.5) is compiled with 2 bit unicode?
Ah, 2 bit unicode? No, that's pretty unlikely... ;) Stefan

Luthiger Stoll Benno wrote:
A straightforward compile of Python will be 2 byte unicode, not 4 bytes. Unfortunately most linux distributions ship with a 4 byte unicode version of Python, and distutils/setuptools cannot distinguish between 4 bytes and 2 bytes unicode yet. We've passed this problem (which goes beyond lxml) along to the setuptools developers, and they say "patches welcome". :) Regards, Martijn

Hey, [Benno]
How can I test whether my python installation (Python 2.3.5) is compiled with 2 bit unicode?
In order to write our patch to fix distutils/setuputils, we actually need an answer to Benno's question. Is there a straightforward way to find this out, in Python code? A brief glance through 'sys' didn't lead to an answer. A quick google likewise didn't seem to lead to anything so far. Perhaps we need to resort to devious unicode string manipulation that behaves differently depending on the amount of bytes your Python is compiled with for unicode representation.. Or we could try asking Fredrik Lundh :). Regards, Martijn

Fredrik Lundh wrote:
*) "BMP plus sixteen supplemental planes should be enough for anybody"
Thanks for the info! Don't know what BMP is, and I only have a vague idea of the planes (I'll read the wikipedia article :), but using 4 bytes to store something that could be stored in less than 3 seems like a waste. :) Oh well, I imagine machines can deal better with 4 bytes, especially if they're 64 bits. Anyway, we'll see whether we can come up with a patch that convinces distutils to distinguish between the two. Regards, Martijn

Martijn Faassen wrote:
start here: http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters </F>

Hi Benno, Luthiger Stoll Benno wrote:
We do not provide eggs for Python installations that use 16 bit unicode (UCS2). The solution is therefore to compile lxml yourself. I assume you're on Linux, so that's not too much of an effort. http://codespeak.net/lxml/build.html
How can I test whether my python installation (Python 2.3.5) is compiled with 2 bit unicode?
Ah, 2 bit unicode? No, that's pretty unlikely... ;) Stefan

Luthiger Stoll Benno wrote:
A straightforward compile of Python will be 2 byte unicode, not 4 bytes. Unfortunately most linux distributions ship with a 4 byte unicode version of Python, and distutils/setuptools cannot distinguish between 4 bytes and 2 bytes unicode yet. We've passed this problem (which goes beyond lxml) along to the setuptools developers, and they say "patches welcome". :) Regards, Martijn

Hey, [Benno]
How can I test whether my python installation (Python 2.3.5) is compiled with 2 bit unicode?
In order to write our patch to fix distutils/setuputils, we actually need an answer to Benno's question. Is there a straightforward way to find this out, in Python code? A brief glance through 'sys' didn't lead to an answer. A quick google likewise didn't seem to lead to anything so far. Perhaps we need to resort to devious unicode string manipulation that behaves differently depending on the amount of bytes your Python is compiled with for unicode representation.. Or we could try asking Fredrik Lundh :). Regards, Martijn

Fredrik Lundh wrote:
*) "BMP plus sixteen supplemental planes should be enough for anybody"
Thanks for the info! Don't know what BMP is, and I only have a vague idea of the planes (I'll read the wikipedia article :), but using 4 bytes to store something that could be stored in less than 3 seems like a waste. :) Oh well, I imagine machines can deal better with 4 bytes, especially if they're 64 bits. Anyway, we'll see whether we can come up with a patch that convinces distutils to distinguish between the two. Regards, Martijn

Martijn Faassen wrote:
start here: http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters </F>
participants (4)
-
Fredrik Lundh
-
Luthiger Stoll Benno
-
Martijn Faassen
-
Stefan Behnel