
Stefan Behnel wrote: [snip]
Anyway, if you have to recompile your Python version to get UCS2 strings, there's no reason not to require the same for the C extensions.
Ah, so current CPython sources builds with 4 byte unicode by default? If this is for sure, then we're fairly safe. If not, then I wonder what to do - you'd like lxml to work with hand-compiled Pythons..
Given the fact that all major distributions seem to use UCS4, a FAQ entry should be enough.
It definitely is encouraging.
By the way, does Pyrex generate different C code depending on whether 4 or 2 byte unicode is used? If so, then that would mean an installation of pyrex as well for these people...
No, the distinction between different unicode encodings is handled completely inside the Python interpreter. The C code is not affected and Pyrex does not rely on it.
Good, that's what I was hoping for. That at least means people should be able to recompile without installing Pyrex first.
To support parsing from unicode, lxml even has generic run-time support code to detect the internal unicode encoding, which should work for any encoding supported by libxml2/libiconv.
Cool!
Regards,
Martijn