Mailman 3 Unicode compatibility - capi-sig

newer
Easy way to Return Value from 1...

Unicode compatibility

Daniel Stutzbach

May 21, 2010

9:34 a.m.

I'm working on http://bugs.python.org/issue8654 and I'd like to get some feedback from extension-writers, since it will impact them.

Synopsis of the problem:

If you try to load an extension module that: then you get an ugly "undefined symbol" error from the linker.

uses any of Python's Unicode functions, and
was compiled by a Python with the opposite Unicode setting (UCS2 vs UCS4)

For Python 3, __repr__ must return a Unicode object which means that almost all extensions will need to call some Unicode functions. It's basically fruitless to upload a binary egg for Python 3 to PyPi, since it will generate link errors for a large fraction of downloaders (as I discovered the hard way).

Proposed solution:

By default, extensions will compile in a "Unicode-agnostic" mode, where Py_UNICODE is an incomplete type. The extension's code can pass Py_UNICODE pointers back and forth between Python API functions, but it cannot dereference them nor use sizeof(Py_UNICODE). Unicode-agnostic modules will load and run in both UCS2 and UCS4 interpreters. Most extensions fall into this category.

If a module needs to dereference Py_UNICODE, it can define PY_REAL_PY_UNICODE before including Python.h to make Py_UNICODE a complete type, .Attempting to load such a module into a mismatched interpreter will cause an ImportError (instead of an ugly linker error). If an extension uses PY_REAL_PY_UNICODE in any .c file, it must also use it in the .c file that calls PyModule_Create to ensure the Unicode width is stored in the module's information.

I have two questions for the greater community:

Do you have any fundamental concerns with this design?
Would you prefer the default be reversed? i.e, that Py_UNICODE be a complete type by default, and an extension must have a #define to compile in Unicode-agnostic mode?

Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises, LLC <http://stutzbachenterprises.com>

Show replies by date

Stefan Behnel

May 2010

12:51 p.m.

Daniel Stutzbach, 21.05.2010 16:34:

...

Well known problem, yes.

...

This is a pretty bad default for Cython code. Starting with version 0.13, Cython will try to infer Py_UNICODE for single character unicode strings and use that whenever possible, e.g. when for-looping over unicode strings and during character comparisons. Making Py_UNICODE an incomplete type will render this impossible.

...

So that would be an option that all Cython modules (or at least those that use Py_UNICODE and/or single unicode characters somewhere) would use automatically. Not much to win here.

...

Cython modules should normally be self-contained, but it will not be 100% sure that a module that wraps C code using Py_UNICODE will also use Py_UNICODE somewhere, so that Cython could enable that option automatically. Cython would therefore be forced to enable the option for basically all code that calls into C code.

...

Absolutely. IMHO, the only platform that always requires binaries due to incomplete operating system support for source distributions is MS Windows, where Py_UNICODE equals wchar_t anyway. In some cases, MacOS-X is broken enough to require binary releases, too, but the normal target on that platform is the system Python, which has a universal setting for the Py_UNICODE size as well.

So the only remaining platforms that suffer from binary incompatibility problems here are Linux und Unix systems, where the Py_UNICODE size differs between installations and distributions. Given that these systems are best targeted with a source distribution, it sounds like a bad default to complicate the usage of Py_UNICODE for everyone, unless users explicitly disable this behaviour. It's much better to provide this as an option for extension writers who really want (or need) to provide portable binary distributions for whatever reason.

Personally, I think the drawbacks totally outweigh the single advantage, though, so I could absolutely live without this change. It's easy enough to drop the linkage error message into a web search engine.

Stefan

Robert Bradshaw

11:18 a.m.

On May 23, 2010, at 10:51 AM, Stefan Behnel wrote:

...

Daniel Stutzbach, 21.05.2010 16:34:

...
If you try to load an extension module that: then you get an ugly "undefined symbol" error from the linker.

uses any of Python's Unicode functions, and

was compiled by a Python with the opposite Unicode setting (UCS2
vs UCS4)

Well known problem, yes.

...
By default, extensions will compile in a "Unicode-agnostic" mode,
where Py_UNICODE is an incomplete type. The extension's code can pass
Py_UNICODE pointers back and forth between Python API functions, but it cannot dereference them nor use sizeof(Py_UNICODE). Unicode-agnostic
modules will load and run in both UCS2 and UCS4 interpreters. Most extensions
fall into this category.

This is a pretty bad default for Cython code. Starting with version
0.13, Cython will try to infer Py_UNICODE for single character
unicode strings and use that whenever possible, e.g. when for- looping over unicode strings and during character comparisons.
Making Py_UNICODE an incomplete type will render this impossible.

...
If a module needs to dereference Py_UNICODE, it can define PY_REAL_PY_UNICODE before including Python.h to make Py_UNICODE a
complete type

So that would be an option that all Cython modules (or at least
those that use Py_UNICODE and/or single unicode characters
somewhere) would use automatically. Not much to win here.

...
Attempting to load such a module into a mismatched interpreter will cause an ImportError (instead of an ugly linker error). If an
extension uses PY_REAL_PY_UNICODE in any .c file, it must also use it in
the .c file that calls PyModule_Create to ensure the Unicode width is stored in
the module's information.

Cython modules should normally be self-contained, but it will not be
100% sure that a module that wraps C code using Py_UNICODE will also
use Py_UNICODE somewhere, so that Cython could enable that option
automatically. Cython would therefore be forced to enable the option
for basically all code that calls into C code.

...

Would you prefer the default be reversed? i.e, that Py_UNICODE
be a complete type by default, and an extension must have a #define to
compile in Unicode-agnostic mode?

Absolutely. IMHO, the only platform that always requires binaries
due to incomplete operating system support for source distributions
is MS Windows, where Py_UNICODE equals wchar_t anyway. In some
cases, MacOS-X is broken enough to require binary releases, too, but
the normal target on that platform is the system Python, which has a
universal setting for the Py_UNICODE size as well.

So the only remaining platforms that suffer from binary
incompatibility problems here are Linux und Unix systems, where the
Py_UNICODE size differs between installations and distributions.
Given that these systems are best targeted with a source
distribution, it sounds like a bad default to complicate the usage
of Py_UNICODE for everyone, unless users explicitly disable this
behaviour. It's much better to provide this as an option for
extension writers who really want (or need) to provide portable
binary distributions for whatever reason.

Personally, I think the drawbacks totally outweigh the single
advantage, though, so I could absolutely live without this change.
It's easy enough to drop the linkage error message into a web search
engine.

I (unsurprisingly) be against this change as well, given the reasons
listed above, but would like to suggest some alternatives. First, is
there a way to easily get the runtime size of Py_UNICODE? Then the
module could be sure to raise an error itself when there's a mismatch
before doing anything dangerous. A potentially better alternative
would be to store record the UCS2/UCS4 distinction as part of the
binary specification/name, with support for choosing the right one
added into the package management infrastructures. Of course this will
double the number of binaries, but that's just a reflection of the
choice to make UCS2/UCS4 a binary incompatible compile time decision.

Robert

Daniel Stutzbach

12:09 p.m.

Robert, Stefan, thank you for your feedback.

How about the following variation, which I believe will address your concerns:

By default, Py_UNICODE will be a fully-specified type. In a nutshell, the default will behave just like Python 2 or 3.1, except that trying to load a mismatched module will raise an ImportError with a more helpful error message (much friendlier to novice programmers). Cython would continue to use this mode.

Extension authors who want a Unicode-agnostic build can specify an option in their setup.py that will instruct distutils to pass a -D_Py_UNICODE_AGNOSTIC compiler flag to ensure that all of their .c files are built in Unicode-independent mode. That way, the whole extension is compiled in the same mode.

It would indeed be great if package managers included the Unicode setting as part of the platform type. PJE's proposed implementation of that feature ( http://bit.ly/1bO62) would allow eggs to specify UCS2, UCS4, or "Don't Care". My patch greatly increases the number of eggs that could label themselves "Don't Care", reducing maintenance work for package maintainers who like to distribute binary eggs [1]. In other words, they are complimentary solutions.

[1] A quick Google search of PyPi reveals many packages offering Linux binary eggs.

Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises, LLC <http://stutzbachenterprises.com>

M.-A. Lemburg

5:45 a.m.

Daniel Stutzbach wrote:

...

Robert, Stefan, thank you for your feedback.

How about the following variation, which I believe will address your concerns:

By default, Py_UNICODE will be a fully-specified type. In a nutshell, the default will behave just like Python 2 or 3.1, except that trying to load a mismatched module will raise an ImportError with a more helpful error message (much friendlier to novice programmers). Cython would continue to use this mode.

Extension authors who want a Unicode-agnostic build can specify an option in their setup.py that will instruct distutils to pass a -D_Py_UNICODE_AGNOSTIC compiler flag to ensure that all of their .c files are built in Unicode-independent mode. That way, the whole extension is compiled in the same mode.

That would be our (eGenix) preferred implementation variant as well.

Building Unicode agnostic extensions should be a feature that the extension writers turn on explicitly, rather than being the default that has to be turned off.

However, rather than using a distutils options to specify enable the agnostic mode, I would presume that extension writers simply write:

#define _Py_UNICODE_AGNOSTIC 1 #include "Python.h"

in their code and then add

[build_ext] unicode-agnostic=1

to their setup.cfg.

...

It would indeed be great if package managers included the Unicode setting as part of the platform type. PJE's proposed implementation of that feature ( http://bit.ly/1bO62) would allow eggs to specify UCS2, UCS4, or "Don't Care". My patch greatly increases the number of eggs that could label themselves "Don't Care", reducing maintenance work for package maintainers who like to distribute binary eggs [1]. In other words, they are complimentary solutions.

Rather than waiting for package managers to include support for this (I've been trying to get some awareness for this problem for years, without much success), it's probably better to just fix distutils to include a UCS2/UCS4 marker in the platform string.

...

[1] A quick Google search of PyPi reveals many packages offering Linux binary eggs.

-- Marc-Andre Lemburg eGenix.com

Professional Python Services directly from the Source (#1, May 26 2010)

...

...
...
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

2010-07-19: EuroPython 2010, Birmingham, UK 53 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

Daniel Stutzbach

8:57 a.m.

On Wed, May 26, 2010 at 5:45 AM, M.-A. Lemburg <mal@egenix.com> wrote:

...

I think I was much too vague when I said "distutils option". I fear that I implied a command-line option, which is not at all what I intended. I was picturing that the module author would include something like the following in their setup.py:

Extension("foo", ["foo.c"], unicode_agnostic=True)

which would arrange to add _Py_UNICODE_AGNOSTIC to their define_macros. The module author would not (and should not) define the macro themselves at the top of a .c file. By enabling it in setup.py, we guarantee that it will be defined when compiling all of the module's .c files or not at all.

...

In principle, I agree. I don't personally have enough familiarity with the innards of distutils to feel comfortable writing a patch that alters the platform string.

Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises, LLC <http://stutzbachenterprises.com>

Stefan Behnel

May 2010

5:51 p.m.

Daniel Stutzbach, 21.05.2010 16:34:

...

Well known problem, yes.

...

So that would be an option that all Cython modules (or at least those that use Py_UNICODE and/or single unicode characters somewhere) would use automatically. Not much to win here.

...

Stefan

Robert Bradshaw

4:18 p.m.

On May 23, 2010, at 10:51 AM, Stefan Behnel wrote:

...

Daniel Stutzbach, 21.05.2010 16:34:

...
If you try to load an extension module that: then you get an ugly "undefined symbol" error from the linker.

uses any of Python's Unicode functions, and

was compiled by a Python with the opposite Unicode setting (UCS2
vs UCS4)

Well known problem, yes.

...
By default, extensions will compile in a "Unicode-agnostic" mode,
where Py_UNICODE is an incomplete type. The extension's code can pass
Py_UNICODE pointers back and forth between Python API functions, but it cannot dereference them nor use sizeof(Py_UNICODE). Unicode-agnostic
modules will load and run in both UCS2 and UCS4 interpreters. Most extensions
fall into this category.

This is a pretty bad default for Cython code. Starting with version
0.13, Cython will try to infer Py_UNICODE for single character
unicode strings and use that whenever possible, e.g. when for- looping over unicode strings and during character comparisons.
Making Py_UNICODE an incomplete type will render this impossible.

...
If a module needs to dereference Py_UNICODE, it can define PY_REAL_PY_UNICODE before including Python.h to make Py_UNICODE a
complete type

So that would be an option that all Cython modules (or at least
those that use Py_UNICODE and/or single unicode characters
somewhere) would use automatically. Not much to win here.

...
Attempting to load such a module into a mismatched interpreter will cause an ImportError (instead of an ugly linker error). If an
extension uses PY_REAL_PY_UNICODE in any .c file, it must also use it in
the .c file that calls PyModule_Create to ensure the Unicode width is stored in
the module's information.

Cython modules should normally be self-contained, but it will not be
100% sure that a module that wraps C code using Py_UNICODE will also
use Py_UNICODE somewhere, so that Cython could enable that option
automatically. Cython would therefore be forced to enable the option
for basically all code that calls into C code.

...

Would you prefer the default be reversed? i.e, that Py_UNICODE
be a complete type by default, and an extension must have a #define to
compile in Unicode-agnostic mode?

Absolutely. IMHO, the only platform that always requires binaries
due to incomplete operating system support for source distributions
is MS Windows, where Py_UNICODE equals wchar_t anyway. In some
cases, MacOS-X is broken enough to require binary releases, too, but
the normal target on that platform is the system Python, which has a
universal setting for the Py_UNICODE size as well.

So the only remaining platforms that suffer from binary
incompatibility problems here are Linux und Unix systems, where the
Py_UNICODE size differs between installations and distributions.
Given that these systems are best targeted with a source
distribution, it sounds like a bad default to complicate the usage
of Py_UNICODE for everyone, unless users explicitly disable this
behaviour. It's much better to provide this as an option for
extension writers who really want (or need) to provide portable
binary distributions for whatever reason.

Personally, I think the drawbacks totally outweigh the single
advantage, though, so I could absolutely live without this change.
It's easy enough to drop the linkage error message into a web search
engine.

Robert

Daniel Stutzbach

5:09 p.m.

Robert, Stefan, thank you for your feedback.

How about the following variation, which I believe will address your concerns:

[1] A quick Google search of PyPi reveals many packages offering Linux binary eggs.

Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises, LLC <http://stutzbachenterprises.com>

M.-A. Lemburg

10:45 a.m.

Daniel Stutzbach wrote:

...

Robert, Stefan, thank you for your feedback.

How about the following variation, which I believe will address your concerns:

By default, Py_UNICODE will be a fully-specified type. In a nutshell, the default will behave just like Python 2 or 3.1, except that trying to load a mismatched module will raise an ImportError with a more helpful error message (much friendlier to novice programmers). Cython would continue to use this mode.

Extension authors who want a Unicode-agnostic build can specify an option in their setup.py that will instruct distutils to pass a -D_Py_UNICODE_AGNOSTIC compiler flag to ensure that all of their .c files are built in Unicode-independent mode. That way, the whole extension is compiled in the same mode.

That would be our (eGenix) preferred implementation variant as well.

Building Unicode agnostic extensions should be a feature that the extension writers turn on explicitly, rather than being the default that has to be turned off.

However, rather than using a distutils options to specify enable the agnostic mode, I would presume that extension writers simply write:

#define _Py_UNICODE_AGNOSTIC 1 #include "Python.h"

in their code and then add

[build_ext] unicode-agnostic=1

to their setup.cfg.

...

It would indeed be great if package managers included the Unicode setting as part of the platform type. PJE's proposed implementation of that feature ( http://bit.ly/1bO62) would allow eggs to specify UCS2, UCS4, or "Don't Care". My patch greatly increases the number of eggs that could label themselves "Don't Care", reducing maintenance work for package maintainers who like to distribute binary eggs [1]. In other words, they are complimentary solutions.

...

[1] A quick Google search of PyPi reveals many packages offering Linux binary eggs.

-- Marc-Andre Lemburg eGenix.com

Professional Python Services directly from the Source (#1, May 26 2010)

...

...
...
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

2010-07-19: EuroPython 2010, Birmingham, UK 53 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

Daniel Stutzbach

1:57 p.m.

On Wed, May 26, 2010 at 5:45 AM, M.-A. Lemburg <mal@egenix.com> wrote:

...

Extension("foo", ["foo.c"], unicode_agnostic=True)

...

In principle, I agree. I don't personally have enough familiarity with the innards of distutils to feel comfortable writing a patch that alters the platform string.

Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises, LLC <http://stutzbachenterprises.com>

5406

Age (days ago)

5411

Last active (days ago)

List overview

Download

5 comments

4 participants

participants (4)

Daniel Stutzbach
M.-A. Lemburg
Robert Bradshaw
Stefan Behnel

Unicode compatibility

Daniel Stutzbach

Would you prefer the default be reversed? i.e, that Py_UNICODE be a complete type by default, and an extension must have a #define to compile in Unicode-agnostic mode?

Stefan Behnel

Robert Bradshaw

Daniel Stutzbach

[1] A quick Google search of PyPi reveals many packages offering Linux binary eggs.

M.-A. Lemburg

Daniel Stutzbach

In principle, I agree. I don't personally have enough familiarity with the innards of distutils to feel comfortable writing a patch that alters the platform string.

Stefan Behnel

Robert Bradshaw

Daniel Stutzbach

[1] A quick Google search of PyPi reveals many packages offering Linux binary eggs.

M.-A. Lemburg

Daniel Stutzbach

In principle, I agree. I don't personally have enough familiarity with the innards of distutils to feel comfortable writing a patch that alters the platform string.

tags

participants (4)