[python-ldap] Python3 Status Question

Wed May 28 15:51:06 CEST 2014

On 05/27/2014 05:53 PM, Raphaël Barrois wrote:
> On Tue, 27 May 2014 17:41:24 -0400
> David Stanek <dstanek at dstanek.com> wrote:
> 
>> On Tue, May 27, 2014 at 5:35 PM, Raphaël Barrois
>> <raphael.barrois at m4x.org>wrote:
>>
>>> On Tue, 27 May 2014 12:26:58 -0400
>>> David Stanek <dstanek at dstanek.com> wrote:
>>>
>>>> Hello All,
>>>>
>>>> I'm in the process of getting the OpenStack Keystone project to
>>>> work on Python 3. Part of this work is to get all of our
>>>> dependencies to also work against Python 3. It looks like there
>>>> have been several attempts for python-ldap, with the most
>>>> promising being the one recently announced[1] on this list.
>>>>
>>>> This patchset is working so far, but I have quite a bit more
>>>> testing to do including more changes to Keystone so that I can
>>>> actually run full tests. What is the process for getting these
>>>> patches accepted?
>>>>
>>>> Thanks!
>>>>
>>>> 1.
>>>> https://mail.python.org/pipermail/python-ldap/2014q1/003348.html
>>>>
>>>
>>
>>
>>> - Decide that switching to Python3 should be an occasion for
>>>   significant API redesign, and release that version under a new
>>> name
>>
>>
>> I'm curious about this option. Is a API redesign on the table because
>> of Py3 changes (str->bytes, etc.) or is it just an opportunity to
>> make a change?
>>
>>
>>
> 
> This seemed to be an option when I scrubbed the archives before
> starting my Py3 version, see for instance
> https://mail.python.org/pipermail/python-ldap/2012q2/003115.html.
> 
> Regarding my Py3 fork, the goal is to provide a consistent API for both
> Py2 and Py3, where code using python-ldap can be used the same in both
> versions.
> This is not always easy, especially in places where objects
> are returned as bytes where the RFC states they are actually UTF-8
> encoded text ; in such situations, we'll have to decide whether we go
> for backwards-compatibility or for future-proof design.

The fact the python-ldap API cannot accept unicode (with non-ASCII
chars) input and only returns UTF-8 encoded data is one of the most
difficult aspects of using python-ldap. In most cases it requires
writing a new API for a Python applicaton whose sole responsibility is
to act as an encode/decode wrapper insulating the application code from
python-ldap.

I believe the utf-8 encode/decode should be the responsibility of the
ldap binding (i.e. python-ldap) instead of the application. This would
be ideal because the application could just use unicode for it's text
strings (as it should) and be able to simply call python-ldap without
concern.

However, we do have history to consider. Here is my suggestion.

1) Incoming parameters are checked to see if they are bytes or unicode.
If it's bytes then assume it's already utf-8 encoded. Otherwise encode
the unicode to utf-8. This is relatively easy to code in CPython.

2) Introduce a flag to control utf-8 decoding for output values. In
python2 the flag defaults to False to be consistent with existing
behavior. In python3 it defaults to True (so applications can just use
strings in a sane manner consistent with Python3 strings).

This way legacy code will continue to run correctly, python2 code could
avoid an encode/decode wrapper if chose to and python3 code could just
use strings without concerns but would have the option fallback to the
previous behavior. It's the best of both worlds. Having written a fair
amount of code which uses python-ldap I'd love to get out of the
encode/decode game, it's the source of a lot of problems (usually not
discovered until the code is in the field when non-English users start
supplying user values and then the code will need a fair amount of
refactoring to start using encode/decode wrappers).

P.S.: One also has to be careful to distinguish between the portions of
python-ldap which is written in pure Python vs. those which are written
in CPython. Currently the pure Python code behaves differently, you can
pass unicode and usually the right thing happens because the interpreter
recognizes it as unicode instead of being subjected to the default
encoding applied in PyArg_ParseTuple*() in the CPython binding (when the
parameter is marked as a string).

-- 
John