[Python-Dev] Pre-PEP: Redesigning extension modules

Sun Sep 1 14:23:02 CEST 2013

On 1 September 2013 18:11, Stefan Behnel <stefan_ml at behnel.de> wrote:
> Nick Coghlan, 01.09.2013 03:28:
>> On 1 Sep 2013 05:18, "Stefan Behnel" wrote:
>>> I can't really remember a case where I could afford the
>>> runtime overhead of implementing a wrapper in Python and going through
>>> something like ctypes or cffi. I mean, testing C libraries with Python
>>> tools would be one, but then, you wouldn't want to write an extension
>>> module for that and instead want to call it directly from the test code as
>>> directly as possible.
>>>
>>> I'm certainly aware that that use case exists, though, and also the case
>>> of just wanting to get things done as quickly and easily as possible.
>>
>> Keep in mind I first came to Python as a tool for test automation of custom
>> C++ hardware APIs that could be written to be SWIG friendly.
>
> Interesting again. Would you still do it that way? I recently had a
> discussion with Holger Krekel of py.test fame about testing C code with
> Cython, and we quickly agreed that wrapping the code in an extension module
> was both too cumbersome and too inflexible for testing purposes.
> Specifically, neither of Cython's top selling points fits here, not speed,
> not clarity, not API design. It's most likely different for SWIG, which
> involves less (not no, just less) manual work and gives you API-wise more
> of less exactly what you put in. However, cffi is almost certainly the
> better way to do it, because it gives you all sorts of flexibility for your
> test code without having to think about the wrapper design all the time.
>
> The situation is also different for C++ where you have less options for
> wrapping it. I can imagine SWIG still being the tool of choice on that
> front when it comes to bare and direct testing of large code bases.

To directly wrap C++, I'd still use SWIG. It makes a huge difference
when you can tweak the C++ side of the API to be SWIG friendly rather
than having to live with whatever a third party C++ library provides.
Having classes in C++ map directly to classes in Python is the main
benefit of doing it this way over using a C wrapper and cffi.

However, for an existing C API, or a custom API where I didn't need
the direct object mapping that C++ can provide, using cffi would be a
more attractive option than SWIG these days (the stuff I was doing
with SWIG was back around 2003 or so).

I think this is getting a little off topic for the list, though :)

>> I now work for an OS vendor where the 3 common languages for system
>> utilities are C, C++ and Python.
>>
>> For those use cases, dropping a bunch of standard Python objects in a
>> module dict is often going to be a quick and easy solution that avoids a
>> lot of nasty pointer lifecycle issues at the C level.
>
> That's yet another use case, BTW. When you control the whole application,
> then safety doesn't really matter at these points and keeping a bunch of
> stuff in a dict will usually work just fine. I'm mainly used to writing
> libraries for (sometimes tons of) other people, in which case the
> requirements are so diverse on user side that safety is a top thing to care
> about. Anything you can keep inside of C code should stay there.
> (Especially when dealing with libxml2&friends in lxml which continuously
> present their 'interesting' usability characteristics.)

I don't think it's a coincidence that it was the etree interface with
expat that highlighted the deficiencies of the current extension
module hooks when it comes to working properly with
test.support.import_fresh_module :)

>> * PEP 3121 with a size of "0". As above, but avoids the module state APIs
>> in order to support reloading. All module state (including type
>> cross-references) is stored in hidden state (e.g. an instance of a custom
>> type not exposed to Python, with a reference stored on each custom type
>> object defined in the module, and any module level "functions" actually
>> being methods of a hidden object).
>
> Thanks for elaborating. I had completely failed to make the mental link
> that you could simply stick bound methods as functions into the module
> dict, i.e. that they don't even have to be methods of the module itself.
> That's something that Cython could already use in older CPythons, even as a
> preparation for any future import protocol changes. The object that they
> are methods of would then eventually become the module instance.
>
> You'd still suffer a slight performance hit from going from a static global
> C variable to a pointer indirection - for everything: string constants,
> cached Python objects, all user defined global C variables would have to go
> there as Cython cannot know if they are module instance specific state or
> not (they usually will be, I guess). But that has to be done anyway if the
> goal is to get rid of static state to enable sub-interpreters. I can't wait
> seeing lxml run threaded in mod_wsgi... ;-)

To be honest, I didn't realise that such a trick might already be
possible until I was writing down this list of alternatives. If you
manage to turn it into a real solution for lxml (or Cython in
general), it would be great to hear more about how you turned the
general idea into something real :)

That means the powers any new extension initialisation API will offer
will be limited to:

* letting the module know its own name (and other details)
* letting the module explicitly block reloading
* letting the module support loading multiple copies at once by taking
the initial import out of sys.modules (but keeping a separate
reference to it alive)

<snip>
>>> As soon as you have more than one extension type in your module, and they
>>> interact with each other, they will almost certainly have to do type
>>> checks
>>> against each other to make sure users haven't passed them rubbish before
>>> they access any C struct fields of the object. Doing a type check means
>>> that at least one type has a pointer to the other, meaning that it holds
>>> global module state.
>>
>> Sure, but you can use the CPython API rather than writing normal C code. We
>> do this fairly often in CPython when we're dealing with things stored in
>> modules that can be manipulated from Python.
>>
>> It incurs CPython's dynamic dispatch overhead, but sometimes that's worth
>> it to avoid needing to deal with C level lifecycle issues.
>
> Not so much of a problem in Cython, because all you usually have to do to
> get fast C level access to something is to change a "def" into a "cdef"
> somewhere, or add a decorator, or an assignment to a known extension type
> variable. Once the module global state is 'virtualised', this will also be
> a safe thing to do in the face of multiple module instances, and still be
> much faster than going through Python calls.

It's kinda cool how designing a next generation API can sometimes
reveal hidden possibilities of an *existing* API :)

We had another one of those recently on distutils-sig, when I realised
the much-maligned .pth modules are actually a decent solution to
sharing distributions between virtual environments. I'm so used to
disliking their global side effects when used with the system Python
that it took me a long time to recognise the validity of using them to
make implicit path additions in a more controlled virtual environment
:)

In terms of where we go from here - do you mind if I use your pre-PEP
as the initial basis for a PEP of my own some time in the next week or
two (listing you as co-author)? Improving extension module
initialisation has been the driver for most of the PEP 451 feedback
I've been giving to Eric over on import-sig, so I have some definite
ideas on how I think that API should look :)

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia