New subject: advice needed: best approach to enabling "metamodules"?

Nov. 29, 2014

      Hi all,

There was some discussion on python-ideas last month about how to make
it easier/more reliable for a module to override attribute access.
This is useful for things like autoloading submodules (accessing
'foo.bar' triggers the import of 'bar'), or for deprecating module
attributes that aren't functions. (Accessing 'foo.bar' emits a
DeprecationWarning, "the bar attribute will be removed soon".) Python
has had some basic support for this for a long time -- if a module
overwrites its entry in sys.modules[__name__], then the object that's
placed there will be returned by 'import'. This allows one to define
custom subclasses of module and use them instead of the default,
similar to how metaclasses allow one to use custom subclasses of
'type'.

In practice though it's very difficult to make this work safely and
correctly for a top-level package. The main problem is that when you
create a new object to stick into sys.modules, this necessarily means
creating a new namespace dict. And now you have a mess, because now
you have two dicts: new_module.__dict__ which is the namespace you
export, and old_module.__dict__, which is the globals() for the code
that's trying to define the module namespace. Keeping these in sync is
extremely error-prone -- consider what happens, e.g., when your
package __init__.py wants to import submodules which then recursively
import the top-level package -- so it's difficult to justify for the
kind of large packages that might be worried about deprecating entries
in their top-level namespace. So what we'd really like is a way to
somehow end up with an object that (a) has the same __dict__ as the
original module, but (b) is of our own custom module subclass. If we
can do this then metamodules will become safe and easy to write
correctly.

(There's a little demo of working metamodules here:
   https://github.com/njsmith/metamodule/
but it uses ctypes hacks that depend on non-stable parts of the
CPython ABI, so it's not a long-term solution.)

I've now spent some time trying to hack this capability into CPython
and I've made a list of the possible options I can think of to fix
this. I'm writing to python-dev because none of them are obviously The
Right Way so I'd like to get some opinions/ruling/whatever on which
approach to follow up on.

Option 1: Make it possible to change the type of a module object
in-place, so that we can write something like

   sys.modules[__name__].__class__ = MyModuleSubclass

Option 1 downside: The invariants required to make __class__
assignment safe are complicated, and only implemented for
heap-allocated type objects. PyModule_Type is not heap-allocated, so
making this work would require lots of delicate surgery to
typeobject.c. I'd rather not go down that rabbit-hole.

----

Option 2: Make PyModule_Type into a heap type allocated at interpreter
startup, so that the above just works.

Option 2 downside: PyModule_Type is exposed as a statically-allocated
global symbol, so doing this would involve breaking the stable ABI.

----

Option 3: Make it legal to assign to the __dict__ attribute of a
module object, so that we can write something like

   new_module = MyModuleSubclass(...)
   new_module.__dict__ = sys.modules[__name__].__dict__
   sys.modules[__name__].__dict__ = {}     # ***
   sys.modules[__name__] = new_module

The line marked *** is necessary because the way modules are designed,
they expect to control the lifecycle of their __dict__. When the
module object is initialized, it fills in a bunch of stuff in the
__dict__. When the module object (not the dict object!) is
deallocated, it deletes everything from the __dict__. This latter
feature in particular means that having two module objects sharing the
same __dict__ is bad news.

Option 3 downside: The paragraph above. Also, there's stuff inside the
module struct besides just the __dict__, and more stuff has appeared
there over time.

----

Option 4: Add a new function sys.swap_module_internals, which takes
two module objects and swaps their __dict__ and other attributes. By
making the operation a swap instead of an assignment, we avoid the
lifecycle pitfalls from Option 3. By making it a builtin, we can make
sure it always handles all the module fields that matter, not just
__dict__. Usage:

   new_module = MyModuleSubclass(...)
   sys.swap_module_internals(new_module, sys.modules[__name__])
   sys.modules[__name__] = new_module

Option 4 downside: Obviously a hack.

----

Option 3 or 4 both seem workable, it just depends on which way we
prefer to hold our nose. Option 4 is slightly more correct in that it
works for *all* modules, but OTOH at the moment the only time Option 3
*really* fails is for compiled modules with PEP 3121 metadata, and
compiled modules can already use a module subclass via other means
(since they instantiate their own module objects).

Thoughts? Suggestions on other options I've missed? Should I go ahead
and write a patch for one of these?

-n

-- 
Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

advice needed: best approach to enabling "metamodules"?

tags

participants (13)