advice needed: best approach to enabling "metamodules"?

Hi all, There was some discussion on python-ideas last month about how to make it easier/more reliable for a module to override attribute access. This is useful for things like autoloading submodules (accessing 'foo.bar' triggers the import of 'bar'), or for deprecating module attributes that aren't functions. (Accessing 'foo.bar' emits a DeprecationWarning, "the bar attribute will be removed soon".) Python has had some basic support for this for a long time -- if a module overwrites its entry in sys.modules[__name__], then the object that's placed there will be returned by 'import'. This allows one to define custom subclasses of module and use them instead of the default, similar to how metaclasses allow one to use custom subclasses of 'type'. In practice though it's very difficult to make this work safely and correctly for a top-level package. The main problem is that when you create a new object to stick into sys.modules, this necessarily means creating a new namespace dict. And now you have a mess, because now you have two dicts: new_module.__dict__ which is the namespace you export, and old_module.__dict__, which is the globals() for the code that's trying to define the module namespace. Keeping these in sync is extremely error-prone -- consider what happens, e.g., when your package __init__.py wants to import submodules which then recursively import the top-level package -- so it's difficult to justify for the kind of large packages that might be worried about deprecating entries in their top-level namespace. So what we'd really like is a way to somehow end up with an object that (a) has the same __dict__ as the original module, but (b) is of our own custom module subclass. If we can do this then metamodules will become safe and easy to write correctly. (There's a little demo of working metamodules here: https://github.com/njsmith/metamodule/ but it uses ctypes hacks that depend on non-stable parts of the CPython ABI, so it's not a long-term solution.) I've now spent some time trying to hack this capability into CPython and I've made a list of the possible options I can think of to fix this. I'm writing to python-dev because none of them are obviously The Right Way so I'd like to get some opinions/ruling/whatever on which approach to follow up on. Option 1: Make it possible to change the type of a module object in-place, so that we can write something like sys.modules[__name__].__class__ = MyModuleSubclass Option 1 downside: The invariants required to make __class__ assignment safe are complicated, and only implemented for heap-allocated type objects. PyModule_Type is not heap-allocated, so making this work would require lots of delicate surgery to typeobject.c. I'd rather not go down that rabbit-hole. ---- Option 2: Make PyModule_Type into a heap type allocated at interpreter startup, so that the above just works. Option 2 downside: PyModule_Type is exposed as a statically-allocated global symbol, so doing this would involve breaking the stable ABI. ---- Option 3: Make it legal to assign to the __dict__ attribute of a module object, so that we can write something like new_module = MyModuleSubclass(...) new_module.__dict__ = sys.modules[__name__].__dict__ sys.modules[__name__].__dict__ = {} # *** sys.modules[__name__] = new_module The line marked *** is necessary because the way modules are designed, they expect to control the lifecycle of their __dict__. When the module object is initialized, it fills in a bunch of stuff in the __dict__. When the module object (not the dict object!) is deallocated, it deletes everything from the __dict__. This latter feature in particular means that having two module objects sharing the same __dict__ is bad news. Option 3 downside: The paragraph above. Also, there's stuff inside the module struct besides just the __dict__, and more stuff has appeared there over time. ---- Option 4: Add a new function sys.swap_module_internals, which takes two module objects and swaps their __dict__ and other attributes. By making the operation a swap instead of an assignment, we avoid the lifecycle pitfalls from Option 3. By making it a builtin, we can make sure it always handles all the module fields that matter, not just __dict__. Usage: new_module = MyModuleSubclass(...) sys.swap_module_internals(new_module, sys.modules[__name__]) sys.modules[__name__] = new_module Option 4 downside: Obviously a hack. ---- Option 3 or 4 both seem workable, it just depends on which way we prefer to hold our nose. Option 4 is slightly more correct in that it works for *all* modules, but OTOH at the moment the only time Option 3 *really* fails is for compiled modules with PEP 3121 metadata, and compiled modules can already use a module subclass via other means (since they instantiate their own module objects). Thoughts? Suggestions on other options I've missed? Should I go ahead and write a patch for one of these? -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org

On Sat, Nov 29, 2014 at 12:59 PM, Nathaniel Smith <njs@pobox.com> wrote:
This one corresponds to what I've seen in quite a number of C APIs. It's not ideal, but nothing is; and at least this way, it's clear that you're fiddling with internals. Letting the interpreter do the grunt-work for you is *definitely* preferable to having recipes out there saying "swap in a new __dict__, then don't forget to clear the old module's __dict__", which will have massive versioning issues as soon as a new best-practice comes along; making it a function, like this, means its implementation can smoothly change between versions (even in a bug-fix release). Would it be better to make that function also switch out the entry in sys.modules? That way, it's 100% dedicated to this job of "I want to make a subclass of module and use that for myself", and could then be made atomic against other imports. I've no idea whether there's any other weird shenanigans that could be deployed with this kind of module switch, nor whether cutting them out would be a good or bad thing! ChrisA

Are these really all our options? All of them sound like hacks, none of them sound like anything the language (or even the CPython implementation) should sanction. Have I missed the discussion where the use cases and constraints were analyzed and all other approaches were rejected? (I might have some half-baked ideas, but I feel I should read up on the past discussion first, and they are probably more fit for python-ideas than for python-dev. Plus I'm just writing this email because I'm procrastinating on the type hinting PEP. :-) --Guido On Fri, Nov 28, 2014 at 7:45 PM, Chris Angelico <rosuav@gmail.com> wrote:
-- --Guido van Rossum (python.org/~guido)

On Sat, Nov 29, 2014 at 4:21 AM, Guido van Rossum <guido@python.org> wrote:
The previous discussions I was referring to are here: http://thread.gmane.org/gmane.comp.python.ideas/29487/focus=29555 http://thread.gmane.org/gmane.comp.python.ideas/29788 There might well be other options; these are just the best ones I could think of :-). The constraints are pretty tight, though: - The "new module" object (whatever it is) should have a __dict__ that aliases the original module globals(). I can elaborate on this if my original email wasn't enough, but hopefully it's obvious that making two copies of the same namespace and then trying to keep them in sync at the very least smells bad :-). - The "new module" object has to be a subtype of ModuleType, b/c there are lots of places that do isinstance(x, ModuleType) checks (notably -- but not only -- reload()). Since a major goal here is to make it possible to do cleaner deprecations, it would be really unfortunate if switching an existing package to use the metamodule support itself broke things :-). - Lookups in the normal case should have no additional performance overhead, because module lookups are extremely extremely common. (So this rules out dict proxies and tricks like that -- we really need 'new_module.__dict__ is globals()' to be true.) AFAICT there are three logically possible strategies for satisfying that first constraint: (a) convert the original module object into the type we want, in-place (b) create a new module object that acts like the original module object (c) somehow arrange for our special type to be used from the start My options 1 and 2 are means of accomplishing (a), and my options 3 and 4 are means of accomplishing (b) while working around the behavioural quirks of module objects (as required by the second constraint). The python-ideas thread did also consider several methods of implementing strategy (c), but they're messy enough that I left them out here. The problem is that somehow we have to execute code to create the new subtype *before* we have an entry in sys.modules for the package that contains the code for the subtype. So one option would be to add a new rule, that if a file pkgname/__new__.py exists, then this is executed first and is required to set up sys.modules["pkgname"] before we exec pkgname/__init__.py. So pkgname/__new__.py might look like: import sys from pkgname._metamodule import MyModuleSubtype sys.modules[__name__] = MyModuleSubtype(__name__, docstring) This runs into a lot of problems though. To start with, the 'from pkgname._metamodule ...' line is an infinite loop, b/c this is the code used to create sys.modules["pkgname"]. It's not clear where the globals dict for executing __new__.py comes from (who defines __name__? Currently that's done by ModuleType.__init__). It only works for packages, not modules. The need to provide the docstring here, before __init__.py is even read, is weird. It adds extra stat() calls to every package lookup. And, the biggest showstopper IMHO: AFAICT it's impossible to write a polyfill to support this code on old python versions, so it's useless to any package which needs to keep compatibility with 2.7 (or even 3.4). Sure, you can backport the whole import system like importlib2, but telling everyone that they need to replace every 'import numpy' with 'import importlib2; import numpy' is a total non-starter. So, yeah, those 4 options are really the only plausible ones I know of. Option 1 and option 3 are pretty nice at the language level! Most Python objects allow assignment to __class__ and __dict__, and both PyPy and Jython at least do support __class__ assignment. Really the only downside with Option 1 is that actually implementing it requires attention from someone with deep knowledge of typeobject.c. -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org

On 29/11/14 19:37, Nathaniel Smith wrote: [snip]
- The "new module" object has to be a subtype of ModuleType, b/c there are lots of places that do isinstance(x, ModuleType) checks (notably
It has to be a *subtype* is does not need to be a *subclass*
Cheers, Mark.

On Sun, Nov 30, 2014 at 11:07:57AM +1300, Greg Ewing wrote:
Perhaps I'm missing something, but won't that imply that every module which wants to use a "special" module type has to re-invent the wheel? If this feature is going to be used, I would expect to be able to re-use pre-written module types. E.g. having written "module with properties" (so to speak) once, I can just import it and use it in my next project. -- Steven

On Sun, Nov 30, 2014 at 12:05 AM, Steven D'Aprano <steve@pearwood.info> wrote:
I expect you'd package the special metamodule class in a stand-alone package, not directly in the ones that use it. You could import other packages freely, just the one that you're currently defining would be unavailable.

On Sat, Nov 29, 2014 at 8:37 PM, Nathaniel Smith <njs@pobox.com> wrote: [...]
As Greg Ewing said – you don't want to import from the package whose metamodule you're defining. You'd want to do as little work as possible in __new__.py. I'd use something like this: import types class __metamodule__(types.ModuleType): def __call__(self): return self.main() where Python would get the attribute __metamodule__ from __new__.py, and use `__metamodule__(name, doc)` as the thing to execute __main__ in.
Well, it could still be in __metamodule__.__init__().
It only works for packages, not modules.
I don't see a need for this treatment for modules in a package – if you want `from mypkg import callme`, you can make "callme" a function rather than a callable module. If you *also* want `from mypkg.callme import something_else`, I say you should split "callme" into two differently named things; names are cheap inside a package. If really needed, modules in a package can use an import hook defined in the package, or be converted to subpackages. Single-module projects would be left out, yes – but those can be simply converted to a package.
The need to provide the docstring here, before __init__.py is even read, is weird.
Does it have to be before __init__.py is read? Can't __init__.py be compiled beforehand, to get __doc__, and only *run* in the new namespace? (Or should __new__.py define import hooks that say how __init__.py should be loaded/compiled? I don't see a case for that.)
It adds extra stat() calls to every package lookup.
Fair.
I'm probably missing something obvious, but where would this not work? - As the first thing it does, __init__.py imports the polyfill and calls polyfill(__name__) - The polyfill, if running non-recursively* under old Python: -- compiles __init__.py -- imports __new__.py to get __metamodule__ -- instantiates metamodule with name, and docstring from compiled code -- * remembers the instance, to check for recursion later -- puts it in sys.modules -- execs __init__ in it - afterwards the original __init__.py execution continues, filling up a now-unused module's namespace

All the use cases seem to be about adding some kind of getattr hook to modules. They all seem to involve modifying the CPython C code anyway. So why not tackle that problem head-on and modify module_getattro() to look for a global named __getattr__ and if it exists, call that instead of raising AttributeError? On Sat, Nov 29, 2014 at 11:37 AM, Nathaniel Smith <njs@pobox.com> wrote:
-- --Guido van Rossum (python.org/~guido)

Are these really all our options? All of them sound like hacks, none of
On Sat, Nov 29, 2014, 21:55 Guido van Rossum <guido@python.org> wrote: All the use cases seem to be about adding some kind of getattr hook to modules. They all seem to involve modifying the CPython C code anyway. So why not tackle that problem head-on and modify module_getattro() to look for a global named __getattr__ and if it exists, call that instead of raising AttributeError? Not sure if anyone thought of it. :) Seems like a reasonable solution to me. Be curious to know what the benchmark suite said the impact was. -brett On Sat, Nov 29, 2014 at 11:37 AM, Nathaniel Smith <njs@pobox.com> wrote: On Sat, Nov 29, 2014 at 4:21 AM, Guido van Rossum <guido@python.org> wrote: them
The previous discussions I was referring to are here: http://thread.gmane.org/gmane.comp.python.ideas/29487/focus=29555 http://thread.gmane.org/gmane.comp.python.ideas/29788 There might well be other options; these are just the best ones I could think of :-). The constraints are pretty tight, though: - The "new module" object (whatever it is) should have a __dict__ that aliases the original module globals(). I can elaborate on this if my original email wasn't enough, but hopefully it's obvious that making two copies of the same namespace and then trying to keep them in sync at the very least smells bad :-). - The "new module" object has to be a subtype of ModuleType, b/c there are lots of places that do isinstance(x, ModuleType) checks (notably -- but not only -- reload()). Since a major goal here is to make it possible to do cleaner deprecations, it would be really unfortunate if switching an existing package to use the metamodule support itself broke things :-). - Lookups in the normal case should have no additional performance overhead, because module lookups are extremely extremely common. (So this rules out dict proxies and tricks like that -- we really need 'new_module.__dict__ is globals()' to be true.) AFAICT there are three logically possible strategies for satisfying that first constraint: (a) convert the original module object into the type we want, in-place (b) create a new module object that acts like the original module object (c) somehow arrange for our special type to be used from the start My options 1 and 2 are means of accomplishing (a), and my options 3 and 4 are means of accomplishing (b) while working around the behavioural quirks of module objects (as required by the second constraint). The python-ideas thread did also consider several methods of implementing strategy (c), but they're messy enough that I left them out here. The problem is that somehow we have to execute code to create the new subtype *before* we have an entry in sys.modules for the package that contains the code for the subtype. So one option would be to add a new rule, that if a file pkgname/__new__.py exists, then this is executed first and is required to set up sys.modules["pkgname"] before we exec pkgname/__init__.py. So pkgname/__new__.py might look like: import sys from pkgname._metamodule import MyModuleSubtype sys.modules[__name__] = MyModuleSubtype(__name__, docstring) This runs into a lot of problems though. To start with, the 'from pkgname._metamodule ...' line is an infinite loop, b/c this is the code used to create sys.modules["pkgname"]. It's not clear where the globals dict for executing __new__.py comes from (who defines __name__? Currently that's done by ModuleType.__init__). It only works for packages, not modules. The need to provide the docstring here, before __init__.py is even read, is weird. It adds extra stat() calls to every package lookup. And, the biggest showstopper IMHO: AFAICT it's impossible to write a polyfill to support this code on old python versions, so it's useless to any package which needs to keep compatibility with 2.7 (or even 3.4). Sure, you can backport the whole import system like importlib2, but telling everyone that they need to replace every 'import numpy' with 'import importlib2; import numpy' is a total non-starter. So, yeah, those 4 options are really the only plausible ones I know of. Option 1 and option 3 are pretty nice at the language level! Most Python objects allow assignment to __class__ and __dict__, and both PyPy and Jython at least do support __class__ assignment. Really the only downside with Option 1 is that actually implementing it requires attention from someone with deep knowledge of typeobject.c. -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido%40python.org -- --Guido van Rossum (python.org/~guido) _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/brett%40python.org

On Sun, Nov 30, 2014 at 6:15 AM, Brett Cannon <brett@python.org> wrote:
Why would there be any impact? The __getattr__ hook would be similar to the one on classes -- it's only invoked at the point where otherwise AttributeError would be raised. -- --Guido van Rossum (python.org/~guido)

On Sun Nov 30 2014 at 2:28:31 PM Ethan Furman <ethan@stoneleaf.us> wrote:
You don't; you just can't shoehorn everything back to 2.7. And just to make sure everyone participating in this discussion is up on the latest import stuff, Python 3.4 does have Loader.create_module() <https://docs.python.org/3/library/importlib.html#importlib.abc.Loader.create...> which lets you control what object is used for a module in the import machinery (this is prior to loading, though, so you can't specify it in a module but at the loader level only). This is how I was able to implement lazy loading for 3.5 <https://docs.python.org/3.5/library/importlib.html#importlib.util.LazyLoader> .

On Sun, Nov 30, 2014 at 7:27 PM, Ethan Furman <ethan@stoneleaf.us> wrote:
I think that's doable -- assuming I'm remembering correctly the slightly weird class vs. instance lookup rules for special methods, you can write a module subclass like class GetAttrModule(types.ModuleType): def __getattr__(self, name): return self.__dict__["__getattr__"](name) and then use ctypes hacks to get it into sys.modules[__name__]. -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org

On 11/30/2014 03:41 PM, Terry Reedy wrote:
My understanding of one of the use-cases was being able to issue warnings about deprecated attributes, which would be most effective if a backport could be written for current versions. -- ~Ethan~

On Sun, 30 Nov 2014 11:15:50 -0800 Guido van Rossum <guido@python.org> wrote:
builtins are typically found by first looking up in the current globals (module) scope, failing, and then falling back on __builtins__. Depending on how much overhead is added to the "failing" step, there /might/ be a performance difference. Of course, that would only occur wherever a __getattr__ hook is defined. Regards Antoine.

On Sun, Nov 30, 2014 at 1:12 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
The builtins lookup process never does a module attribute lookup -- it only does dict lookups. So it would not be affected by a module __getattr__ hook (unless we were to use dict proxies, which Nathaniel already rejected). @Nathaniel: perhaps you could get what you want without any C code changes using the approach of Brett's LazyLoader? -- --Guido van Rossum (python.org/~guido)

On Sun, Nov 30, 2014 at 2:54 AM, Guido van Rossum <guido@python.org> wrote:
You need to allow overriding __dir__ as well for tab-completion, and some people wanted to use the properties API instead of raw __getattr__, etc. Maybe someone will want __getattribute__ semantics, I dunno. So since we're *so close* to being able to just use the subclassing machinery, it seemed cleaner to try and get that working instead of reimplementing bits of it piecewise. That said, __getattr__ + __dir__ would be enough for my immediate use cases. -n
-- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org

On Sun, Nov 30, 2014 at 11:29 AM, Nathaniel Smith <njs@pobox.com> wrote:
Hm... I agree about __dir__ but the other things feel too speculative.
That would really be option 1, right? It's the one that looks cleanest from the user's POV (or at least from the POV of a developer who wants to build a framework using this feature -- for a simple one-off use case, __getattr__ sounds pretty attractive). I think that if we really want option 1, the issue of PyModuleType not being a heap type can be dealt with.
That said, __getattr__ + __dir__ would be enough for my immediate use cases.
Perhaps it would be a good exercise to try and write the "lazy submodule import"(*) use case three ways: (a) using only CPython 3.4; (b) using __class__ assignment; (c) using customizable __getattr__ and __dir__. I think we can learn a lot about the alternatives from this exercise. I presume there's already a version of (a) floating around, but if it's been used in practice at all, it's probably too gnarly to serve as a useful comparison (though its essence may be extracted to serve as such). FWIW I believe all proposals here have a big limitation: the module *itself* cannot benefit much from all these shenanigans, because references to globals from within the module's own code are just dictionary accesses, and we don't want to change that. (*) I originally wrote "lazy import", but I realized that messing with the module class object probably isn't the best way to implement that -- it requires a proxy for the module that's managed by an import hook. But if you think it's possible, feel free to use this example, as "lazy import" seems a pretty useful thing to have in many situations. (At least that's how I would do it. And I would probably add some atrocious hack to patch up the importing module's globals once the module is actually loaded, to reduce the cost of using the proxy over the lifetime of the process. -- --Guido van Rossum (python.org/~guido)

On Sun Nov 30 2014 at 3:55:39 PM Guido van Rossum <guido@python.org> wrote:
Start at https://hg.python.org/cpython/file/64bb01bce12c/Lib/importlib/util.py#l207 and read down the rest of the file. It really only requires changing __class__ to drop the proxy and that's done immediately after the lazy import. The approach also occurs *after* the finder so you don't get ImportError for at least missing a file.

On Sun, Nov 30, 2014 at 8:54 PM, Guido van Rossum <guido@python.org> wrote:
Options 1-4 all have the effect of making it fairly simple to slot an arbitrary user-defined module subclass into sys.modules. Option 1 is the cleanest API though :-).
(b) and (c) are very straightforward and trivial. Probably I could do a better job of faking dir()'s default behaviour on modules, but basically: ##### __class__ assignment__ ##### import sys, types, importlib class MyModule(types.ModuleType): def __getattr__(self, name): if name in _lazy_submodules: # implicitly assigns submodule to self.__dict__[name] return importlib.import_module(name, package=self.__package__) def __dir__(self): entries = set(self.__dict__) entries.update(__lazy_submodules__) return sorted(entries) sys.modules[__name__].__class__ = MyModule _lazy_submodules = {"foo", "bar"} ##### customizable __getattr__ and __dir__ ##### import importlib def __getattr__(name): if name in _lazy_submodules: # implicitly assigns submodule to globals()[name] return importlib.import_module(name, package=self.__package__) def __dir__(): entries = set(globals()) entries.update(__lazy_submodules__) return sorted(entries) _lazy_submodules = {"foo", "bar"}
I think that's fine -- IMHO the main uses cases here are about controlling the public API. And a module that really wants to can always import itself if it wants to pull more shenanigans :-) (i.e., foo/__init__.py can do "import foo; foo.blahblah" instead of just "blahblah".) -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org

Guido van Rossum wrote:
If assignment to the __class__ of a module were permitted (by whatever means) then you could put this in a module: class __class__(types.ModuleType): ... which makes it look almost like a deliberate language feature. :-) Seriously, of the options presented, I think that allowing __class__ assignment is the most elegant, since it solves a lot of problems in one go without introducing any new features -- just removing a restriction that prevents an existing language mechanism from working in this case. -- Greg

What if we'd have metaclass semantics on module creation? Eg, suppose the default: __metaclass__ = ModuleType What if Python would support __prepare__ for modules? Thanks, -- Ionel M. On Sat, Nov 29, 2014 at 11:36 PM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:

On Sat, 29 Nov 2014 01:59:06 +0000 Nathaniel Smith <njs@pobox.com> wrote:
Option 1b: have __class__ assignment delegate to a tp_classassign slot on the old class, so that typeobject.c doesn't have to be cluttered with many special cases.
[...]
How do these two options interact with the fact that module functions store their globals dict, not the module itself? Regards Antoine.

On 29 November 2014 at 21:32, Antoine Pitrou <solipsis@pitrou.net> wrote:
Aye, being able to hook class switching could be potentially useful (including the ability to just disallow it entirely if you really wanted to do that).
Right, that's the part I consider the most challenging with metamodules - the fact that there's a longstanding assumption that a module is just a "dictionary with some metadata", so the interpreter is inclined to treat them that way. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Sat, Nov 29, 2014 at 11:32 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
I'm intrigued -- how would this help? I have a vague impression that one could add another branch to object_set_class that went something like if at least one of the types is a subtype of the other type, and the subtype is a heap type with tp_dealloc == subtype_dealloc, and the subtype doesn't add any important slots, and ... then the __class__ assignment is legal. (This is taking advantage of the fact that if you don't have any extra slots added, then subtype_dealloc just basically defers to the base type's tp_dealloc, so it doesn't really matter which one you end up calling.) And my vague impression is that there isn't really anything special about the module type that would allow a tp_classassign function to simplify this logic. But these are just vague impressions :-)
I think that's totally fine? The whole point of all these proposals is to make sure that the final module object does in fact have the correct globals dict. ~$ git clone git@github.com:njsmith/metamodule.git ~$ cd metamodule ~/metamodule$ python3.4
If anything this is another argument for why we NEED something like this :-). -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org

On Sat, 29 Nov 2014 20:02:50 +0000 Nathaniel Smith <njs@pobox.com> wrote:
It would allow ModuleType to override tp_classassign to decide whether and how __class__ assignment on a module instance is allowed to work. So typeobject.c needn't know about any specifics of ModuleType or any other type.
Ok, I see. The code hacks up the new module to take ownership of the old module's __dict__. That doesn't look very clean to me. Regards Antoine.

On 29/11/14 01:59, Nathaniel Smith wrote:
Hi all,
[snip]
Why does MyModuleClass need to sub-class types.ModuleType? Modules have no special behaviour, apart from the inability to write to their __dict__ attribute, which is the very thing you don't want. If it quacks like a module... Cheers, Mark.

Hi, This discussion has been going on for a while, but no one has questioned the basic premise. Does this needs any change to the language or interpreter? I believe it does not. I'm modified your original metamodule.py to not use ctypes and support reloading: https://gist.github.com/markshannon/1868e7e6115d70ce6e76 Cheers, Mark. On 29/11/14 01:59, Nathaniel Smith wrote:

On Sun, Nov 30, 2014 at 10:14 PM, Mark Shannon <mark@hotpy.org> wrote:
Interesting approach! As written, your code will blow up on any python < 3.4, because when old_module gets deallocated it'll wipe the module dict clean. And I guess even on >=3.4, this might still happen if old_module somehow manages to get itself into a reference loop before getting deallocated. (Hopefully not, but what a nightmare to debug if it did.) However, both of these issues can be fixed by stashing a reference to old_module somewhere in new_module. The __class__ = ModuleType trick is super-clever but makes me irrationally uncomfortable. I know that this is documented as a valid method of fooling isinstance(), but I didn't know that until your yesterday, and the idea of objects where type(foo) is not foo.__class__ strikes me as somewhat blasphemous. Maybe this is all fine though. The pseudo-module objects generated this way will still won't pass PyModule_Check, so in theory this could produce behavioural differences. I can't name any specific places where this will break things, though. From a quick skim of the CPython source, a few observations: It means the PyModule_* API functions won't work (e.g. PyModule_GetDict); maybe these aren't used enough to matter. It looks like the __reduce__ methods on "method objects" (Objects/methodobject.c) have a special check for ->m_self being a module object, and won't pickle correctly if ->m_self ends up pointing to one of these pseudo-modules. I have no idea how one ends up with a method whose ->m_self points to a module object, though -- maybe it never actually happens. PyImport_Cleanup treats module objects differently from non-module objects during shutdown. I guess it also has the mild limitation that it doesn't work with extension modules, but eh. Mostly I'd be nervous about the two points above. -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org

On Mon, Dec 1, 2014 at 12:59 AM, Nathaniel Smith <njs@pobox.com> wrote:
Actually, there is one showstopper here -- in the first version where reload() uses isinstance() is actually 3.4. Before that you need a real module subtype for reload to work. But this is in principle workaroundable by using subclassing + ctypes on old versions of python and the __class__ = hack on new versions.
-- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org

Nathaniel, did you look at Brett's LazyLoader? It overcomes the subclass issue by using a module loader that makes all modules instances of a (trivial) Module subclass. I'm sure this approach can be backported as far as you need to go. On Sun, Nov 30, 2014 at 5:02 PM, Nathaniel Smith <njs@pobox.com> wrote:
-- --Guido van Rossum (python.org/~guido)

On Mon, Dec 1, 2014 at 1:27 AM, Guido van Rossum <guido@python.org> wrote:
The problem is that by the time your package's code starts running, it's too late to install such a loader. Brett's strategy works well for lazy-loading submodules (e.g., making it so 'import numpy' makes 'numpy.testing' available, but without the speed hit of importing it immediately), but it doesn't help if you want to actually hook attribute access on your top-level package (e.g., making 'numpy.foo' trigger a DeprecationWarning -- we have a lot of stupid exported constants that we can never get rid of because our rules say that we have to deprecate things before removing them). Or maybe you're suggesting that we define a trivial heap-allocated subclass of PyModule_Type and use that everywhere, as a quick-and-dirty way to enable __class__ assignment? (E.g., return it from PyModule_New?) I considered this before but hesitated b/c it could potentially break backwards compatibility -- e.g. if code A creates a PyModule_Type object directly without going through PyModule_New, and then code B checks whether the resulting object is a module by doing isinstance(x, type(sys)), this will break. (type(sys) is a pretty common way to get a handle to ModuleType -- in fact both types.py and importlib use it.) So in my mind I sorta lumped it in with my Option 2, "minor compatibility break". OTOH maybe anyone who creates a module object without going through PyModule_New deserves whatever they get. -n
-- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org

On Sun, Nov 30, 2014 at 5:42 PM, Nathaniel Smith <njs@pobox.com> wrote:
Couldn't you install a package loader using some install-time hook? Anyway, I still think that the issues with heap types can be overcome. Hm, didn't you bring that up before here? Was the conclusion that it's impossible? -- --Guido van Rossum (python.org/~guido)

On Mon, Dec 1, 2014 at 4:06 AM, Guido van Rossum <guido@python.org> wrote:
I've brought it up several times but no-one's really discussed it :-). I finally attempted a deep dive into typeobject.c today myself. I'm not at all sure I understand the intricacies correctly here, but I *think* __class__ assignment could be relatively easily extended to handle non-heap types, and in fact the current restriction to heap types is actually buggy (IIUC). object_set_class is responsible for checking whether it's okay to take an object of class "oldto" and convert it to an object of class "newto". Basically it's goal is just to avoid crashing the interpreter (as would quickly happen if you e.g. allowed "[].__class__ = dict"). Currently the rules (spread across object_set_class and compatible_for_assignment) are: (1) both oldto and newto have to be heap types (2) they have to have the same tp_dealloc (3) they have to have the same tp_free (4) if you walk up the ->tp_base chain for both types until you find the most-ancestral type that has a compatible struct layout (as checked by equiv_structs), then either (4a) these ancestral types have to be the same, OR (4b) these ancestral types have to have the same tp_base, AND they have to have added the same slots on top of that tp_base (e.g. if you have class A(object): pass and class B(object): pass then they'll both have added a __dict__ slot at the same point in the instance struct, so that's fine; this is checked in same_slots_added). The only place the code assumes that it is dealing with heap types is in (4b) -- same_slots_added unconditionally casts the ancestral types to (PyHeapTypeObject*). AFAICT that's why step (1) is there, to protect this code. But I don't think the check actually works -- step (1) checks that the types we're trying to assign are heap types, but this is no guarantee that the *ancestral* types will be heap types. [Also, the code for __bases__ assignment appears to also call into this code with no heap type checks at all.] E.g., I think if you do class MyList(list): __slots__ = () class MyDict(dict): __slots__ = () MyList().__class__ = MyDict() then you'll end up in same_slots_added casting PyDict_Type and PyList_Type to PyHeapTypeObjects and then following invalid pointers into la-la land. (The __slots__ = () is to maintain layout compatibility with the base types; if you find builtin types that already have __dict__ and weaklist and HAVE_GC then this example should still work even with perfectly empty subclasses.) Okay, so suppose we move the heap type check (step 1) down into same_slots_added (step 4b), since AFAICT this is actually more correct anyway. This is almost enough to enable __class__ assignment on modules, because the cases we care about will go through the (4a) branch rather than (4b), so the heap type thing is irrelevant. The remaining problem is the requirement that both types have the same tp_dealloc (step 2). ModuleType itself has tp_dealloc == module_dealloc, while all(?) heap types have tp_dealloc == subtype_dealloc. Here again, though, I'm not sure what purpose this check serves. subtype_dealloc basically cleans up extra slots, and then calls the base class tp_dealloc. So AFAICT it's totally fine if oldto->tp_dealloc == module_dealloc, and newto->tp_dealloc == subtype_dealloc, so long as newto is a subtype of oldto -- b/c this means newto->tp_dealloc will end up calling oldto->tp_dealloc anyway. OTOH it's not actually a guarantee of anything useful to see that oldto->tp_dealloc == newto->tp_dealloc == subtype_dealloc, because subtype_dealloc does totally different things depending on the ancestry tree -- MyList and MyDict above pass the tp_dealloc check, even though list.tp_dealloc and dict.tp_dealloc are definitely *not* interchangeable. So I suspect that a more correct way to do this check would be something like PyTypeObject *old__real_deallocer = oldto, *new_real_deallocer = newto; while (old_real_deallocer->tp_dealloc == subtype_dealloc) old_real_deallocer = old_real_deallocer->tp_base; while (new_real_deallocer->tp_dealloc == subtype_dealloc) new_real_deallocer = new_real_deallocer->tp_base; if (old_real_deallocer->tp_dealloc != new_real_deallocer) error out; Module subclasses would pass this check. Alternatively it might make more sense to add a check in equiv_structs that (child_type->tp_dealloc == subtype_dealloc || child_type->tp_dealloc == parent_type->tp_dealloc); I think that would accomplish the same thing in a somewhat cleaner way. Obviously this code is really subtle though, so don't trust any of the above without review from someone who knows typeobject.c better than me! (Antoine?) -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org

On Mon, Dec 1, 2014 at 1:38 PM, Nathaniel Smith <njs@pobox.com> wrote:
That's because nobody dares to touch it. (Myself included -- I increased the size of typeobject.c from ~50 to ~5000 lines in a single intense editing session more than a decade ago, and since then it's been basically unmaintainable. :-(
Have you filed this as a bug? I believe nobody has discovered this problem before. I've confirmed it as far back as 2.5 (I don't have anything older installed).
Yeah, I can't see a way that type_new() can create a type whose tp_dealloc isn't subtype_dealloc.
I guess the simple check is an upper bound (or whatever that's called -- my math-speak is rusty ;-) for the necessary-and-sufficient check that you're describing.
I'm not set up to disagree with you on this any more...
Or Benjamin? -- --Guido van Rossum (python.org/~guido)

On Mon, 1 Dec 2014 21:38:45 +0000 Nathaniel Smith <njs@pobox.com> wrote:
I'm not sure. Many operations are standardized on heap types that can have arbitrary definitions on static types (I'm talking about the tp_ methods). You'd have to review them to double check. For example, a heap type's tp_new increments the type's refcount, so you have to adjust the instance refcount if you cast it from a non-heap type to a heap type, and vice-versa (see slot_tp_new()). (this raises the interesting question "what happens if you assign to __class__ from a __del__ method?")
Sounds good.
There's no "child" and "parent" types in equiv_structs(). Regards Antoine.

On Tue, Dec 2, 2014 at 9:19 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Reading through the list of tp_ methods I can't see any other that look problematic. The finalizers are kinda intimate, but I think people would expect that if you swap an instance's type to something that has a different __del__ method then it's the new __del__ method that'll be called. If we wanted to be really careful we should perhaps do something cleverer with tp_is_gc, but so long as type objects are the only objects that have a non-trivial tp_is_gc, and the tp_is_gc call depends only on their tp_flags (which are unmodified by __class__ assignment), then we should still be safe (and anyway this is orthogonal to the current issues).
Right, fortunately this is easy :-).
(this raises the interesting question "what happens if you assign to __class__ from a __del__ method?")
subtype_dealloc actually attempts to take this possibility into account -- see the comment "Extract the type again; tp_del may have changed it". I'm not at all sure that it's handling is *correct* -- there's a bunch of code that references 'type' between the call to tp_del and this comment, and there's code after the comment that references 'base' without recalculating it. But it is there :-)
Not as currently written, but every single call site is of the form equiv_structs(x, x->tp_base). And equiv_structs takes advantage of this -- e.g., checking that two types have the same tp_basicsize is pretty uninformative if they're unrelated types, but if they're parent and child then it tells you that they have exactly the same slots. I wrote a patch incorporating the above ideas: http://bugs.python.org/issue22986 -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org

On Sat, Nov 29, 2014 at 12:59 PM, Nathaniel Smith <njs@pobox.com> wrote:
This one corresponds to what I've seen in quite a number of C APIs. It's not ideal, but nothing is; and at least this way, it's clear that you're fiddling with internals. Letting the interpreter do the grunt-work for you is *definitely* preferable to having recipes out there saying "swap in a new __dict__, then don't forget to clear the old module's __dict__", which will have massive versioning issues as soon as a new best-practice comes along; making it a function, like this, means its implementation can smoothly change between versions (even in a bug-fix release). Would it be better to make that function also switch out the entry in sys.modules? That way, it's 100% dedicated to this job of "I want to make a subclass of module and use that for myself", and could then be made atomic against other imports. I've no idea whether there's any other weird shenanigans that could be deployed with this kind of module switch, nor whether cutting them out would be a good or bad thing! ChrisA

Are these really all our options? All of them sound like hacks, none of them sound like anything the language (or even the CPython implementation) should sanction. Have I missed the discussion where the use cases and constraints were analyzed and all other approaches were rejected? (I might have some half-baked ideas, but I feel I should read up on the past discussion first, and they are probably more fit for python-ideas than for python-dev. Plus I'm just writing this email because I'm procrastinating on the type hinting PEP. :-) --Guido On Fri, Nov 28, 2014 at 7:45 PM, Chris Angelico <rosuav@gmail.com> wrote:
-- --Guido van Rossum (python.org/~guido)

On Sat, Nov 29, 2014 at 4:21 AM, Guido van Rossum <guido@python.org> wrote:
The previous discussions I was referring to are here: http://thread.gmane.org/gmane.comp.python.ideas/29487/focus=29555 http://thread.gmane.org/gmane.comp.python.ideas/29788 There might well be other options; these are just the best ones I could think of :-). The constraints are pretty tight, though: - The "new module" object (whatever it is) should have a __dict__ that aliases the original module globals(). I can elaborate on this if my original email wasn't enough, but hopefully it's obvious that making two copies of the same namespace and then trying to keep them in sync at the very least smells bad :-). - The "new module" object has to be a subtype of ModuleType, b/c there are lots of places that do isinstance(x, ModuleType) checks (notably -- but not only -- reload()). Since a major goal here is to make it possible to do cleaner deprecations, it would be really unfortunate if switching an existing package to use the metamodule support itself broke things :-). - Lookups in the normal case should have no additional performance overhead, because module lookups are extremely extremely common. (So this rules out dict proxies and tricks like that -- we really need 'new_module.__dict__ is globals()' to be true.) AFAICT there are three logically possible strategies for satisfying that first constraint: (a) convert the original module object into the type we want, in-place (b) create a new module object that acts like the original module object (c) somehow arrange for our special type to be used from the start My options 1 and 2 are means of accomplishing (a), and my options 3 and 4 are means of accomplishing (b) while working around the behavioural quirks of module objects (as required by the second constraint). The python-ideas thread did also consider several methods of implementing strategy (c), but they're messy enough that I left them out here. The problem is that somehow we have to execute code to create the new subtype *before* we have an entry in sys.modules for the package that contains the code for the subtype. So one option would be to add a new rule, that if a file pkgname/__new__.py exists, then this is executed first and is required to set up sys.modules["pkgname"] before we exec pkgname/__init__.py. So pkgname/__new__.py might look like: import sys from pkgname._metamodule import MyModuleSubtype sys.modules[__name__] = MyModuleSubtype(__name__, docstring) This runs into a lot of problems though. To start with, the 'from pkgname._metamodule ...' line is an infinite loop, b/c this is the code used to create sys.modules["pkgname"]. It's not clear where the globals dict for executing __new__.py comes from (who defines __name__? Currently that's done by ModuleType.__init__). It only works for packages, not modules. The need to provide the docstring here, before __init__.py is even read, is weird. It adds extra stat() calls to every package lookup. And, the biggest showstopper IMHO: AFAICT it's impossible to write a polyfill to support this code on old python versions, so it's useless to any package which needs to keep compatibility with 2.7 (or even 3.4). Sure, you can backport the whole import system like importlib2, but telling everyone that they need to replace every 'import numpy' with 'import importlib2; import numpy' is a total non-starter. So, yeah, those 4 options are really the only plausible ones I know of. Option 1 and option 3 are pretty nice at the language level! Most Python objects allow assignment to __class__ and __dict__, and both PyPy and Jython at least do support __class__ assignment. Really the only downside with Option 1 is that actually implementing it requires attention from someone with deep knowledge of typeobject.c. -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org

On 29/11/14 19:37, Nathaniel Smith wrote: [snip]
- The "new module" object has to be a subtype of ModuleType, b/c there are lots of places that do isinstance(x, ModuleType) checks (notably
It has to be a *subtype* is does not need to be a *subclass*
Cheers, Mark.

On Sun, Nov 30, 2014 at 11:07:57AM +1300, Greg Ewing wrote:
Perhaps I'm missing something, but won't that imply that every module which wants to use a "special" module type has to re-invent the wheel? If this feature is going to be used, I would expect to be able to re-use pre-written module types. E.g. having written "module with properties" (so to speak) once, I can just import it and use it in my next project. -- Steven

On Sun, Nov 30, 2014 at 12:05 AM, Steven D'Aprano <steve@pearwood.info> wrote:
I expect you'd package the special metamodule class in a stand-alone package, not directly in the ones that use it. You could import other packages freely, just the one that you're currently defining would be unavailable.

On Sat, Nov 29, 2014 at 8:37 PM, Nathaniel Smith <njs@pobox.com> wrote: [...]
As Greg Ewing said – you don't want to import from the package whose metamodule you're defining. You'd want to do as little work as possible in __new__.py. I'd use something like this: import types class __metamodule__(types.ModuleType): def __call__(self): return self.main() where Python would get the attribute __metamodule__ from __new__.py, and use `__metamodule__(name, doc)` as the thing to execute __main__ in.
Well, it could still be in __metamodule__.__init__().
It only works for packages, not modules.
I don't see a need for this treatment for modules in a package – if you want `from mypkg import callme`, you can make "callme" a function rather than a callable module. If you *also* want `from mypkg.callme import something_else`, I say you should split "callme" into two differently named things; names are cheap inside a package. If really needed, modules in a package can use an import hook defined in the package, or be converted to subpackages. Single-module projects would be left out, yes – but those can be simply converted to a package.
The need to provide the docstring here, before __init__.py is even read, is weird.
Does it have to be before __init__.py is read? Can't __init__.py be compiled beforehand, to get __doc__, and only *run* in the new namespace? (Or should __new__.py define import hooks that say how __init__.py should be loaded/compiled? I don't see a case for that.)
It adds extra stat() calls to every package lookup.
Fair.
I'm probably missing something obvious, but where would this not work? - As the first thing it does, __init__.py imports the polyfill and calls polyfill(__name__) - The polyfill, if running non-recursively* under old Python: -- compiles __init__.py -- imports __new__.py to get __metamodule__ -- instantiates metamodule with name, and docstring from compiled code -- * remembers the instance, to check for recursion later -- puts it in sys.modules -- execs __init__ in it - afterwards the original __init__.py execution continues, filling up a now-unused module's namespace

All the use cases seem to be about adding some kind of getattr hook to modules. They all seem to involve modifying the CPython C code anyway. So why not tackle that problem head-on and modify module_getattro() to look for a global named __getattr__ and if it exists, call that instead of raising AttributeError? On Sat, Nov 29, 2014 at 11:37 AM, Nathaniel Smith <njs@pobox.com> wrote:
-- --Guido van Rossum (python.org/~guido)

Are these really all our options? All of them sound like hacks, none of
On Sat, Nov 29, 2014, 21:55 Guido van Rossum <guido@python.org> wrote: All the use cases seem to be about adding some kind of getattr hook to modules. They all seem to involve modifying the CPython C code anyway. So why not tackle that problem head-on and modify module_getattro() to look for a global named __getattr__ and if it exists, call that instead of raising AttributeError? Not sure if anyone thought of it. :) Seems like a reasonable solution to me. Be curious to know what the benchmark suite said the impact was. -brett On Sat, Nov 29, 2014 at 11:37 AM, Nathaniel Smith <njs@pobox.com> wrote: On Sat, Nov 29, 2014 at 4:21 AM, Guido van Rossum <guido@python.org> wrote: them
The previous discussions I was referring to are here: http://thread.gmane.org/gmane.comp.python.ideas/29487/focus=29555 http://thread.gmane.org/gmane.comp.python.ideas/29788 There might well be other options; these are just the best ones I could think of :-). The constraints are pretty tight, though: - The "new module" object (whatever it is) should have a __dict__ that aliases the original module globals(). I can elaborate on this if my original email wasn't enough, but hopefully it's obvious that making two copies of the same namespace and then trying to keep them in sync at the very least smells bad :-). - The "new module" object has to be a subtype of ModuleType, b/c there are lots of places that do isinstance(x, ModuleType) checks (notably -- but not only -- reload()). Since a major goal here is to make it possible to do cleaner deprecations, it would be really unfortunate if switching an existing package to use the metamodule support itself broke things :-). - Lookups in the normal case should have no additional performance overhead, because module lookups are extremely extremely common. (So this rules out dict proxies and tricks like that -- we really need 'new_module.__dict__ is globals()' to be true.) AFAICT there are three logically possible strategies for satisfying that first constraint: (a) convert the original module object into the type we want, in-place (b) create a new module object that acts like the original module object (c) somehow arrange for our special type to be used from the start My options 1 and 2 are means of accomplishing (a), and my options 3 and 4 are means of accomplishing (b) while working around the behavioural quirks of module objects (as required by the second constraint). The python-ideas thread did also consider several methods of implementing strategy (c), but they're messy enough that I left them out here. The problem is that somehow we have to execute code to create the new subtype *before* we have an entry in sys.modules for the package that contains the code for the subtype. So one option would be to add a new rule, that if a file pkgname/__new__.py exists, then this is executed first and is required to set up sys.modules["pkgname"] before we exec pkgname/__init__.py. So pkgname/__new__.py might look like: import sys from pkgname._metamodule import MyModuleSubtype sys.modules[__name__] = MyModuleSubtype(__name__, docstring) This runs into a lot of problems though. To start with, the 'from pkgname._metamodule ...' line is an infinite loop, b/c this is the code used to create sys.modules["pkgname"]. It's not clear where the globals dict for executing __new__.py comes from (who defines __name__? Currently that's done by ModuleType.__init__). It only works for packages, not modules. The need to provide the docstring here, before __init__.py is even read, is weird. It adds extra stat() calls to every package lookup. And, the biggest showstopper IMHO: AFAICT it's impossible to write a polyfill to support this code on old python versions, so it's useless to any package which needs to keep compatibility with 2.7 (or even 3.4). Sure, you can backport the whole import system like importlib2, but telling everyone that they need to replace every 'import numpy' with 'import importlib2; import numpy' is a total non-starter. So, yeah, those 4 options are really the only plausible ones I know of. Option 1 and option 3 are pretty nice at the language level! Most Python objects allow assignment to __class__ and __dict__, and both PyPy and Jython at least do support __class__ assignment. Really the only downside with Option 1 is that actually implementing it requires attention from someone with deep knowledge of typeobject.c. -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido%40python.org -- --Guido van Rossum (python.org/~guido) _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/brett%40python.org

On Sun, Nov 30, 2014 at 6:15 AM, Brett Cannon <brett@python.org> wrote:
Why would there be any impact? The __getattr__ hook would be similar to the one on classes -- it's only invoked at the point where otherwise AttributeError would be raised. -- --Guido van Rossum (python.org/~guido)

On Sun Nov 30 2014 at 2:28:31 PM Ethan Furman <ethan@stoneleaf.us> wrote:
You don't; you just can't shoehorn everything back to 2.7. And just to make sure everyone participating in this discussion is up on the latest import stuff, Python 3.4 does have Loader.create_module() <https://docs.python.org/3/library/importlib.html#importlib.abc.Loader.create...> which lets you control what object is used for a module in the import machinery (this is prior to loading, though, so you can't specify it in a module but at the loader level only). This is how I was able to implement lazy loading for 3.5 <https://docs.python.org/3.5/library/importlib.html#importlib.util.LazyLoader> .

On Sun, Nov 30, 2014 at 7:27 PM, Ethan Furman <ethan@stoneleaf.us> wrote:
I think that's doable -- assuming I'm remembering correctly the slightly weird class vs. instance lookup rules for special methods, you can write a module subclass like class GetAttrModule(types.ModuleType): def __getattr__(self, name): return self.__dict__["__getattr__"](name) and then use ctypes hacks to get it into sys.modules[__name__]. -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org

On 11/30/2014 03:41 PM, Terry Reedy wrote:
My understanding of one of the use-cases was being able to issue warnings about deprecated attributes, which would be most effective if a backport could be written for current versions. -- ~Ethan~

On Sun, 30 Nov 2014 11:15:50 -0800 Guido van Rossum <guido@python.org> wrote:
builtins are typically found by first looking up in the current globals (module) scope, failing, and then falling back on __builtins__. Depending on how much overhead is added to the "failing" step, there /might/ be a performance difference. Of course, that would only occur wherever a __getattr__ hook is defined. Regards Antoine.

On Sun, Nov 30, 2014 at 1:12 PM, Antoine Pitrou <solipsis@pitrou.net> wrote:
The builtins lookup process never does a module attribute lookup -- it only does dict lookups. So it would not be affected by a module __getattr__ hook (unless we were to use dict proxies, which Nathaniel already rejected). @Nathaniel: perhaps you could get what you want without any C code changes using the approach of Brett's LazyLoader? -- --Guido van Rossum (python.org/~guido)

On Sun, Nov 30, 2014 at 2:54 AM, Guido van Rossum <guido@python.org> wrote:
You need to allow overriding __dir__ as well for tab-completion, and some people wanted to use the properties API instead of raw __getattr__, etc. Maybe someone will want __getattribute__ semantics, I dunno. So since we're *so close* to being able to just use the subclassing machinery, it seemed cleaner to try and get that working instead of reimplementing bits of it piecewise. That said, __getattr__ + __dir__ would be enough for my immediate use cases. -n
-- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org

On Sun, Nov 30, 2014 at 11:29 AM, Nathaniel Smith <njs@pobox.com> wrote:
Hm... I agree about __dir__ but the other things feel too speculative.
That would really be option 1, right? It's the one that looks cleanest from the user's POV (or at least from the POV of a developer who wants to build a framework using this feature -- for a simple one-off use case, __getattr__ sounds pretty attractive). I think that if we really want option 1, the issue of PyModuleType not being a heap type can be dealt with.
That said, __getattr__ + __dir__ would be enough for my immediate use cases.
Perhaps it would be a good exercise to try and write the "lazy submodule import"(*) use case three ways: (a) using only CPython 3.4; (b) using __class__ assignment; (c) using customizable __getattr__ and __dir__. I think we can learn a lot about the alternatives from this exercise. I presume there's already a version of (a) floating around, but if it's been used in practice at all, it's probably too gnarly to serve as a useful comparison (though its essence may be extracted to serve as such). FWIW I believe all proposals here have a big limitation: the module *itself* cannot benefit much from all these shenanigans, because references to globals from within the module's own code are just dictionary accesses, and we don't want to change that. (*) I originally wrote "lazy import", but I realized that messing with the module class object probably isn't the best way to implement that -- it requires a proxy for the module that's managed by an import hook. But if you think it's possible, feel free to use this example, as "lazy import" seems a pretty useful thing to have in many situations. (At least that's how I would do it. And I would probably add some atrocious hack to patch up the importing module's globals once the module is actually loaded, to reduce the cost of using the proxy over the lifetime of the process. -- --Guido van Rossum (python.org/~guido)

On Sun Nov 30 2014 at 3:55:39 PM Guido van Rossum <guido@python.org> wrote:
Start at https://hg.python.org/cpython/file/64bb01bce12c/Lib/importlib/util.py#l207 and read down the rest of the file. It really only requires changing __class__ to drop the proxy and that's done immediately after the lazy import. The approach also occurs *after* the finder so you don't get ImportError for at least missing a file.

On Sun, Nov 30, 2014 at 8:54 PM, Guido van Rossum <guido@python.org> wrote:
Options 1-4 all have the effect of making it fairly simple to slot an arbitrary user-defined module subclass into sys.modules. Option 1 is the cleanest API though :-).
(b) and (c) are very straightforward and trivial. Probably I could do a better job of faking dir()'s default behaviour on modules, but basically: ##### __class__ assignment__ ##### import sys, types, importlib class MyModule(types.ModuleType): def __getattr__(self, name): if name in _lazy_submodules: # implicitly assigns submodule to self.__dict__[name] return importlib.import_module(name, package=self.__package__) def __dir__(self): entries = set(self.__dict__) entries.update(__lazy_submodules__) return sorted(entries) sys.modules[__name__].__class__ = MyModule _lazy_submodules = {"foo", "bar"} ##### customizable __getattr__ and __dir__ ##### import importlib def __getattr__(name): if name in _lazy_submodules: # implicitly assigns submodule to globals()[name] return importlib.import_module(name, package=self.__package__) def __dir__(): entries = set(globals()) entries.update(__lazy_submodules__) return sorted(entries) _lazy_submodules = {"foo", "bar"}
I think that's fine -- IMHO the main uses cases here are about controlling the public API. And a module that really wants to can always import itself if it wants to pull more shenanigans :-) (i.e., foo/__init__.py can do "import foo; foo.blahblah" instead of just "blahblah".) -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org

Guido van Rossum wrote:
If assignment to the __class__ of a module were permitted (by whatever means) then you could put this in a module: class __class__(types.ModuleType): ... which makes it look almost like a deliberate language feature. :-) Seriously, of the options presented, I think that allowing __class__ assignment is the most elegant, since it solves a lot of problems in one go without introducing any new features -- just removing a restriction that prevents an existing language mechanism from working in this case. -- Greg

What if we'd have metaclass semantics on module creation? Eg, suppose the default: __metaclass__ = ModuleType What if Python would support __prepare__ for modules? Thanks, -- Ionel M. On Sat, Nov 29, 2014 at 11:36 PM, Greg Ewing <greg.ewing@canterbury.ac.nz> wrote:

On Sat, 29 Nov 2014 01:59:06 +0000 Nathaniel Smith <njs@pobox.com> wrote:
Option 1b: have __class__ assignment delegate to a tp_classassign slot on the old class, so that typeobject.c doesn't have to be cluttered with many special cases.
[...]
How do these two options interact with the fact that module functions store their globals dict, not the module itself? Regards Antoine.

On 29 November 2014 at 21:32, Antoine Pitrou <solipsis@pitrou.net> wrote:
Aye, being able to hook class switching could be potentially useful (including the ability to just disallow it entirely if you really wanted to do that).
Right, that's the part I consider the most challenging with metamodules - the fact that there's a longstanding assumption that a module is just a "dictionary with some metadata", so the interpreter is inclined to treat them that way. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

On Sat, Nov 29, 2014 at 11:32 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
I'm intrigued -- how would this help? I have a vague impression that one could add another branch to object_set_class that went something like if at least one of the types is a subtype of the other type, and the subtype is a heap type with tp_dealloc == subtype_dealloc, and the subtype doesn't add any important slots, and ... then the __class__ assignment is legal. (This is taking advantage of the fact that if you don't have any extra slots added, then subtype_dealloc just basically defers to the base type's tp_dealloc, so it doesn't really matter which one you end up calling.) And my vague impression is that there isn't really anything special about the module type that would allow a tp_classassign function to simplify this logic. But these are just vague impressions :-)
I think that's totally fine? The whole point of all these proposals is to make sure that the final module object does in fact have the correct globals dict. ~$ git clone git@github.com:njsmith/metamodule.git ~$ cd metamodule ~/metamodule$ python3.4
If anything this is another argument for why we NEED something like this :-). -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org

On Sat, 29 Nov 2014 20:02:50 +0000 Nathaniel Smith <njs@pobox.com> wrote:
It would allow ModuleType to override tp_classassign to decide whether and how __class__ assignment on a module instance is allowed to work. So typeobject.c needn't know about any specifics of ModuleType or any other type.
Ok, I see. The code hacks up the new module to take ownership of the old module's __dict__. That doesn't look very clean to me. Regards Antoine.

On 29/11/14 01:59, Nathaniel Smith wrote:
Hi all,
[snip]
Why does MyModuleClass need to sub-class types.ModuleType? Modules have no special behaviour, apart from the inability to write to their __dict__ attribute, which is the very thing you don't want. If it quacks like a module... Cheers, Mark.

Hi, This discussion has been going on for a while, but no one has questioned the basic premise. Does this needs any change to the language or interpreter? I believe it does not. I'm modified your original metamodule.py to not use ctypes and support reloading: https://gist.github.com/markshannon/1868e7e6115d70ce6e76 Cheers, Mark. On 29/11/14 01:59, Nathaniel Smith wrote:

On Sun, Nov 30, 2014 at 10:14 PM, Mark Shannon <mark@hotpy.org> wrote:
Interesting approach! As written, your code will blow up on any python < 3.4, because when old_module gets deallocated it'll wipe the module dict clean. And I guess even on >=3.4, this might still happen if old_module somehow manages to get itself into a reference loop before getting deallocated. (Hopefully not, but what a nightmare to debug if it did.) However, both of these issues can be fixed by stashing a reference to old_module somewhere in new_module. The __class__ = ModuleType trick is super-clever but makes me irrationally uncomfortable. I know that this is documented as a valid method of fooling isinstance(), but I didn't know that until your yesterday, and the idea of objects where type(foo) is not foo.__class__ strikes me as somewhat blasphemous. Maybe this is all fine though. The pseudo-module objects generated this way will still won't pass PyModule_Check, so in theory this could produce behavioural differences. I can't name any specific places where this will break things, though. From a quick skim of the CPython source, a few observations: It means the PyModule_* API functions won't work (e.g. PyModule_GetDict); maybe these aren't used enough to matter. It looks like the __reduce__ methods on "method objects" (Objects/methodobject.c) have a special check for ->m_self being a module object, and won't pickle correctly if ->m_self ends up pointing to one of these pseudo-modules. I have no idea how one ends up with a method whose ->m_self points to a module object, though -- maybe it never actually happens. PyImport_Cleanup treats module objects differently from non-module objects during shutdown. I guess it also has the mild limitation that it doesn't work with extension modules, but eh. Mostly I'd be nervous about the two points above. -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org

On Mon, Dec 1, 2014 at 12:59 AM, Nathaniel Smith <njs@pobox.com> wrote:
Actually, there is one showstopper here -- in the first version where reload() uses isinstance() is actually 3.4. Before that you need a real module subtype for reload to work. But this is in principle workaroundable by using subclassing + ctypes on old versions of python and the __class__ = hack on new versions.
-- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org

Nathaniel, did you look at Brett's LazyLoader? It overcomes the subclass issue by using a module loader that makes all modules instances of a (trivial) Module subclass. I'm sure this approach can be backported as far as you need to go. On Sun, Nov 30, 2014 at 5:02 PM, Nathaniel Smith <njs@pobox.com> wrote:
-- --Guido van Rossum (python.org/~guido)

On Mon, Dec 1, 2014 at 1:27 AM, Guido van Rossum <guido@python.org> wrote:
The problem is that by the time your package's code starts running, it's too late to install such a loader. Brett's strategy works well for lazy-loading submodules (e.g., making it so 'import numpy' makes 'numpy.testing' available, but without the speed hit of importing it immediately), but it doesn't help if you want to actually hook attribute access on your top-level package (e.g., making 'numpy.foo' trigger a DeprecationWarning -- we have a lot of stupid exported constants that we can never get rid of because our rules say that we have to deprecate things before removing them). Or maybe you're suggesting that we define a trivial heap-allocated subclass of PyModule_Type and use that everywhere, as a quick-and-dirty way to enable __class__ assignment? (E.g., return it from PyModule_New?) I considered this before but hesitated b/c it could potentially break backwards compatibility -- e.g. if code A creates a PyModule_Type object directly without going through PyModule_New, and then code B checks whether the resulting object is a module by doing isinstance(x, type(sys)), this will break. (type(sys) is a pretty common way to get a handle to ModuleType -- in fact both types.py and importlib use it.) So in my mind I sorta lumped it in with my Option 2, "minor compatibility break". OTOH maybe anyone who creates a module object without going through PyModule_New deserves whatever they get. -n
-- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org

On Sun, Nov 30, 2014 at 5:42 PM, Nathaniel Smith <njs@pobox.com> wrote:
Couldn't you install a package loader using some install-time hook? Anyway, I still think that the issues with heap types can be overcome. Hm, didn't you bring that up before here? Was the conclusion that it's impossible? -- --Guido van Rossum (python.org/~guido)

On Mon, Dec 1, 2014 at 4:06 AM, Guido van Rossum <guido@python.org> wrote:
I've brought it up several times but no-one's really discussed it :-). I finally attempted a deep dive into typeobject.c today myself. I'm not at all sure I understand the intricacies correctly here, but I *think* __class__ assignment could be relatively easily extended to handle non-heap types, and in fact the current restriction to heap types is actually buggy (IIUC). object_set_class is responsible for checking whether it's okay to take an object of class "oldto" and convert it to an object of class "newto". Basically it's goal is just to avoid crashing the interpreter (as would quickly happen if you e.g. allowed "[].__class__ = dict"). Currently the rules (spread across object_set_class and compatible_for_assignment) are: (1) both oldto and newto have to be heap types (2) they have to have the same tp_dealloc (3) they have to have the same tp_free (4) if you walk up the ->tp_base chain for both types until you find the most-ancestral type that has a compatible struct layout (as checked by equiv_structs), then either (4a) these ancestral types have to be the same, OR (4b) these ancestral types have to have the same tp_base, AND they have to have added the same slots on top of that tp_base (e.g. if you have class A(object): pass and class B(object): pass then they'll both have added a __dict__ slot at the same point in the instance struct, so that's fine; this is checked in same_slots_added). The only place the code assumes that it is dealing with heap types is in (4b) -- same_slots_added unconditionally casts the ancestral types to (PyHeapTypeObject*). AFAICT that's why step (1) is there, to protect this code. But I don't think the check actually works -- step (1) checks that the types we're trying to assign are heap types, but this is no guarantee that the *ancestral* types will be heap types. [Also, the code for __bases__ assignment appears to also call into this code with no heap type checks at all.] E.g., I think if you do class MyList(list): __slots__ = () class MyDict(dict): __slots__ = () MyList().__class__ = MyDict() then you'll end up in same_slots_added casting PyDict_Type and PyList_Type to PyHeapTypeObjects and then following invalid pointers into la-la land. (The __slots__ = () is to maintain layout compatibility with the base types; if you find builtin types that already have __dict__ and weaklist and HAVE_GC then this example should still work even with perfectly empty subclasses.) Okay, so suppose we move the heap type check (step 1) down into same_slots_added (step 4b), since AFAICT this is actually more correct anyway. This is almost enough to enable __class__ assignment on modules, because the cases we care about will go through the (4a) branch rather than (4b), so the heap type thing is irrelevant. The remaining problem is the requirement that both types have the same tp_dealloc (step 2). ModuleType itself has tp_dealloc == module_dealloc, while all(?) heap types have tp_dealloc == subtype_dealloc. Here again, though, I'm not sure what purpose this check serves. subtype_dealloc basically cleans up extra slots, and then calls the base class tp_dealloc. So AFAICT it's totally fine if oldto->tp_dealloc == module_dealloc, and newto->tp_dealloc == subtype_dealloc, so long as newto is a subtype of oldto -- b/c this means newto->tp_dealloc will end up calling oldto->tp_dealloc anyway. OTOH it's not actually a guarantee of anything useful to see that oldto->tp_dealloc == newto->tp_dealloc == subtype_dealloc, because subtype_dealloc does totally different things depending on the ancestry tree -- MyList and MyDict above pass the tp_dealloc check, even though list.tp_dealloc and dict.tp_dealloc are definitely *not* interchangeable. So I suspect that a more correct way to do this check would be something like PyTypeObject *old__real_deallocer = oldto, *new_real_deallocer = newto; while (old_real_deallocer->tp_dealloc == subtype_dealloc) old_real_deallocer = old_real_deallocer->tp_base; while (new_real_deallocer->tp_dealloc == subtype_dealloc) new_real_deallocer = new_real_deallocer->tp_base; if (old_real_deallocer->tp_dealloc != new_real_deallocer) error out; Module subclasses would pass this check. Alternatively it might make more sense to add a check in equiv_structs that (child_type->tp_dealloc == subtype_dealloc || child_type->tp_dealloc == parent_type->tp_dealloc); I think that would accomplish the same thing in a somewhat cleaner way. Obviously this code is really subtle though, so don't trust any of the above without review from someone who knows typeobject.c better than me! (Antoine?) -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org

On Mon, Dec 1, 2014 at 1:38 PM, Nathaniel Smith <njs@pobox.com> wrote:
That's because nobody dares to touch it. (Myself included -- I increased the size of typeobject.c from ~50 to ~5000 lines in a single intense editing session more than a decade ago, and since then it's been basically unmaintainable. :-(
Have you filed this as a bug? I believe nobody has discovered this problem before. I've confirmed it as far back as 2.5 (I don't have anything older installed).
Yeah, I can't see a way that type_new() can create a type whose tp_dealloc isn't subtype_dealloc.
I guess the simple check is an upper bound (or whatever that's called -- my math-speak is rusty ;-) for the necessary-and-sufficient check that you're describing.
I'm not set up to disagree with you on this any more...
Or Benjamin? -- --Guido van Rossum (python.org/~guido)

On Mon, 1 Dec 2014 21:38:45 +0000 Nathaniel Smith <njs@pobox.com> wrote:
I'm not sure. Many operations are standardized on heap types that can have arbitrary definitions on static types (I'm talking about the tp_ methods). You'd have to review them to double check. For example, a heap type's tp_new increments the type's refcount, so you have to adjust the instance refcount if you cast it from a non-heap type to a heap type, and vice-versa (see slot_tp_new()). (this raises the interesting question "what happens if you assign to __class__ from a __del__ method?")
Sounds good.
There's no "child" and "parent" types in equiv_structs(). Regards Antoine.

On Tue, Dec 2, 2014 at 9:19 AM, Antoine Pitrou <solipsis@pitrou.net> wrote:
Reading through the list of tp_ methods I can't see any other that look problematic. The finalizers are kinda intimate, but I think people would expect that if you swap an instance's type to something that has a different __del__ method then it's the new __del__ method that'll be called. If we wanted to be really careful we should perhaps do something cleverer with tp_is_gc, but so long as type objects are the only objects that have a non-trivial tp_is_gc, and the tp_is_gc call depends only on their tp_flags (which are unmodified by __class__ assignment), then we should still be safe (and anyway this is orthogonal to the current issues).
Right, fortunately this is easy :-).
(this raises the interesting question "what happens if you assign to __class__ from a __del__ method?")
subtype_dealloc actually attempts to take this possibility into account -- see the comment "Extract the type again; tp_del may have changed it". I'm not at all sure that it's handling is *correct* -- there's a bunch of code that references 'type' between the call to tp_del and this comment, and there's code after the comment that references 'base' without recalculating it. But it is there :-)
Not as currently written, but every single call site is of the form equiv_structs(x, x->tp_base). And equiv_structs takes advantage of this -- e.g., checking that two types have the same tp_basicsize is pretty uninformative if they're unrelated types, but if they're parent and child then it tells you that they have exactly the same slots. I wrote a patch incorporating the above ideas: http://bugs.python.org/issue22986 -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org
participants (13)
-
Antoine Pitrou
-
Brett Cannon
-
Chris Angelico
-
Ethan Furman
-
Greg Ewing
-
Guido van Rossum
-
Ionel Cristian Mărieș
-
Mark Shannon
-
Nathaniel Smith
-
Nick Coghlan
-
Petr Viktorin
-
Steven D'Aprano
-
Terry Reedy