Strange artifacts with PEP 3121 and monkey-patching sys.modules (in csv, ElementTree and others)
Hello, Recently as part of the effort of untangling the tests of ElementTree and general code improvements (e.g. http://bugs.python.org/issue15651), I ran into something strange about PEP 3121-compliant modules. I'll demonstrate with csv, just as an example. PEP 3121 mandates this function to look up the module-specific state in the current sub-interpreter: PyObject* PyState_FindModule(struct PyModuleDef*); This appears to make the following assumption: a given sub-interpreter only imports any C extension *once*. If it happens more than once, the assumption breaks in troubling ways. In normal code, it should never happen more than once because of the caching in sys.modules; However, many of our tests monkey-patch sys.modules (mainly by calling test.support.import_fresh_module) and hell breaks use. Here's a simple example: ---- import sys csv = __import__('csv') csv.register_dialect('unixpwd', delimiter=':', quoting=csv.QUOTE_NONE) print(csv.list_dialects()) # ==> ['unixpwd', 'excel-tab', 'excel', 'unix'] del sys.modules['csv'] # FUN del sys.modules['_csv'] some_other_csv = __import__('csv') print(csv.list_dialects()) # ==> ['excel-tab', 'excel', 'unix'] ---- Note how doing some sys.modules acrobatics and re-importing suddenly changes the internal state of a previously imported module. This happens because: 1. The first import of 'csv' (which then imports `_csv) creates module-specific state on the heap and associates it with the current sub-interpreter. The list of dialects, amongst other things, is in that state. 2. The 'del's wipe 'csv' and '_csv' from the cache. 3. The second import of 'csv' also creates/initializes a new '_csv' module because it's not in sys.modules. This *replaces* the per-sub-interpreter cached version of the module's state with the clean state of a new module So essentially, while PEP 3121 moves state from C-file globals to per-module state, the state is still global, and this fact can be exposed from pure Python code. The above is a toy example. Here's a more serious case I ran into with ET, but once again is demonstrated with 'csv' for simplicity: ---- import io from test.support import import_fresh_module import csv csv_other = import_fresh_module('csv', fresh=['_csv', 'csv']) f = io.StringIO('foo\x00,bar\nbaz,42') reader = csv.reader(f) try: for row in reader: print(row) except csv.Error as e: print('Caught csv.error', e) except Exception as e: print('Caught Exception', e) ---- In the above, the reader throws 'csv.Error' (because of the NULL byte) but the exception clause does not catch it where expected, because it's a different exception class called `csv.Error`, due to the same problem demonstrated above (if the seemingly innocent import_fresh_module is removed, all is good). Any ideas/suggestion regarding this are welcome. This is quite an esoteric problem, but I believe it's serious. PEP 3121 is not used much (yet), but recently there was talk again about committing some of the patches created for converting Modules/*.c extensions to it during a GSoC project. I believe that we should understand the implications first. There can be a number of solutions; including modifying the PEP 3121 implementation machinery to really create/keep state "per module" and not just "per kind of module in a single sub-interpreter". Eli
In a similar vein, Antoine recently noted that the fact the per-module state isn't a real PyObject creates a variety of interesting lifecycle management challenges. I'm not seeing an easy solution, either, except to automatically skip reinitialization when the module has already been imported.
n Sat, Aug 10, 2013 at 5:47 PM, Nick Coghlan
In a similar vein, Antoine recently noted that the fact the per-module state isn't a real PyObject creates a variety of interesting lifecycle management challenges.
I'm not seeing an easy solution, either, except to automatically skip reinitialization when the module has already been imported.
This solution has problems. For example, in the case of ET it would preclude testing what happens when pyexpat is disabled (remember we were discussing this...). This is because there would be no real way to create new instances of such modules (they would all cache themselves in the init function - similarly to what ET now does in trunk, because otherwise some of its global-dependent crazy tests fail). A more radical solution would be to *really* have multiple instances of state per sub-interpreter. Well, they already exist -- it's PyState_FindModule which is the problematic one because it only remembers the last one. But I see that it's only being used by extension modules themselves, to efficiently find modules they belong to. It feels a bit like a hack that was made to avoid rewriting lots of code, because in general a module's objects *can* know which module instance they came from. E.g. it can be saved as a private field in classes exported by the module. So a more radical approach would be: PyState_FindModule can be deprecated, but still exist and be documented to return the state the *last* module created in this sub-interpreter. stdlib extension modules that actually use this mechanism can be rewritten to just remember the module for real, and not rely on PyState_FindModule to fetch it from a global cache. I don't think this would be hard, and it would make the good intention of PEP 3121 more real - actual intependent state per module instance. Eli
On 10 Aug 2013 21:06, "Eli Bendersky"
n Sat, Aug 10, 2013 at 5:47 PM, Nick Coghlan
wrote: In a similar vein, Antoine recently noted that the fact the per-module
I'm not seeing an easy solution, either, except to automatically skip
reinitialization when the module has already been imported.
This solution has problems. For example, in the case of ET it would
state isn't a real PyObject creates a variety of interesting lifecycle management challenges. preclude testing what happens when pyexpat is disabled (remember we were discussing this...). This is because there would be no real way to create new instances of such modules (they would all cache themselves in the init function - similarly to what ET now does in trunk, because otherwise some of its global-dependent crazy tests fail). Right, it would still be broken, just in a less horrible way.
A more radical solution would be to *really* have multiple instances of
state per sub-interpreter. Well, they already exist -- it's PyState_FindModule which is the problematic one because it only remembers the last one. But I see that it's only being used by extension modules themselves, to efficiently find modules they belong to. It feels a bit like a hack that was made to avoid rewriting lots of code, because in general a module's objects *can* know which module instance they came from. E.g. it can be saved as a private field in classes exported by the module.
So a more radical approach would be:
PyState_FindModule can be deprecated, but still exist and be documented
to return the state the *last* module created in this sub-interpreter. stdlib extension modules that actually use this mechanism can be rewritten to just remember the module for real, and not rely on PyState_FindModule to fetch it from a global cache. I don't think this would be hard, and it would make the good intention of PEP 3121 more real - actual intependent state per module instance. Sounds promising to me. I suspect handling exported functions will prove to be tricky, though - they may need to be redesigned to behave more like "module methods".
Eli
On Sat, 10 Aug 2013 18:06:02 -0700
Eli Bendersky
This solution has problems. For example, in the case of ET it would preclude testing what happens when pyexpat is disabled (remember we were discussing this...). This is because there would be no real way to create new instances of such modules (they would all cache themselves in the init function - similarly to what ET now does in trunk, because otherwise some of its global-dependent crazy tests fail).
A more radical solution would be to *really* have multiple instances of state per sub-interpreter. Well, they already exist -- it's PyState_FindModule which is the problematic one because it only remembers the last one.
I'm not sure I understand your diagnosis. modules_per_index (and PyState_FindModule) is per-interpreter so we already have a per-interpreter state here. Something else must be interferring. Note that module state is just a field attached to the module object ("void *md_state" in PyModuleObject). It's really the extension modules which are per-interpreter, which is a good thing. Regards Antoine.
On Sun, Aug 11, 2013 at 2:58 AM, Antoine Pitrou
On Sat, 10 Aug 2013 18:06:02 -0700 Eli Bendersky
wrote: This solution has problems. For example, in the case of ET it would preclude testing what happens when pyexpat is disabled (remember we were discussing this...). This is because there would be no real way to create new instances of such modules (they would all cache themselves in the init function - similarly to what ET now does in trunk, because otherwise some of its global-dependent crazy tests fail).
A more radical solution would be to *really* have multiple instances of state per sub-interpreter. Well, they already exist -- it's PyState_FindModule which is the problematic one because it only remembers the last one.
I'm not sure I understand your diagnosis. modules_per_index (and PyState_FindModule) is per-interpreter so we already have a per-interpreter state here. Something else must be interferring.
Yes, it's per interpreter, but only one per interpreter is remembered in state->modules_by_index. What I'm trying to say is that currently two different instances of PyModuleObject *within the same interpterer* share the state if they get to it through PyState_FindModule, because they share the same PyModuleDef, and stat->modules_by_index keeps only one module per PyModuleDef.
Note that module state is just a field attached to the module object ("void *md_state" in PyModuleObject). It's really the extension modules which are per-interpreter, which is a good thing.
Hi Eli,
On Sat, 10 Aug 2013 17:12:53 -0700
Eli Bendersky
Note how doing some sys.modules acrobatics and re-importing suddenly changes the internal state of a previously imported module. This happens because:
1. The first import of 'csv' (which then imports `_csv) creates module-specific state on the heap and associates it with the current sub-interpreter. The list of dialects, amongst other things, is in that state. 2. The 'del's wipe 'csv' and '_csv' from the cache. 3. The second import of 'csv' also creates/initializes a new '_csv' module because it's not in sys.modules. This *replaces* the per-sub-interpreter cached version of the module's state with the clean state of a new module
I would say this is pretty much expected. The converse would be a bug IMO (but perhaps Martin disagrees). PEP 3121's stated goal is not only subinterpreter support: "Extension module initialization currently has a few deficiencies. There is no cleanup for modules, the entry point name might give naming conflicts, the entry functions don't follow the usual calling convention, and multiple interpreters are not supported well." Re-initializing state when importing a module anew makes extension modules more like pure Python modules, which is a good thing. I think the piece of interpretation you offered yesterday on IRC may be the right explanation for the ET shenanigans: "Maybe the bug is that ParseError is kept in per-module state, and also exported from the module?" PEP 3121 doesn't offer any guidelines for using its API, and its example shows PyObject* fields in a module state. I'm starting to think that it might be a bad use of PEP 3121. PyObjects can, and therefore should be stored in the extension module dict where they will participate in normal resource management (i.e. garbage collection). If they are in the module dict, then they shouldn't be held alive by the module state too, otherwise the (currently tricky) lifetime management of extension modules can produce oddities. So, the PEP 3121 "module state" pointer (the optional opaque void* thing) should only be used to hold non-PyObjects. PyObjects should go to the module dict, like they do in normal Python modules. Now, the reason our PEP 3121 extension modules abuse the module state pointer to keep PyObjects is two-fold: 1. it's surprisingly easier (it's actually a one-liner if you don't handle errors - a rather bad thing, but all PEP 3121 extension modules currently don't handle a NULL return from PyState_FindModule...) 2. it protects the module from any module dict monkeypatching. It's not important if you are using a generic API on the PyObject, but it is if the PyObject is really a custom C type with well-defined fields. Those two issues can be addressed if we offer an API for it. How about: PyObject *PyState_GetModuleAttr(struct PyModuleDef *def, const char *name, PyObject *restrict_type) *def* is a pointer to the module definition. *name* is the attribute to look up on the module dict. *restrict_type*, if non-NULL, is a type object the looked up attribute must be an instance of. Lookup an attribute in the current interpreter's extension module instance for the module definition *def*. Returns a *new* reference (!), or NULL if an error occurred. An error can be: - no such module exists for the current interpreter (ImportError? RuntimeError? SystemError?) - no such attribute exists in the module dict (AttributeError) - the attribute doesn't conform to *restrict_type* (TypeError) So code can be written like: PyObject *dialects = PyState_GetModuleAttr( &_csvmodule, "dialects", &PyDict_Type); if (dialects == NULL) return NULL; Regards Antoine.
On Sun, 11 Aug 2013 12:33:16 +0200
Antoine Pitrou
So, the PEP 3121 "module state" pointer (the optional opaque void* thing) should only be used to hold non-PyObjects. PyObjects should go to the module dict, like they do in normal Python modules. Now, the reason our PEP 3121 extension modules abuse the module state pointer to keep PyObjects is two-fold:
1. it's surprisingly easier (it's actually a one-liner if you don't handle errors - a rather bad thing, but all PEP 3121 extension modules currently don't handle a NULL return from PyState_FindModule...)
2. it protects the module from any module dict monkeypatching. It's not important if you are using a generic API on the PyObject, but it is if the PyObject is really a custom C type with well-defined fields.
I overlooked a third reason which is performance. But, those lookups are generally not performance-critical. Regards Antoine.
On 11 August 2013 06:33, Antoine Pitrou
So code can be written like:
PyObject *dialects = PyState_GetModuleAttr( &_csvmodule, "dialects", &PyDict_Type); if (dialects == NULL) return NULL;
This sounds like a good near term solution to me. Longer term, I think there may be value in providing a richer extension module initialisation API that lets extension modules be represented as module *subclasses* in sys.modules, since that would get us to a position where it is possible to have *multiple* instances of an extension module in the *same* subinterpreter by holding on to external references after removing them from sys.modules (which is what we do in the test suite for pure Python modules). Enabling that also ties into the question of passing info to the extension module about how it is being loaded (e.g. as a submodule of a larger package), as well as allowing extension modules to cleanly handle reload(). However, that's dependent on the ModuleSpec idea we're currently thrashing out on import-sig (and should be able to bring to python-dev soon), and I think getting that integrated at all will be ambitious enough for 3.4 - using it to improve extension module handling would then be a project for 3.5. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Sun, 11 Aug 2013 07:04:40 -0400
Nick Coghlan
On 11 August 2013 06:33, Antoine Pitrou
wrote: So code can be written like:
PyObject *dialects = PyState_GetModuleAttr( &_csvmodule, "dialects", &PyDict_Type); if (dialects == NULL) return NULL;
This sounds like a good near term solution to me.
Longer term, I think there may be value in providing a richer extension module initialisation API that lets extension modules be represented as module *subclasses* in sys.modules, since that would get us to a position where it is possible to have *multiple* instances of an extension module in the *same* subinterpreter by holding on to external references after removing them from sys.modules (which is what we do in the test suite for pure Python modules).
Either that, or add a "struct PyMemberDef *m_members" field to PyModuleDef, to enable looking up stuff in the m_state using regular attribute lookup. Unfortunately, doing so would probably break the ABI. Also, allowing for module subclasses is probably more flexible in the long term. We just need to devise a convenience API for that (perhaps by allowing to create both the subclass *and* instantiate it in a single call).
However, that's dependent on the ModuleSpec idea we're currently thrashing out on import-sig (and should be able to bring to python-dev soon), and I think getting that integrated at all will be ambitious enough for 3.4 - using it to improve extension module handling would then be a project for 3.5.
Sounds reasonable. Regards Antoine.
Antoine Pitrou, 11.08.2013 13:48:
On Sun, 11 Aug 2013 07:04:40 -0400 Nick Coghlan wrote:
On 11 August 2013 06:33, Antoine Pitrou wrote:
So code can be written like:
PyObject *dialects = PyState_GetModuleAttr( &_csvmodule, "dialects", &PyDict_Type); if (dialects == NULL) return NULL;
This sounds like a good near term solution to me.
Longer term, I think there may be value in providing a richer extension module initialisation API that lets extension modules be represented as module *subclasses* in sys.modules, since that would get us to a position where it is possible to have *multiple* instances of an extension module in the *same* subinterpreter by holding on to external references after removing them from sys.modules (which is what we do in the test suite for pure Python modules).
Either that, or add a "struct PyMemberDef *m_members" field to PyModuleDef, to enable looking up stuff in the m_state using regular attribute lookup.
Hmm, yes, it's unfortunate that the module state isn't just a public part of the object struct.
Unfortunately, doing so would probably break the ABI. Also, allowing for module subclasses is probably more flexible in the long term.
+1000
We just need to devise a convenience API for that (perhaps by allowing to create both the subclass *and* instantiate it in a single call).
Right. This conflicts somewhat with the simplified module creation. If the module loader passed the readily instantiated module instance into the module init function, then module subtypes don't fit into this scheme anymore. One more reason why modules shouldn't be special. Essentially, we need an m_new() and m_init() for them. And the lifetime of the module type would have to be linked to the (sub-)interpreter, whereas the lifetime of the module instance would be determined by whoever uses the module and/or decides to unload/reload it. Stefan
On Sun, 11 Aug 2013 14:16:10 +0200
Stefan Behnel
We just need to devise a convenience API for that (perhaps by allowing to create both the subclass *and* instantiate it in a single call).
Right. This conflicts somewhat with the simplified module creation. If the module loader passed the readily instantiated module instance into the module init function, then module subtypes don't fit into this scheme anymore.
One more reason why modules shouldn't be special. Essentially, we need an m_new() and m_init() for them. And the lifetime of the module type would have to be linked to the (sub-)interpreter, whereas the lifetime of the module instance would be determined by whoever uses the module and/or decides to unload/reload it.
It may be simpler if the only strong reference to the module type is in the module instance itself. Successive module initializations would get different types, but that shouldn't be a problem in practice. Regards Antoine.
Antoine Pitrou, 11.08.2013 14:32:
On Sun, 11 Aug 2013 14:16:10 +0200 Stefan Behnel wrote:
We just need to devise a convenience API for that (perhaps by allowing to create both the subclass *and* instantiate it in a single call).
Right. This conflicts somewhat with the simplified module creation. If the module loader passed the readily instantiated module instance into the module init function, then module subtypes don't fit into this scheme anymore.
One more reason why modules shouldn't be special. Essentially, we need an m_new() and m_init() for them. And the lifetime of the module type would have to be linked to the (sub-)interpreter, whereas the lifetime of the module instance would be determined by whoever uses the module and/or decides to unload/reload it.
It may be simpler if the only strong reference to the module type is in the module instance itself. Successive module initializations would get different types, but that shouldn't be a problem in practice.
Agreed. Then the module instance would just be the only instance of a new type that gets created each time the module initialised. Even if module subtypes were to become common place once they are generally supported (because they're the easiest way to store per-module state efficiently), module reinitialisation should be rare enough to just buy them with a new type for each. The size of the complete module state+dict will almost always outweigh the size of the one additional type by factors. Stefan
Stefan Behnel, 11.08.2013 14:48:
Antoine Pitrou, 11.08.2013 14:32:
On Sun, 11 Aug 2013 14:16:10 +0200 Stefan Behnel wrote:
We just need to devise a convenience API for that (perhaps by allowing to create both the subclass *and* instantiate it in a single call).
Right. This conflicts somewhat with the simplified module creation. If the module loader passed the readily instantiated module instance into the module init function, then module subtypes don't fit into this scheme anymore.
One more reason why modules shouldn't be special. Essentially, we need an m_new() and m_init() for them. And the lifetime of the module type would have to be linked to the (sub-)interpreter, whereas the lifetime of the module instance would be determined by whoever uses the module and/or decides to unload/reload it.
It may be simpler if the only strong reference to the module type is in the module instance itself. Successive module initializations would get different types, but that shouldn't be a problem in practice.
Agreed. Then the module instance would just be the only instance of a new type that gets created each time the module initialised. Even if module subtypes were to become common place once they are generally supported (because they're the easiest way to store per-module state efficiently), module reinitialisation should be rare enough to just buy them with a new type for each. The size of the complete module state+dict will almost always outweigh the size of the one additional type by factors.
BTW, this already suggests a simple module initialisation interface. The extension module would expose a function that returns a module type, and the loader/importer would then simply instantiate that. Nothing else is needed. Stefan
Stefan Behnel, 11.08.2013 14:53:
Stefan Behnel, 11.08.2013 14:48:
Antoine Pitrou, 11.08.2013 14:32:
On Sun, 11 Aug 2013 14:16:10 +0200 Stefan Behnel wrote:
We just need to devise a convenience API for that (perhaps by allowing to create both the subclass *and* instantiate it in a single call).
Right. This conflicts somewhat with the simplified module creation. If the module loader passed the readily instantiated module instance into the module init function, then module subtypes don't fit into this scheme anymore.
One more reason why modules shouldn't be special. Essentially, we need an m_new() and m_init() for them. And the lifetime of the module type would have to be linked to the (sub-)interpreter, whereas the lifetime of the module instance would be determined by whoever uses the module and/or decides to unload/reload it.
It may be simpler if the only strong reference to the module type is in the module instance itself. Successive module initializations would get different types, but that shouldn't be a problem in practice.
Agreed. Then the module instance would just be the only instance of a new type that gets created each time the module initialised. Even if module subtypes were to become common place once they are generally supported (because they're the easiest way to store per-module state efficiently), module reinitialisation should be rare enough to just buy them with a new type for each. The size of the complete module state+dict will almost always outweigh the size of the one additional type by factors.
BTW, this already suggests a simple module initialisation interface. The extension module would expose a function that returns a module type, and the loader/importer would then simply instantiate that. Nothing else is needed.
Actually, strike the word "module type" and replace it with "type". Is there really a reason why Python needs a module type at all? I mean, you can stick arbitrary objects in sys.modules, so why not allow arbitrary types to be returned by the module creation function? Stefan
On 11 Aug 2013 09:02, "Stefan Behnel"
Stefan Behnel, 11.08.2013 14:53:
Stefan Behnel, 11.08.2013 14:48:
Antoine Pitrou, 11.08.2013 14:32:
On Sun, 11 Aug 2013 14:16:10 +0200 Stefan Behnel wrote:
We just need to devise a convenience API for that (perhaps by allowing
create both the subclass *and* instantiate it in a single call).
Right. This conflicts somewhat with the simplified module creation. If the module loader passed the readily instantiated module instance into
module init function, then module subtypes don't fit into this scheme anymore.
One more reason why modules shouldn't be special. Essentially, we need an m_new() and m_init() for them. And the lifetime of the module type would have to be linked to the (sub-)interpreter, whereas the lifetime of
to the the
module instance would be determined by whoever uses the module and/or decides to unload/reload it.
It may be simpler if the only strong reference to the module type is in the module instance itself. Successive module initializations would get different types, but that shouldn't be a problem in practice.
Agreed. Then the module instance would just be the only instance of a new type that gets created each time the module initialised. Even if module subtypes were to become common place once they are generally supported (because they're the easiest way to store per-module state efficiently), module reinitialisation should be rare enough to just buy them with a new type for each. The size of the complete module state+dict will almost always outweigh the size of the one additional type by factors.
BTW, this already suggests a simple module initialisation interface. The extension module would expose a function that returns a module type, and the loader/importer would then simply instantiate that. Nothing else is needed.
Actually, strike the word "module type" and replace it with "type". Is there really a reason why Python needs a module type at all? I mean, you can stick arbitrary objects in sys.modules, so why not allow arbitrary types to be returned by the module creation function?
That's exactly what I have in mind, but the way extension module imports currently work means we can't easily do it just yet. Fortunately, importlib means we now have some hope of fixing that :) Cheers, Nick.
Stefan
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe:
http://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
Nick Coghlan, 11.08.2013 15:19:
On 11 Aug 2013 09:02, "Stefan Behnel" wrote:
BTW, this already suggests a simple module initialisation interface. The extension module would expose a function that returns a module type, and the loader/importer would then simply instantiate that. Nothing else is needed.
Actually, strike the word "module type" and replace it with "type". Is there really a reason why Python needs a module type at all? I mean, you can stick arbitrary objects in sys.modules, so why not allow arbitrary types to be returned by the module creation function?
That's exactly what I have in mind, but the way extension module imports currently work means we can't easily do it just yet. Fortunately, importlib means we now have some hope of fixing that :)
Well, what do we need? We don't need to care about existing code, as long as the current scheme is only deprecated and not deleted. That won't happen before Py4 anyway. New code would simply export a different symbol when compiling for a CPython that supports it, which points to the function that returns the type. Then, there's already the PyType_Copy() function, which can be used to create a heap type from a statically defined type. So extension modules can simply define an (arbitrary) additional type in any way they see fit, copy it to the heap, and return it. Next, we need to define a signature for the type's __init__() method. This can be done in a future proof way by allowing arbitrary keyword arguments to be added, i.e. such a type must have a signature like def __init__(self, currently, used, pos, args, **kwargs) and simply ignore kwargs for now. Actually, we may get away with not passing all too many arguments here if we allow the importer to add stuff to the type's dict in between, specifically __file__, __path__ and friends, so that they are available before the type gets instantiated. Not sure if this is a good idea, but it would at least relieve the user from having to copy these things over from some kind of context or whatever we might want to pass in. Alternatively, we could split the instantiation up between tp_new() and tp_init(), and let the importer set stuff on the instance dict in between the two. But given that this context won't actually change once the shared library is loaded, the only reason to prefer modifying the instance instead of the type would be to avoid requiring a tp_dict for the type. Open for discussion, I guess. Did I forget anything? Sounds simple enough to me so far. Stefan
On Sun, Aug 11, 2013 at 6:52 AM, Stefan Behnel
Nick Coghlan, 11.08.2013 15:19:
On 11 Aug 2013 09:02, "Stefan Behnel" wrote:
BTW, this already suggests a simple module initialisation interface. The extension module would expose a function that returns a module type, and the loader/importer would then simply instantiate that. Nothing else is needed.
Actually, strike the word "module type" and replace it with "type". Is there really a reason why Python needs a module type at all? I mean, you can stick arbitrary objects in sys.modules, so why not allow arbitrary types to be returned by the module creation function?
That's exactly what I have in mind, but the way extension module imports currently work means we can't easily do it just yet. Fortunately, importlib means we now have some hope of fixing that :)
Well, what do we need? We don't need to care about existing code, as long as the current scheme is only deprecated and not deleted. That won't happen before Py4 anyway. New code would simply export a different symbol when compiling for a CPython that supports it, which points to the function that returns the type.
Then, there's already the PyType_Copy() function, which can be used to create a heap type from a statically defined type. So extension modules can simply define an (arbitrary) additional type in any way they see fit, copy it to the heap, and return it.
Next, we need to define a signature for the type's __init__() method. This can be done in a future proof way by allowing arbitrary keyword arguments to be added, i.e. such a type must have a signature like
def __init__(self, currently, used, pos, args, **kwargs)
and simply ignore kwargs for now.
Actually, we may get away with not passing all too many arguments here if we allow the importer to add stuff to the type's dict in between, specifically __file__, __path__ and friends, so that they are available before the type gets instantiated. Not sure if this is a good idea, but it would at least relieve the user from having to copy these things over from some kind of context or whatever we might want to pass in.
Alternatively, we could split the instantiation up between tp_new() and tp_init(), and let the importer set stuff on the instance dict in between the two. But given that this context won't actually change once the shared library is loaded, the only reason to prefer modifying the instance instead of the type would be to avoid requiring a tp_dict for the type. Open for discussion, I guess.
Did I forget anything? Sounds simple enough to me so far.
Out of curiosity - can we list actual use cases for this new design? The previous thread, admittedly, deals with an isoteric corner-cases that comes up in overly-clever tests. If we plan to serious consider these changes - and this appears to be worth a PEP - we need a list of actual advantages over the current approach. It's not that a more conceptually pure design is an insufficient reason, IMHO, but it would be interesting to hear about other implications. Eli
Eli Bendersky, 11.08.2013 19:43:
Out of curiosity - can we list actual use cases for this new design? The previous thread, admittedly, deals with an isoteric corner-cases that comes up in overly-clever tests. If we plan to serious consider these changes - and this appears to be worth a PEP - we need a list of actual advantages over the current approach. It's not that a more conceptually pure design is an insufficient reason, IMHO, but it would be interesting to hear about other implications.
http://mail.python.org/pipermail/python-dev/2012-November/122599.html http://bugs.python.org/issue13429 http://bugs.python.org/issue16392 Yes, it definitely needs a PEP. Stefan
On 11 Aug 2013 09:55, "Stefan Behnel"
Nick Coghlan, 11.08.2013 15:19:
On 11 Aug 2013 09:02, "Stefan Behnel" wrote:
BTW, this already suggests a simple module initialisation interface.
extension module would expose a function that returns a module type, and the loader/importer would then simply instantiate that. Nothing else is needed.
Actually, strike the word "module type" and replace it with "type". Is there really a reason why Python needs a module type at all? I mean, you can stick arbitrary objects in sys.modules, so why not allow arbitrary types to be returned by the module creation function?
That's exactly what I have in mind, but the way extension module imports currently work means we can't easily do it just yet. Fortunately, importlib means we now have some hope of fixing that :)
Well, what do we need? We don't need to care about existing code, as long as the current scheme is only deprecated and not deleted. That won't happen before Py4 anyway. New code would simply export a different symbol when compiling for a CPython that supports it, which points to the function
The that
returns the type.
Then, there's already the PyType_Copy() function, which can be used to create a heap type from a statically defined type. So extension modules can simply define an (arbitrary) additional type in any way they see fit, copy it to the heap, and return it.
Next, we need to define a signature for the type's __init__() method.
We need the "ModuleSpec" object to pass here, which is what we're currently working on in import-sig. We're not going to define something specifically for C extensions when other modules suffer related problems. Cheers, Nick. This
can be done in a future proof way by allowing arbitrary keyword arguments to be added, i.e. such a type must have a signature like
def __init__(self, currently, used, pos, args, **kwargs)
and simply ignore kwargs for now.
Actually, we may get away with not passing all too many arguments here if we allow the importer to add stuff to the type's dict in between, specifically __file__, __path__ and friends, so that they are available before the type gets instantiated. Not sure if this is a good idea, but it would at least relieve the user from having to copy these things over from some kind of context or whatever we might want to pass in.
Alternatively, we could split the instantiation up between tp_new() and tp_init(), and let the importer set stuff on the instance dict in between the two. But given that this context won't actually change once the shared library is loaded, the only reason to prefer modifying the instance instead of the type would be to avoid requiring a tp_dict for the type. Open for discussion, I guess.
Did I forget anything? Sounds simple enough to me so far.
Stefan
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com
Nick Coghlan, 12.08.2013 00:41:
On 11 Aug 2013 09:55, "Stefan Behnel" wrote:
this already suggests a simple module initialisation interface. The extension module would expose a function that returns a module type, and the loader/importer would then simply instantiate that. Nothing else is needed. Actually, strike the word "module type" and replace it with "type". [...] Next, we need to define a signature for the type's __init__() method.
We need the "ModuleSpec" object to pass here, which is what we're currently working on in import-sig.
Ok but that's just the very final step. All the rest is C-API specific. And for clarification: you want to let the importer create the ModuleSpec object and the pass it into the module's __init__ method? I guess it could also be passed into the type creation function then, right? Since it wouldn't harm to do that, I think it's a good idea to provide as much information to the extension module as possible, as early as we can, and that's the first time we talk to the shared library. I've started writing up a pre-PEP that describes this protocol. I think it makes sense to keep it separate from the ModuleSpec PEP as the latter can easily be accepted without changing anything at the C-API level, but it shouldn't happen the other way round. Stefan
Antoine Pitrou, 11.08.2013 12:33:
On Sat, 10 Aug 2013 17:12:53 -0700 Eli Bendersky wrote:
Note how doing some sys.modules acrobatics and re-importing suddenly changes the internal state of a previously imported module. This happens because:
1. The first import of 'csv' (which then imports `_csv) creates module-specific state on the heap and associates it with the current sub-interpreter. The list of dialects, amongst other things, is in that state. 2. The 'del's wipe 'csv' and '_csv' from the cache. 3. The second import of 'csv' also creates/initializes a new '_csv' module because it's not in sys.modules. This *replaces* the per-sub-interpreter cached version of the module's state with the clean state of a new module
I would say this is pretty much expected. The converse would be a bug IMO (but perhaps Martin disagrees). PEP 3121's stated goal is not only subinterpreter support:
"Extension module initialization currently has a few deficiencies. There is no cleanup for modules, the entry point name might give naming conflicts, the entry functions don't follow the usual calling convention, and multiple interpreters are not supported well."
Re-initializing state when importing a module anew makes extension modules more like pure Python modules, which is a good thing.
It's the same as defining a type or function in a loop, or inside of a closure. The whole point of reimporting is that you get a new module. However, it should not change the content of the old module, just create a new one.
So, the PEP 3121 "module state" pointer (the optional opaque void* thing) should only be used to hold non-PyObjects. PyObjects should go to the module dict, like they do in normal Python modules. Now, the reason our PEP 3121 extension modules abuse the module state pointer to keep PyObjects is two-fold:
1. it's surprisingly easier (it's actually a one-liner if you don't handle errors - a rather bad thing, but all PEP 3121 extension modules currently don't handle a NULL return from PyState_FindModule...)
2. it protects the module from any module dict monkeypatching. It's not important if you are using a generic API on the PyObject, but it is if the PyObject is really a custom C type with well-defined fields.
Yes, it's a major safety problem if you can crash the interpreter by assigning None to a module attribute.
Those two issues can be addressed if we offer an API for it. How about:
PyObject *PyState_GetModuleAttr(struct PyModuleDef *def, const char *name, PyObject *restrict_type)
*def* is a pointer to the module definition. *name* is the attribute to look up on the module dict. *restrict_type*, if non-NULL, is a type object the looked up attribute must be an instance of.
Lookup an attribute in the current interpreter's extension module instance for the module definition *def*. Returns a *new* reference (!), or NULL if an error occurred. An error can be: - no such module exists for the current interpreter (ImportError? RuntimeError? SystemError?) - no such attribute exists in the module dict (AttributeError) - the attribute doesn't conform to *restrict_type* (TypeError)
So code can be written like:
PyObject *dialects = PyState_GetModuleAttr( &_csvmodule, "dialects", &PyDict_Type); if (dialects == NULL) return NULL;
At least for Cython it's unlikely that it'll ever use this. It's just way too much overhead for looking up a global name. Plus, not all global names are visible in the module dict, e.g. it's common to have types that are only used internally to keep some kind of state. Those would still have to live in the internal per-module state. ISTM that this is not a proper solution for the problem, because it only covers the simple use cases. Rather, I'd prefer making the handling of names in the per-module instance state safer. Essentially, with PEP 3121, modules are just one form of an extension type. So what's wrong with giving them normal extension type fields? Functions are essentially methods of the module, global types are just inner classes. Both should keep the module alive (on the one side) and be tied to it (on the other side). If you reimport a module, you'd get a new set of everything, and the old module would just linger in the background until the last reference to it dies. In other words, I don't see why modules should be any special. Stefan
On Sun, Aug 11, 2013 at 3:33 AM, Antoine Pitrou
Hi Eli,
On Sat, 10 Aug 2013 17:12:53 -0700 Eli Bendersky
wrote: Note how doing some sys.modules acrobatics and re-importing suddenly changes the internal state of a previously imported module. This happens because:
1. The first import of 'csv' (which then imports `_csv) creates module-specific state on the heap and associates it with the current sub-interpreter. The list of dialects, amongst other things, is in that state. 2. The 'del's wipe 'csv' and '_csv' from the cache. 3. The second import of 'csv' also creates/initializes a new '_csv'
module
because it's not in sys.modules. This *replaces* the per-sub-interpreter cached version of the module's state with the clean state of a new module
I would say this is pretty much expected.
I'm struggling to see how it's expected. The two imported csv modules are different (i.e. different id() of members), and yet some state is shared between them. I think the root reason for it is that "PyModuleDev _csvmodule" is uniqued per interpreter, not per module instance. Even if dialects were not a PyObject, this would still be problematic, don't you think? And note that here, unlike the ET.ParseError case, I don't think the problem is exporting internal per-module state as a module attribute. The following two are un-reconcilable, IMHO: 1. Wanting to have two instances of the same module in the same interpterer. 2. Using a global shared PyModuleDef between all instances of the same module in the same interpterer.
The converse would be a bug IMO (but perhaps Martin disagrees). PEP 3121's stated goal is not only subinterpreter support:
"Extension module initialization currently has a few deficiencies. There is no cleanup for modules, the entry point name might give naming conflicts, the entry functions don't follow the usual calling convention, and multiple interpreters are not supported well."
Re-initializing state when importing a module anew makes extension modules more like pure Python modules, which is a good thing.
I think the piece of interpretation you offered yesterday on IRC may be the right explanation for the ET shenanigans:
"Maybe the bug is that ParseError is kept in per-module state, and also exported from the module?"
PEP 3121 doesn't offer any guidelines for using its API, and its example shows PyObject* fields in a module state.
I'm starting to think that it might be a bad use of PEP 3121. PyObjects can, and therefore should be stored in the extension module dict where they will participate in normal resource management (i.e. garbage collection). If they are in the module dict, then they shouldn't be held alive by the module state too, otherwise the (currently tricky) lifetime management of extension modules can produce oddities.
So, the PEP 3121 "module state" pointer (the optional opaque void* thing) should only be used to hold non-PyObjects. PyObjects should go to the module dict, like they do in normal Python modules. Now, the reason our PEP 3121 extension modules abuse the module state pointer to keep PyObjects is two-fold:
1. it's surprisingly easier (it's actually a one-liner if you don't handle errors - a rather bad thing, but all PEP 3121 extension modules currently don't handle a NULL return from PyState_FindModule...)
2. it protects the module from any module dict monkeypatching. It's not important if you are using a generic API on the PyObject, but it is if the PyObject is really a custom C type with well-defined fields.
Those two issues can be addressed if we offer an API for it. How about:
PyObject *PyState_GetModuleAttr(struct PyModuleDef *def, const char *name, PyObject *restrict_type)
*def* is a pointer to the module definition. *name* is the attribute to look up on the module dict. *restrict_type*, if non-NULL, is a type object the looked up attribute must be an instance of.
Lookup an attribute in the current interpreter's extension module instance for the module definition *def*. Returns a *new* reference (!), or NULL if an error occurred. An error can be: - no such module exists for the current interpreter (ImportError? RuntimeError? SystemError?) - no such attribute exists in the module dict (AttributeError) - the attribute doesn't conform to *restrict_type* (TypeError)
So code can be written like:
PyObject *dialects = PyState_GetModuleAttr( &_csvmodule, "dialects", &PyDict_Type); if (dialects == NULL) return NULL;
On Sun, 11 Aug 2013 06:26:55 -0700
Eli Bendersky
On Sun, Aug 11, 2013 at 3:33 AM, Antoine Pitrou
wrote: Hi Eli,
On Sat, 10 Aug 2013 17:12:53 -0700 Eli Bendersky
wrote: Note how doing some sys.modules acrobatics and re-importing suddenly changes the internal state of a previously imported module. This happens because:
1. The first import of 'csv' (which then imports `_csv) creates module-specific state on the heap and associates it with the current sub-interpreter. The list of dialects, amongst other things, is in that state. 2. The 'del's wipe 'csv' and '_csv' from the cache. 3. The second import of 'csv' also creates/initializes a new '_csv'
module
because it's not in sys.modules. This *replaces* the per-sub-interpreter cached version of the module's state with the clean state of a new module
I would say this is pretty much expected.
I'm struggling to see how it's expected. The two imported csv modules are different (i.e. different id() of members), and yet some state is shared between them.
There are two csv modules, but there are not two _csv modules. Extension modules are currently immortal until the end of the interpreter:
csv = __import__('csv') wcsv = weakref.ref(csv) w_csv = weakref.ref(sys.modules['_csv']) del sys.modules['csv'] del sys.modules['_csv'] del csv gc.collect() 50 wcsv() w_csv()
So, "sharing" a state is pretty much expected, since you are re-initializating an existing module. (but the module does get re-initialized, which is the point of PEP 3121)
1. Wanting to have two instances of the same module in the same interpterer.
It could be nice, but really, that's not a common use case. And it's impossible for extension modules, currently. Regards Antoine.
On Sun, Aug 11, 2013 at 6:40 AM, Antoine Pitrou
On Sun, Aug 11, 2013 at 3:33 AM, Antoine Pitrou
wrote: Hi Eli,
On Sat, 10 Aug 2013 17:12:53 -0700 Eli Bendersky
wrote: Note how doing some sys.modules acrobatics and re-importing suddenly changes the internal state of a previously imported module. This
happens
because:
1. The first import of 'csv' (which then imports `_csv) creates module-specific state on the heap and associates it with the current sub-interpreter. The list of dialects, amongst other things, is in
state. 2. The 'del's wipe 'csv' and '_csv' from the cache. 3. The second import of 'csv' also creates/initializes a new '_csv' module because it's not in sys.modules. This *replaces* the
On Sun, 11 Aug 2013 06:26:55 -0700 Eli Bendersky
wrote: that per-sub-interpreter cached version of the module's state with the clean state of a new module
I would say this is pretty much expected.
I'm struggling to see how it's expected. The two imported csv modules are different (i.e. different id() of members), and yet some state is shared between them.
There are two csv modules, but there are not two _csv modules. Extension modules are currently immortal until the end of the interpreter:
csv = __import__('csv') wcsv = weakref.ref(csv) w_csv = weakref.ref(sys.modules['_csv']) del sys.modules['csv'] del sys.modules['_csv'] del csv gc.collect() 50 wcsv() w_csv()
So, "sharing" a state is pretty much expected, since you are re-initializating an existing module. (but the module does get re-initialized, which is the point of PEP 3121)
Yes, you're right - this is an oversight on my behalf. Indeed, the extensions dict in import.c keeps it alive once loaded, and only ever gets cleaned up in Py_Finalize. Eli
On Sun, 11 Aug 2013 08:49:56 -0700
Eli Bendersky
On Sun, Aug 11, 2013 at 6:40 AM, Antoine Pitrou
wrote: On Sun, Aug 11, 2013 at 3:33 AM, Antoine Pitrou
wrote: Hi Eli,
On Sat, 10 Aug 2013 17:12:53 -0700 Eli Bendersky
wrote: Note how doing some sys.modules acrobatics and re-importing suddenly changes the internal state of a previously imported module. This
happens
because:
1. The first import of 'csv' (which then imports `_csv) creates module-specific state on the heap and associates it with the current sub-interpreter. The list of dialects, amongst other things, is in
state. 2. The 'del's wipe 'csv' and '_csv' from the cache. 3. The second import of 'csv' also creates/initializes a new '_csv' module because it's not in sys.modules. This *replaces* the
On Sun, 11 Aug 2013 06:26:55 -0700 Eli Bendersky
wrote: that per-sub-interpreter cached version of the module's state with the clean state of a new module
I would say this is pretty much expected.
I'm struggling to see how it's expected. The two imported csv modules are different (i.e. different id() of members), and yet some state is shared between them.
There are two csv modules, but there are not two _csv modules. Extension modules are currently immortal until the end of the interpreter:
csv = __import__('csv') wcsv = weakref.ref(csv) w_csv = weakref.ref(sys.modules['_csv']) del sys.modules['csv'] del sys.modules['_csv'] del csv gc.collect() 50 wcsv() w_csv()
So, "sharing" a state is pretty much expected, since you are re-initializating an existing module. (but the module does get re-initialized, which is the point of PEP 3121)
Yes, you're right - this is an oversight on my behalf. Indeed, the extensions dict in import.c keeps it alive once loaded, and only ever gets cleaned up in Py_Finalize.
It's not the extensions dict in import.c, it's modules_by_index in the interpreter state. (otherwise it wouldn't be per-interpreter) The extensions dict holds the module *definition* (the struct PyModuleDef), not the module instance. Regards Antoine.
On Sun, Aug 11, 2013 at 8:56 AM, Antoine Pitrou
On Sun, 11 Aug 2013 08:49:56 -0700 Eli Bendersky
wrote:
Yes, you're right - this is an oversight on my behalf. Indeed, the extensions dict in import.c keeps it alive once loaded, and only ever gets cleaned up in Py_Finalize.
On Sun, Aug 11, 2013 at 6:40 AM, Antoine Pitrou
wrote: On Sun, 11 Aug 2013 06:26:55 -0700 Eli Bendersky
wrote: On Sun, Aug 11, 2013 at 3:33 AM, Antoine Pitrou
wrote:
Hi Eli,
On Sat, 10 Aug 2013 17:12:53 -0700 Eli Bendersky
wrote: Note how doing some sys.modules acrobatics and re-importing
changes the internal state of a previously imported module. This happens because:
1. The first import of 'csv' (which then imports `_csv) creates module-specific state on the heap and associates it with the current sub-interpreter. The list of dialects, amongst other things, is in
state. 2. The 'del's wipe 'csv' and '_csv' from the cache. 3. The second import of 'csv' also creates/initializes a new '_csv' module because it's not in sys.modules. This *replaces* the
suddenly that per-sub-interpreter
cached version of the module's state with the clean state of a new module
I would say this is pretty much expected.
I'm struggling to see how it's expected. The two imported csv modules are different (i.e. different id() of members), and yet some state is shared between them.
There are two csv modules, but there are not two _csv modules. Extension modules are currently immortal until the end of the interpreter:
csv = __import__('csv') wcsv = weakref.ref(csv) w_csv = weakref.ref(sys.modules['_csv']) del sys.modules['csv'] del sys.modules['_csv'] del csv gc.collect() 50 wcsv() w_csv()
So, "sharing" a state is pretty much expected, since you are re-initializating an existing module. (but the module does get re-initialized, which is the point of PEP
It's not the extensions dict in import.c, it's modules_by_index in the interpreter state. (otherwise it wouldn't be per-interpreter)
The extensions dict holds the module *definition* (the struct PyModuleDef), not the module instance.
Thanks for the clarification. Eli
participants (4)
-
Antoine Pitrou
-
Eli Bendersky
-
Nick Coghlan
-
Stefan Behnel