Mixed modules for both PyPy and CPython
Hi all, This is a follow-up on last night's IRC discussion. It's come to a point where e-mail is better. The topic is "WP3", i.e. how to write in Python built-in modules that work for both CPython and PyPy ("built-in" == "extension" == "compileable" for the purpose of this mail). There are two levels of issues involved in this: * In which style the modules should be written in the first place? * Which style is easier to support in both CPython and PyPy given what we have now? In PyPy, we are using "mixed modules" with an interp-level and an app-level part, both optional, with explicit ways to interact between them. The interp-level part is what gets compiled to C. It contains code like 'space.add(w_1, w_2)' to perform an app-level addition. Christian is working on an alternative, which is a single-source approach where the annotator's failures to derive types are taken as a hint that the corresponding variables should contain wrapped objects. This is a convenient but hard-to-control extension of a historical hack, which works mostly because we needed it while we were still developing the annotator and the rtyper. I am clearly biased towards "solution 1", which is to reuse the mixed modules style that we have evolved over several years. Here is some implementation effort comparison between the two styles (1=mixed module, 2=single source). It is easy to develop and test in 2, as it runs on CPython with no further efforts. For 1, the module developer needs the whole of PyPy to test his complete module. We could ease this by writing a dummy interface-compatible gateway and object space, performing sanity-checks and giving useful error messages. Then the developer only needs to check his code against this. Annotation problems in the module are easier to spot early in the model 1; indeed, the fact that with 2 we cannot gracefully crash on SomeObjects, and moreover the need for many fragile-looking small extensions to the flow object space and the annotator, is what makes me most uneasy about 2. The mixed modules are designed for PyPy, so they work there already now. For the single-source approach to work on PyPy, however, there would be many interesting and delicate problems -- both for py.py and for the translated pypy-c. I'd rather avoid to think about it too much right now :-) For translation, for 2 we already have the basic machinery implemented as a leftover from PyPy "early ages". But I want to convince you that the basic support for 1 is extremely easy, or will soon be. We need a new object space to compile with the mixed module; but this "CPyObjSpace" is completely trivial. It could be based on rctypes, in which case it would look like this: class CPyObjSpace: def newint(self, intval): return ctypes.pydll.PyInt_FromLong(intval) def add(self, w_1, w_2): return ctypes.pydll.PyNumber_Add(w_1, w_2) ... Note that this even works on top of CPython! (Not out-of-the-box on Linux, however, where for obscure reasons CPython is by default not compiled as a separate .so file...). The gateway code can be written in the same style, by using ctypes.pydll.PyObject_Call() to call the app-level parts, and ctypes callbacks for the reverse direction. The calls like 'ctypes.pydll.PyInt_FromLong()' return an instance of 'ctypes.py_object', which naturally plays the role of a wrapped object for the CPyObjSpace. The more difficult bits of translation are e.g. support for creating at interp-level types that are exposed to app-level, in particular about the special methods __add__, __getitem__ etc. There is an example of that in pypy.module.Numeric.interp_numeric, where __getitem__ is added to the TypeDef of "array". This creates some difficulty that is common to approaches 1 and 2, and which I think Christian is also working on. In 1 we would probably make the TypeDef declaration turn, for CPython, into static PyTypeObject structures. The special methods can be recognized and put into the proper slots, with a bit of wrapping. The whole PyTypeObject structure with its pointers to functions could be expressed with ctypes too: # use pypy.rpython.rctypes.ctypes_platform to get at the layout PyTypeObject = ctypes_platform.getstruct("PyTypeObject *", ...) def TypeDef(name, **rawdict): mytype = PyTypeObject() mytype.tp_name = name if '__getitem__' in rawdict: # build a ctypes callback mp_subscript = callback_wrapper(rawdict['__dict__']) mytype.tp_as_mapping.mp_subscript = mp_subscript return mytype This gives us prebuilt PyTypeObject structures in ctypes, which get compiled into the C code by rctypes. At the same time, running on top of CPython is still possible; ctypes will let you issue the following kind of calls dynamically when running on CPython: myinstance = ctypes.pydll.PyObject_Call(mytype) The same line translated by rctypes becomes in the C file: PyObject* myinstance = PyObject_Call(mytype); Of course this latter part is dependent on rctypes being completed first. I understand and respect Christian's needs for something that works right now; nevertheless, at this point I think the mixed modules approach is the best medium-term solution. This is all open to discussion, of course. ...but do keep in mind that the people who completed the annotator as it is now are not likely to appreciate the addition of Yet More Ad-Hoc Heuristics all over the place :-/ A bientot, Armin
Re-hi all, A clarification on a point that caused confusion... The purpose of the whole discussion was about how to have a "write-once, run-everywhere" approach of developing an extension module a single time and then compiling it for either CPython or PyPy. This source code should not need to know if it will be compiled for PyPy or CPython (or even just run on top of the CPython interpreter for testing). The two approaches I opposed are two ways to do that. Some confusion came from the names of the two approaches, which I took from IRC. One approach is "single-source": this just means "source-in-a-single-file", with wrapped objects and native values mixed in the same source code. (It's not in the sense "write a single time the source"; both approaches are about this.) By opposition, "mixed module" means "source-in-two-files", one running at interp-level and one at app-level. This is what we're doing in pypy/module/*/{app,interp}_*.py. Holger suggests to call "single-source" the whole approach of writing once and running everywhere, which makes sense. If we do so, then we need better names for the two approaches. What about "explicit" versus "implicit" levels? Mixed modules provide explicit level separation, whereas the alternative is to rely on the annotator to separate the levels. I'd also like to point out that the latter -- implicit levels -- has a kind of elegance to it that I like. What I dislike, though, beyond the fact that it would require yet another refactoring of PyPy's module approach (and in a way that is quite unclear), is that it requires many additional fragile rules and heuristics all over the flow objspace and the annotator to get the "expected" separation of levels effects. I'd be happy to experiment more in that direction, but I believe that the other direction's work is needed for now and won't be lost anyway. Basically, we can still come back to this and think later about ways to allow more implicit-level manipulations of objects. A bientot, Armin.
Hi Armin, hi all, On Wed, Apr 12, 2006 at 13:48 +0200, Armin Rigo wrote:
The purpose of the whole discussion was about how to have a "write-once, run-everywhere" approach of developing an extension module a single time and then compiling it for either CPython or PyPy. This source code should not need to know if it will be compiled for PyPy or CPython (or even just run on top of the CPython interpreter for testing).
... Mixed modules provide explicit level separation, whereas the alternative is to rely on the annotator to separate the levels.
Hum, i wonder how strongly opposed these explicit versus implicit level separation models need to be. Is it not possible to support a programming model that can mostly avoid knowing about interpreter versus application level distinction without extending/refactoring the annotator? Maybe we could identify some interesting interaction use cases and try to support them explicitely like e.g. publishing an rpython level class in CPython providing glue code between the rpython class and its CPython representation? However, having an underlying interp/app separation with 'space', wrapped objects and exceptions, still seems like a very good (proven) foundation on which to build such glue code. (Btw, i wouldn't mind if such glue code would not allow all possible interactions - our primary goal is not to provide seemless integration with CPython here). And for the planned June PyPy release we likely want to have the "explicit" approach nicely working before experimenting with where we can go from there, right? best, holger
Hi Holger, On Fri, Apr 14, 2006 at 07:44:12PM +0200, holger krekel wrote:
Hum, i wonder how strongly opposed these explicit versus implicit level separation models need to be.
Yes, you're right here. It's mostly about what we need to do next: we must choose one of the two models and develop it enough, until it becomes useful for PyPy and CPython alike. We could possibly do both models in parallel, but I'm not sure it's the best way forward at this point.
Is it not possible to support a programming model that can mostly avoid knowing about interpreter versus application level distinction without extending/refactoring the annotator?
Sure, no-one thinks about refactoring the annotator. The implicit model already works, by using the existing support for SomeObjects, completed by Christian over the time. It's a bit hackish, though, and we'll definitely need ways to control where SomeObjects are expected or not. At the moment, what makes me reluctant to continue with the implicit model are two other issues: on the one hand, it's unclear how it would work for PyPy (it works for CPython only); and there are many language design issues ahead that I'd rather avoid for the time being.
"explicit" approach nicely working before experimenting with where we can go from there, right?
Yes, exactly my opinion.
(Btw, i wouldn't mind if such glue code would not allow all possible interactions - our primary goal is not to provide seemless integration with CPython here).
I'm not too worried about this. Our mixed-module model already supports mostly any kind of interaction, including defining new types with properties and overridden operations. The path to support the same for CPython extension modules is more or less clear, and very incremental. A bientot, Armin.
Hi Armin, On Sat, Apr 15, 2006 at 15:09 +0200, Armin Rigo wrote:
On Fri, Apr 14, 2006 at 07:44:12PM +0200, holger krekel wrote:
Hum, i wonder how strongly opposed these explicit versus implicit level separation models need to be.
Yes, you're right here. It's mostly about what we need to do next: we must choose one of the two models and develop it enough, until it becomes useful for PyPy and CPython alike. We could possibly do both models in parallel, but I'm not sure it's the best way forward at this point.
Is it not possible to support a programming model that can mostly avoid knowing about interpreter versus application level distinction without extending/refactoring the annotator?
Sure, no-one thinks about refactoring the annotator. The implicit model already works, by using the existing support for SomeObjects, completed by Christian over the time. It's a bit hackish, though, and we'll definitely need ways to control where SomeObjects are expected or not. At the moment, what makes me reluctant to continue with the implicit model are two other issues: on the one hand, it's unclear how it would work for PyPy (it works for CPython only); and there are many language design issues ahead that I'd rather avoid for the time being.
I agree but may have a somewhat different idea in mind when talking about a more implicit model: namely assuming that all objects live within the current RPython model (no SomeObject's whatsoever) and providing explicit interactions (like gateway.interp2app), exposing of type definitions etc.
"explicit" approach nicely working before experimenting with where we can go from there, right?
Yes, exactly my opinion.
(Btw, i wouldn't mind if such glue code would not allow all possible interactions - our primary goal is not to provide seemless integration with CPython here).
I'm not too worried about this. Our mixed-module model already supports mostly any kind of interaction, including defining new types with properties and overridden operations. The path to support the same for CPython extension modules is more or less clear, and very incremental.
Yes, the mixed modules (and interpreter/gateway's, typedef's) support interaction but by rather explicitely programming the machinery. With "glue code" i mean code where the user does not need to know about such machinery so much. IOW, the question is which implicit models (as seen from the ext module programmer) are possible without having SomeObjects around? holger
Hi Holger, On Sat, Apr 15, 2006 at 03:58:09PM +0200, holger krekel wrote:
I agree but may have a somewhat different idea in mind when talking about a more implicit model: (...)
Ok, then we agree everywhere -- short of a confusing terminology: we gave the names "implicit levels" and "explicit levels" to very precise things and now you're calling "an implicit model" something that is inbetween :-) A bientot, Armin
On Sat, Apr 15, 2006 at 18:23 +0200, Armin Rigo wrote:
On Sat, Apr 15, 2006 at 03:58:09PM +0200, holger krekel wrote:
I agree but may have a somewhat different idea in mind when talking about a more implicit model: (...)
Ok, then we agree everywhere -- short of a confusing terminology: we gave the names "implicit levels" and "explicit levels" to very precise things and now you're calling "an implicit model" something that is inbetween :-)
indeed, i wasn't quite explicit enough, i guess :) IMO "implicit" and "explicit" do not denote a binary property but there rather can be quantities of "implicit" or "explicit", therefore the "more" in "more implicit model". holger
Hello, There is one other issue which may be relevant: restricted python. Restricted python implementations recently showed up on the py3k list as a wishlist item - and they are wanted even for python 2.X. I thought that pypy might be the answer to these restricted python wishes. Implementation would be as follows: For all allowed functionality, create a compilable pypy extension to handle it. For restricted functionality, delegate to a CPyObjectSpace which could allow/disallow or modify the operation. What makes this work where CPy couldn't is the ability to use two different cooperating object spaces to evaluate an expression. CPy doesn't have the object space abstraction which would allow this sort of partitioning. Another benefit is that different restricted interpreters could be created, each with a different set of allowed and disallowed operations. In summary: (compile a restricted PyPy - rpy - as a CPy extension here)
import rpy def filehandler(fname, mode): ... if fname in ['allowed1.txt', 'allowed2.txt']: ... return open(fname, mode) ... else: ... raise rpy.Restricted('File access not permitted') ... interp = rpy.new() # restrictions were defined at compile time # Add a callback for some restricted functionality interp.add_handler('file', filehandler) interp.interact()
# Now in rpy 2 + 3 # Allowed operation, didn't hit any restrictions
5
open('allowed1.txt', 'r') # allowed by handler
<Restricted File 'allowed1.txt' at 0x23891910>
open('disallowed.txt', 'r') # disallowed by handler
Traceback (most recent call last): File "<interp_stdin>", line 1, in ? __main__.__interp__.Restricted: File access not permitted
Would either of these approaches lend itself better to this sort of restricted execution idea? VanL
Hi! 2006/4/12, VanL <news-8a9e0fd91190ca@northportal.net>:
There is one other issue which may be relevant: restricted python. Restricted python implementations recently showed up on the py3k list as a wishlist item - and they are wanted even for python 2.X.
Careful. "Restricted Python" (or RPython) means something very specific in PyPy lingo. Restricted Python means that you stick to some restrictions in your coding style to make type inference possible. See http://codespeak.net/pypy/dist/pypy/doc/coding-guide.html#restricted-python for more details. To your mail: security features are part of what we promised to do to the EU, so there will be something happening in that direction (even before November). I personally don't have any clue what :-) [snip] Cheers, Carl Friedrich
Hello, Carl Friedrich Bolz wrote:
Careful. "Restricted Python" (or RPython) means something very specific in PyPy lingo. Restricted Python means that you stick to some restrictions in your coding style to make type inference possible. See
http://codespeak.net/pypy/dist/pypy/doc/coding-guide.html#restricted-python
Sorry - I knew that. I just didn't make the connection from one side of my brain to another. Maybe "secure python" (spy)? What made me think this would work are a couple of comments made here and on the py3k list. First, Armin (about a year ago, I think)made some comments about the objectspace abstraction allowing a "Remote Object Space" and allowing different object spaces to participate in the evaluation of code. Second, comments on py3k list indicated that secure python is difficult because of a) introspection, b) type inference, and c) GIL acquisition. However, a) a partitioned object space would allow introspection to be limited - go too far and you run into an opaque objectspace wall; b) Pypy already has functionality to do some type inference - probably as much as necessary; and c) Pypy doesn't have a GIL. The only relevant GIL would be in the CPy host python, inaccessible to code in the Spy. Of course, I also may be completely wrong :) VanL
Hello, Carl Friedrich Bolz wrote:
Careful. "Restricted Python" (or RPython) means something very specific in PyPy lingo. Restricted Python means that you stick to some restrictions in your coding style to make type inference possible. See
http://codespeak.net/pypy/dist/pypy/doc/coding-guide.html#restricted-python
Sorry - I knew that. I just didn't make the connection from one side of my brain to another. Maybe "secure python" (spy)? What made me think this would work are a couple of comments made here and on the py3k list. First, Armin (about a year ago, I think)made some comments about the objectspace abstraction allowing a "Remote Object Space" and allowing different object spaces to participate in the evaluation of code. Second, comments on py3k list indicated that secure python is difficult because of a) introspection, b) type inference, and c) GIL acquisition. However, a) a partitioned object space would allow introspection to be limited - go too far and you run into an opaque objectspace wall; b) Pypy already has functionality to do some type inference - probably as much as necessary; and c) Pypy doesn't have a GIL. The only relevant GIL would be in the CPy host python, inaccessible to code in the Spy. Of course, I also may be completely wrong :) VanL
Hi VanL, On Wed, Apr 12, 2006 at 13:45 -0600, VanL wrote:
Carl Friedrich Bolz wrote:
Careful. "Restricted Python" (or RPython) means something very specific in PyPy lingo. Restricted Python means that you stick to some restrictions in your coding style to make type inference possible. See
http://codespeak.net/pypy/dist/pypy/doc/coding-guide.html#restricted-python
Sorry - I knew that. I just didn't make the connection from one side of my brain to another. Maybe "secure python" (spy)?
just as a side note: "secure" is a very vague term, technically (and politically). The professor i learned with, warned against talking about security without specifying against which kind of attacks and about which scenarios one is talking about it. Did you see my security related posting a few days ago, btw?
What made me think this would work are a couple of comments made here and on the py3k list. First, Armin (about a year ago, I think)made some comments about the objectspace abstraction allowing a "Remote Object Space" and allowing different object spaces to participate in the evaluation of code.
Right, that is still an open topic, a bit touched by the parallel discussion on this list of integrating PyPy (and RPython) with CPython.
Second, comments on py3k list indicated that secure python is difficult because of a) introspection, b) type inference, and c) GIL acquisition.
Hum, this list looks a bit weird to me. Could you state what the actual attacks are for which security measures are discussed? Or which use cases are people on py3k having in mind? cheers & thanks, holger
Hello, holger krekel wrote:
Second, comments on py3k list indicated that secure python is difficult because of a) introspection, b) type inference, and c) GIL acquisition.
Hum, this list looks a bit weird to me. Could you state what the actual attacks are for which security measures are discussed? Or which use cases are people on py3k having in mind?
This is an amalgam of several different posts (and maybe different threads) but here goes: In the thread "Will we have a true restricted exec environment for python 3000," Vineet Jain asked for a restricted mode which would "1. Limit the memory consumed by the script 2. Limit access to file system and other system resources 3. Limit cpu time that the script will take 4. Be able to specify which modules are available for import." In responses to that request, various people commented on the difficulties of implementing such a restricted mode. On that thread, several people had the same idea I had, to try to use PyPy for this purpose - however, it didn't look like many people were up-to-date reading both lists (and thus familiar-ish with PyPy's execution model). A) Introspection Nick Coghlan stated that: "I'm interested, but I'm also aware of how much work it would be. I'm disinclined to trust any mechanism which allows the untrusted code to run in the same process, as the implications of being able to do: self.__class__.__mro__[-1].__subtypes__() are somewhat staggering, and designing an in-process sandbox to cope with that is a big ask (and demonstrating that the sandbox actually *achieves* that goal is even tougher)." Vineet volunteered with a proposal to start a "light" python subinterpreter, which would be controlled by the main interpreter. Nick countered, "But will it allow you to use numbers or strings? If yes, then you can get to object(), and hence to pretty much whatever C builtins you want. So its not enough to try to hide dangerous builtins like file(), you want to remove them from the light version entirely (routing all file system and network access requests through the main application). But if the file objects are gone, what happens to the Python machinery that relies on them (like import)? Python's powerful introspection is a severe drawback from a security POV - it is *really* hard to make a user stay in a box you put them in without crippling some part of the language as a side effect." Thus, in CPy, allowing someone to access a C type effectively opens up all the C types. In PyPy, however, each type is effectively in its own box. Further, PyPy already has a structure that can deal with these sorts of accesses: the flowgraph. Operations in PyPy come about because of traversals of the graph - certain branches of the graph could be restricted or proxied out to a trusted interpreter. B) GIL Acquisition Another person suggested leveraging the multiple subinterpreter code which already exists in CPython to create a restricted-exec interpreter. MvL noted that GIL acquisition made that difficult: "Part of the problem is that it doesn't really work. Some objects *are* shared across interpreters, such as global objects in extension modules (extension modules are initialized only once). I believe that the GIL management code (for acquiring the GIL out of nowhere) breaks if there are multiple interpreters." C) Type inference I tried to find the thread for this one - its not from the Py3K list - but I recall a couple years ago someone attempting to make an rexec version of python. One of the comments that I recall from that discussion had to do with understanding what types were being manipulated. I believe there was an example somewhat like operator.add is trusted class A: def __add__(self, other): ... something evil here ... a, b = A(), 1 a + b [something evil happens] However, this is a foggy memory that I have so far been unable to substantiate. Thanks, VanL
Hi VanL! On Sat, Apr 15, 2006 at 17:38 -0600, VanL wrote:
holger krekel wrote:
Second, comments on py3k list indicated that secure python is difficult because of a) introspection, b) type inference, and c) GIL acquisition.
Hum, this list looks a bit weird to me. Could you state what the actual attacks are for which security measures are discussed? Or which use cases are people on py3k having in mind?
This is an amalgam of several different posts (and maybe different threads) but here goes:
hey, thanks for the effort!
In the thread "Will we have a true restricted exec environment for python 3000," Vineet Jain asked for a restricted mode which would
"1. Limit the memory consumed by the script 2. Limit access to file system and other system resources 3. Limit cpu time that the script will take 4. Be able to specify which modules are available for import."
all more or less relates to the "sandbox" idea.
In responses to that request, various people commented on the difficulties of implementing such a restricted mode. On that thread, several people had the same idea I had, to try to use PyPy for this purpose - however, it didn't look like many people were up-to-date reading both lists (and thus familiar-ish with PyPy's execution model).
Yes, using PyPy's metaprogramming facilities for implementing sandbox models should make the tasks relatively easy. "Metaprogramming" because PyPy is about writing a program that generates programs which happen to be a full Python interpreter. But it's also true that we haven't had concise discussions (or rather never documented the results :) about how to implement the above. But the fact that we can instrument our GC, resource handling code, or the main interpreter loop and accordingly produce a full python interpreter gives us various ways to go about the sandbox problem. Funny little thesis topic, i'd say :)
A) Introspection ... introspection funnyness ... Python's powerful introspection is a severe drawback from a security POV - it is *really* hard to make a user stay in a box you put them in without crippling some part of the language as a side effect."
All the possible ways to introspect your way around the python object model makes one wonder if protecting the resources isn't a more viable approach than protecting navigation.
Thus, in CPy, allowing someone to access a C type effectively opens up all the C types. In PyPy, however, each type is effectively in its own box. Further, PyPy already has a structure that can deal with these sorts of accesses: the flowgraph. Operations in PyPy come about because of traversals of the graph - certain branches of the graph could be restricted or proxied out to a trusted interpreter.
Applying systematic transformations to families of graphs (relating to IO resources, say) is one possibility. Lately we seem to want to express almost everything as graph transformations, btw.
B) GIL Acquisition
Another person suggested leveraging the multiple subinterpreter code which already exists in CPython to create a restricted-exec interpreter. MvL noted that GIL acquisition made that difficult:
"Part of the problem is that it doesn't really work. Some objects *are* shared across interpreters, such as global objects in extension modules (extension modules are initialized only once). I believe that the GIL management code (for acquiring the GIL out of nowhere) breaks if there are multiple interpreters."
For PyPy, it need not be true that the GIL makes protecting resources harder. Then again, it doesn't matter so much because i don't think that we'll require a sub-interpreter for implementing resource protection (but who knows :) I guess we should discuss sometime on which path to follow (also see my other mail). Any opinions on that, btw? best and thanks! holger
Re-hi, On Wed, Apr 12, 2006 at 12:29:48PM +0200, Armin Rigo wrote:
class CPyObjSpace:
def newint(self, intval): return ctypes.pydll.PyInt_FromLong(intval)
def add(self, w_1, w_2): return ctypes.pydll.PyNumber_Add(w_1, w_2) ...
Correction here. It's not 'ctypes.pydll' -- this one is an object used to load other DLLs following the CPython calling convensions. It's 'ctypes.pythonapi' instead, and it works on standard Linux installations as well. Just try: >>> from ctypes import * >>> pythonapi.PyNumber_Add.restype = py_object >>> pythonapi.PyNumber_Add(py_object(5), py_object(6)) 11 (The result is 11 instead of py_object(11) because ctypes does automatic unwrapping of some return types.) A bientot, Armin
participants (6)
-
Armin Rigo
-
Carl Friedrich Bolz
-
holger krekel
-
hpk@trillke.net
-
Van Lindberg
-
VanL