support for 64-bit processors and eliminating global state

I'm new to pypy but would encourage the development folks to apply some focus towards two things: support for both 32 and 64-bit processors and eliminating global state including the GIL. The near future of mainstream processors is multi-core x86_64. For the short-term both 32-bit and 64-bit platforms will be around. Code that makes "naked" assumptions about word size will break and needs to be re-factored to hide the word-size dependencies. Similarly code that assumes a single thread of execution or uses a GIL to protect global state will make efficient use of modern processors. Any language or system that cannot make the transition to 64-bit multi-core will start to loose ground to those that do. At the Parallel Computing Laboratory (UC Berkeley) one of the projects we are working on is called SEJITS which stands for Selective Embedded Just in Time Specialization. The idea is that one can extend a self-introspecting modern scripting language for calling native coded modules (e.g. C) at selected points for handling specialized operations (e.g. vector/matrix operations using tuned SIMD or CUDA code). You can see the abstract of a recent SEJITS paper at http://pmea.ac.upc.edu/program.html (session 1a) but unfortunately the paper is not online yet. Both Python and Ruby are being looked at as potential target languages for SEJITS work. Both have sufficient introspection facilities to support selective JIT operations. Python has an advantage in having been used by the scientific community for longer than Ruby with more established users. I'd love to see this work integrate with pypy. At the moment the folks involved are targeting CPython. In any case, I think the transition to multi-core/multi-threaded 64-bit machines is a potential watershed of major importance which it would behoove pypy-dev folks to keep in mind. Respectfully. Jeff Anderson-Lee

Hi. In general, we try hard not to make hard assumptions, so the transition to 64 bit should be generally smooth. However, we have limited resources and we're generally volunteer run. For example, my laptop does not support 64bit, which makes it significantly harder for me to work on (that being one example). So, 64bit support is put on the second plan, but not because we plan to hardcode 32bit everywhere, but simply because our resources are limited. If you want to push for it by donating time/money you're welcome to do so. It's nice to hear what you're doing, however full paper would be much better than the abstract. Is there a way to obtain it somehow else? PS. I also did some stuff, like this, by lazily constructing numpy expressions and compiling them to assembler, so I'm personally interested in hearing more. Cheers, fijal On Wed, Sep 30, 2009 at 10:27 AM, Jeff Anderson-Lee <jonah@eecs.berkeley.edu> wrote:

On Sep 30, 2009, at 1:27 PM, Jeff Anderson-Lee wrote:
PyPy does support 32 and 64 bit processors, the jit for x86_64 is not ready though but this is just a problem of time, when the 32 bit jit is ready doing a 64bit one is simple (but many manhours of work). The GIL in pypy is only there because no one proposed anything to change that, pypy already does not depend on reference counting but can use a garbage collector so it is probably way easier to change than CPython.
I haven't read the paper but pypy does already have a JIT, maybe if you are interested in it you can read more on the pypy blog http://morepypy.blogspot.com/ . Probably someone with more experience with both pypy and the JIT is going to answer this email so I will not try to explain it in here. -- Leonardo Santagada santagada at gmail.com

Hi Leonardo. I think you're not reading this mail in details, let me explain. On Wed, Sep 30, 2009 at 11:16 AM, Leonardo Santagada <santagada@gmail.com> wrote:
It's not that many hours needed to have 64bit JIT as far as I know. I did a lot of refactoring recently so it should be much easier. Also we don't have a 64bit buildbot, which means 64bit support might rot over time, we don't know and it's not officially supported.
It's true that we don't have a good story here and we need one. Something a'la Jython would work (unlike in CPython), but it's work.
Note that's not precisely what Jeff wants. General purpose JIT is nice, but it's rather hard to imagine how it'll generate efficient CUDA code automatically, without hints from the user. Since PyPy actually has a jit-generator, it should be far easier to implement this in PyPy than somewhere else (you can write code that is "interpreter" and JIT will be automatically created for it), however it's still work to get nice paralellizable (or parallelizing?) framework. Cheers, fijal

Maciej Fijalkowski wrote:
It's great to hear that you are already working in the 64-bit direction. Most modern laptops have 64-bit compatible chips. My two-year old centrino duo does. A dual boot solution (e.g. with Ubunto for x86_64) can do the trick for an inexpensive development environment. That's not the same as a buildbot though. Some of the original 64-bit processors are nearing retirement age though, so perhaps some kind soul may see this note and volunteer an old system that has been replaced to support pypy-dev. Just sayin' it's important. Glad you seem to think so too.
The last time I looked, Hoard didn't support x86_64 although it did seem to work for threaded environments fairly efficiently if I recall. Having a separate arena for each thread (or each virtual processor) helps to avoid a lot of locking for frequent/small allocations in a VM. That may mean factoring out the allocation so that it calls something like myalloc(pool,size) rather than just malloc(size). I read that pypy was trying to factor out the GC code to support multiple back-ends. Having an API that supports multiple concurrent allocator pools can be useful in that regard. Similarly, a JIT can be modularized so as not to depend on globals, but have a JitContext structure: jit_xxx(struct JitContext *jc, ...) That allows jitting to be going on in multiple threads at once. I looked at libjit and it didn't have that structure, meaning that jit processing of functions was a potential bottleneck. I haven't got deep enough into pypy yet to know whether or not that is the case for you folks. In fact, I'd like to encourage the use of a global-less coding style for the sake of improved parallelization. Every global is another reason for a GIL.
Yes. The SEJITS approach can be used even with a Python that doesn't have a JIT as long as it has a suitable foreign function interface. The trick is to interpose in the AST processing to recognize and handle "selective" patterns in the tree. The current system actually generates C-code on-the fly then compiles and links it in with FFI hooks so that subsequent calls can access it more directly. This is obviously only worth doing for code for which the native code is substantially faster and/or will be called sufficiently often. If I had the cash on hand I would gladly support your work with a donation. Unfortunately I don't have sufficient personal resources nor access to corporate funds. (As a research group, we get our funds from outside donations, not dole it out!) I think it's a great project though, and if cheer leading counts, you definitely have my support in that regard.

Hi Jeff, On Wed, Sep 30, 2009 at 11:19:06AM -0700, Jeff Anderson-Lee wrote:
That's all true, but the primary issue with the GIL in Python (either in CPython or PyPy) is more fundamental. Its purpose is to protect concurrent object accesses. It is to avoid getting random nonsense or crashing the interpreter if you do two lst.append() on the same list in two threads in parallel, for example. The Python language reference itself says that doing this gives you the "expected" result, i.e. the same as if the lst.append() was protected with locking. From there to the GIL the distance is not very large. We could ignore this point and try to implement a free multithreading interpreter in PyPy -- and we would then hit all the issues you describe above, and have to solve them somehow. But the point is that it would no longer be a fully compliant Python interpreter. Depending on the use case it might be either a very useful quasi-Python interpreter, or a useless one that crashes randomly when running any large body of code that otherwise works on CPython or Jython. Of course it's not the end of the story, as shown e.g. by Jython, in which the locking exists too but is more fine-grained. There is one lock per list or dictionary, which itself costs very little because the Java platform is extremely good at optimizing such locks. A bientot, Armin.

Hi Gabriel, On Wed, Sep 30, 2009 at 04:01:25PM -0400, Gabriel Lavoie wrote:
Actually, what are the officially supported platforms?
Linux, Mac OS/X and Windows. All in 32-bit only right now, although the 64-bit version on Linux and OS/X generally works more or less. A bientot, Armin.

Hi. In general, we try hard not to make hard assumptions, so the transition to 64 bit should be generally smooth. However, we have limited resources and we're generally volunteer run. For example, my laptop does not support 64bit, which makes it significantly harder for me to work on (that being one example). So, 64bit support is put on the second plan, but not because we plan to hardcode 32bit everywhere, but simply because our resources are limited. If you want to push for it by donating time/money you're welcome to do so. It's nice to hear what you're doing, however full paper would be much better than the abstract. Is there a way to obtain it somehow else? PS. I also did some stuff, like this, by lazily constructing numpy expressions and compiling them to assembler, so I'm personally interested in hearing more. Cheers, fijal On Wed, Sep 30, 2009 at 10:27 AM, Jeff Anderson-Lee <jonah@eecs.berkeley.edu> wrote:

On Sep 30, 2009, at 1:27 PM, Jeff Anderson-Lee wrote:
PyPy does support 32 and 64 bit processors, the jit for x86_64 is not ready though but this is just a problem of time, when the 32 bit jit is ready doing a 64bit one is simple (but many manhours of work). The GIL in pypy is only there because no one proposed anything to change that, pypy already does not depend on reference counting but can use a garbage collector so it is probably way easier to change than CPython.
I haven't read the paper but pypy does already have a JIT, maybe if you are interested in it you can read more on the pypy blog http://morepypy.blogspot.com/ . Probably someone with more experience with both pypy and the JIT is going to answer this email so I will not try to explain it in here. -- Leonardo Santagada santagada at gmail.com

Hi Leonardo. I think you're not reading this mail in details, let me explain. On Wed, Sep 30, 2009 at 11:16 AM, Leonardo Santagada <santagada@gmail.com> wrote:
It's not that many hours needed to have 64bit JIT as far as I know. I did a lot of refactoring recently so it should be much easier. Also we don't have a 64bit buildbot, which means 64bit support might rot over time, we don't know and it's not officially supported.
It's true that we don't have a good story here and we need one. Something a'la Jython would work (unlike in CPython), but it's work.
Note that's not precisely what Jeff wants. General purpose JIT is nice, but it's rather hard to imagine how it'll generate efficient CUDA code automatically, without hints from the user. Since PyPy actually has a jit-generator, it should be far easier to implement this in PyPy than somewhere else (you can write code that is "interpreter" and JIT will be automatically created for it), however it's still work to get nice paralellizable (or parallelizing?) framework. Cheers, fijal

Maciej Fijalkowski wrote:
It's great to hear that you are already working in the 64-bit direction. Most modern laptops have 64-bit compatible chips. My two-year old centrino duo does. A dual boot solution (e.g. with Ubunto for x86_64) can do the trick for an inexpensive development environment. That's not the same as a buildbot though. Some of the original 64-bit processors are nearing retirement age though, so perhaps some kind soul may see this note and volunteer an old system that has been replaced to support pypy-dev. Just sayin' it's important. Glad you seem to think so too.
The last time I looked, Hoard didn't support x86_64 although it did seem to work for threaded environments fairly efficiently if I recall. Having a separate arena for each thread (or each virtual processor) helps to avoid a lot of locking for frequent/small allocations in a VM. That may mean factoring out the allocation so that it calls something like myalloc(pool,size) rather than just malloc(size). I read that pypy was trying to factor out the GC code to support multiple back-ends. Having an API that supports multiple concurrent allocator pools can be useful in that regard. Similarly, a JIT can be modularized so as not to depend on globals, but have a JitContext structure: jit_xxx(struct JitContext *jc, ...) That allows jitting to be going on in multiple threads at once. I looked at libjit and it didn't have that structure, meaning that jit processing of functions was a potential bottleneck. I haven't got deep enough into pypy yet to know whether or not that is the case for you folks. In fact, I'd like to encourage the use of a global-less coding style for the sake of improved parallelization. Every global is another reason for a GIL.
Yes. The SEJITS approach can be used even with a Python that doesn't have a JIT as long as it has a suitable foreign function interface. The trick is to interpose in the AST processing to recognize and handle "selective" patterns in the tree. The current system actually generates C-code on-the fly then compiles and links it in with FFI hooks so that subsequent calls can access it more directly. This is obviously only worth doing for code for which the native code is substantially faster and/or will be called sufficiently often. If I had the cash on hand I would gladly support your work with a donation. Unfortunately I don't have sufficient personal resources nor access to corporate funds. (As a research group, we get our funds from outside donations, not dole it out!) I think it's a great project though, and if cheer leading counts, you definitely have my support in that regard.

Hi Jeff, On Wed, Sep 30, 2009 at 11:19:06AM -0700, Jeff Anderson-Lee wrote:
That's all true, but the primary issue with the GIL in Python (either in CPython or PyPy) is more fundamental. Its purpose is to protect concurrent object accesses. It is to avoid getting random nonsense or crashing the interpreter if you do two lst.append() on the same list in two threads in parallel, for example. The Python language reference itself says that doing this gives you the "expected" result, i.e. the same as if the lst.append() was protected with locking. From there to the GIL the distance is not very large. We could ignore this point and try to implement a free multithreading interpreter in PyPy -- and we would then hit all the issues you describe above, and have to solve them somehow. But the point is that it would no longer be a fully compliant Python interpreter. Depending on the use case it might be either a very useful quasi-Python interpreter, or a useless one that crashes randomly when running any large body of code that otherwise works on CPython or Jython. Of course it's not the end of the story, as shown e.g. by Jython, in which the locking exists too but is more fine-grained. There is one lock per list or dictionary, which itself costs very little because the Java platform is extremely good at optimizing such locks. A bientot, Armin.

Hi Gabriel, On Wed, Sep 30, 2009 at 04:01:25PM -0400, Gabriel Lavoie wrote:
Actually, what are the officially supported platforms?
Linux, Mac OS/X and Windows. All in 32-bit only right now, although the 64-bit version on Linux and OS/X generally works more or less. A bientot, Armin.
participants (5)
-
Armin Rigo
-
Gabriel Lavoie
-
Jeff Anderson-Lee
-
Leonardo Santagada
-
Maciej Fijalkowski