Feature Request: Py_NewInterpreter to create separate GIL (branch)
repeated from c.l.p : "Feature Request: Py_NewInterpreter to create separate GIL (branch)" Daniel Dittmar wrote:
robert wrote:
I'd like to use multiple CPU cores for selected time consuming Python computations (incl. numpy/scipy) in a frictionless manner.
Interprocess communication is tedious and out of question, so I thought about simply using a more Python interpreter instances (Py_NewInterpreter) with extra GIL in the same process.
If I understand Python/ceval.c, the GIL is really global, not specific to an interpreter instance: static PyThread_type_lock interpreter_lock = 0; /* This is the GIL */
Thats the show stopper as of now. There are only a handfull funcs in ceval.c to use that very global lock. The rest uses that funcs around thread states. Would it be a possibilty in next Python to have the lock separate for each Interpreter instance. Thus: have *interpreter_lock separate in each PyThreadState instance and only threads of same Interpreter have same GIL? Separation between Interpreters seems to be enough. The Interpreter runs mainly on the stack. Possibly only very few global C-level resources would require individual extra locks. Sooner or later Python will have to answer the multi-processor question. A per-interpreter GIL and a nice module for tunneling Python-Objects directly between Interpreters inside one process might be the answer at the right border-line ? Existing extension code base would remain compatible, as far as there is already decent locking on module globals, which is the the usual case. Robert
On 11/3/06, Robert <kxroberto@googlemail.com> wrote:
repeated from c.l.p : "Feature Request: Py_NewInterpreter to create separate GIL (branch)"
Daniel Dittmar wrote:
robert wrote:
I'd like to use multiple CPU cores for selected time consuming Python computations (incl. numpy/scipy) in a frictionless manner.
Interprocess communication is tedious and out of question, so I thought about simply using a more Python interpreter instances (Py_NewInterpreter) with extra GIL in the same process.
If I understand Python/ceval.c, the GIL is really global, not specific to an interpreter instance: static PyThread_type_lock interpreter_lock = 0; /* This is the GIL */
Thats the show stopper as of now. There are only a handfull funcs in ceval.c to use that very global lock. The rest uses that funcs around thread states.
Would it be a possibilty in next Python to have the lock separate for each Interpreter instance. Thus: have *interpreter_lock separate in each PyThreadState instance and only threads of same Interpreter have same GIL? Separation between Interpreters seems to be enough. The Interpreter runs mainly on the stack. Possibly only very few global C-level resources would require individual extra locks.
Right, but that's the trick. For instance extension modules are shared between interpreters. Also look at the sys module and basically anything that is set by a function call is a process-level setting that would also need protection. Then you get into the fun stuff of the possibility of sharing objects created in one interpreter and then passed to another that is not necessarily known ahead of time (whether it be directly through C code or through process-level objects such as an attribute in an extension module). It is not as simple, unfortunately, as a few locks. Sooner or later Python will have to answer the multi-processor question.
A per-interpreter GIL and a nice module for tunneling Python-Objects directly between Interpreters inside one process might be the answer at the right border-line ? Existing extension code base would remain compatible, as far as there is already decent locking on module globals, which is the the usual case.
This is not true (see above). From my viewpoint the only way for this to work would be to come up with a way to wrap all access to module objects in extension modules so that they are not trampled on because of separate locks per-interpreter, or have to force all extension modules to be coded so that they are instantiated individually per interpreter. And of course deal with all other process-level objects somehow. The SMP issue for Python will most likely not happen until someone cares enough to write code to do it and this take on it is no exception. There is no simple solution or else someone would have done it by now. -Brett
Robert schrieb:
Would it be a possibilty in next Python to have the lock separate for each Interpreter instance. Thus: have *interpreter_lock separate in each PyThreadState instance and only threads of same Interpreter have same GIL? Separation between Interpreters seems to be enough. The Interpreter runs mainly on the stack. Possibly only very few global C-level resources would require individual extra locks.
Notice that at least the following objects are shared between interpreters, as they are singletons: - None, True, False, (), "", u"" - strings of length 1, Unicode strings of length 1 with ord < 256 - integers between -5 and 256 How do you deal with the reference counters of these objects? Also, type objects (in particular exception types) are shared between interpreters. These are mutable objects, so you have actually dictionaries shared between interpreters. How would you deal with these? Also, the current thread state is a global variable, currently (_PyThreadState_Current). How would you provide access to the current thread state if there are multiple simultaneous threads? Regards, Martin
On Nov 4, 2006, at 3:49 AM, Martin v. Löwis wrote:
Notice that at least the following objects are shared between interpreters, as they are singletons: - None, True, False, (), "", u"" - strings of length 1, Unicode strings of length 1 with ord < 256 - integers between -5 and 256 How do you deal with the reference counters of these objects?
Also, type objects (in particular exception types) are shared between interpreters. These are mutable objects, so you have actually dictionaries shared between interpreters. How would you deal with these?
All these should be dealt with by making them per-interpreter singletons, not per address space. That should be simple enough, unfortunately the margins of this email are too small to describe how. ;) Also it'd be backwards incompatible with current extension modules. James
On 11/5/06, James Y Knight <foom@fuhm.net> wrote:
On Nov 4, 2006, at 3:49 AM, Martin v. Löwis wrote:
Notice that at least the following objects are shared between interpreters, as they are singletons: - None, True, False, (), "", u"" - strings of length 1, Unicode strings of length 1 with ord < 256 - integers between -5 and 256 How do you deal with the reference counters of these objects?
Also, type objects (in particular exception types) are shared between interpreters. These are mutable objects, so you have actually dictionaries shared between interpreters. How would you deal with these?
All these should be dealt with by making them per-interpreter singletons, not per address space. That should be simple enough, unfortunately the margins of this email are too small to describe how. ;) Also it'd be backwards incompatible with current extension modules.
I don't know how you define simple. In order to be able to have separate GILs you have to remove *all* sharing of objects between interpreters. And all other data structures, too. It would probably kill performance too, because currently obmalloc relies on the GIL. So I don't see much point in continuing this thread. -- --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum wrote:
I don't know how you define simple. In order to be able to have separate GILs you have to remove *all* sharing of objects between interpreters. And all other data structures, too. It would probably kill performance too, because currently obmalloc relies on the GIL.
Nitpick: You have to remove all sharing of *mutable* objects. One day, when we get "pure" GC with no refcounting, that will be a meaningful distinction. :) -- Talin
Talin <talin@acm.org> wrote:
Guido van Rossum wrote:
I don't know how you define simple. In order to be able to have separate GILs you have to remove *all* sharing of objects between interpreters. And all other data structures, too. It would probably kill performance too, because currently obmalloc relies on the GIL.
Nitpick: You have to remove all sharing of *mutable* objects. One day, when we get "pure" GC with no refcounting, that will be a meaningful distinction. :)
Python already grew that feature a couple years back, but it never became mainline. Search google (I don't know the magic incantation off the top of my head), buf if I remember correctly, it wasn't a significant win if any at all. - Josiah
Talin wrote:
/ I don't know how you define simple. In order to be able to have />>/ separate GILs you have to remove *all* sharing of objects between />>/ interpreters. And all other data structures, too. It would probably />>/ kill performance too, because currently obmalloc relies on the GIL. / Nitpick: You have to remove all sharing of *mutable* objects. One day, when we get "pure" GC with no refcounting, that will be a meaningful distinction. :)
Is it mad?: It could be a distinction now: immutables/singletons refcount could be held ~fix around MAXINT easily (by a loose periodic GC scheme, or by Py_INC/DEFREF to be like { if ob.refcount!=MAXINT ... ) dicty things like Exception.x=5 could either be disabled or Exception.refcount=MAXINT/.__dict__=lockingdict ... or exceptions could be doubled as they don't have to cross the bridge (weren't they in an ordinary python module once ?). obmalloc.c/LOCK() could be something fast like: _retry: __asm LOCK INC malloc_lock if (malloc_lock!=1) { LOCK DEC malloc_lock; /*yield();*/ goto _retry; } To know the final speed costs ( http://groups.google.de/group/comp.lang.python/msg/01cef42159fd1712 ) would require an experiment. Cheap signal processors (<1%) don't need to be supported for free threading interpreters. Builtin/Extension modules global __dict__ to become a lockingdict. Yet a speedy LOCK INC lock method may possibly lead to general free threading threads (for most CPUs) at all. Almost all Python objects have static/uncritical attributes/require only few locks. A full blown LOCK INC lock method on dict & list accesses, (avoidable for fastlocals?) & defaulty Py_INC/DECREF (as far as there is still refcounting in Py3K). Py_FASTINCREF could be fast for known immutables (mainly Py_None) with MAXINT method, and for fresh creations etc. PyThreadState_GET(): A ts(PyThread_get_thread_ident())/*TlsGetValue() would become necessary. Is there a fast thread_ID register in todays CPU's?* Robert
participants (7)
-
"Martin v. Löwis"
-
Brett Cannon
-
Guido van Rossum
-
James Y Knight
-
Josiah Carlson
-
Robert
-
Talin