
On Wed, May 11, 2011 at 4:58 PM, Christian Heimes <lists@cheimes.de> wrote:
Hello,
today I've spent several hours debugging a segfault in JCC [1]. JCC is a framework to wrap Java code for Python. It's most prominently used in PyLucene [2]. You can read more about my debugging in [3]
With JCC every Python thread must be registered at the JVM through JCC. An unattached thread, that accesses a wrapped Java object, leads to errors and may even cause a segfault. Accessing also includes garbage collection. A code line like
a = {}
or "a b c".split()
can segfault since the allocation of a dict or a bound method runs through _PyObject_GC_New(), which may trigger a cyclic garbage collection run. If the current thread isn't attached to the JVM but triggers a gc.collect() with some Java objects in a cycle, the interpreter crashes. It's quite complicated and hard to "fix" third party tools to attach all threads created in the third party library.
The issue could be solved with a simple on_thread_start hook in the threading module. However there is more to it. In order to free memory threads must also be detached from the JVM, when a thread has ended. A second on_thread_stop hook isn't enough since the bound methods may also lead to a gc.collect() run after the thread is detached.
I propose three changes to Python in order to fix the issue:
on thread start hook --------------------
Similar to the atexit module, third party modules can register a callable with *args and **kwargs. The functions are called inside the newly created thread just before the target is called. The best place for the hook list is threading.Thread._bootstrap_inner() right before the try: self.run() except: block. Exceptions are ignored during the call but reported to the user at the end (same as atexit's atexit_callfunc())
on thread end hook ------------------
Same as on thread start hook but the callables are called inside the dying thread after self.run().
Makes sense to me. Something that needs clarifying: when the process dies (main python thread has exited and all remaining python threads are daemon threads) the on thread end hook will _not_ be called. +1 This is really two separate feature requests. The above thread hooks and the below gc hooks.
gc.disable_thread(), gc.enable_thread(), gc.isenabled_thread() --------------------------------------------------------------
Right now almost any code can trigger a gc.collect() run non-deterministicly. Some application like JCC want to control if gc.collect() is wanted on a thread level. This could be solved with a new flat in PyThreadState. PyThreadState->gc_enabled is enabled by default. When the flag is false, _PyObject_GC_Malloc() doesn't start a gc.collect() run for that thread. The collection is delayed until another thread or the main thread triggers it.
The three functions should also have a C equivalent so C code can prevent gc in a thread.
This also sounds useful since we are a long long way from concurrent gc. (and whenever we gain that, we'd need a way to control when it can or can't happen or to register the gc threads with the anything that needs to know about 'em, JCC, etc..) +1 -gps