[Python-ideas] Threading hooks and disable gc per thread
Christian Heimes
lists at cheimes.de
Thu May 12 01:58:05 CEST 2011
Hello,
today I've spent several hours debugging a segfault in JCC [1]. JCC is a
framework to wrap Java code for Python. It's most prominently used in
PyLucene [2]. You can read more about my debugging in [3]
With JCC every Python thread must be registered at the JVM through JCC.
An unattached thread, that accesses a wrapped Java object, leads to
errors and may even cause a segfault. Accessing also includes garbage
collection. A code line like
a = {}
or
"a b c".split()
can segfault since the allocation of a dict or a bound method runs
through _PyObject_GC_New(), which may trigger a cyclic garbage
collection run. If the current thread isn't attached to the JVM but
triggers a gc.collect() with some Java objects in a cycle, the
interpreter crashes. It's quite complicated and hard to "fix" third
party tools to attach all threads created in the third party library.
The issue could be solved with a simple on_thread_start hook in the
threading module. However there is more to it. In order to free memory
threads must also be detached from the JVM, when a thread has ended. A
second on_thread_stop hook isn't enough since the bound methods may also
lead to a gc.collect() run after the thread is detached.
I propose three changes to Python in order to fix the issue:
on thread start hook
--------------------
Similar to the atexit module, third party modules can register a
callable with *args and **kwargs. The functions are called inside the
newly created thread just before the target is called. The best place
for the hook list is threading.Thread._bootstrap_inner() right before
the try: self.run() except: block. Exceptions are ignored during the
call but reported to the user at the end (same as atexit's
atexit_callfunc())
on thread end hook
------------------
Same as on thread start hook but the callables are called inside the
dying thread after self.run().
gc.disable_thread(), gc.enable_thread(), gc.isenabled_thread()
--------------------------------------------------------------
Right now almost any code can trigger a gc.collect() run
non-deterministicly. Some application like JCC want to control if
gc.collect() is wanted on a thread level. This could be solved with a
new flat in PyThreadState. PyThreadState->gc_enabled is enabled by
default. When the flag is false, _PyObject_GC_Malloc() doesn't start a
gc.collect() run for that thread. The collection is delayed until
another thread or the main thread triggers it.
The three functions should also have a C equivalent so C code can
prevent gc in a thread.
Thoughs?
Christian
[1] http://lucene.apache.org/pylucene/jcc/index.html
[2] http://lucene.apache.org/pylucene/
[3]
http://mail-archives.apache.org/mod_mbox/lucene-pylucene-dev/201105.mbox/browser
More information about the Python-ideas
mailing list