refleaks & test_tcl & threads

On my box, the latest batch of refleak fixes seems to have fixed all but one of the leaky tests. test_threading_local still leaks, but it leaks rather consistently now (which is new.) I'm not able to make the other ones leak with any combination of '-u<resource>' or removing .pyc's beforehand or running longer or shorter. I hope it's not just my box :-) test_threading_local is not entirely consistent, but it looks a lot more reliable on my box than on Neal's automated mails: test_threading_local beginning 11 repetitions 12345678901 ........... test_threading_local leaked [34, 34, 34, 34, 34, 26, 26, 22, 34] references One remaining issue with refleakhunting on my machine is that test_tcl can't stand being run twice. Even without -R, this makes Python hang while waiting for a mutex in the second run through test_tcl: ...trunk $ ./python -E -tt Lib/test/regrtest test_tcl test_tcl Attaching gdb to the hung process shows this unenlightening trace: #0 0x00002b7d6629514b in __lll_mutex_lock_wait () from /lib/libpthread.so.0 #1 0x00002b7d6639a280 in completed.4801 () from /lib/libpthread.so.0 #2 0x0000000000000004 in ?? () #3 0x00002b7d66291dca in pthread_mutex_lock () from /lib/libpthread.so.0 #4 0x0000000000000000 in ?? () The process has one other thread, which is stuck here: #0 0x00002b7d667f14d6 in __select_nocancel () from /lib/libc.so.6 #1 0x00002b7d67512d8c in Tcl_WaitForEvent () from /usr/lib/libtcl8.4.so.0 #2 0x00002b7d66290b1c in start_thread () from /lib/libpthread.so.0 #3 0x00002b7d667f8962 in clone () from /lib/libc.so.6 #4 0x0000000000000000 in ?? () It smells like test_tcl or Tkinter is doing something wrong with regards to threads. I can reproduce this on a few machines, but all of them run newish linux kernels with newish glibc's and newish tcl/tk. At least in kernel/libc land, various thread related things changed of late. I don't have access to other machines with tcl/tk right now, but I wonder if anyone can reproduce this in different situations. -- Thomas Wouters <thomas@python.org> Hi! I'm a .signature virus! copy me into your .signature file to help me spread!

On 4/16/06, Thomas Wouters <thomas@python.org> wrote:
test_threading_local is not entirely consistent, but it looks a lot more reliable on my box than on Neal's automated mails:
test_threading_local beginning 11 repetitions 12345678901 ........... test_threading_local leaked [34, 34, 34, 34, 34, 26, 26, 22, 34] references
This is caused by _threading_local.local's __del__ method, or rather the fact that it's part of a closure enclosing threading.enumerate. Fixing the inner __del__ to call enumerate (which is 'threading.enumerate') directly, rather than through the cellvar 'threading_enumerate', makes the leak go away. The fact that the leakage is inconsistent is actually easily explained: the 'local' instances are stored on the 'currentThread' object indexed by 'key', and keys sometimes get reused (since they're basically id(self)), so sometimes an old reference is overwritten. It doesn't quite explain why using the cellvar causes the cycle, nor does it explain why gc.garbage remains empty. I guess some Thread objects linger in threading's _active or _limbo dicts, but I have no idea why having a cellvar in the cycle matters; they seem to be participating in GC just fine, and I cannot reproduce the leak with a simpler test. And on top of that, I'm not sure *why* _threading_local.local is doing the song and dance to get a cellvar. If the global 'enumerate' (which is threading.enumerate) disappears, it will be because Python is cleaning up. Even if we had a need to clean up the local dict at that time (which I don't believe we do), using a cellvar doesn't guarantee anything more than using a global name. Chances are very good that the 'threading' module has also been cleared, meaning that while we still have a reference to threading.enumerate, it cannot use the three globals it uses (_active, _limbo, _active_limbo_lock.) All in all, I think matters improve significantly if it just deals with the NameError it'll get at cleanup (which it already does.) -- Thomas Wouters <thomas@python.org> Hi! I'm a .signature virus! copy me into your .signature file to help me spread!

[Thomas Wouters]
test_threading_local is not entirely consistent, but it looks a lot more reliable on my box than on Neal's automated mails:
test_threading_local beginning 11 repetitions 12345678901 ........... test_threading_local leaked [34, 34, 34, 34, 34, 26, 26, 22, 34] references
[also Thomas]
This is caused by _threading_local.local's __del__ method, or rather the fact that it's part of a closure enclosing threading.enumerate . Fixing the inner __del__ to call enumerate (which is 'threading.enumerate') directly, rather than through the cellvar 'threading_enumerate', makes the leak go away. The fact that the leakage is inconsistent is actually easily explained: the 'local' instances are stored on the 'currentThread' object indexed by 'key', and keys sometimes get reused (since they're basically id(self)), so sometimes an old reference is overwritten. It doesn't quite explain why using the cellvar causes the cycle, nor does it explain why gc.garbage remains empty. I guess some Thread objects linger in threading's _active or _limbo dicts, but I have no idea why having a cellvar in the cycle matters; they seem to be participating in GC just fine, and I cannot reproduce the leak with a simpler test.
And on top of that, I'm not sure *why* _threading_local.local is doing the song and dance to get a cellvar. If the global 'enumerate' (which is threading.enumerate) disappears, it will be because Python is cleaning up. Even if we had a need to clean up the local dict at that time (which I don't believe we do), using a cellvar doesn't guarantee anything more than using a global name.
The threading_enumerate = enumerate line creates a persistent local variable at module import time, which (unlike a global name) can't get "None'd out" at shutdown time. BTW, it's very easy to miss that this line _is_ executed at module import time, and is executed only once over the life of the interpreter; more on that below.
Chances are very good that the 'threading' module has also been cleared, meaning that while we still have a reference to threading.enumerate, it cannot use the three globals it uses (_active, _limbo, _active_limbo_lock.) All in all, I think matters improve significantly if it just deals with the NameError it'll get at cleanup (which it already does.)
Well, you missed something obvious :-): the code is so clever now that its __del__ doesn't actually do anything. In outline: ""|" ... # Threading import is at end ... class local(_localbase): ... def __del__(): threading_enumerate = enumerate ... def __del__(self): try: threads = list(threading_enumerate()) except: # if enumerate fails, as it seems to do during # shutdown, we'll skip cleanup under the assumption # that there is nothing to clean up return ... return __del__ __del__ = __del__() from threading import currentThread, enumerate, RLock """ Lots of questions pop to mind, from why the import is at the bottom of the file, to why it's doing this seemingly bizarre nested-__del__ dance. I don't have good answers to any of them <0.1 wink>, but this is the bottom line: at the time threading_enumerate = enumerate is executed, `enumerate` is bound to __builtin__.enumerate, not to threading.enumerate. That line is executed during the _execution_ of the "class local" statement, the first time the module is imported, and the import at the bottom of the file has not been executed by that time. So there is no global `enumerate` at the time, but there is one in __builtin__ so that's the one it captures. As a result, "the real" __del__'s threads = list(threading_enumerate()) line _always_ raises an exception, namely the TypeError "enumerate() takes exactly 1 argument (0 given)", which is swallowed by the bare "except:", and __del__ then returns having accomplished nothing. Of course that's the real reason it never cleans anything up now -- while virtually any way of rewriting it causes it to get the _intended_ threading.enumerate, and then the leaks stop. I'll check one of those in.

[Thomas Wouters]
... One remaining issue with refleakhunting on my machine is that test_tcl can't stand being run twice. Even without -R, this makes Python hang while waiting for a mutex in the second run through test_tcl:
...trunk $ ./python -E -tt Lib/test/regrtest test_tcl test_tcl
Attaching gdb to the hung process shows this unenlightening trace: #0 0x00002b7d6629514b in __lll_mutex_lock_wait () from /lib/libpthread.so.0 #1 0x00002b7d6639a280 in completed.4801 () from /lib/libpthread.so.0 #2 0x0000000000000004 in ?? () #3 0x00002b7d66291dca in pthread_mutex_lock () from /lib/libpthread.so.0 #4 0x0000000000000000 in ?? ()
The process has one other thread, which is stuck here: #0 0x00002b7d667f14d6 in __select_nocancel () from /lib/libc.so.6 #1 0x00002b7d67512d8c in Tcl_WaitForEvent () from /usr/lib/libtcl8.4.so.0 #2 0x00002b7d66290b1c in start_thread () from /lib/libpthread.so.0 #3 0x00002b7d667f8962 in clone () from /lib/libc.so.6 #4 0x0000000000000000 in ?? ()
It smells like test_tcl or Tkinter is doing something wrong with regards to threads. I can reproduce this on a few machines, but all of them run newish linux kernels with newish glibc's and newish tcl/tk. At least in kernel/libc land, various thread related things changed of late. I don't have access to other machines with tcl/tk right now, but I wonder if anyone can reproduce this in different situations.
FYI, there's no problem running test_tcl with -R on WinXP Pro SP2 from current trunk: C:\Code\python\PCbuild>python_d -E -tt ../lib/test/regrtest.py -R:: test_tcl test_tcl beginning 9 repetitions 123456789 ......... 1 test OK. [27306 refs] That's using Tcl/Tk 8.4.12 (I don't know what newish means in Tcl-land these days).
participants (2)
-
Thomas Wouters
-
Tim Peters