Greg Ewing wrote:
- Objects owned by a thread
Python code creates and destroys temporary objects at a high rate -- stack frames, argument tuples, intermediate results, etc. If the code is executed by a thread, those objects are rarely if ever seen outside of that thread. It would be beneficial if refcount operations on such objects could be carried out by the thread that created them without locking.
While we are on the topic of reference counting, I'd like to direct your attention to Recycler, an IBM research project:
"The Recycler is a concurrent multiprocessor garbage collector with extremely low pause times (maximum of 6 milliseconds over eight benchmarks) while remaining competitive with the best throughput- oriented collectors in end-to-end execution times. This paper describes the overall architecture of the Recycler, including its use of reference counting and concurrent cycle collection, and presents extensive measurements of the system comparing it to a parallel, stop-the-world mark-and-sweep collector."
There are a bunch of research papers describing Recycler which can be found at the following link:
I'd start with the papers entitled "Java without the Coffee Breaks: A Non-intrusive Multiprocessor Garbage Collector" and "Concurrent Cycle Collection in Reference Counted Systems".
Let me describe a bit about how the Recycler works and how it relates to what you've proposed.
The basic idea is that for each thread, there is a set of thread local data (TLD) that contains a pair of "refcount buffers", one buffer for increfs and one buffer for decrefs. Each refcount buffer is a flat array of pointers which starts empty and gradually fills up.
The "incref" operation does not actually touch the reference count field of the object. Instead, an "incref" appends a pointer to the object to the end of the incref buffer for that thread. Similarly, a decref operation appends a pointer to the object to the decref buffer. Since the refcount buffers are thread-local, there is no need for locking or synchronization.
When one of the buffers gets full, both buffers are swapped out for new ones, and the old buffers are placed on a queue which is processed by the collector thread. The collector thread is the only thread which is allowed to actually touch the reference counts of the individual objects, and its the only thread which is allowed to delete objects.
Processing the buffers is relatively simple: First, the incref buffer is processed. The collector thread scans through each pointer in the buffer, and increments the refcount of each object. Then the decref buffer is processed in a similar way, decrementing the refcount.
However, it also needs to process the buffers for the other threads before the memory can be reclaimed. Recycler defines a series of "epochs" (i.e. intervals between collections). Within a refcount buffer, each epoch is represented as a contiguous range of values within the array -- all of the increfs and decrefs which occurred during that epoch. An auxiliary array records the high water mark for each epoch.
Using this information, the collector thread is able to process only those increfs and decrefs for all threads which occurred before the current epoch. This does mean that objects whose refcount falls to zero during the current epoch will not be deleted until the next collection cycle.
Recycler also handles cyclic garbage via cycle detection, which is described in the paper. It does not use a "mark and sweep" type of algorithm, but is instead able to detect cycles locally without scanning the entire heap.
Thus, the Recycler's use of refcount buffers achieves what you were trying to achieve, which is refcounting without locking. However, it does require access to thread-local data for each incref / release operation. The performance of this scheme will greatly depend on how quickly the code can get access to thread local data. The fastest possible method would be to dedicate a register, but that's infeasible on most systems. Another idea is for large functions to look up the TLD and stuff it in a local variable at the beginning of the function. For older source code the existing, backwards-compatible incref and decref macros could each individually get access to the TLD, but these would be slower than the more optimized methods in which the TLD was supplied as a parameter.