I know this may be tiresome by now, so feel free to ignore, but I'd like to share with the list an idea about the GIL, more specifically the reference counting mechanism.
Simply put, make the reference counter a sharded one. That is, separate it into several subcounters, in this case one for each thread.
Yeah, that's known as sloppy counters, see e.g. http://pdos.csail.mit.edu/papers/linux:osdi10.pdf, section 4.3. Actually you don't need a per-thread counter, only per-cpu (see sched_getcpu()/getcpu()), although I'm not sure it'd be as fast as using thread-register like Trent does.
Unfortunately, in a crude test of mine there is already a severe performance degradation, and that is without rwlocks. I've used a basic linked list, and changed the INCREF/DECREF macros to functions to accommodate the extra logic so it may not be the best approach (too many dereferences).
I think that's a dead-end. Extra indirection and cache miss will kill performance. Also, such counters are mostly useful for data structures which are only seldom deallocated (because reconciling the counters is expensive), and with garbage collection, you tend to have many short-lived objects (generational hypothesis). Finally, that'll increase the size of objects consequently (since you need a counter per core/thread), which is bad for cache.
Does this makes sense to anyone?
You can try ;-) cf P.S.: Starting a thread on the GIL on April fools' day is a great idea!