Multi threaded computation, thread local storage reducing false sharing, read only access to globals and a paused GIL to allow parallel scaling reads

Hi My name is Josh Ring, I am interested in raw compute performance from C modules called from python. Most of which are single threaded (eg Numpy, Scipy etc) Some things are sensible with many threads: 1. Read only global state can avoid locks entirely in multithreaded code, this is to avoid cache line invalidations killing scaling >2-4 threads. 2. Incr/decr can be paused when entering the parallel region to avoid invalidating caches of objects, readonly access makes this safe. 2. Locality of memory, so using thread local stack by default and heap allocations bound per thread this is essential to scale >4 threads and with NUMA server systems. 3. Leaving the GIL intact for single threaded code to do the "cleanup stage" of temporaries after parallel computation has finished. - I liked the approach of a "parallel region", where data does not need to be pickled, and can directly read-only access shared memory. - If global state is unchangeable from a threaded region we can avoid many gotchas and races, leaving the GIL alone, almost. - If reference counting can be "paused" during the parallel region we can avoid cache invalidation from multiple threads, due to incr/decr, which limits scaling with more threads, this is evident even with 2 threads. - Thread local storage is the default and only option, avoiding clunky "threading.local()" storage classes, "thread bound" heap allocations would also be a good thing to increase efficiency and reduce "false sharing". https://en.wikipedia.org/wiki/False_sharing - Implement the parallel region using a function with a decorator akin to openMP? The function then defines the scope for the local variables and the start and end of parallel region, when to return etc in a straightforward manner. - By default, the objects returned from the parallel region return into separate objects (avoiding GIL contention), these temporary objects are then merged into a list once control is returned to just a single thread. - Objects marked @thread_shared have their state merged from the thread local copies once execution from all the threads has finished. This could be made intelligent if some index is provided to put the entries in the right place in a list/dict etc. Thoughts? This proposal is borrowing several ideas from Pyparallel, but without the focus on windows only and IO. It is more focused on accelerating raw compute performance and is tailored for high performance computing for instance at CERN etc.
participants (1)
-
Josh Ring