a huge shared read-only data in parallel accesses -- How? multithreading? multiprocessing?

Wed Dec 9 09:58:11 EST 2009

Hi all,

Q: how to organize parallel accesses to a huge common read-only Python
data structure?

Details:

I have a huge data structure that takes >50% of RAM.
My goal is to have many computational threads (or processes) that can
have an efficient read-access to the huge and complex data structure.

"Efficient" in particular means "without serialization" and "without
unneeded lockings on read-only data"

To what I see, there are following strategies:

1. multi-processing
 => a. child-processes get their own *copies* of huge data structure
-- bad and not possible at all in my case;
 => b. child-processes often communicate with the parent process via
some IPC -- bad (serialization);
 => c. child-processes access the huge structure via some shared
memory approach -- feasible without serialization?! (copy-on-write is
not working here well in CPython/Linux!!);

2. multi-threading
 => d. CPython is told to have problems here because of GIL --  any
comments?
 => e. GIL-less implementations have their own issues -- any hot
recommendations?

I am a big fan of parallel map() approach -- either
multiprocessing.Pool.map or even better pprocess.pmap. However this
doesn't work straight-forward anymore, when "huge data" means >50%
RAM
;-)

Comments and ideas are highly welcome!!

Here is the workbench example of my case:

######################
import time
from multiprocessing import Pool
def f(_):
        time.sleep(5) # just to emulate the time used by my
computation
        res = sum(parent_x) # my sofisticated formula goes here
        return res

if __name__ == '__main__':
        parent_x = [1./i for i in xrange(1,10000000)]# my huge read-
only data :o)
        p = Pool(7)
        res= list(p.map(f, xrange(10)))
        # switch to ps and see how fast your free memory is getting
wasted...
        print res
######################

Kind regards
Valery