a huge shared read-only data in parallel accesses -- How? multithreading? multiprocessing?
Valery
khamenya at gmail.com
Wed Dec 9 09:58:11 EST 2009
Hi all,
Q: how to organize parallel accesses to a huge common read-only Python
data structure?
Details:
I have a huge data structure that takes >50% of RAM.
My goal is to have many computational threads (or processes) that can
have an efficient read-access to the huge and complex data structure.
"Efficient" in particular means "without serialization" and "without
unneeded lockings on read-only data"
To what I see, there are following strategies:
1. multi-processing
=> a. child-processes get their own *copies* of huge data structure
-- bad and not possible at all in my case;
=> b. child-processes often communicate with the parent process via
some IPC -- bad (serialization);
=> c. child-processes access the huge structure via some shared
memory approach -- feasible without serialization?! (copy-on-write is
not working here well in CPython/Linux!!);
2. multi-threading
=> d. CPython is told to have problems here because of GIL -- any
comments?
=> e. GIL-less implementations have their own issues -- any hot
recommendations?
I am a big fan of parallel map() approach -- either
multiprocessing.Pool.map or even better pprocess.pmap. However this
doesn't work straight-forward anymore, when "huge data" means >50%
RAM
;-)
Comments and ideas are highly welcome!!
Here is the workbench example of my case:
######################
import time
from multiprocessing import Pool
def f(_):
time.sleep(5) # just to emulate the time used by my
computation
res = sum(parent_x) # my sofisticated formula goes here
return res
if __name__ == '__main__':
parent_x = [1./i for i in xrange(1,10000000)]# my huge read-
only data :o)
p = Pool(7)
res= list(p.map(f, xrange(10)))
# switch to ps and see how fast your free memory is getting
wasted...
print res
######################
Kind regards
Valery
More information about the Python-list
mailing list