Pickle based workflow - looking for advice
Fabien
fabien.maussion at gmail.com
Mon Apr 13 10:58:01 EDT 2015
Folks,
I am writing a quite extensive piece of scientific software. Its
workflow is quite easy to explain. The tool realizes series of
operations on watersheds (such as mapping data on it, geostatistics and
more). There are thousands of independent watersheds of different size,
and the size determines the computing time spent on each of them.
Say I have the operations A, B, C and D. B and C are completely
independent but they need A to be run first, D needs B and C, and so
forth. Eventually the whole operations A, B, C and D will run once for
all, but of course the whole development is an iterative process and I
rerun all operations many times.
Currently my workflow is defined as follows:
Define a unique ID and file directory for each watershed, and define A
and B:
def A(watershed_dir):
# read some external data
# do stuff
# Store the stuff in a Watershed object
# save it
f_pickle = os.path.join(watershed_dir, 'watershed.p')
with open(f_pickle, 'wb') as f:
pickle.dump(watershed, f)
def B(watershed_directory):
w = pickle.read()
f_pickle = os.path.join(watershed_dir, 'watershed.p')
with open(f_pickle, 'rb') as f:
watershed = pickle.load(f)
# do new stuff
# store it in watershed and save
with open(f_pickle, 'wb') as f:
pickle.dump(watershed, f)
So the watershed object is a data container which grows in content. The
pickle that stores the info can reach a few Mb of size. I chose this
strategy because A, B, C and D are independent, but they can share their
results through the pickle. The functions have a single argument (the
path to the working directory), which means that when I run the
thousands catchments I can use the multiprocessing pool:
import multiprocessing as mp
poolargs = [list of directories]
pool = mp.Pool()
poolout = pool.map(A, poolargs, chunksize=1)
poolout = pool.map(B, poolargs, chunksize=1)
etc.
I can easily choose to rerun just B without rerunning A. Reading and
writing pickle times is real slow in comparison to the other stuffs to
do (running B or C on a single catchment can take seconds for example).
Now, to my questions:
1. Does that seem reasonable?
2. Should Watershed be an object or should it be a simple dictionary? I
thought that an object could be nice, because it could take care of some
operations such as plotting and logging. Currently I defined a class
Watershed, but its attributes are defined and filled by A, B and C (this
seems a bit wrong to me). I could give more responsibilities to this
class but it might become way too big: since the whole purpose of the
tool is to work on watersheds, making a Watershed class actually sounds
like a code smell (http://en.wikipedia.org/wiki/God_object)
3. The operation A opens an external file, reads data out of it and
writes it in Watershed object. Is it a bad idea to multiprocess this? (I
guess it is, since the file might be read twice at the same time)
4. Other comments you might have?
Sorry for the lengthy mail but thanks for any tip.
Fabien
More information about the Python-list
mailing list