Pickle based workflow - looking for advice
davea at davea.name
Mon Apr 13 18:25:38 CEST 2015
On 04/13/2015 10:58 AM, Fabien wrote:
A comment. Pickle is a method of creating persistent data, most
commonly used to preserve data between runs. A database is another
method. Although either one can also be used with multiprocessing, you
seem to be worrying more about the mechanism, and not enough about the
> I am writing a quite extensive piece of scientific software. Its
> workflow is quite easy to explain. The tool realizes series of
> operations on watersheds (such as mapping data on it, geostatistics and
> more). There are thousands of independent watersheds of different size,
> and the size determines the computing time spent on each of them.
First question: what is the name or "identity" of a watershed?
Apparently it's named by a directory. But you mention ID as well. You
write a function A() that takes only a directory name. Is that the name
of the watershed? One per directory? And you can derive the ID from
the directory name?
Second question, is there any communication between watersheds, or are
they totally independent?
Third: this "external data", is it dynamic, do you have to fetch it in
a particular order, is it separated by watershed id, or what?
Fourth: when the program starts, are the directories all empty, so the
presence of a pickle file tells you that A() has run? Or is there some
other meaning for those files?
> Say I have the operations A, B, C and D. B and C are completely
> independent but they need A to be run first, D needs B and C, and so
> forth. Eventually the whole operations A, B, C and D will run once for
For all what?
> but of course the whole development is an iterative process and I
> rerun all operations many times.
Based on what? Is the external data changing, and you have to rerun
functions to update what you've already stored about them? Or do you
just mean you call the A() function on every possible watershed?
(I suddenly have to go out, so I can't comment on the rest, except that
choosing to pickle, or to marshall, or to database, or to
custom-serialize seems a bit premature. You may have it all clear in
your head, but I can't see what the interplay between all these calls to
one-letter-named functions is intended to be.)
More information about the Python-list