Renting CPU time for a Python script

Sun Jul 21 15:36:24 EDT 2002

> 1- code in automatic checkpointing and self-restarting abilities. It's fairly 
>> easy to do, and saves a lot of headaches.
> 
> 
> Do you know of any examples of this in available code, or could you
> outline how you would implement something like that?  I have a project
> right now that such a capability would come quite useful in, but
> rather than hazard my own implementation right off, I'd like to try
> and leverage some other efforts.

The specifics depend too much on the details of the problem, but the 
principle is simple. First, you decide how much time you are willing to 
lose in the event of a crash, balancing the cost of the checkpointing 
operation with running time (you don't want to checkpoint every 10 
minutes if checkpointing itself takes 10 minutes!). For the OP with 5-12 
day long runs, I'd guess a checkpoint every 6-12 hours should be enough, 
and no big deal even if it costs a few minutes to do it. A rule of thumb 
is that the checkpointing operation shouldn't add more than ~1-2% to 
your running time. Think of it as the monthly percentage of your income 
you're willing to spend on insurance.

Then you need to identify what data in your code defines the state of 
the program. Globals, counters, progress markers, etc. Come up with a 
format to save it and dump it to a file, with a suitable naming 
convention (typically encoding the input parameters of the run into the 
filename is a good idea, so that multiple parallel runs with different 
parameters can all checkpoint into a common directory without clobbering 
each other). If your code is well designed, this could be as simple as 
dumping a single object with pickle, or dumping a state object plus a 
few binary data files.

Then your code is written so that:

1- if successful, it deletes the checkpoint file and leaves something 
like 'checkpoint_filename.success' in its place, a zero-sized file. This 
way you can know immediately which runs finished correctly. If you want 
to get fancy, you can run a cgi script which allows you to check over 
the web the status of your entire run set: for each parameter set 
identifying a run, a green marks it as successfully finished, yellow as 
in progress, blue as not started yet and red as crashed. Then you can 
simply click on the red ones and resubmit them for re-run.

2- at startup, each run looks at the checkpoint directory to make sure 
there's no .success file (to avoid repeating a run accidentally). Then 
it looks for a .chkpoint file, and if it finds one that means the same 
run had been started and crashed partway. At that point it invokes the 
restarting routine and picks up at the chekcpoint and runs until completion.

3- if desired (I've done this in the past) you can have each run submit 
the next one in the run stream when finished. This allows you to keep a 
set of parallel running streams going, using as many machines as you 
have available but with each machine doing one run at a time.

With this kind of scheme you can very easily manage a project which 
involves hundreds of long-running jobs on a distributed and possibly 
unreliable network over several months, with minimal fuss.

Good luck,

f.