Renting CPU time for a Python script
Fernando Perez
fperez528 at yahoo.com
Sun Jul 21 15:36:24 EDT 2002
> 1- code in automatic checkpointing and self-restarting abilities. It's fairly
>> easy to do, and saves a lot of headaches.
>
>
> Do you know of any examples of this in available code, or could you
> outline how you would implement something like that? I have a project
> right now that such a capability would come quite useful in, but
> rather than hazard my own implementation right off, I'd like to try
> and leverage some other efforts.
The specifics depend too much on the details of the problem, but the
principle is simple. First, you decide how much time you are willing to
lose in the event of a crash, balancing the cost of the checkpointing
operation with running time (you don't want to checkpoint every 10
minutes if checkpointing itself takes 10 minutes!). For the OP with 5-12
day long runs, I'd guess a checkpoint every 6-12 hours should be enough,
and no big deal even if it costs a few minutes to do it. A rule of thumb
is that the checkpointing operation shouldn't add more than ~1-2% to
your running time. Think of it as the monthly percentage of your income
you're willing to spend on insurance.
Then you need to identify what data in your code defines the state of
the program. Globals, counters, progress markers, etc. Come up with a
format to save it and dump it to a file, with a suitable naming
convention (typically encoding the input parameters of the run into the
filename is a good idea, so that multiple parallel runs with different
parameters can all checkpoint into a common directory without clobbering
each other). If your code is well designed, this could be as simple as
dumping a single object with pickle, or dumping a state object plus a
few binary data files.
Then your code is written so that:
1- if successful, it deletes the checkpoint file and leaves something
like 'checkpoint_filename.success' in its place, a zero-sized file. This
way you can know immediately which runs finished correctly. If you want
to get fancy, you can run a cgi script which allows you to check over
the web the status of your entire run set: for each parameter set
identifying a run, a green marks it as successfully finished, yellow as
in progress, blue as not started yet and red as crashed. Then you can
simply click on the red ones and resubmit them for re-run.
2- at startup, each run looks at the checkpoint directory to make sure
there's no .success file (to avoid repeating a run accidentally). Then
it looks for a .chkpoint file, and if it finds one that means the same
run had been started and crashed partway. At that point it invokes the
restarting routine and picks up at the chekcpoint and runs until completion.
3- if desired (I've done this in the past) you can have each run submit
the next one in the run stream when finished. This allows you to keep a
set of parallel running streams going, using as many machines as you
have available but with each machine doing one run at a time.
With this kind of scheme you can very easily manage a project which
involves hundreds of long-running jobs on a distributed and possibly
unreliable network over several months, with minimal fuss.
Good luck,
f.
More information about the Python-list
mailing list