client-server parallellised number crunching

Hans Georg Schaathun georg at schaathun.net
Tue Apr 26 15:55:20 EDT 2011


I wonder if anyone has any experience with this ...

I try to set up a simple client-server system to do some number
crunching, using a simple ad hoc protocol over TCP/IP.  I use 
two Queue objects on the server side to manage the input and the output
of the client process.  A basic system running seemingly fine on a single 
quad-core box was surprisingly simple to set up, and it seems to give
me a reasonable speed-up of a factor of around 3-3.5 using four client 
processes in addition to the master process.  (If anyone wants more
details, please ask.)

Now, I would like to use remote hosts as well, more precisely, student
lab boxen which are rather unreliable.  By experience I'd expect to
lose roughly 4-5 jobs in 100 CPU hours on average.  Thus I need some 
way of detecting lost connections and requeue unfinished tasks, 
avoiding any serious delays in this detection.  What is the best way to
do this in python?

It is, of course, possible for the master thread upon processing the
results, to requeue the tasks for any missing results, but it seems
to me to be a cleaner solution if I could detect disconnects and
requeue the tasks from the networking threads.  Is that possible
using python sockets?

Somebody will probably ask why I am not using one of the multiprocessing 
libraries.  I have tried at least two, and got trapped by the overhead
of passing complex pickled objects across.  Doing it myself has at least
helped me clarify what can be parallelised effectively.  Now,
understanding the parallelisable subproblems better, I could try again,
if I can trust that these libraries can robustly handle lost clients.
That I don't know if I can.

Any ideas?
TIA
-- 
:-- Hans Georg



More information about the Python-list mailing list