client-server parallellised number crunching
drsalists at gmail.com
Tue Apr 26 22:31:02 CEST 2011
On Tue, Apr 26, 2011 at 12:55 PM, Hans Georg Schaathun
<georg at schaathun.net> wrote:
> I wonder if anyone has any experience with this ...
> I try to set up a simple client-server system to do some number
> crunching, using a simple ad hoc protocol over TCP/IP. I use
> two Queue objects on the server side to manage the input and the output
> of the client process. A basic system running seemingly fine on a single
> quad-core box was surprisingly simple to set up, and it seems to give
> me a reasonable speed-up of a factor of around 3-3.5 using four client
> processes in addition to the master process. (If anyone wants more
> details, please ask.)
> Now, I would like to use remote hosts as well, more precisely, student
> lab boxen which are rather unreliable. By experience I'd expect to
> lose roughly 4-5 jobs in 100 CPU hours on average. Thus I need some
> way of detecting lost connections and requeue unfinished tasks,
> avoiding any serious delays in this detection. What is the best way to
> do this in python?
> It is, of course, possible for the master thread upon processing the
> results, to requeue the tasks for any missing results, but it seems
> to me to be a cleaner solution if I could detect disconnects and
> requeue the tasks from the networking threads. Is that possible
> using python sockets?
> Somebody will probably ask why I am not using one of the multiprocessing
> libraries. I have tried at least two, and got trapped by the overhead
> of passing complex pickled objects across. Doing it myself has at least
> helped me clarify what can be parallelised effectively. Now,
> understanding the parallelisable subproblems better, I could try again,
> if I can trust that these libraries can robustly handle lost clients.
> That I don't know if I can.
You probably should assign a unique identifier to each piece of work,
and implement two timeouts - one on your socket, using select or poll
or similar, and one for the pieces of work based on the identifier.
More information about the Python-list