[IPython-dev] Heartbeat Device
MinRK
benjaminrk at gmail.com
Mon Jul 12 15:51:47 EDT 2010
On Mon, Jul 12, 2010 at 09:15, Brian Granger <ellisonbg at gmail.com> wrote:
> On Fri, Jul 9, 2010 at 3:35 PM, MinRK <benjaminrk at gmail.com> wrote:
> > Brian,
> > Have you worked on the Heartbeat Device? Does that need to go in 0MQ
> itself,
>
> I have not. Ideally it could go into 0MQ itself. But, in principle,
> we could do it in pyzmq. We just have to write a nogil pure C
> function that uses the low-level C API to do the heartbeat. Then we
> can just run that function in a thread with a "with nogil" block.
> Shouldn't be too bad, given how simple the heartbeat logic is. The
> main thing we will have to think about is how to start/stop the
> heartbeat in a clean way.
>
> > or can it be part of pyzmq?
> > I'm trying to work out how to really tell that an engine is down.
> > Is the heartbeat to be in a separate process?
>
> No, just a separate C/C++ thread that doesn't hold the GIL.
>
> > Are we guaranteed that a zmq thread is responsive no matter what an
> engine
> > process is doing? If that's the case, is a moderate timeout on recv
> adequate
> > to determine engine failure?
>
> Yes, I think we can assume this. The only thing that would take the
> 0mq thread down is something semi-fatal like a signal that doesn't get
> handled. But as long as the 0MQ thread doesn't have any bugs, it
> should simply keep running no matter what the other thread does (OK,
> other than segfaulting)
>
> > If zmq threads are guaranteed to be responsive, it seems like a simple
> pair
> > socket might be good enough, rather than needing a new device. Or even
> > through the registration XREP socket.
>
> That (registration XREP socket) won't work unless we want to write all
> that logic in C.
> I don't know about a PAIR socket because of the need for multiple clients?
>
I wasn't thinking of a single PAIR socket, but rather a pair for each
engine. We already have a pair for each engine for the queue, but I am not
quite seeing the need for a special device beyond a PAIR socket in the
heartbeat.
>
> > Can we formalize exactly what the heartbeat needs to be?
>
> OK, let's think. The engine needs to connect, the controller bind.
> It would be nice if the controller didn't need a separate heartbeat
> socket for each engine, but I guess we need the ability to track which
> specific engine is heartbeating. Also, there is the question of to
> do want to do a reqest/reply or pub/sub style heartbeat. What do you
> think?
>
The way we talked about it, the heartbeat needs to issue commands both ways.
While it is used for checking whether an engine remains alive, it is also
the avenue for aborting jobs. If we do have a strict heartbeat, then I
think PUB/SUB is a good choice.
However, if heartbeat is all it does, then we need a _third_ connection to
each engine for control commands. Since messages cannot jump the queue, the
engine queue PAIR socket cannot be used for commands, and a PUB/SUB model
for heartbeat can _either_ receive commands _or_ have results.
control commands:
beat (check alive)
abort (remove a task from the queue)
signal (SIGINT, etc.)
exit (engine.kill)
reset (clear queue, namespace)
more?
It's possible that we could implement these with a PUB on the controller and
a SUB on each engine, only interpreting results received via the queue's
PAIR socket. But then every command would be sent to every engine, even
though many would only be meant for one (too inefficient/costly?). It would
however make the actual heartbeat command very simple as a single send.
It does not allow for the engine to initiate queries of the controller, for
instance a work stealing implementation. Again, it is possible that this
could be implemented via the job queue PAIR socket, but that would only
allow for stealing when completely starved for work, since the job queue and
communication queue would be the same.
There's also the issue of task dependency.
If we are to implement dependency checking as we discussed (depend on
taskIDs, and only execute once the task has been completed), the engine
needs to be able to query the controller about the tasks depended upon. This
makes the controller being the PUB side unworkable.
This says to me that we need two-way connections between the engines and the
controller. That can either be implemented as multiple connections (PUB/SUB
+ PAIR or REQ/REP), or simply a PAIR socket for each engine could provide
the whole heartbeat/command channel.
-MinRK
>
> Brian
>
>
> > -MinRK
>
>
>
> --
> Brian E. Granger, Ph.D.
> Assistant Professor of Physics
> Cal Poly State University, San Luis Obispo
> bgranger at calpoly.edu
> ellisonbg at gmail.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/ipython-dev/attachments/20100712/be552c58/attachment.html>
More information about the IPython-dev
mailing list