[IPython-dev] Heartbeat Device

Mon Jul 12 19:10:55 EDT 2010

I've been thinking about this, and it seems like we can't have a responsive
rich control connection unless it is in another process, like the old
IPython daemon.  Pure heartbeat is easy with a C device, and we may not even
need a new one. For instance, I added support for the builtin devices of
zeromq to pyzmq with a few lines, and you can have simple is_alive style
heartbeat with a FORWARDER device.

I pushed a basic example of this (examples/heartbeat) to my pyzmq fork.

Running a ~3 second numpy.dot action, the heartbeat pings remain responsive
at <1ms.

-MinRK

On Mon, Jul 12, 2010 at 12:51, MinRK <benjaminrk at gmail.com> wrote:

>
>
> On Mon, Jul 12, 2010 at 09:15, Brian Granger <ellisonbg at gmail.com> wrote:
>
>> On Fri, Jul 9, 2010 at 3:35 PM, MinRK <benjaminrk at gmail.com> wrote:
>> > Brian,
>> > Have you worked on the Heartbeat Device? Does that need to go in 0MQ
>> itself,
>>
>> I have not.  Ideally it could go into 0MQ itself.  But, in principle,
>> we could do it in pyzmq.  We just have to write a nogil pure C
>> function that uses the low-level C API to do the heartbeat.  Then we
>> can just run that function in a thread with a "with nogil" block.
>> Shouldn't be too bad, given how simple the heartbeat logic is.  The
>> main thing we will have to think about is how to start/stop the
>> heartbeat in a clean way.
>>
>> > or can it be part of pyzmq?
>> > I'm trying to work out how to really tell that an engine is down.
>> > Is the heartbeat to be in a separate process?
>>
>> No, just a separate C/C++ thread that doesn't hold the GIL.
>>
>> > Are we guaranteed that a zmq thread is responsive no matter what an
>> engine
>> > process is doing? If that's the case, is a moderate timeout on recv
>> adequate
>> > to determine engine failure?
>>
>> Yes, I think we can assume this.  The only thing that would take the
>> 0mq thread down is something semi-fatal like a signal that doesn't get
>> handled.  But as long as the 0MQ thread doesn't have any bugs, it
>> should simply keep running no matter what the other thread does (OK,
>> other than segfaulting)
>>
>> > If zmq threads are guaranteed to be responsive, it seems like a simple
>> pair
>> > socket might be good enough, rather than needing a new device. Or even
>> > through the registration XREP socket.
>>
>> That (registration XREP socket) won't work unless we want to write all
>> that logic in C.
>> I don't know about a PAIR socket because of the need for multiple clients?
>>
> I wasn't thinking of a single PAIR socket, but rather a pair for each
> engine. We already have a pair for each engine for the queue, but I am not
> quite seeing the need for a special device beyond a PAIR socket in the
> heartbeat.
>
>
>>
>> > Can we formalize exactly what the heartbeat needs to be?
>>
>> OK, let's think.  The engine needs to connect, the controller bind.
>> It would be nice if the controller didn't need a separate heartbeat
>> socket for each engine, but I guess we need the ability to track which
>> specific engine is heartbeating.   Also, there is the question of to
>> do want to do a reqest/reply or pub/sub style heartbeat.  What do you
>> think?
>>
> The way we talked about it, the heartbeat needs to issue commands both
> ways. While it is used for checking whether an engine remains alive, it is
> also the avenue for aborting jobs.  If we do have a strict heartbeat, then I
> think PUB/SUB is a good choice.
>
> However, if heartbeat is all it does, then we need a _third_ connection to
> each engine for control commands. Since messages cannot jump the queue, the
> engine queue PAIR socket cannot be used for commands, and a PUB/SUB model
> for heartbeat can _either_ receive commands _or_ have results.
>
> control commands:
> beat (check alive)
> abort (remove a task from the queue)
> signal (SIGINT, etc.)
> exit (engine.kill)
> reset (clear queue, namespace)
>
> more?
>
> It's possible that we could implement these with a PUB on the controller
> and a SUB on each engine, only interpreting results received via the queue's
> PAIR socket. But then every command would be sent to every engine, even
> though many would only be meant for one (too inefficient/costly?). It would
> however make the actual heartbeat command very simple as a single send.
>
> It does not allow for the engine to initiate queries of the controller, for
> instance a work stealing implementation. Again, it is possible that this
> could be implemented via the job queue PAIR socket, but that would only
> allow for stealing when completely starved for work, since the job queue and
> communication queue would be the same.
>
> There's also the issue of task dependency.
>
> If we are to implement dependency checking as we discussed (depend on
> taskIDs, and only execute once the task has been completed), the engine
> needs to be able to query the controller about the tasks depended upon. This
> makes the controller being the PUB side unworkable.
>
> This says to me that we need two-way connections between the engines and
> the controller. That can either be implemented as multiple connections
> (PUB/SUB + PAIR or REQ/REP), or simply a PAIR socket for each engine could
> provide the whole heartbeat/command channel.
>
> -MinRK
>
>
>>
>> Brian
>>
>>
>> > -MinRK
>>
>>
>>
>> --
>> Brian E. Granger, Ph.D.
>> Assistant Professor of Physics
>> Cal Poly State University, San Luis Obispo
>> bgranger at calpoly.edu
>> ellisonbg at gmail.com
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/ipython-dev/attachments/20100712/17a66468/attachment.html>