[IPython-dev] Heartbeat Device
ellisonbg at gmail.com
Mon Jul 12 23:43:29 EDT 2010
On Mon, Jul 12, 2010 at 4:10 PM, MinRK <benjaminrk at gmail.com> wrote:
> I've been thinking about this, and it seems like we can't have a responsive
> rich control connection unless it is in another process, like the old
> IPython daemon.
I am not quite sure I follow what you mean by this. Can you elaborate?
> Pure heartbeat is easy with a C device, and we may not even
> need a new one. For instance, I added support for the builtin devices of
> zeromq to pyzmq with a few lines, and you can have simple is_alive style
> heartbeat with a FORWARDER device.
I looked at this and it looks very nice. I think for basic is_alive
type heartbeats this will work fine. The only thing to be careful of
is that 0MQ sockets are not thread safe. Thus, it would be best to
actually create the socket in the thread as well. But we do want the
flexibility to be able to pass in sockets to the device. We will have
to think about that issue.
> I pushed a basic example of this (examples/heartbeat) to my pyzmq fork.
> Running a ~3 second numpy.dot action, the heartbeat pings remain responsive
> at <1ms.
This is great!
> On Mon, Jul 12, 2010 at 12:51, MinRK <benjaminrk at gmail.com> wrote:
>> On Mon, Jul 12, 2010 at 09:15, Brian Granger <ellisonbg at gmail.com> wrote:
>>> On Fri, Jul 9, 2010 at 3:35 PM, MinRK <benjaminrk at gmail.com> wrote:
>>> > Brian,
>>> > Have you worked on the Heartbeat Device? Does that need to go in 0MQ
>>> > itself,
>>> I have not. Ideally it could go into 0MQ itself. But, in principle,
>>> we could do it in pyzmq. We just have to write a nogil pure C
>>> function that uses the low-level C API to do the heartbeat. Then we
>>> can just run that function in a thread with a "with nogil" block.
>>> Shouldn't be too bad, given how simple the heartbeat logic is. The
>>> main thing we will have to think about is how to start/stop the
>>> heartbeat in a clean way.
>>> > or can it be part of pyzmq?
>>> > I'm trying to work out how to really tell that an engine is down.
>>> > Is the heartbeat to be in a separate process?
>>> No, just a separate C/C++ thread that doesn't hold the GIL.
>>> > Are we guaranteed that a zmq thread is responsive no matter what an
>>> > engine
>>> > process is doing? If that's the case, is a moderate timeout on recv
>>> > adequate
>>> > to determine engine failure?
>>> Yes, I think we can assume this. The only thing that would take the
>>> 0mq thread down is something semi-fatal like a signal that doesn't get
>>> handled. But as long as the 0MQ thread doesn't have any bugs, it
>>> should simply keep running no matter what the other thread does (OK,
>>> other than segfaulting)
>>> > If zmq threads are guaranteed to be responsive, it seems like a simple
>>> > pair
>>> > socket might be good enough, rather than needing a new device. Or even
>>> > through the registration XREP socket.
>>> That (registration XREP socket) won't work unless we want to write all
>>> that logic in C.
>>> I don't know about a PAIR socket because of the need for multiple
>> I wasn't thinking of a single PAIR socket, but rather a pair for each
>> engine. We already have a pair for each engine for the queue, but I am not
>> quite seeing the need for a special device beyond a PAIR socket in the
>>> > Can we formalize exactly what the heartbeat needs to be?
>>> OK, let's think. The engine needs to connect, the controller bind.
>>> It would be nice if the controller didn't need a separate heartbeat
>>> socket for each engine, but I guess we need the ability to track which
>>> specific engine is heartbeating. Also, there is the question of to
>>> do want to do a reqest/reply or pub/sub style heartbeat. What do you
>> The way we talked about it, the heartbeat needs to issue commands both
>> ways. While it is used for checking whether an engine remains alive, it is
>> also the avenue for aborting jobs. If we do have a strict heartbeat, then I
>> think PUB/SUB is a good choice.
>> However, if heartbeat is all it does, then we need a _third_ connection to
>> each engine for control commands. Since messages cannot jump the queue, the
>> engine queue PAIR socket cannot be used for commands, and a PUB/SUB model
>> for heartbeat can _either_ receive commands _or_ have results.
>> control commands:
>> beat (check alive)
>> abort (remove a task from the queue)
>> signal (SIGINT, etc.)
>> exit (engine.kill)
>> reset (clear queue, namespace)
>> It's possible that we could implement these with a PUB on the controller
>> and a SUB on each engine, only interpreting results received via the queue's
>> PAIR socket. But then every command would be sent to every engine, even
>> though many would only be meant for one (too inefficient/costly?). It would
>> however make the actual heartbeat command very simple as a single send.
>> It does not allow for the engine to initiate queries of the controller,
>> for instance a work stealing implementation. Again, it is possible that this
>> could be implemented via the job queue PAIR socket, but that would only
>> allow for stealing when completely starved for work, since the job queue and
>> communication queue would be the same.
>> There's also the issue of task dependency.
>> If we are to implement dependency checking as we discussed (depend on
>> taskIDs, and only execute once the task has been completed), the engine
>> needs to be able to query the controller about the tasks depended upon. This
>> makes the controller being the PUB side unworkable.
>> This says to me that we need two-way connections between the engines and
>> the controller. That can either be implemented as multiple connections
>> (PUB/SUB + PAIR or REQ/REP), or simply a PAIR socket for each engine could
>> provide the whole heartbeat/command channel.
>>> > -MinRK
>>> Brian E. Granger, Ph.D.
>>> Assistant Professor of Physics
>>> Cal Poly State University, San Luis Obispo
>>> bgranger at calpoly.edu
>>> ellisonbg at gmail.com
Brian E. Granger, Ph.D.
Assistant Professor of Physics
Cal Poly State University, San Luis Obispo
bgranger at calpoly.edu
ellisonbg at gmail.com
More information about the IPython-dev