[IPython-dev] Heartbeat Device

Tue Jul 13 01:04:24 EDT 2010

On Mon, Jul 12, 2010 at 9:49 PM, MinRK <benjaminrk at gmail.com> wrote:
>
>
> On Mon, Jul 12, 2010 at 20:43, Brian Granger <ellisonbg at gmail.com> wrote:
>>
>> Min,
>>
>> On Mon, Jul 12, 2010 at 4:10 PM, MinRK <benjaminrk at gmail.com> wrote:
>> > I've been thinking about this, and it seems like we can't have a
>> > responsive
>> > rich control connection unless it is in another process, like the old
>> > IPython daemon.
>>
>> I am not quite sure I follow what you mean by this.  Can you elaborate?
>
> The main advantage that we were to gain from the out-of-process ipdaemon was
> the ability to abort/kill (signal) blocking jobs. With 0MQ threads, the only
> logic we can have in a control/heartbeat thread must be implemented in
> GIL-free C/C++. That limits what we can do in terms of interacting with the
> main work thread, as I understand it.

Yes, but I think it might be possible to spawn an external process to
send a signal back to the process.  But I am not sure about this.

>>
>> > Pure heartbeat is easy with a C device, and we may not even
>> > need a new one. For instance, I added support for the builtin devices of
>> > zeromq to pyzmq with a few lines, and you can have simple is_alive style
>> > heartbeat with a FORWARDER device.
>>
>> I looked at this and it looks very nice.  I think for basic is_alive
>> type heartbeats this will work fine.  The only thing to be careful of
>> is that 0MQ sockets are not thread safe.  Thus, it would be best to
>> actually create the socket in the thread as well.  But we do want the
>> flexibility to be able to pass in sockets to the device.  We will have
>> to think about that issue.
>
>
> I wrote/pushed a basic ThreadsafeDevice, which creates/binds/connects inside
> the thread's run method.
> It adds bind_in/out, connect_in/out, and setsockopt_in/out methods which
> just queue up arguments to be called at the head of the run method. I added
> a tspong.py in the heartbeat example using it.

Cool, I will review this and merge it into master.

Cheers,

Brian

>>
>> > I pushed a basic example of this (examples/heartbeat) to my pyzmq fork.
>> > Running a ~3 second numpy.dot action, the heartbeat pings remain
>> > responsive
>> > at <1ms.
>>
>> This is great!
>>
>> Cheers,
>>
>> Brian
>> > -MinRK
>> >
>> > On Mon, Jul 12, 2010 at 12:51, MinRK <benjaminrk at gmail.com> wrote:
>> >>
>> >>
>> >> On Mon, Jul 12, 2010 at 09:15, Brian Granger <ellisonbg at gmail.com>
>> >> wrote:
>> >>>
>> >>> On Fri, Jul 9, 2010 at 3:35 PM, MinRK <benjaminrk at gmail.com> wrote:
>> >>> > Brian,
>> >>> > Have you worked on the Heartbeat Device? Does that need to go in 0MQ
>> >>> > itself,
>> >>>
>> >>> I have not.  Ideally it could go into 0MQ itself.  But, in principle,
>> >>> we could do it in pyzmq.  We just have to write a nogil pure C
>> >>> function that uses the low-level C API to do the heartbeat.  Then we
>> >>> can just run that function in a thread with a "with nogil" block.
>> >>> Shouldn't be too bad, given how simple the heartbeat logic is.  The
>> >>> main thing we will have to think about is how to start/stop the
>> >>> heartbeat in a clean way.
>> >>>
>> >>> > or can it be part of pyzmq?
>> >>> > I'm trying to work out how to really tell that an engine is down.
>> >>> > Is the heartbeat to be in a separate process?
>> >>>
>> >>> No, just a separate C/C++ thread that doesn't hold the GIL.
>> >>>
>> >>> > Are we guaranteed that a zmq thread is responsive no matter what an
>> >>> > engine
>> >>> > process is doing? If that's the case, is a moderate timeout on recv
>> >>> > adequate
>> >>> > to determine engine failure?
>> >>>
>> >>> Yes, I think we can assume this.  The only thing that would take the
>> >>> 0mq thread down is something semi-fatal like a signal that doesn't get
>> >>> handled.  But as long as the 0MQ thread doesn't have any bugs, it
>> >>> should simply keep running no matter what the other thread does (OK,
>> >>> other than segfaulting)
>> >>>
>> >>> > If zmq threads are guaranteed to be responsive, it seems like a
>> >>> > simple
>> >>> > pair
>> >>> > socket might be good enough, rather than needing a new device. Or
>> >>> > even
>> >>> > through the registration XREP socket.
>> >>>
>> >>> That (registration XREP socket) won't work unless we want to write all
>> >>> that logic in C.
>> >>> I don't know about a PAIR socket because of the need for multiple
>> >>> clients?
>> >>
>> >> I wasn't thinking of a single PAIR socket, but rather a pair for each
>> >> engine. We already have a pair for each engine for the queue, but I am
>> >> not
>> >> quite seeing the need for a special device beyond a PAIR socket in the
>> >> heartbeat.
>> >>
>> >>>
>> >>> > Can we formalize exactly what the heartbeat needs to be?
>> >>>
>> >>> OK, let's think.  The engine needs to connect, the controller bind.
>> >>> It would be nice if the controller didn't need a separate heartbeat
>> >>> socket for each engine, but I guess we need the ability to track which
>> >>> specific engine is heartbeating.   Also, there is the question of to
>> >>> do want to do a reqest/reply or pub/sub style heartbeat.  What do you
>> >>> think?
>> >>
>> >> The way we talked about it, the heartbeat needs to issue commands both
>> >> ways. While it is used for checking whether an engine remains alive, it
>> >> is
>> >> also the avenue for aborting jobs.  If we do have a strict heartbeat,
>> >> then I
>> >> think PUB/SUB is a good choice.
>> >> However, if heartbeat is all it does, then we need a _third_ connection
>> >> to
>> >> each engine for control commands. Since messages cannot jump the queue,
>> >> the
>> >> engine queue PAIR socket cannot be used for commands, and a PUB/SUB
>> >> model
>> >> for heartbeat can _either_ receive commands _or_ have results.
>> >> control commands:
>> >> beat (check alive)
>> >> abort (remove a task from the queue)
>> >> signal (SIGINT, etc.)
>> >> exit (engine.kill)
>> >> reset (clear queue, namespace)
>> >> more?
>> >> It's possible that we could implement these with a PUB on the
>> >> controller
>> >> and a SUB on each engine, only interpreting results received via the
>> >> queue's
>> >> PAIR socket. But then every command would be sent to every engine, even
>> >> though many would only be meant for one (too inefficient/costly?). It
>> >> would
>> >> however make the actual heartbeat command very simple as a single send.
>> >> It does not allow for the engine to initiate queries of the controller,
>> >> for instance a work stealing implementation. Again, it is possible that
>> >> this
>> >> could be implemented via the job queue PAIR socket, but that would only
>> >> allow for stealing when completely starved for work, since the job
>> >> queue and
>> >> communication queue would be the same.
>> >> There's also the issue of task dependency.
>> >> If we are to implement dependency checking as we discussed (depend on
>> >> taskIDs, and only execute once the task has been completed), the engine
>> >> needs to be able to query the controller about the tasks depended upon.
>> >> This
>> >> makes the controller being the PUB side unworkable.
>> >> This says to me that we need two-way connections between the engines
>> >> and
>> >> the controller. That can either be implemented as multiple connections
>> >> (PUB/SUB + PAIR or REQ/REP), or simply a PAIR socket for each engine
>> >> could
>> >> provide the whole heartbeat/command channel.
>> >> -MinRK
>> >>
>> >>>
>> >>> Brian
>> >>>
>> >>>
>> >>> > -MinRK
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Brian E. Granger, Ph.D.
>> >>> Assistant Professor of Physics
>> >>> Cal Poly State University, San Luis Obispo
>> >>> bgranger at calpoly.edu
>> >>> ellisonbg at gmail.com
>> >>
>> >
>> >
>>
>>
>>
>> --
>> Brian E. Granger, Ph.D.
>> Assistant Professor of Physics
>> Cal Poly State University, San Luis Obispo
>> bgranger at calpoly.edu
>> ellisonbg at gmail.com
>
>

-- 
Brian E. Granger, Ph.D.
Assistant Professor of Physics
Cal Poly State University, San Luis Obispo
bgranger at calpoly.edu
ellisonbg at gmail.com