[IPython-dev] Heartbeat Device

MinRK benjaminrk at gmail.com
Tue Jul 13 00:49:01 EDT 2010


On Mon, Jul 12, 2010 at 20:43, Brian Granger <ellisonbg at gmail.com> wrote:

> Min,
>
> On Mon, Jul 12, 2010 at 4:10 PM, MinRK <benjaminrk at gmail.com> wrote:
> > I've been thinking about this, and it seems like we can't have a
> responsive
> > rich control connection unless it is in another process, like the old
> > IPython daemon.
>
> I am not quite sure I follow what you mean by this.  Can you elaborate?
>

The main advantage that we were to gain from the out-of-process ipdaemon was
the ability to abort/kill (signal) blocking jobs. With 0MQ threads, the only
logic we can have in a control/heartbeat thread must be implemented in
GIL-free C/C++. That limits what we can do in terms of interacting with the
main work thread, as I understand it.


>
> > Pure heartbeat is easy with a C device, and we may not even
> > need a new one. For instance, I added support for the builtin devices of
> > zeromq to pyzmq with a few lines, and you can have simple is_alive style
> > heartbeat with a FORWARDER device.
>
> I looked at this and it looks very nice.  I think for basic is_alive
> type heartbeats this will work fine.  The only thing to be careful of
> is that 0MQ sockets are not thread safe.  Thus, it would be best to
> actually create the socket in the thread as well.  But we do want the
> flexibility to be able to pass in sockets to the device.  We will have
> to think about that issue.
>

I wrote/pushed a basic ThreadsafeDevice, which creates/binds/connects inside
the thread's run method.
It adds bind_in/out, connect_in/out, and setsockopt_in/out methods which
just queue up arguments to be called at the head of the run method. I added
a tspong.py in the heartbeat example using it.


>
> > I pushed a basic example of this (examples/heartbeat) to my pyzmq fork.
> > Running a ~3 second numpy.dot action, the heartbeat pings remain
> responsive
> > at <1ms.
>
> This is great!
>
> Cheers,
>
> Brian
> > -MinRK
> >
> > On Mon, Jul 12, 2010 at 12:51, MinRK <benjaminrk at gmail.com> wrote:
> >>
> >>
> >> On Mon, Jul 12, 2010 at 09:15, Brian Granger <ellisonbg at gmail.com>
> wrote:
> >>>
> >>> On Fri, Jul 9, 2010 at 3:35 PM, MinRK <benjaminrk at gmail.com> wrote:
> >>> > Brian,
> >>> > Have you worked on the Heartbeat Device? Does that need to go in 0MQ
> >>> > itself,
> >>>
> >>> I have not.  Ideally it could go into 0MQ itself.  But, in principle,
> >>> we could do it in pyzmq.  We just have to write a nogil pure C
> >>> function that uses the low-level C API to do the heartbeat.  Then we
> >>> can just run that function in a thread with a "with nogil" block.
> >>> Shouldn't be too bad, given how simple the heartbeat logic is.  The
> >>> main thing we will have to think about is how to start/stop the
> >>> heartbeat in a clean way.
> >>>
> >>> > or can it be part of pyzmq?
> >>> > I'm trying to work out how to really tell that an engine is down.
> >>> > Is the heartbeat to be in a separate process?
> >>>
> >>> No, just a separate C/C++ thread that doesn't hold the GIL.
> >>>
> >>> > Are we guaranteed that a zmq thread is responsive no matter what an
> >>> > engine
> >>> > process is doing? If that's the case, is a moderate timeout on recv
> >>> > adequate
> >>> > to determine engine failure?
> >>>
> >>> Yes, I think we can assume this.  The only thing that would take the
> >>> 0mq thread down is something semi-fatal like a signal that doesn't get
> >>> handled.  But as long as the 0MQ thread doesn't have any bugs, it
> >>> should simply keep running no matter what the other thread does (OK,
> >>> other than segfaulting)
> >>>
> >>> > If zmq threads are guaranteed to be responsive, it seems like a
> simple
> >>> > pair
> >>> > socket might be good enough, rather than needing a new device. Or
> even
> >>> > through the registration XREP socket.
> >>>
> >>> That (registration XREP socket) won't work unless we want to write all
> >>> that logic in C.
> >>> I don't know about a PAIR socket because of the need for multiple
> >>> clients?
> >>
> >> I wasn't thinking of a single PAIR socket, but rather a pair for each
> >> engine. We already have a pair for each engine for the queue, but I am
> not
> >> quite seeing the need for a special device beyond a PAIR socket in the
> >> heartbeat.
> >>
> >>>
> >>> > Can we formalize exactly what the heartbeat needs to be?
> >>>
> >>> OK, let's think.  The engine needs to connect, the controller bind.
> >>> It would be nice if the controller didn't need a separate heartbeat
> >>> socket for each engine, but I guess we need the ability to track which
> >>> specific engine is heartbeating.   Also, there is the question of to
> >>> do want to do a reqest/reply or pub/sub style heartbeat.  What do you
> >>> think?
> >>
> >> The way we talked about it, the heartbeat needs to issue commands both
> >> ways. While it is used for checking whether an engine remains alive, it
> is
> >> also the avenue for aborting jobs.  If we do have a strict heartbeat,
> then I
> >> think PUB/SUB is a good choice.
> >> However, if heartbeat is all it does, then we need a _third_ connection
> to
> >> each engine for control commands. Since messages cannot jump the queue,
> the
> >> engine queue PAIR socket cannot be used for commands, and a PUB/SUB
> model
> >> for heartbeat can _either_ receive commands _or_ have results.
> >> control commands:
> >> beat (check alive)
> >> abort (remove a task from the queue)
> >> signal (SIGINT, etc.)
> >> exit (engine.kill)
> >> reset (clear queue, namespace)
> >> more?
> >> It's possible that we could implement these with a PUB on the controller
> >> and a SUB on each engine, only interpreting results received via the
> queue's
> >> PAIR socket. But then every command would be sent to every engine, even
> >> though many would only be meant for one (too inefficient/costly?). It
> would
> >> however make the actual heartbeat command very simple as a single send.
> >> It does not allow for the engine to initiate queries of the controller,
> >> for instance a work stealing implementation. Again, it is possible that
> this
> >> could be implemented via the job queue PAIR socket, but that would only
> >> allow for stealing when completely starved for work, since the job queue
> and
> >> communication queue would be the same.
> >> There's also the issue of task dependency.
> >> If we are to implement dependency checking as we discussed (depend on
> >> taskIDs, and only execute once the task has been completed), the engine
> >> needs to be able to query the controller about the tasks depended upon.
> This
> >> makes the controller being the PUB side unworkable.
> >> This says to me that we need two-way connections between the engines and
> >> the controller. That can either be implemented as multiple connections
> >> (PUB/SUB + PAIR or REQ/REP), or simply a PAIR socket for each engine
> could
> >> provide the whole heartbeat/command channel.
> >> -MinRK
> >>
> >>>
> >>> Brian
> >>>
> >>>
> >>> > -MinRK
> >>>
> >>>
> >>>
> >>> --
> >>> Brian E. Granger, Ph.D.
> >>> Assistant Professor of Physics
> >>> Cal Poly State University, San Luis Obispo
> >>> bgranger at calpoly.edu
> >>> ellisonbg at gmail.com
> >>
> >
> >
>
>
>
> --
> Brian E. Granger, Ph.D.
> Assistant Professor of Physics
> Cal Poly State University, San Luis Obispo
> bgranger at calpoly.edu
> ellisonbg at gmail.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/ipython-dev/attachments/20100712/90b2c56f/attachment.html>


More information about the IPython-dev mailing list