[IPython-dev] Heartbeat Device

MinRK benjaminrk at gmail.com
Tue Jul 13 01:10:01 EDT 2010


On Mon, Jul 12, 2010 at 22:04, Brian Granger <ellisonbg at gmail.com> wrote:

> On Mon, Jul 12, 2010 at 9:49 PM, MinRK <benjaminrk at gmail.com> wrote:
> >
> >
> > On Mon, Jul 12, 2010 at 20:43, Brian Granger <ellisonbg at gmail.com>
> wrote:
> >>
> >> Min,
> >>
> >> On Mon, Jul 12, 2010 at 4:10 PM, MinRK <benjaminrk at gmail.com> wrote:
> >> > I've been thinking about this, and it seems like we can't have a
> >> > responsive
> >> > rich control connection unless it is in another process, like the old
> >> > IPython daemon.
> >>
> >> I am not quite sure I follow what you mean by this.  Can you elaborate?
> >
> > The main advantage that we were to gain from the out-of-process ipdaemon
> was
> > the ability to abort/kill (signal) blocking jobs. With 0MQ threads, the
> only
> > logic we can have in a control/heartbeat thread must be implemented in
> > GIL-free C/C++. That limits what we can do in terms of interacting with
> the
> > main work thread, as I understand it.
>
> Yes, but I think it might be possible to spawn an external process to
> send a signal back to the process.  But I am not sure about this.
>
> >>
> >> > Pure heartbeat is easy with a C device, and we may not even
> >> > need a new one. For instance, I added support for the builtin devices
> of
> >> > zeromq to pyzmq with a few lines, and you can have simple is_alive
> style
> >> > heartbeat with a FORWARDER device.
> >>
> >> I looked at this and it looks very nice.  I think for basic is_alive
> >> type heartbeats this will work fine.  The only thing to be careful of
> >> is that 0MQ sockets are not thread safe.  Thus, it would be best to
> >> actually create the socket in the thread as well.  But we do want the
> >> flexibility to be able to pass in sockets to the device.  We will have
> >> to think about that issue.
> >
> >
> > I wrote/pushed a basic ThreadsafeDevice, which creates/binds/connects
> inside
> > the thread's run method.
> > It adds bind_in/out, connect_in/out, and setsockopt_in/out methods which
> > just queue up arguments to be called at the head of the run method. I
> added
> > a tspong.py in the heartbeat example using it.
>
> Cool, I will review this and merge it into master.
>
>
I'd say it's not ready for master in one particular respect: The Device
thread doesn't respond to signals, so I have to kill it to stop it. I
haven't yet figured out why this is happening; it might be quite simple.

I'll push up some unit tests tomorrow



> Cheers,
>
> Brian
>
> >>
> >> > I pushed a basic example of this (examples/heartbeat) to my pyzmq
> fork.
> >> > Running a ~3 second numpy.dot action, the heartbeat pings remain
> >> > responsive
> >> > at <1ms.
> >>
> >> This is great!
> >>
> >> Cheers,
> >>
> >> Brian
> >> > -MinRK
> >> >
> >> > On Mon, Jul 12, 2010 at 12:51, MinRK <benjaminrk at gmail.com> wrote:
> >> >>
> >> >>
> >> >> On Mon, Jul 12, 2010 at 09:15, Brian Granger <ellisonbg at gmail.com>
> >> >> wrote:
> >> >>>
> >> >>> On Fri, Jul 9, 2010 at 3:35 PM, MinRK <benjaminrk at gmail.com> wrote:
> >> >>> > Brian,
> >> >>> > Have you worked on the Heartbeat Device? Does that need to go in
> 0MQ
> >> >>> > itself,
> >> >>>
> >> >>> I have not.  Ideally it could go into 0MQ itself.  But, in
> principle,
> >> >>> we could do it in pyzmq.  We just have to write a nogil pure C
> >> >>> function that uses the low-level C API to do the heartbeat.  Then we
> >> >>> can just run that function in a thread with a "with nogil" block.
> >> >>> Shouldn't be too bad, given how simple the heartbeat logic is.  The
> >> >>> main thing we will have to think about is how to start/stop the
> >> >>> heartbeat in a clean way.
> >> >>>
> >> >>> > or can it be part of pyzmq?
> >> >>> > I'm trying to work out how to really tell that an engine is down.
> >> >>> > Is the heartbeat to be in a separate process?
> >> >>>
> >> >>> No, just a separate C/C++ thread that doesn't hold the GIL.
> >> >>>
> >> >>> > Are we guaranteed that a zmq thread is responsive no matter what
> an
> >> >>> > engine
> >> >>> > process is doing? If that's the case, is a moderate timeout on
> recv
> >> >>> > adequate
> >> >>> > to determine engine failure?
> >> >>>
> >> >>> Yes, I think we can assume this.  The only thing that would take the
> >> >>> 0mq thread down is something semi-fatal like a signal that doesn't
> get
> >> >>> handled.  But as long as the 0MQ thread doesn't have any bugs, it
> >> >>> should simply keep running no matter what the other thread does (OK,
> >> >>> other than segfaulting)
> >> >>>
> >> >>> > If zmq threads are guaranteed to be responsive, it seems like a
> >> >>> > simple
> >> >>> > pair
> >> >>> > socket might be good enough, rather than needing a new device. Or
> >> >>> > even
> >> >>> > through the registration XREP socket.
> >> >>>
> >> >>> That (registration XREP socket) won't work unless we want to write
> all
> >> >>> that logic in C.
> >> >>> I don't know about a PAIR socket because of the need for multiple
> >> >>> clients?
> >> >>
> >> >> I wasn't thinking of a single PAIR socket, but rather a pair for each
> >> >> engine. We already have a pair for each engine for the queue, but I
> am
> >> >> not
> >> >> quite seeing the need for a special device beyond a PAIR socket in
> the
> >> >> heartbeat.
> >> >>
> >> >>>
> >> >>> > Can we formalize exactly what the heartbeat needs to be?
> >> >>>
> >> >>> OK, let's think.  The engine needs to connect, the controller bind.
> >> >>> It would be nice if the controller didn't need a separate heartbeat
> >> >>> socket for each engine, but I guess we need the ability to track
> which
> >> >>> specific engine is heartbeating.   Also, there is the question of to
> >> >>> do want to do a reqest/reply or pub/sub style heartbeat.  What do
> you
> >> >>> think?
> >> >>
> >> >> The way we talked about it, the heartbeat needs to issue commands
> both
> >> >> ways. While it is used for checking whether an engine remains alive,
> it
> >> >> is
> >> >> also the avenue for aborting jobs.  If we do have a strict heartbeat,
> >> >> then I
> >> >> think PUB/SUB is a good choice.
> >> >> However, if heartbeat is all it does, then we need a _third_
> connection
> >> >> to
> >> >> each engine for control commands. Since messages cannot jump the
> queue,
> >> >> the
> >> >> engine queue PAIR socket cannot be used for commands, and a PUB/SUB
> >> >> model
> >> >> for heartbeat can _either_ receive commands _or_ have results.
> >> >> control commands:
> >> >> beat (check alive)
> >> >> abort (remove a task from the queue)
> >> >> signal (SIGINT, etc.)
> >> >> exit (engine.kill)
> >> >> reset (clear queue, namespace)
> >> >> more?
> >> >> It's possible that we could implement these with a PUB on the
> >> >> controller
> >> >> and a SUB on each engine, only interpreting results received via the
> >> >> queue's
> >> >> PAIR socket. But then every command would be sent to every engine,
> even
> >> >> though many would only be meant for one (too inefficient/costly?). It
> >> >> would
> >> >> however make the actual heartbeat command very simple as a single
> send.
> >> >> It does not allow for the engine to initiate queries of the
> controller,
> >> >> for instance a work stealing implementation. Again, it is possible
> that
> >> >> this
> >> >> could be implemented via the job queue PAIR socket, but that would
> only
> >> >> allow for stealing when completely starved for work, since the job
> >> >> queue and
> >> >> communication queue would be the same.
> >> >> There's also the issue of task dependency.
> >> >> If we are to implement dependency checking as we discussed (depend on
> >> >> taskIDs, and only execute once the task has been completed), the
> engine
> >> >> needs to be able to query the controller about the tasks depended
> upon.
> >> >> This
> >> >> makes the controller being the PUB side unworkable.
> >> >> This says to me that we need two-way connections between the engines
> >> >> and
> >> >> the controller. That can either be implemented as multiple
> connections
> >> >> (PUB/SUB + PAIR or REQ/REP), or simply a PAIR socket for each engine
> >> >> could
> >> >> provide the whole heartbeat/command channel.
> >> >> -MinRK
> >> >>
> >> >>>
> >> >>> Brian
> >> >>>
> >> >>>
> >> >>> > -MinRK
> >> >>>
> >> >>>
> >> >>>
> >> >>> --
> >> >>> Brian E. Granger, Ph.D.
> >> >>> Assistant Professor of Physics
> >> >>> Cal Poly State University, San Luis Obispo
> >> >>> bgranger at calpoly.edu
> >> >>> ellisonbg at gmail.com
> >> >>
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Brian E. Granger, Ph.D.
> >> Assistant Professor of Physics
> >> Cal Poly State University, San Luis Obispo
> >> bgranger at calpoly.edu
> >> ellisonbg at gmail.com
> >
> >
>
>
>
> --
> Brian E. Granger, Ph.D.
> Assistant Professor of Physics
> Cal Poly State University, San Luis Obispo
> bgranger at calpoly.edu
> ellisonbg at gmail.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/ipython-dev/attachments/20100712/0d3a3cac/attachment.html>


More information about the IPython-dev mailing list