[IPython-dev] Heartbeat Device

Brian Granger ellisonbg at gmail.com
Tue Jul 13 23:57:31 EDT 2010


Nice to know that works, but I don't think that will work for devices
that use blocking recv calls.  But it may.

Brian

On Tue, Jul 13, 2010 at 12:51 PM, MinRK <benjaminrk at gmail.com> wrote:
> Re: not exiting without killing:
> I just needed to add thread.setDaemon(True), so the device threads do exit
> properly now.
> committed/pushed to git.
>
> -MinRK
> On Mon, Jul 12, 2010 at 22:10, MinRK <benjaminrk at gmail.com> wrote:
>>
>>
>> On Mon, Jul 12, 2010 at 22:04, Brian Granger <ellisonbg at gmail.com> wrote:
>>>
>>> On Mon, Jul 12, 2010 at 9:49 PM, MinRK <benjaminrk at gmail.com> wrote:
>>> >
>>> >
>>> > On Mon, Jul 12, 2010 at 20:43, Brian Granger <ellisonbg at gmail.com>
>>> > wrote:
>>> >>
>>> >> Min,
>>> >>
>>> >> On Mon, Jul 12, 2010 at 4:10 PM, MinRK <benjaminrk at gmail.com> wrote:
>>> >> > I've been thinking about this, and it seems like we can't have a
>>> >> > responsive
>>> >> > rich control connection unless it is in another process, like the
>>> >> > old
>>> >> > IPython daemon.
>>> >>
>>> >> I am not quite sure I follow what you mean by this.  Can you
>>> >> elaborate?
>>> >
>>> > The main advantage that we were to gain from the out-of-process
>>> > ipdaemon was
>>> > the ability to abort/kill (signal) blocking jobs. With 0MQ threads, the
>>> > only
>>> > logic we can have in a control/heartbeat thread must be implemented in
>>> > GIL-free C/C++. That limits what we can do in terms of interacting with
>>> > the
>>> > main work thread, as I understand it.
>>>
>>> Yes, but I think it might be possible to spawn an external process to
>>> send a signal back to the process.  But I am not sure about this.
>>>
>>> >>
>>> >> > Pure heartbeat is easy with a C device, and we may not even
>>> >> > need a new one. For instance, I added support for the builtin
>>> >> > devices of
>>> >> > zeromq to pyzmq with a few lines, and you can have simple is_alive
>>> >> > style
>>> >> > heartbeat with a FORWARDER device.
>>> >>
>>> >> I looked at this and it looks very nice.  I think for basic is_alive
>>> >> type heartbeats this will work fine.  The only thing to be careful of
>>> >> is that 0MQ sockets are not thread safe.  Thus, it would be best to
>>> >> actually create the socket in the thread as well.  But we do want the
>>> >> flexibility to be able to pass in sockets to the device.  We will have
>>> >> to think about that issue.
>>> >
>>> >
>>> > I wrote/pushed a basic ThreadsafeDevice, which creates/binds/connects
>>> > inside
>>> > the thread's run method.
>>> > It adds bind_in/out, connect_in/out, and setsockopt_in/out methods
>>> > which
>>> > just queue up arguments to be called at the head of the run method. I
>>> > added
>>> > a tspong.py in the heartbeat example using it.
>>>
>>> Cool, I will review this and merge it into master.
>>>
>>
>> I'd say it's not ready for master in one particular respect: The Device
>> thread doesn't respond to signals, so I have to kill it to stop it. I
>> haven't yet figured out why this is happening; it might be quite simple.
>> I'll push up some unit tests tomorrow
>>
>>>
>>> Cheers,
>>>
>>> Brian
>>>
>>> >>
>>> >> > I pushed a basic example of this (examples/heartbeat) to my pyzmq
>>> >> > fork.
>>> >> > Running a ~3 second numpy.dot action, the heartbeat pings remain
>>> >> > responsive
>>> >> > at <1ms.
>>> >>
>>> >> This is great!
>>> >>
>>> >> Cheers,
>>> >>
>>> >> Brian
>>> >> > -MinRK
>>> >> >
>>> >> > On Mon, Jul 12, 2010 at 12:51, MinRK <benjaminrk at gmail.com> wrote:
>>> >> >>
>>> >> >>
>>> >> >> On Mon, Jul 12, 2010 at 09:15, Brian Granger <ellisonbg at gmail.com>
>>> >> >> wrote:
>>> >> >>>
>>> >> >>> On Fri, Jul 9, 2010 at 3:35 PM, MinRK <benjaminrk at gmail.com>
>>> >> >>> wrote:
>>> >> >>> > Brian,
>>> >> >>> > Have you worked on the Heartbeat Device? Does that need to go in
>>> >> >>> > 0MQ
>>> >> >>> > itself,
>>> >> >>>
>>> >> >>> I have not.  Ideally it could go into 0MQ itself.  But, in
>>> >> >>> principle,
>>> >> >>> we could do it in pyzmq.  We just have to write a nogil pure C
>>> >> >>> function that uses the low-level C API to do the heartbeat.  Then
>>> >> >>> we
>>> >> >>> can just run that function in a thread with a "with nogil" block.
>>> >> >>> Shouldn't be too bad, given how simple the heartbeat logic is.
>>> >> >>>  The
>>> >> >>> main thing we will have to think about is how to start/stop the
>>> >> >>> heartbeat in a clean way.
>>> >> >>>
>>> >> >>> > or can it be part of pyzmq?
>>> >> >>> > I'm trying to work out how to really tell that an engine is
>>> >> >>> > down.
>>> >> >>> > Is the heartbeat to be in a separate process?
>>> >> >>>
>>> >> >>> No, just a separate C/C++ thread that doesn't hold the GIL.
>>> >> >>>
>>> >> >>> > Are we guaranteed that a zmq thread is responsive no matter what
>>> >> >>> > an
>>> >> >>> > engine
>>> >> >>> > process is doing? If that's the case, is a moderate timeout on
>>> >> >>> > recv
>>> >> >>> > adequate
>>> >> >>> > to determine engine failure?
>>> >> >>>
>>> >> >>> Yes, I think we can assume this.  The only thing that would take
>>> >> >>> the
>>> >> >>> 0mq thread down is something semi-fatal like a signal that doesn't
>>> >> >>> get
>>> >> >>> handled.  But as long as the 0MQ thread doesn't have any bugs, it
>>> >> >>> should simply keep running no matter what the other thread does
>>> >> >>> (OK,
>>> >> >>> other than segfaulting)
>>> >> >>>
>>> >> >>> > If zmq threads are guaranteed to be responsive, it seems like a
>>> >> >>> > simple
>>> >> >>> > pair
>>> >> >>> > socket might be good enough, rather than needing a new device.
>>> >> >>> > Or
>>> >> >>> > even
>>> >> >>> > through the registration XREP socket.
>>> >> >>>
>>> >> >>> That (registration XREP socket) won't work unless we want to write
>>> >> >>> all
>>> >> >>> that logic in C.
>>> >> >>> I don't know about a PAIR socket because of the need for multiple
>>> >> >>> clients?
>>> >> >>
>>> >> >> I wasn't thinking of a single PAIR socket, but rather a pair for
>>> >> >> each
>>> >> >> engine. We already have a pair for each engine for the queue, but I
>>> >> >> am
>>> >> >> not
>>> >> >> quite seeing the need for a special device beyond a PAIR socket in
>>> >> >> the
>>> >> >> heartbeat.
>>> >> >>
>>> >> >>>
>>> >> >>> > Can we formalize exactly what the heartbeat needs to be?
>>> >> >>>
>>> >> >>> OK, let's think.  The engine needs to connect, the controller
>>> >> >>> bind.
>>> >> >>> It would be nice if the controller didn't need a separate
>>> >> >>> heartbeat
>>> >> >>> socket for each engine, but I guess we need the ability to track
>>> >> >>> which
>>> >> >>> specific engine is heartbeating.   Also, there is the question of
>>> >> >>> to
>>> >> >>> do want to do a reqest/reply or pub/sub style heartbeat.  What do
>>> >> >>> you
>>> >> >>> think?
>>> >> >>
>>> >> >> The way we talked about it, the heartbeat needs to issue commands
>>> >> >> both
>>> >> >> ways. While it is used for checking whether an engine remains
>>> >> >> alive, it
>>> >> >> is
>>> >> >> also the avenue for aborting jobs.  If we do have a strict
>>> >> >> heartbeat,
>>> >> >> then I
>>> >> >> think PUB/SUB is a good choice.
>>> >> >> However, if heartbeat is all it does, then we need a _third_
>>> >> >> connection
>>> >> >> to
>>> >> >> each engine for control commands. Since messages cannot jump the
>>> >> >> queue,
>>> >> >> the
>>> >> >> engine queue PAIR socket cannot be used for commands, and a PUB/SUB
>>> >> >> model
>>> >> >> for heartbeat can _either_ receive commands _or_ have results.
>>> >> >> control commands:
>>> >> >> beat (check alive)
>>> >> >> abort (remove a task from the queue)
>>> >> >> signal (SIGINT, etc.)
>>> >> >> exit (engine.kill)
>>> >> >> reset (clear queue, namespace)
>>> >> >> more?
>>> >> >> It's possible that we could implement these with a PUB on the
>>> >> >> controller
>>> >> >> and a SUB on each engine, only interpreting results received via
>>> >> >> the
>>> >> >> queue's
>>> >> >> PAIR socket. But then every command would be sent to every engine,
>>> >> >> even
>>> >> >> though many would only be meant for one (too inefficient/costly?).
>>> >> >> It
>>> >> >> would
>>> >> >> however make the actual heartbeat command very simple as a single
>>> >> >> send.
>>> >> >> It does not allow for the engine to initiate queries of the
>>> >> >> controller,
>>> >> >> for instance a work stealing implementation. Again, it is possible
>>> >> >> that
>>> >> >> this
>>> >> >> could be implemented via the job queue PAIR socket, but that would
>>> >> >> only
>>> >> >> allow for stealing when completely starved for work, since the job
>>> >> >> queue and
>>> >> >> communication queue would be the same.
>>> >> >> There's also the issue of task dependency.
>>> >> >> If we are to implement dependency checking as we discussed (depend
>>> >> >> on
>>> >> >> taskIDs, and only execute once the task has been completed), the
>>> >> >> engine
>>> >> >> needs to be able to query the controller about the tasks depended
>>> >> >> upon.
>>> >> >> This
>>> >> >> makes the controller being the PUB side unworkable.
>>> >> >> This says to me that we need two-way connections between the
>>> >> >> engines
>>> >> >> and
>>> >> >> the controller. That can either be implemented as multiple
>>> >> >> connections
>>> >> >> (PUB/SUB + PAIR or REQ/REP), or simply a PAIR socket for each
>>> >> >> engine
>>> >> >> could
>>> >> >> provide the whole heartbeat/command channel.
>>> >> >> -MinRK
>>> >> >>
>>> >> >>>
>>> >> >>> Brian
>>> >> >>>
>>> >> >>>
>>> >> >>> > -MinRK
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>> --
>>> >> >>> Brian E. Granger, Ph.D.
>>> >> >>> Assistant Professor of Physics
>>> >> >>> Cal Poly State University, San Luis Obispo
>>> >> >>> bgranger at calpoly.edu
>>> >> >>> ellisonbg at gmail.com
>>> >> >>
>>> >> >
>>> >> >
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Brian E. Granger, Ph.D.
>>> >> Assistant Professor of Physics
>>> >> Cal Poly State University, San Luis Obispo
>>> >> bgranger at calpoly.edu
>>> >> ellisonbg at gmail.com
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> Brian E. Granger, Ph.D.
>>> Assistant Professor of Physics
>>> Cal Poly State University, San Luis Obispo
>>> bgranger at calpoly.edu
>>> ellisonbg at gmail.com
>>
>
>



-- 
Brian E. Granger, Ph.D.
Assistant Professor of Physics
Cal Poly State University, San Luis Obispo
bgranger at calpoly.edu
ellisonbg at gmail.com



More information about the IPython-dev mailing list