[IPython-dev] Heartbeat Device

MinRK benjaminrk at gmail.com
Tue Jul 13 15:51:39 EDT 2010


Re: not exiting without killing:
I just needed to add thread.setDaemon(True), so the device threads do exit
properly now.

committed/pushed to git.

-MinRK

On Mon, Jul 12, 2010 at 22:10, MinRK <benjaminrk at gmail.com> wrote:

>
>
> On Mon, Jul 12, 2010 at 22:04, Brian Granger <ellisonbg at gmail.com> wrote:
>
>> On Mon, Jul 12, 2010 at 9:49 PM, MinRK <benjaminrk at gmail.com> wrote:
>> >
>> >
>> > On Mon, Jul 12, 2010 at 20:43, Brian Granger <ellisonbg at gmail.com>
>> wrote:
>> >>
>> >> Min,
>> >>
>> >> On Mon, Jul 12, 2010 at 4:10 PM, MinRK <benjaminrk at gmail.com> wrote:
>> >> > I've been thinking about this, and it seems like we can't have a
>> >> > responsive
>> >> > rich control connection unless it is in another process, like the old
>> >> > IPython daemon.
>> >>
>> >> I am not quite sure I follow what you mean by this.  Can you elaborate?
>> >
>> > The main advantage that we were to gain from the out-of-process ipdaemon
>> was
>> > the ability to abort/kill (signal) blocking jobs. With 0MQ threads, the
>> only
>> > logic we can have in a control/heartbeat thread must be implemented in
>> > GIL-free C/C++. That limits what we can do in terms of interacting with
>> the
>> > main work thread, as I understand it.
>>
>> Yes, but I think it might be possible to spawn an external process to
>> send a signal back to the process.  But I am not sure about this.
>>
>> >>
>> >> > Pure heartbeat is easy with a C device, and we may not even
>> >> > need a new one. For instance, I added support for the builtin devices
>> of
>> >> > zeromq to pyzmq with a few lines, and you can have simple is_alive
>> style
>> >> > heartbeat with a FORWARDER device.
>> >>
>> >> I looked at this and it looks very nice.  I think for basic is_alive
>> >> type heartbeats this will work fine.  The only thing to be careful of
>> >> is that 0MQ sockets are not thread safe.  Thus, it would be best to
>> >> actually create the socket in the thread as well.  But we do want the
>> >> flexibility to be able to pass in sockets to the device.  We will have
>> >> to think about that issue.
>> >
>> >
>> > I wrote/pushed a basic ThreadsafeDevice, which creates/binds/connects
>> inside
>> > the thread's run method.
>> > It adds bind_in/out, connect_in/out, and setsockopt_in/out methods which
>> > just queue up arguments to be called at the head of the run method. I
>> added
>> > a tspong.py in the heartbeat example using it.
>>
>> Cool, I will review this and merge it into master.
>>
>>
> I'd say it's not ready for master in one particular respect: The Device
> thread doesn't respond to signals, so I have to kill it to stop it. I
> haven't yet figured out why this is happening; it might be quite simple.
>
> I'll push up some unit tests tomorrow
>
>
>
>> Cheers,
>>
>> Brian
>>
>> >>
>> >> > I pushed a basic example of this (examples/heartbeat) to my pyzmq
>> fork.
>> >> > Running a ~3 second numpy.dot action, the heartbeat pings remain
>> >> > responsive
>> >> > at <1ms.
>> >>
>> >> This is great!
>> >>
>> >> Cheers,
>> >>
>> >> Brian
>> >> > -MinRK
>> >> >
>> >> > On Mon, Jul 12, 2010 at 12:51, MinRK <benjaminrk at gmail.com> wrote:
>> >> >>
>> >> >>
>> >> >> On Mon, Jul 12, 2010 at 09:15, Brian Granger <ellisonbg at gmail.com>
>> >> >> wrote:
>> >> >>>
>> >> >>> On Fri, Jul 9, 2010 at 3:35 PM, MinRK <benjaminrk at gmail.com>
>> wrote:
>> >> >>> > Brian,
>> >> >>> > Have you worked on the Heartbeat Device? Does that need to go in
>> 0MQ
>> >> >>> > itself,
>> >> >>>
>> >> >>> I have not.  Ideally it could go into 0MQ itself.  But, in
>> principle,
>> >> >>> we could do it in pyzmq.  We just have to write a nogil pure C
>> >> >>> function that uses the low-level C API to do the heartbeat.  Then
>> we
>> >> >>> can just run that function in a thread with a "with nogil" block.
>> >> >>> Shouldn't be too bad, given how simple the heartbeat logic is.  The
>> >> >>> main thing we will have to think about is how to start/stop the
>> >> >>> heartbeat in a clean way.
>> >> >>>
>> >> >>> > or can it be part of pyzmq?
>> >> >>> > I'm trying to work out how to really tell that an engine is down.
>> >> >>> > Is the heartbeat to be in a separate process?
>> >> >>>
>> >> >>> No, just a separate C/C++ thread that doesn't hold the GIL.
>> >> >>>
>> >> >>> > Are we guaranteed that a zmq thread is responsive no matter what
>> an
>> >> >>> > engine
>> >> >>> > process is doing? If that's the case, is a moderate timeout on
>> recv
>> >> >>> > adequate
>> >> >>> > to determine engine failure?
>> >> >>>
>> >> >>> Yes, I think we can assume this.  The only thing that would take
>> the
>> >> >>> 0mq thread down is something semi-fatal like a signal that doesn't
>> get
>> >> >>> handled.  But as long as the 0MQ thread doesn't have any bugs, it
>> >> >>> should simply keep running no matter what the other thread does
>> (OK,
>> >> >>> other than segfaulting)
>> >> >>>
>> >> >>> > If zmq threads are guaranteed to be responsive, it seems like a
>> >> >>> > simple
>> >> >>> > pair
>> >> >>> > socket might be good enough, rather than needing a new device. Or
>> >> >>> > even
>> >> >>> > through the registration XREP socket.
>> >> >>>
>> >> >>> That (registration XREP socket) won't work unless we want to write
>> all
>> >> >>> that logic in C.
>> >> >>> I don't know about a PAIR socket because of the need for multiple
>> >> >>> clients?
>> >> >>
>> >> >> I wasn't thinking of a single PAIR socket, but rather a pair for
>> each
>> >> >> engine. We already have a pair for each engine for the queue, but I
>> am
>> >> >> not
>> >> >> quite seeing the need for a special device beyond a PAIR socket in
>> the
>> >> >> heartbeat.
>> >> >>
>> >> >>>
>> >> >>> > Can we formalize exactly what the heartbeat needs to be?
>> >> >>>
>> >> >>> OK, let's think.  The engine needs to connect, the controller bind.
>> >> >>> It would be nice if the controller didn't need a separate heartbeat
>> >> >>> socket for each engine, but I guess we need the ability to track
>> which
>> >> >>> specific engine is heartbeating.   Also, there is the question of
>> to
>> >> >>> do want to do a reqest/reply or pub/sub style heartbeat.  What do
>> you
>> >> >>> think?
>> >> >>
>> >> >> The way we talked about it, the heartbeat needs to issue commands
>> both
>> >> >> ways. While it is used for checking whether an engine remains alive,
>> it
>> >> >> is
>> >> >> also the avenue for aborting jobs.  If we do have a strict
>> heartbeat,
>> >> >> then I
>> >> >> think PUB/SUB is a good choice.
>> >> >> However, if heartbeat is all it does, then we need a _third_
>> connection
>> >> >> to
>> >> >> each engine for control commands. Since messages cannot jump the
>> queue,
>> >> >> the
>> >> >> engine queue PAIR socket cannot be used for commands, and a PUB/SUB
>> >> >> model
>> >> >> for heartbeat can _either_ receive commands _or_ have results.
>> >> >> control commands:
>> >> >> beat (check alive)
>> >> >> abort (remove a task from the queue)
>> >> >> signal (SIGINT, etc.)
>> >> >> exit (engine.kill)
>> >> >> reset (clear queue, namespace)
>> >> >> more?
>> >> >> It's possible that we could implement these with a PUB on the
>> >> >> controller
>> >> >> and a SUB on each engine, only interpreting results received via the
>> >> >> queue's
>> >> >> PAIR socket. But then every command would be sent to every engine,
>> even
>> >> >> though many would only be meant for one (too inefficient/costly?).
>> It
>> >> >> would
>> >> >> however make the actual heartbeat command very simple as a single
>> send.
>> >> >> It does not allow for the engine to initiate queries of the
>> controller,
>> >> >> for instance a work stealing implementation. Again, it is possible
>> that
>> >> >> this
>> >> >> could be implemented via the job queue PAIR socket, but that would
>> only
>> >> >> allow for stealing when completely starved for work, since the job
>> >> >> queue and
>> >> >> communication queue would be the same.
>> >> >> There's also the issue of task dependency.
>> >> >> If we are to implement dependency checking as we discussed (depend
>> on
>> >> >> taskIDs, and only execute once the task has been completed), the
>> engine
>> >> >> needs to be able to query the controller about the tasks depended
>> upon.
>> >> >> This
>> >> >> makes the controller being the PUB side unworkable.
>> >> >> This says to me that we need two-way connections between the engines
>> >> >> and
>> >> >> the controller. That can either be implemented as multiple
>> connections
>> >> >> (PUB/SUB + PAIR or REQ/REP), or simply a PAIR socket for each engine
>> >> >> could
>> >> >> provide the whole heartbeat/command channel.
>> >> >> -MinRK
>> >> >>
>> >> >>>
>> >> >>> Brian
>> >> >>>
>> >> >>>
>> >> >>> > -MinRK
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> --
>> >> >>> Brian E. Granger, Ph.D.
>> >> >>> Assistant Professor of Physics
>> >> >>> Cal Poly State University, San Luis Obispo
>> >> >>> bgranger at calpoly.edu
>> >> >>> ellisonbg at gmail.com
>> >> >>
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Brian E. Granger, Ph.D.
>> >> Assistant Professor of Physics
>> >> Cal Poly State University, San Luis Obispo
>> >> bgranger at calpoly.edu
>> >> ellisonbg at gmail.com
>> >
>> >
>>
>>
>>
>> --
>> Brian E. Granger, Ph.D.
>> Assistant Professor of Physics
>> Cal Poly State University, San Luis Obispo
>> bgranger at calpoly.edu
>> ellisonbg at gmail.com
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/ipython-dev/attachments/20100713/9b3e80bd/attachment.html>


More information about the IPython-dev mailing list