<div>Re: not exiting without killing:</div>I just needed to add thread.setDaemon(True), so the device threads do exit properly now. <div><br></div><div>committed/pushed to git.<br><br></div><div>-MinRK</div><div><br><div class="gmail_quote">


On Mon, Jul 12, 2010 at 22:10, MinRK <span dir="ltr"><<a href="mailto:benjaminrk@gmail.com">benjaminrk@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">


<br><br><div class="gmail_quote"><div><div></div><div class="h5">On Mon, Jul 12, 2010 at 22:04, Brian Granger <span dir="ltr"><<a href="mailto:ellisonbg@gmail.com" target="_blank">ellisonbg@gmail.com</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div>On Mon, Jul 12, 2010 at 9:49 PM, MinRK <<a href="mailto:benjaminrk@gmail.com" target="_blank">benjaminrk@gmail.com</a>> wrote:<br>

><br>

><br>

> On Mon, Jul 12, 2010 at 20:43, Brian Granger <<a href="mailto:ellisonbg@gmail.com" target="_blank">ellisonbg@gmail.com</a>> wrote:<br>

>><br>

>> Min,<br>

>><br>

>> On Mon, Jul 12, 2010 at 4:10 PM, MinRK <<a href="mailto:benjaminrk@gmail.com" target="_blank">benjaminrk@gmail.com</a>> wrote:<br>

>> > I've been thinking about this, and it seems like we can't have a<br>

>> > responsive<br>

>> > rich control connection unless it is in another process, like the old<br>

>> > IPython daemon.<br>

>><br>

>> I am not quite sure I follow what you mean by this.  Can you elaborate?<br>

><br>

> The main advantage that we were to gain from the out-of-process ipdaemon was<br>

> the ability to abort/kill (signal) blocking jobs. With 0MQ threads, the only<br>

> logic we can have in a control/heartbeat thread must be implemented in<br>

> GIL-free C/C++. That limits what we can do in terms of interacting with the<br>

> main work thread, as I understand it.<br>

<br>

</div>Yes, but I think it might be possible to spawn an external process to<br>

send a signal back to the process.  But I am not sure about this.<br>

<div><br>

>><br>

>> > Pure heartbeat is easy with a C device, and we may not even<br>

>> > need a new one. For instance, I added support for the builtin devices of<br>

>> > zeromq to pyzmq with a few lines, and you can have simple is_alive style<br>

>> > heartbeat with a FORWARDER device.<br>

>><br>

>> I looked at this and it looks very nice.  I think for basic is_alive<br>

>> type heartbeats this will work fine.  The only thing to be careful of<br>

>> is that 0MQ sockets are not thread safe.  Thus, it would be best to<br>

>> actually create the socket in the thread as well.  But we do want the<br>

>> flexibility to be able to pass in sockets to the device.  We will have<br>

>> to think about that issue.<br>

><br>

><br>

> I wrote/pushed a basic ThreadsafeDevice, which creates/binds/connects inside<br>

> the thread's run method.<br>

> It adds bind_in/out, connect_in/out, and setsockopt_in/out methods which<br>

> just queue up arguments to be called at the head of the run method. I added<br>

> a tspong.py in the heartbeat example using it.<br>

<br>

</div>Cool, I will review this and merge it into master.<br>

<br></blockquote><div> </div></div></div><div>I'd say it's not ready for master in one particular respect: The Device thread doesn't respond to signals, so I have to kill it to stop it. I haven't yet figured out why this is happening; it might be quite simple.</div>


<div><br></div><div>I'll push up some unit tests tomorrow</div><div><div></div><div class="h5"><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


Cheers,<br>

<font color="#888888"><br>

Brian<br>

</font><div><div></div><div><br>

>><br>

>> > I pushed a basic example of this (examples/heartbeat) to my pyzmq fork.<br>

>> > Running a ~3 second numpy.dot action, the heartbeat pings remain<br>

>> > responsive<br>

>> > at <1ms.<br>

>><br>

>> This is great!<br>

>><br>

>> Cheers,<br>

>><br>

>> Brian<br>

>> > -MinRK<br>

>> ><br>

>> > On Mon, Jul 12, 2010 at 12:51, MinRK <<a href="mailto:benjaminrk@gmail.com" target="_blank">benjaminrk@gmail.com</a>> wrote:<br>

>> >><br>

>> >><br>

>> >> On Mon, Jul 12, 2010 at 09:15, Brian Granger <<a href="mailto:ellisonbg@gmail.com" target="_blank">ellisonbg@gmail.com</a>><br>

>> >> wrote:<br>

>> >>><br>

>> >>> On Fri, Jul 9, 2010 at 3:35 PM, MinRK <<a href="mailto:benjaminrk@gmail.com" target="_blank">benjaminrk@gmail.com</a>> wrote:<br>

>> >>> > Brian,<br>

>> >>> > Have you worked on the Heartbeat Device? Does that need to go in 0MQ<br>

>> >>> > itself,<br>

>> >>><br>

>> >>> I have not.  Ideally it could go into 0MQ itself.  But, in principle,<br>

>> >>> we could do it in pyzmq.  We just have to write a nogil pure C<br>

>> >>> function that uses the low-level C API to do the heartbeat.  Then we<br>

>> >>> can just run that function in a thread with a "with nogil" block.<br>

>> >>> Shouldn't be too bad, given how simple the heartbeat logic is.  The<br>

>> >>> main thing we will have to think about is how to start/stop the<br>

>> >>> heartbeat in a clean way.<br>

>> >>><br>

>> >>> > or can it be part of pyzmq?<br>

>> >>> > I'm trying to work out how to really tell that an engine is down.<br>

>> >>> > Is the heartbeat to be in a separate process?<br>

>> >>><br>

>> >>> No, just a separate C/C++ thread that doesn't hold the GIL.<br>

>> >>><br>

>> >>> > Are we guaranteed that a zmq thread is responsive no matter what an<br>

>> >>> > engine<br>

>> >>> > process is doing? If that's the case, is a moderate timeout on recv<br>

>> >>> > adequate<br>

>> >>> > to determine engine failure?<br>

>> >>><br>

>> >>> Yes, I think we can assume this.  The only thing that would take the<br>

>> >>> 0mq thread down is something semi-fatal like a signal that doesn't get<br>

>> >>> handled.  But as long as the 0MQ thread doesn't have any bugs, it<br>

>> >>> should simply keep running no matter what the other thread does (OK,<br>

>> >>> other than segfaulting)<br>

>> >>><br>

>> >>> > If zmq threads are guaranteed to be responsive, it seems like a<br>

>> >>> > simple<br>

>> >>> > pair<br>

>> >>> > socket might be good enough, rather than needing a new device. Or<br>

>> >>> > even<br>

>> >>> > through the registration XREP socket.<br>

>> >>><br>

>> >>> That (registration XREP socket) won't work unless we want to write all<br>

>> >>> that logic in C.<br>

>> >>> I don't know about a PAIR socket because of the need for multiple<br>

>> >>> clients?<br>

>> >><br>

>> >> I wasn't thinking of a single PAIR socket, but rather a pair for each<br>

>> >> engine. We already have a pair for each engine for the queue, but I am<br>

>> >> not<br>

>> >> quite seeing the need for a special device beyond a PAIR socket in the<br>

>> >> heartbeat.<br>

>> >><br>

>> >>><br>

>> >>> > Can we formalize exactly what the heartbeat needs to be?<br>

>> >>><br>

>> >>> OK, let's think.  The engine needs to connect, the controller bind.<br>

>> >>> It would be nice if the controller didn't need a separate heartbeat<br>

>> >>> socket for each engine, but I guess we need the ability to track which<br>

>> >>> specific engine is heartbeating.   Also, there is the question of to<br>

>> >>> do want to do a reqest/reply or pub/sub style heartbeat.  What do you<br>

>> >>> think?<br>

>> >><br>

>> >> The way we talked about it, the heartbeat needs to issue commands both<br>

>> >> ways. While it is used for checking whether an engine remains alive, it<br>

>> >> is<br>

>> >> also the avenue for aborting jobs.  If we do have a strict heartbeat,<br>

>> >> then I<br>

>> >> think PUB/SUB is a good choice.<br>

>> >> However, if heartbeat is all it does, then we need a _third_ connection<br>

>> >> to<br>

>> >> each engine for control commands. Since messages cannot jump the queue,<br>

>> >> the<br>

>> >> engine queue PAIR socket cannot be used for commands, and a PUB/SUB<br>

>> >> model<br>

>> >> for heartbeat can _either_ receive commands _or_ have results.<br>

>> >> control commands:<br>

>> >> beat (check alive)<br>

>> >> abort (remove a task from the queue)<br>

>> >> signal (SIGINT, etc.)<br>

>> >> exit (engine.kill)<br>

>> >> reset (clear queue, namespace)<br>

>> >> more?<br>

>> >> It's possible that we could implement these with a PUB on the<br>

>> >> controller<br>

>> >> and a SUB on each engine, only interpreting results received via the<br>

>> >> queue's<br>

>> >> PAIR socket. But then every command would be sent to every engine, even<br>

>> >> though many would only be meant for one (too inefficient/costly?). It<br>

>> >> would<br>

>> >> however make the actual heartbeat command very simple as a single send.<br>

>> >> It does not allow for the engine to initiate queries of the controller,<br>

>> >> for instance a work stealing implementation. Again, it is possible that<br>

>> >> this<br>

>> >> could be implemented via the job queue PAIR socket, but that would only<br>

>> >> allow for stealing when completely starved for work, since the job<br>

>> >> queue and<br>

>> >> communication queue would be the same.<br>

>> >> There's also the issue of task dependency.<br>

>> >> If we are to implement dependency checking as we discussed (depend on<br>

>> >> taskIDs, and only execute once the task has been completed), the engine<br>

>> >> needs to be able to query the controller about the tasks depended upon.<br>

>> >> This<br>

>> >> makes the controller being the PUB side unworkable.<br>

>> >> This says to me that we need two-way connections between the engines<br>

>> >> and<br>

>> >> the controller. That can either be implemented as multiple connections<br>

>> >> (PUB/SUB + PAIR or REQ/REP), or simply a PAIR socket for each engine<br>

>> >> could<br>

>> >> provide the whole heartbeat/command channel.<br>

>> >> -MinRK<br>

>> >><br>

>> >>><br>

>> >>> Brian<br>

>> >>><br>

>> >>><br>

>> >>> > -MinRK<br>

>> >>><br>

>> >>><br>

>> >>><br>

>> >>> --<br>

>> >>> Brian E. Granger, Ph.D.<br>

>> >>> Assistant Professor of Physics<br>

>> >>> Cal Poly State University, San Luis Obispo<br>

>> >>> <a href="mailto:bgranger@calpoly.edu" target="_blank">bgranger@calpoly.edu</a><br>

>> >>> <a href="mailto:ellisonbg@gmail.com" target="_blank">ellisonbg@gmail.com</a><br>

>> >><br>

>> ><br>

>> ><br>

>><br>

>><br>

>><br>

>> --<br>

>> Brian E. Granger, Ph.D.<br>

>> Assistant Professor of Physics<br>

>> Cal Poly State University, San Luis Obispo<br>

>> <a href="mailto:bgranger@calpoly.edu" target="_blank">bgranger@calpoly.edu</a><br>

>> <a href="mailto:ellisonbg@gmail.com" target="_blank">ellisonbg@gmail.com</a><br>

><br>

><br>

<br>

<br>

<br>

</div></div>--<br>

<div><div></div><div>Brian E. Granger, Ph.D.<br>

Assistant Professor of Physics<br>

Cal Poly State University, San Luis Obispo<br>

<a href="mailto:bgranger@calpoly.edu" target="_blank">bgranger@calpoly.edu</a><br>

<a href="mailto:ellisonbg@gmail.com" target="_blank">ellisonbg@gmail.com</a><br>

</div></div></blockquote></div></div></div><br>

</blockquote></div><br></div>