[IPython-dev] Parallel computing segfault behavior

Drain, Theodore R (392P) theodore.r.drain at jpl.nasa.gov
Wed Jan 29 12:48:48 EST 2014


I'd be interested in an automatic restart capability as well.  We have some very long running jobs where a loss of one more more engines might be a problem.  Could you outline what you mean by an "extra watcher"?  Is that just a Client object that polls the engine id's to see if they change (I assume UUID's would be needed, not the simple integer id's)?

Thanks,
Ted

________________________________
From: ipython-dev-bounces at scipy.org [ipython-dev-bounces at scipy.org] on behalf of Min RK [benjaminrk at gmail.com]
Sent: Wednesday, January 29, 2014 9:29 AM
To: IPython developers list
Cc: IPython developers list
Subject: Re: [IPython-dev] Parallel computing segfault behavior



On Jan 29, 2014, at 5:56, Patrick Fuller <patrickfuller at gmail.com<mailto:patrickfuller at gmail.com>> wrote:

Thanks for that code! It's good to know that the remaining cores are still working and that the results are all recoverable.

One last question: each segfault offlines an engine, which means that the cluster slows down and eventually crashes as the number of segfaults approaches the number of ipengines. Should the controller instead start new engines to take the place of killed ones?

No, engine management is done by the user at this point, the controller never starts an engine. If you want to monitor the cluster and bring up replacement engines, this is not hard to do with an extra watcher (or starting engines with supervisord, etc.)


Thanks,
Pat

On Tuesday, January 28, 2014, MinRK <benjaminrk at gmail.com<mailto:benjaminrk at gmail.com>> wrote:


On Tue, Jan 28, 2014 at 5:04 PM, Patrick Fuller <patrickfuller at gmail.com<UrlBlockedError.aspx>> wrote:
...the difference being that this would require starting a new engine on each segfault


On Tuesday, January 28, 2014, Patrick Fuller <patrickfuller at gmail.com<UrlBlockedError.aspx>> wrote:
I guess my question is more along the lines of: should the cluster continue on to complete the queued jobs (as it would if the segfaults were instead python exceptions)?

I see what you mean - the generator halts when it sees an exception, so it's inconvenient to get the successes, while ignoring the failures. I guess we could add separate methods that only iterate through just the successful results.

As far as task submission goes, it does indeed do what you seem to expect, so it's just viewing the results where there is an issue.

Here is an example<http://nbviewer.ipython.org/gist/minrk/8680688> of iterating through only the successful results of a map that segfaults.

-MinRK


On Tuesday, January 28, 2014, MinRK <benjaminrk at gmail.com> wrote:
I get an EngineError when an engine dies running a task:

http://nbviewer.ipython.org/gist/minrk/8679553

I think this is the desired behavior.


On Tue, Jan 28, 2014 at 2:18 PM, Patrick Fuller <patrickfuller at gmail.com> wrote:

Hi,

Has there been any discussion around how ipython parallel handles segfaulting?

To make this question more specific, the following code will cause some workers to crash. All results will become unreadable (or at least un-iterable), and future runs require a restart of the cluster. Is this behavior intended, or is it just something that hasn’t been discussed?

from IPython.parallel import Client
from random import random

def segfaulty_function(random_number, chance=0.25):
    if random_number < chance:
        import ctypes
        i = ctypes.c_char('a')
        j = ctypes.pointer(i)
        c = 0
        while True:
            j[c] = 'a'
            c += 1
        return j
    else:
        return random_number

view = Client(profile="something-parallel-here").load_balanced_view()
results = view.map(segfaulty_function, [random() for _ in range(100)])

for i, result in enumerate(results):
    print i, result

Backstory: Recently I’ve been working with a large monte carlo library that segfaults for, like, no reason at all. It’s due to some weird underlying random number issue and happens once every 5-10 thousand runs. I currently have each worker spin out a child process to isolate the occasional segfault, but this seems excessive. (I'm also trying to fix the source of the segfaults, but debugging is a slow process.)

Thanks,
Pat

_______________________________________________
IPython-dev mailing list
IPython-dev at scipy.org
http://mail.sci<http://mail.scipy.org/mailman/listinfo/ipython-dev>

_______________________________________________
IPython-dev mailing list
IPython-dev at scipy.org<mailto:IPython-dev at scipy.org>
http://mail.scipy.org/mailman/listinfo/ipython-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/ipython-dev/attachments/20140129/64c5b972/attachment.html>


More information about the IPython-dev mailing list