[IPython-dev] Parallel computing segfault behavior

Min RK benjaminrk at gmail.com
Wed Jan 29 12:29:47 EST 2014



> On Jan 29, 2014, at 5:56, Patrick Fuller <patrickfuller at gmail.com> wrote:
> 
> Thanks for that code! It's good to know that the remaining cores are still working and that the results are all recoverable.
> 
> One last question: each segfault offlines an engine, which means that the cluster slows down and eventually crashes as the number of segfaults approaches the number of ipengines. Should the controller instead start new engines to take the place of killed ones?

No, engine management is done by the user at this point, the controller never starts an engine. If you want to monitor the cluster and bring up replacement engines, this is not hard to do with an extra watcher (or starting engines with supervisord, etc.)

> 
> Thanks,
> Pat
> 
>> On Tuesday, January 28, 2014, MinRK <benjaminrk at gmail.com> wrote:
>> 
>> 
>>> On Tue, Jan 28, 2014 at 5:04 PM, Patrick Fuller <patrickfuller at gmail.com> wrote:
>>> ...the difference being that this would require starting a new engine on each segfault
>>> 
>>> 
>>>> On Tuesday, January 28, 2014, Patrick Fuller <patrickfuller at gmail.com> wrote:
>>>> I guess my question is more along the lines of: should the cluster continue on to complete the queued jobs (as it would if the segfaults were instead python exceptions)? 
>> 
>> I see what you mean - the generator halts when it sees an exception, so it's inconvenient to get the successes, while ignoring the failures. I guess we could add separate methods that only iterate through just the successful results.
>> 
>> As far as task submission goes, it does indeed do what you seem to expect, so it's just viewing the results where there is an issue.
>> 
>> Here is an example of iterating through only the successful results of a map that segfaults.
>> 
>> -MinRK
>>  
>> 
>> On Tuesday, January 28, 2014, MinRK <benjaminrk at gmail.com> wrote:
>> I get an EngineError when an engine dies running a task:
>> 
>> http://nbviewer.ipython.org/gist/minrk/8679553
>> 
>> I think this is the desired behavior.
>> 
>> 
>> On Tue, Jan 28, 2014 at 2:18 PM, Patrick Fuller <patrickfuller at gmail.com> wrote:
>> Hi,
>> 
>> Has there been any discussion around how ipython parallel handles segfaulting?
>> 
>> To make this question more specific, the following code will cause some workers to crash. All results will become unreadable (or at least un-iterable), and future runs require a restart of the cluster. Is this behavior intended, or is it just something that hasn’t been discussed?
>> 
>> from IPython.parallel import Client
>> from random import random
>> 
>> def segfaulty_function(random_number, chance=0.25):
>>     if random_number < chance:
>>         import ctypes
>>         i = ctypes.c_char('a')
>>         j = ctypes.pointer(i)
>>         c = 0
>>         while True:
>>             j[c] = 'a'
>>             c += 1
>>         return j
>>     else:
>>         return random_number
>> 
>> view = Client(profile="something-parallel-here").load_balanced_view()
>> results = view.map(segfaulty_function, [random() for _ in range(100)])
>> 
>> for i, result in enumerate(results):
>>     print i, result
>> Backstory: Recently I’ve been working with a large monte carlo library that segfaults for, like, no reason at all. It’s due to some weird underlying random number issue and happens once every 5-10 thousand runs. I currently have each worker spin out a child process to isolate the occasional segfault, but this seems excessive. (I'm also trying to fix the source of the segfaults, but debugging is a slow process.)
>> 
>> Thanks,
>> Pat
>> 
>> 
>> _______________________________________________
>> IPython-dev mailing list
>> IPython-dev at scipy.org
>> http://mail.sci
>> 
> _______________________________________________
> IPython-dev mailing list
> IPython-dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/ipython-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/ipython-dev/attachments/20140129/cda9b4a7/attachment.html>


More information about the IPython-dev mailing list