Rockstar on multiple nodes
Hi Peter and yt developers, (Peter - I am copying you on this message that is also going to the yt developers email list. I believe if you reply-all the yt-dev copy of the message will bounce back to you, or be held for moderation, if you're not subscribed to yt-dev.) I am having trouble running Rockstar within yt on more than one node. For example, I am successful in running with 12 tasks total on one node, but 6 tasks each on two nodes does not work. By not working, I mean that it prints a few of these messages (one per PID, but not by all PIDs, it appears to equal NUM_WRITERS) "[Warning] Network IO Failure (PID NNNNN):" with various explanations ("Connection reset by peer", "Address already in use", "Broken pipe"). After that it prints "[Network] Packet send retry count at: 1" And then it hangs. I have turned on DEBUG_RSOCKET and I can see that tasks on both nodes are communicating with the server, and I also see "Accepted all reader / writer connections." and "Verified all reader / writer connections." The process gets as far as "Analyzing for halos / subhalos..." but it does not make it to " Constructing merger tree..." I am running on a single snapshot. I have tracked down that the call of accept() in _accept_connection() (in socket.c) is where the hang is happening. It looks like that has been called by repair_connection() (in rsocket.c). If I'm interpreting things correctly (please correct me if I'm not), the other half of repair_connection() is a call to _reconnect_to_addr() done by a different task. It looks to me like it is not being called to match by a different task, and that's where the hang is happening. I have done a test with stand-alone Rockstar on the same machine, and I am successful running it on 2 nodes. I think this means that there is some weirdness with the communication when running Rockstar as a library in yt/Python, and not the machine's network. I'm wondering if anyone has been successful running Rockstar in yt on more than one node? Also, does anyone have any intuition for what might be going wrong here? Thanks! -- Stephen Skory s@skory.us http://stephenskory.com/ 510.621.3687 (google voice)
Hi Stephen, To answer your last question first, I think there is exactly zero chance there's a problem with Rockstar itself, and I think the problem almost certainly is a result of the library wrapping. Comments below. On Tue, Nov 27, 2012 at 3:50 PM, Stephen Skory <s@skory.us> wrote:
Hi Peter and yt developers,
(Peter - I am copying you on this message that is also going to the yt developers email list. I believe if you reply-all the yt-dev copy of the message will bounce back to you, or be held for moderation, if you're not subscribed to yt-dev.)
I am having trouble running Rockstar within yt on more than one node. For example, I am successful in running with 12 tasks total on one node, but 6 tasks each on two nodes does not work. By not working, I mean that it prints a few of these messages (one per PID, but not by all PIDs, it appears to equal NUM_WRITERS)
"[Warning] Network IO Failure (PID NNNNN):"
with various explanations ("Connection reset by peer", "Address already in use", "Broken pipe"). After that it prints
This sounds to me like a problem with the forking model that's in place, but I might be wrong. The way I originally set it up, yt would guess at addresses and hosts, which get fed in after the call to _get_hosts. My recommendation would be to have all of these items printed out and then look into DEBUG_RSOCKET.
"[Network] Packet send retry count at: 1"
And then it hangs.
I have turned on DEBUG_RSOCKET and I can see that tasks on both nodes are communicating with the server, and I also see "Accepted all reader / writer connections." and "Verified all reader / writer connections." The process gets as far as "Analyzing for halos / subhalos..." but it does not make it to " Constructing merger tree..." I am running on a single snapshot.
Is it possible that a process has died?
I have tracked down that the call of accept() in _accept_connection() (in socket.c) is where the hang is happening. It looks like that has been called by repair_connection() (in rsocket.c). If I'm interpreting things correctly (please correct me if I'm not), the other half of repair_connection() is a call to _reconnect_to_addr() done by a different task. It looks to me like it is not being called to match by a different task, and that's where the hang is happening.
I have done a test with stand-alone Rockstar on the same machine, and I am successful running it on 2 nodes. I think this means that there is some weirdness with the communication when running Rockstar as a library in yt/Python, and not the machine's network.
I'm wondering if anyone has been successful running Rockstar in yt on more than one node? Also, does anyone have any intuition for what might be going wrong here?
Thanks!
-- Stephen Skory s@skory.us http://stephenskory.com/ 510.621.3687 (google voice) _______________________________________________ yt-dev mailing list yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
Hi Matt,
This sounds to me like a problem with the forking model that's in place, but I might be wrong. The way I originally set it up, yt would guess at addresses and hosts, which get fed in after the call to _get_hosts. My recommendation would be to have all of these items printed out and then look into DEBUG_RSOCKET.
Since I'm not running it inline with Enzo (so there should be no forks made in python) and "FORK_READERS_FROM_WRITERS = 0" is set in rockstar_interface.pyx in Rockstar (this is confirmed in the auto-rockstar.cfg output file), I don't think it should be forking anywhere.
Is it possible that a process has died?
That might be it. When I run "top" on the nodes, I'm seeing "python <defunct>" a number of times equal to NUM_WRITERS. -- Stephen Skory s@skory.us http://stephenskory.com/ 510.621.3687 (google voice)
Hi Matt,
That might be it. When I run "top" on the nodes, I'm seeing "python <defunct>" a number of times equal to NUM_WRITERS.
I haven't figured it out but I've learned a little. There are a few places that Rockstar is fork()Ing, and I've been looking at the one around line 790 of client.c (underneath "else if (!strcmp(cmd, "rock")) {"). I've added some printfs there and what I'm seeing is that it is the forked processes are the ones going defunct, but they are not the PIDs that are reporting Network IO failures. It looks like none of the original unforked python tasks are going defunct before things hang. Could it be that the forked processes are quitting/finishing before they should, and that's why it's hanging? But in the words of Tina Turner, what does going from one to two nodes have to do with it? -- Stephen Skory s@skory.us http://stephenskory.com/ 510.621.3687 (google voice)
Hi Stephen, On Tue, Nov 27, 2012 at 7:46 PM, Stephen Skory <s@skory.us> wrote:
Hi Matt,
That might be it. When I run "top" on the nodes, I'm seeing "python <defunct>" a number of times equal to NUM_WRITERS.
I haven't figured it out but I've learned a little. There are a few places that Rockstar is fork()Ing, and I've been looking at the one around line 790 of client.c (underneath "else if (!strcmp(cmd, "rock")) {"). I've added some printfs there and what I'm seeing is that it is the forked processes are the ones going defunct, but they are not the PIDs that are reporting Network IO failures. It looks like none of the original unforked python tasks are going defunct before things hang. Could it be that the forked processes are quitting/finishing before they should, and that's why it's hanging? But in the words of Tina Turner, what does going from one to two nodes have to do with it?
What're the values of: FORK_READERS_FROM_WRITERS FORK_PROCESSORS_PER_MACHINE ? -Matt
-- Stephen Skory s@skory.us http://stephenskory.com/ 510.621.3687 (google voice) _______________________________________________ yt-dev mailing list yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org
Hi Matt, they are: FORK_READERS_FROM_WRITERS = 0 FORK_PROCESSORS_PER_MACHINE = 1 Also, I've put printfs next to all the fork() calls that I could find, and the only ones I'm seeing before the hang are at the line I mentioned earlier. -- Stephen Skory s@skory.us http://stephenskory.com/ 510.621.3687 (google voice)
participants (2)
-
Matthew Turk
-
Stephen Skory