Hi Stephen,
To answer your last question first, I think there is exactly zero
chance there's a problem with Rockstar itself, and I think the problem
almost certainly is a result of the library wrapping. Comments below.
On Tue, Nov 27, 2012 at 3:50 PM, Stephen Skory wrote:
Hi Peter and yt developers,
(Peter - I am copying you on this message that is also going to the yt developers email list. I believe if you reply-all the yt-dev copy of the message will bounce back to you, or be held for moderation, if you're not subscribed to yt-dev.)
I am having trouble running Rockstar within yt on more than one node. For example, I am successful in running with 12 tasks total on one node, but 6 tasks each on two nodes does not work. By not working, I mean that it prints a few of these messages (one per PID, but not by all PIDs, it appears to equal NUM_WRITERS)
"[Warning] Network IO Failure (PID NNNNN):"
with various explanations ("Connection reset by peer", "Address already in use", "Broken pipe"). After that it prints
This sounds to me like a problem with the forking model that's in place, but I might be wrong. The way I originally set it up, yt would guess at addresses and hosts, which get fed in after the call to _get_hosts. My recommendation would be to have all of these items printed out and then look into DEBUG_RSOCKET.
"[Network] Packet send retry count at: 1"
And then it hangs.
I have turned on DEBUG_RSOCKET and I can see that tasks on both nodes are communicating with the server, and I also see "Accepted all reader / writer connections." and "Verified all reader / writer connections." The process gets as far as "Analyzing for halos / subhalos..." but it does not make it to " Constructing merger tree..." I am running on a single snapshot.
Is it possible that a process has died?
I have tracked down that the call of accept() in _accept_connection() (in socket.c) is where the hang is happening. It looks like that has been called by repair_connection() (in rsocket.c). If I'm interpreting things correctly (please correct me if I'm not), the other half of repair_connection() is a call to _reconnect_to_addr() done by a different task. It looks to me like it is not being called to match by a different task, and that's where the hang is happening.
I have done a test with stand-alone Rockstar on the same machine, and I am successful running it on 2 nodes. I think this means that there is some weirdness with the communication when running Rockstar as a library in yt/Python, and not the machine's network.
I'm wondering if anyone has been successful running Rockstar in yt on more than one node? Also, does anyone have any intuition for what might be going wrong here?
Thanks!
-- Stephen Skory s@skory.us http://stephenskory.com/ 510.621.3687 (google voice) _______________________________________________ yt-dev mailing list yt-dev@lists.spacepope.org http://lists.spacepope.org/listinfo.cgi/yt-dev-spacepope.org