Rockstar on multiple nodes
Hi Peter and yt developers, (Peter - I am copying you on this message that is also going to the yt developers email list. I believe if you reply-all the yt-dev copy of the message will bounce back to you, or be held for moderation, if you're not subscribed to yt-dev.) I am having trouble running Rockstar within yt on more than one node. For example, I am successful in running with 12 tasks total on one node, but 6 tasks each on two nodes does not work. By not working, I mean that it prints a few of these messages (one per PID, but not by all PIDs, it appears to equal NUM_WRITERS) "[Warning] Network IO Failure (PID NNNNN):" with various explanations ("Connection reset by peer", "Address already in use", "Broken pipe"). After that it prints "[Network] Packet send retry count at: 1" And then it hangs. I have turned on DEBUG_RSOCKET and I can see that tasks on both nodes are communicating with the server, and I also see "Accepted all reader / writer connections." and "Verified all reader / writer connections." The process gets as far as "Analyzing for halos / subhalos..." but it does not make it to " Constructing merger tree..." I am running on a single snapshot. I have tracked down that the call of accept() in _accept_connection() (in socket.c) is where the hang is happening. It looks like that has been called by repair_connection() (in rsocket.c). If I'm interpreting things correctly (please correct me if I'm not), the other half of repair_connection() is a call to _reconnect_to_addr() done by a different task. It looks to me like it is not being called to match by a different task, and that's where the hang is happening. I have done a test with stand-alone Rockstar on the same machine, and I am successful running it on 2 nodes. I think this means that there is some weirdness with the communication when running Rockstar as a library in yt/Python, and not the machine's network. I'm wondering if anyone has been successful running Rockstar in yt on more than one node? Also, does anyone have any intuition for what might be going wrong here? Thanks! -- Stephen Skory s@skory.us http://stephenskory.com/ 510.621.3687 (google voice)
participants (2)
-
Matthew Turk
-
Stephen Skory