
Hi Matt,
That might be it. When I run "top" on the nodes, I'm seeing "python <defunct>" a number of times equal to NUM_WRITERS.
I haven't figured it out but I've learned a little. There are a few places that Rockstar is fork()Ing, and I've been looking at the one around line 790 of client.c (underneath "else if (!strcmp(cmd, "rock")) {"). I've added some printfs there and what I'm seeing is that it is the forked processes are the ones going defunct, but they are not the PIDs that are reporting Network IO failures. It looks like none of the original unforked python tasks are going defunct before things hang. Could it be that the forked processes are quitting/finishing before they should, and that's why it's hanging? But in the words of Tina Turner, what does going from one to two nodes have to do with it? -- Stephen Skory s@skory.us http://stephenskory.com/ 510.621.3687 (google voice)