Yes, in fact, I can see that after the first MPI error, other threads can still go and try to get spheres before they get killed by task manager.
Okay, hm, interesting.
There are no core dumps. I ssh-ed in and ran 'top' and looked at the memory per process and total usage on the node, and it wasn't approaching the limits of the machine when it crashed. I think the fact that it crashes at different places in the run cycle means something else, but I don't know what.
This is suspicious. In fact, it makes me think there *could* be a problem with processes hanging, waiting for Barriers. Do you have debugging logging turned on? That should notify you whenever a barrier is entered if it's done via the standard barrierization. (One of the reasons I try to avoid any raw MPI calls.) I'll see if I can write up a long-overdue mechanism for distinguishing logs by processor and paste that. -Matt