We are running Red Hat HPC solution 5.2 on top of RHEL 5.3, and using the Intel Cluster Toolkit, Compiler Edition, for our MPI libraries.
Our MPI application occasionally crashes for no apparent reason. This especially happens when we have a long-running job that runs for many days; sometimes after a day or two - maybe halfway through the run - it just aborts. There are no error messages, and nothing in the system logs that appears to correlate to the crash.
Does anyone else experience this type of thing (whether with the Intel MPI libraries or any other implementation)?
We are trying to look into possible hardware issues but it is hard to know where to start with no information to go on.
We will also be compiling with -trace=log and see if the MPI logs can give us any information about the crashes.
I would be interested in:
1) whether anyone else has experienced this kind of thing and whether there are any suggestions on where to start or how to proceed debugging this
2) whether anyone can point to some more detailed information or documentation about MPI logging and debugging. There is some info in the Intel documentation but it is fairly cursory. Anything would be helpful, whether it is about the Intel MPI libraries or MPI logging and debugging in general. Any hints on how to configure logging and what to look for would be welcomed.