Do any of you know a good (i.e. simple and effective) way to figure out which jobs caused a host to exit? Something short of combing through the log files to figure out which jobs were running on it just before it was closed.
We set our EXIT_RATE low so that a single host with automount issues won't kill a big regression run, and now we are seeing multiple hosts getting closed due to excessive exits. I'd like an easy way to find out which jobs killed the host so I can verify that the job exits were "normal" - i.e. not unexpected and not caused by host issues (automount, memory, etc.).
This is a 7u3 cluster, running EDA applications.
TIA.


LinkBack URL
About LinkBacks
Reply With Quote