+ Reply to Thread
Results 1 to 2 of 2

Thread: Excessive exits - which jobs caused it?

  1. #1
    RhodeKille is offline Junior Member
    Join Date
    September 23rd, 2008
    Posts
    2
    Downloads
    0
    Uploads
    0

    Question Excessive exits - which jobs caused it?

    Do any of you know a good (i.e. simple and effective) way to figure out which jobs caused a host to exit? Something short of combing through the log files to figure out which jobs were running on it just before it was closed.

    We set our EXIT_RATE low so that a single host with automount issues won't kill a big regression run, and now we are seeing multiple hosts getting closed due to excessive exits. I'd like an easy way to find out which jobs killed the host so I can verify that the job exits were "normal" - i.e. not unexpected and not caused by host issues (automount, memory, etc.).

    This is a 7u3 cluster, running EDA applications.

    TIA.

  2. #2
    gthomas is offline Junior Member
    Join Date
    February 29th, 2008
    Posts
    14
    Downloads
    2
    Uploads
    0

    Default

    When the threshold is reached , LSF invokes eadmin (which closes the host).
    The eadmin script is in LSF_SERVERDIR. You can change this script and add your own custom action. If you look at the script you can see a variable called LSB_UNDERRUN_JOBS. This is the info you are looking for (jobs that exit too quickly), you can just add your own line in the script to extract that info into a file.
    Speeding does not kill. Staying stationary does.

+ Reply to Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts