+ Reply to Thread
Results 1 to 3 of 3

Thread: Running jobs are suspending for unrunnable jobs.

  1. #1
    admin_lsf is offline Member
    Join Date
    July 11th, 2008
    Posts
    57
    Downloads
    0
    Uploads
    0

    Default Running jobs are suspending for unrunnable jobs.

    Originally posted by: garym, Wed Nov 29, 2006 8:01 pm

    I have a test queue defined that is pre-emptive over all lower priority queues. We have added limits so that only one test 'job' can be run by a user at any one time (where job may be more than one cpu).

    I have defined a resource testcount under lsf.cluster.name:

    Code:
    RESOURCENAME LOCATION 
    testcount (0@[all]) 
    End ResourceMap
    
    and placed a limit on that resource under lsb.resources:

    Code:
    Begin Limit 
    NAME = testcount_limit 
    PER_USER = all 
    PER_QUEUE=test 
    RESOURCE = [testcount,1] 
    End Limit
    
    On the test queue, I have:

    Code:
    RES_REQ = rusage[testcount=1]
    

    The problem: A user will submit more than one job to the test queue which will suspend other jobs in the cluster, but the test job will pend with the reason:

    Code:
    Resource (testcount) limit defined on queue has been reached;
    

    Jobs should not be suspended until the next test queue job is able to run. What am I missing?
    Last edited by admin_lsf; July 11th, 2008 at 04:06 PM.

  2. #2
    admin_lsf is offline Member
    Join Date
    July 11th, 2008
    Posts
    57
    Downloads
    0
    Uploads
    0

    Default

    Originally posted by: Rasheed, Thu Dec 14, 2006 3:21 am

    Hi,

    Please contact support@platform.com for assistance on this issue and our support engineers will be glad to assist you further.

    Platform Support

  3. #3
    admin_lsf is offline Member
    Join Date
    July 11th, 2008
    Posts
    57
    Downloads
    0
    Uploads
    0

    Default

    Originally posted by: ddunlap, Thu Feb 08, 2007 4:11 pm

    This is pretty much "after the fact", but here's a little follow up, for general interest:

    - when you put "testcount (0@[all])" in lsf.cluster.name, that tells LSF:
    "there are 0 instances of the resource 'testcount', shared by all hosts".
    In your case, I think you wanted to set this to "1"

    - the limit in lsb.resources is not needed. By setting up testcount as a
    shared resource, with an initial value of 1, that will enforce the limit --
    as long as you set up the queue correctly (which, you almost did)

    - in lsb.queues, setting "RES_REQ=rusage[testcount=1]" tells LSF:
    "for all jobs that run in this queue, reserve 1 unit of 'testcount' for
    the duration of the job". But, you didn't include any selection criteria,
    and I think you need that:
    RES_REQ="select[testcount>0] rusage[testcount=1]"
    This says:
    "select only machines that have a value of testcount>0, and reserve
    1 unit of testcount for the duration of the job"

    On the other hand, there might be an easier way to accomplish the
    same thing: in the test queue, set:
    QJOB_LIMIT=1

    This says "only run one job at a time through the test queue".

    My suggestion is: always look for the simplest way to do what you want.

    My $.02,
    Dale
    _________________
    Dale Dunlap
    Technical Consultant

+ Reply to Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts