LAVA, Open MPI, Infiniband (OFED) and ... RLIMIT_MEMLOCK
When submitting an openmpi job through lava on a cluster that is IB enabled (OFED), you will probably see this kind of error:
##################################################
[compute-0-0.local:07337] mca_mpool_openib_register: ibv_reg_mr(0x1711000,528384) failed with error: Cannot allocate memory[compute-0-0.local:07337] mca_mpool_openib_register: ibv_reg_mr(0x1711000,528384) failed with error: Cannot allocate memory
[0,1,9][btl_openib.c:808:mca_btl_openib_create_cq_srq] error creating low priority cq for mthca0 errno says Cannot allocate memory*
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.* There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.* This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
*
* PML add procs failed
* --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)
[0,1,0][btl_openib.c:808:mca_btl_openib_create_cq_srq] error creating low priority cq for mthca0 errno says Cannot allocate memory
##################################################
This one is a little bit more explicit:
##################################################
Your job looked like:
------------------------------------------------------------
# LSBATCH: User input
openmpi-mpirun -np 8 ./hello
------------------------------------------------------------
Exited with exit code 143.
Resource usage summary:
CPU time : 0.08 sec.
The output (if any) follows:
libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
This will severely limit memory registrations.
libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
This will severely limit memory registrations.
libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
This will severely limit memory registrations.
libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
This will severely limit memory registrations.
--------------------------------------------------------------------------
The OpenIB BTL failed to initialize while trying to allocate some
locked memory. This typically can indicate that the memlock limits
are set too low. For most HPC installations, the memlock limits
should be set to "unlimited". The failure occured here:
Host: compute-00-00
OMPI source: btl_openib.c:828
Function: ibv_create_cq()
Device: mthca0
Memlock limit: 32768
You may need to consult with your system administrator to get this
problem fixed. This FAQ entry on the Open MPI web site may also be
helpful:
FAQ: Tuning the run-time characterisitics of MPI OpenFabrics communications (InfiniBand and iWARP)
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
PML add procs failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)
--------------------------------------------------------------------------
The OpenIB BTL failed to initialize while trying to allocate some
locked memory. This typically can indicate that the memlock limits
are set too low. For most HPC installations, the memlock limits
should be set to "unlimited". The failure occured here:
##################################################
Looks easy to fix, from the Open MPI FAQ.
Let's check what are the limits on the nodes:
---------------------------------------------------------
[mbozzore@tyan04 basic]$ ssh -x compute-00-00
Last login: Mon Aug 11 08:03:13 2008 from tyan04.ocs5.org
[mbozzore@compute-00-00 ~]$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 8184
max locked memory (kbytes, -l) 1026028
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 8184
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
---------------------------------------------------------
Looks like this is not the problem, so the next step is to start the same job on the same nodes, but outside of lava:
---------------------------------------------------------
[mbozzore@tyan04 basic]$ mpirun -np 8 --machinefile ./hosts --prefix $MPIHOME ./hello
Hello, world, I am 0 of 8
Hello, world, I am 1 of 8
Hello, world, I am 2 of 8
Hello, world, I am 5 of 8
Hello, world, I am 3 of 8
Hello, world, I am 4 of 8
Hello, world, I am 7 of 8
Hello, world, I am 6 of 8
---------------------------------------------------------
Hmmm ... does not look good; what did I miss ?
I can ... try to force the use of IB, outside of LAVA:
---------------------------------------------------------
[mbozzore@tyan04 basic]$ mpirun -np 8 --machinefile ./hosts --prefix $MPIHOME --mca btl openib,self ./hello
Hello, world, I am 0 of 8
Hello, world, I am 1 of 8
Hello, world, I am 2 of 8
Hello, world, I am 7 of 8
Hello, world, I am 4 of 8
Hello, world, I am 3 of 8
Hello, world, I am 6 of 8
Hello, world, I am 5 of 8
---------------------------------------------------------
And force the use of tcp when running under lava:
---------------------------------------------------------
[mbozzore@tyan04 basic]$ bsub -o%J.out -n 8 openmpi-mpirun -np 8 --mca btl tcp,self ./hello
---------------------------------------------------------
And the job output will be something like:
##################################################
Your job looked like:
------------------------------------------------------------
# LSBATCH: User input
openmpi-mpirun -np 8 --mca btl tcp,self ./hello
------------------------------------------------------------
Successfully completed.
Resource usage summary:
CPU time : 0.08 sec.
The output (if any) follows:
Hello, world, I am 0 of 8
Hello, world, I am 1 of 8
Hello, world, I am 2 of 8
Hello, world, I am 3 of 8
Hello, world, I am 4 of 8
Hello, world, I am 5 of 8
Hello, world, I am 6 of 8
Hello, world, I am 7 of 8
##################################################
Well, this looks very strange ... let's try to check the limits again, but this time through lava:
---------------------------------------------------------
[mbozzore@tyan04 basic]$ bsub -Ip -m compute-00-00 bash
Job <626> is submitted to default queue <normal>.
<<Waiting for dispatch ...>>
<<Starting on compute-00-00>>
[mbozzore@compute-00-00 basic]$ ulimit -a
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 8184
max locked memory (kbytes, -l) 32
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 8184
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
---------------------------------------------------------
Am I getting crazy ???? The same node, but different limits inside/outside of LAVA ...
Actually, no (at least not yet) and getting differents values for the limits (through LAVA / ssh shell) is absolutely normal. The answer is in the init scripts / default limits at boot / init time.
For the limits, Linux provides several resources limits and one of them is RLIMIT_MEMLOCK (maximum number of bytes of memory a process can lock into memory via mlock(), mlckall() or shmctl()). The default soft and hard resources for RLIMIT_MEMLOCK are 8 pages.
Let's check in the source code (Linux kernel). For example, the sys/resource.h header file includes bits/resource.h and from this header file :
Hmmmm ... 8 ... 8 what ? ... oh yes, 8 pages. Let's check the page size then.
I love man pages : man getpagesize
So, just create test.c :
And then gcc test.c; ./a.out :
---------------------------------------------------------
[root@stakhanov conf]# ./a.out
page size=4096
---------------------------------------------------------
Cool, this is also consistent with what you can find here: /usr/include/linux/resource.h
Nice, but still, I am getting different limits inside and outside of lava ... plus the fact that I get the default limit _only_ inside lava.
Ok, what can change these limits ?
Well, ulimit ... and you can set up limits many different ways:
/etc/profile for example sets up the following:
---------------------------------------------------------
# No core files by default
ulimit -S -c 0 > /dev/null 2>&1
---------------------------------------------------------
There is also some setup done in /etc/security/limits.conf
Another very interesting thing, and the key here is that:
This is exactly the problem. Long story short: the lava sbatchd will fork childs; your mpi instance is "under" it and will inherit the sbatchd memlock limit.
So :
Let's check this out: just submit a sleep job and check on the node what is going on:
---------------------------------------------------------
[mbozzore@stakhanov ~]$ bsub sleep 3000
Job <212> is submitted to default queue <normal>.
[mbozzore@stakhanov ~]$ bjobs -w
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
212 mbozzore RUN normal stakhanov compute-00-00 sleep 3000 Aug 21 00:41
[mbozzore@stakhanov ~]$ ssh -x compute-00-00
Last login: Tue Aug 26 19:43:54 2008 from stakhanov.ocs5
[mbozzore@compute-00-00 ~]$ ps -ef | grep sleep
mbozzore 14090 14089 0 19:49 ? 00:00:00 sleep 3000
mbozzore 14129 14094 0 19:49 pts/3 00:00:00 grep sleep
[mbozzore@compute-00-00 ~]$ ps -opid,ppid,comm,args 14089
PID PPID COMMAND COMMAND
14089 14088 1219293702.212 /bin/sh /home/mbozzore/.lsbatch/1219293702.212
[mbozzore@compute-00-00 ~]$ ps -opid,ppid,comm,args 14088
PID PPID COMMAND COMMAND
14088 30427 res /usr/sbin/res -d /etc/lava/conf -m stakhanov /home/mbozzore/.lsbatch/1219293702.212
[mbozzore@compute-00-00 ~]$ ps -opid,ppid,comm,args 30427 PID PPID COMMAND COMMAND
30427 1 sbatchd /usr/sbin/sbatchd
[mbozzore@compute-00-00 ~]$
---------------------------------------------------------
And of course, this is also why restarting lava will solve the problem (full service stop / start): as soon as you log on one node (ssh), you will open a shell and the default memlock limit will be unlimited so the daemons started from this shell will inherit this limit.
The key is to modify the lava init script (just insert a ulimit before starting the daemons).
For reference, I found a lot of useful information reading this book:
Linux System Programming
by Robert Love
Publisher: O'Reilly
Pub Date: September 15, 2007
Print ISBN-10: 0-596-00958-5
Print ISBN-13: 978-0-59-600958-8
It is available through safari books online (O'Reilly - Safari Books Online - 0596009585 - Linux System Programming, 1st Edition)
Mehdi Bozzo-Rey
##################################################
[compute-0-0.local:07337] mca_mpool_openib_register: ibv_reg_mr(0x1711000,528384) failed with error: Cannot allocate memory[compute-0-0.local:07337] mca_mpool_openib_register: ibv_reg_mr(0x1711000,528384) failed with error: Cannot allocate memory
[0,1,9][btl_openib.c:808:mca_btl_openib_create_cq_srq] error creating low priority cq for mthca0 errno says Cannot allocate memory*
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.* There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.* This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
*
* PML add procs failed
* --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)
[0,1,0][btl_openib.c:808:mca_btl_openib_create_cq_srq] error creating low priority cq for mthca0 errno says Cannot allocate memory
##################################################
This one is a little bit more explicit:
##################################################
Your job looked like:
------------------------------------------------------------
# LSBATCH: User input
openmpi-mpirun -np 8 ./hello
------------------------------------------------------------
Exited with exit code 143.
Resource usage summary:
CPU time : 0.08 sec.
The output (if any) follows:
libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
This will severely limit memory registrations.
libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
This will severely limit memory registrations.
libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
This will severely limit memory registrations.
libibverbs: Warning: RLIMIT_MEMLOCK is 32768 bytes.
This will severely limit memory registrations.
--------------------------------------------------------------------------
The OpenIB BTL failed to initialize while trying to allocate some
locked memory. This typically can indicate that the memlock limits
are set too low. For most HPC installations, the memlock limits
should be set to "unlimited". The failure occured here:
Host: compute-00-00
OMPI source: btl_openib.c:828
Function: ibv_create_cq()
Device: mthca0
Memlock limit: 32768
You may need to consult with your system administrator to get this
problem fixed. This FAQ entry on the Open MPI web site may also be
helpful:
FAQ: Tuning the run-time characterisitics of MPI OpenFabrics communications (InfiniBand and iWARP)
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
PML add procs failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)
--------------------------------------------------------------------------
The OpenIB BTL failed to initialize while trying to allocate some
locked memory. This typically can indicate that the memlock limits
are set too low. For most HPC installations, the memlock limits
should be set to "unlimited". The failure occured here:
##################################################
Looks easy to fix, from the Open MPI FAQ.
Let's check what are the limits on the nodes:
---------------------------------------------------------
[mbozzore@tyan04 basic]$ ssh -x compute-00-00
Last login: Mon Aug 11 08:03:13 2008 from tyan04.ocs5.org
[mbozzore@compute-00-00 ~]$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 8184
max locked memory (kbytes, -l) 1026028
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 8184
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
---------------------------------------------------------
Looks like this is not the problem, so the next step is to start the same job on the same nodes, but outside of lava:
---------------------------------------------------------
[mbozzore@tyan04 basic]$ mpirun -np 8 --machinefile ./hosts --prefix $MPIHOME ./hello
Hello, world, I am 0 of 8
Hello, world, I am 1 of 8
Hello, world, I am 2 of 8
Hello, world, I am 5 of 8
Hello, world, I am 3 of 8
Hello, world, I am 4 of 8
Hello, world, I am 7 of 8
Hello, world, I am 6 of 8
---------------------------------------------------------
Hmmm ... does not look good; what did I miss ?
I can ... try to force the use of IB, outside of LAVA:
---------------------------------------------------------
[mbozzore@tyan04 basic]$ mpirun -np 8 --machinefile ./hosts --prefix $MPIHOME --mca btl openib,self ./hello
Hello, world, I am 0 of 8
Hello, world, I am 1 of 8
Hello, world, I am 2 of 8
Hello, world, I am 7 of 8
Hello, world, I am 4 of 8
Hello, world, I am 3 of 8
Hello, world, I am 6 of 8
Hello, world, I am 5 of 8
---------------------------------------------------------
And force the use of tcp when running under lava:
---------------------------------------------------------
[mbozzore@tyan04 basic]$ bsub -o%J.out -n 8 openmpi-mpirun -np 8 --mca btl tcp,self ./hello
---------------------------------------------------------
And the job output will be something like:
##################################################
Your job looked like:
------------------------------------------------------------
# LSBATCH: User input
openmpi-mpirun -np 8 --mca btl tcp,self ./hello
------------------------------------------------------------
Successfully completed.
Resource usage summary:
CPU time : 0.08 sec.
The output (if any) follows:
Hello, world, I am 0 of 8
Hello, world, I am 1 of 8
Hello, world, I am 2 of 8
Hello, world, I am 3 of 8
Hello, world, I am 4 of 8
Hello, world, I am 5 of 8
Hello, world, I am 6 of 8
Hello, world, I am 7 of 8
##################################################
Well, this looks very strange ... let's try to check the limits again, but this time through lava:
---------------------------------------------------------
[mbozzore@tyan04 basic]$ bsub -Ip -m compute-00-00 bash
Job <626> is submitted to default queue <normal>.
<<Waiting for dispatch ...>>
<<Starting on compute-00-00>>
[mbozzore@compute-00-00 basic]$ ulimit -a
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 8184
max locked memory (kbytes, -l) 32
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 8184
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
---------------------------------------------------------
Am I getting crazy ???? The same node, but different limits inside/outside of LAVA ...
Actually, no (at least not yet) and getting differents values for the limits (through LAVA / ssh shell) is absolutely normal. The answer is in the init scripts / default limits at boot / init time.
For the limits, Linux provides several resources limits and one of them is RLIMIT_MEMLOCK (maximum number of bytes of memory a process can lock into memory via mlock(), mlckall() or shmctl()). The default soft and hard resources for RLIMIT_MEMLOCK are 8 pages.
Let's check in the source code (Linux kernel). For example, the sys/resource.h header file includes bits/resource.h and from this header file :
Code:
/* Transmute defines to enumerations. The macro re-definitions are
necessary because some programs want to test for operating system
features with #ifdef RUSAGE_SELF. In ISO C the reflexive
definition is a no-op. */
/* Kinds of resource limit. */
enum __rlimit_resource
{
/* Per-process CPU limit, in seconds. */
RLIMIT_CPU = 0,
#define RLIMIT_CPU RLIMIT_CPU
/* Largest file that can be created, in bytes. */
RLIMIT_FSIZE = 1,
#define RLIMIT_FSIZE RLIMIT_FSIZE
/* Maximum size of data segment, in bytes. */
RLIMIT_DATA = 2,
#define RLIMIT_DATA RLIMIT_DATA
/* Maximum size of stack segment, in bytes. */
RLIMIT_STACK = 3,
#define RLIMIT_STACK RLIMIT_STACK
/* Largest core file that can be created, in bytes. */
RLIMIT_CORE = 4,
#define RLIMIT_CORE RLIMIT_CORE
/* Largest resident set size, in bytes.
This affects swapping; processes that are exceeding their
resident set size will be more likely to have physical memory
taken from them. */
__RLIMIT_RSS = 5,
#define RLIMIT_RSS __RLIMIT_RSS
/* Number of open files. */
RLIMIT_NOFILE = 7,
__RLIMIT_OFILE = RLIMIT_NOFILE, /* BSD name for same. */
#define RLIMIT_NOFILE RLIMIT_NOFILE
#define RLIMIT_OFILE __RLIMIT_OFILE
/* Address space limit. */
RLIMIT_AS = 9,
#define RLIMIT_AS RLIMIT_AS
/* Number of processes. */
__RLIMIT_NPROC = 6,
#define RLIMIT_NPROC __RLIMIT_NPROC
/* Locked-in-memory address space. */
__RLIMIT_MEMLOCK = 8,
#define RLIMIT_MEMLOCK __RLIMIT_MEMLOCK
/* Maximum number of file locks. */
__RLIMIT_LOCKS = 10,
#define RLIMIT_LOCKS __RLIMIT_LOCKS
/* Maximum number of pending signals. */
__RLIMIT_SIGPENDING = 11,
#define RLIMIT_SIGPENDING __RLIMIT_SIGPENDING
Hmmmm ... 8 ... 8 what ? ... oh yes, 8 pages. Let's check the page size then.
I love man pages : man getpagesize
So, just create test.c :
Code:
#include <stdio.h>
#include <unistd.h>
int main()
{
int page_size;
page_size=getpagesize ();
printf("page size=%ld\n",page_size);
}
---------------------------------------------------------
[root@stakhanov conf]# ./a.out
page size=4096
---------------------------------------------------------
Cool, this is also consistent with what you can find here: /usr/include/linux/resource.h
Code:
#ifndef _LINUX_RESOURCE_H #define _LINUX_RESOURCE_H #include <linux/time.h> /* * Resource control/accounting header file for linux */ /* * Definition of struct rusage taken from BSD 4.3 Reno * * We don't support all of these yet, but we might as well have them.... * Otherwise, each time we add new items, programs which depend on this * structure will lose. This reduces the chances of that happening. */ ... ... ... /* * GPG wants 32kB of mlocked memory, to make sure pass phrases * and other sensitive information are never written to disk. */ #define MLOCK_LIMIT (8 * PAGE_SIZE) /* * Due to binary compatibility, the actual resource numbers * may be different for different linux versions.. */
Nice, but still, I am getting different limits inside and outside of lava ... plus the fact that I get the default limit _only_ inside lava.
Ok, what can change these limits ?
Well, ulimit ... and you can set up limits many different ways:
/etc/profile for example sets up the following:
---------------------------------------------------------
# No core files by default
ulimit -S -c 0 > /dev/null 2>&1
---------------------------------------------------------
There is also some setup done in /etc/security/limits.conf
Another very interesting thing, and the key here is that:
- Any process is free to increase a soft limit to any value from 0 to the hard limit, or to decrease a hardlimit. Children will inherit these updated limits during a fork.
- A privileged process is free to set a hard limit to any value. Children will inherit these updated limits during a fork.
This is exactly the problem. Long story short: the lava sbatchd will fork childs; your mpi instance is "under" it and will inherit the sbatchd memlock limit.
So :
- the init script (runlevel 3) will start the LAVA daemons with the default system limits (32k for memlock)
- these daemons will fork childs; your job is a forked process so it will inherit the sbatchd memlock limit
Let's check this out: just submit a sleep job and check on the node what is going on:
---------------------------------------------------------
[mbozzore@stakhanov ~]$ bsub sleep 3000
Job <212> is submitted to default queue <normal>.
[mbozzore@stakhanov ~]$ bjobs -w
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
212 mbozzore RUN normal stakhanov compute-00-00 sleep 3000 Aug 21 00:41
[mbozzore@stakhanov ~]$ ssh -x compute-00-00
Last login: Tue Aug 26 19:43:54 2008 from stakhanov.ocs5
[mbozzore@compute-00-00 ~]$ ps -ef | grep sleep
mbozzore 14090 14089 0 19:49 ? 00:00:00 sleep 3000
mbozzore 14129 14094 0 19:49 pts/3 00:00:00 grep sleep
[mbozzore@compute-00-00 ~]$ ps -opid,ppid,comm,args 14089
PID PPID COMMAND COMMAND
14089 14088 1219293702.212 /bin/sh /home/mbozzore/.lsbatch/1219293702.212
[mbozzore@compute-00-00 ~]$ ps -opid,ppid,comm,args 14088
PID PPID COMMAND COMMAND
14088 30427 res /usr/sbin/res -d /etc/lava/conf -m stakhanov /home/mbozzore/.lsbatch/1219293702.212
[mbozzore@compute-00-00 ~]$ ps -opid,ppid,comm,args 30427 PID PPID COMMAND COMMAND
30427 1 sbatchd /usr/sbin/sbatchd
[mbozzore@compute-00-00 ~]$
---------------------------------------------------------
- if you just ssh to one compute node, you will see that the output of ulimit for memlock is set to unlimited
- if you submit an interactive job (bsub -Ip bash for example), then the output of ulimit -l will be 32k, even if the default shell limit for memlock is unlimited. This is just because the original sbatchd had a 32k limit for memlock
And of course, this is also why restarting lava will solve the problem (full service stop / start): as soon as you log on one node (ssh), you will open a shell and the default memlock limit will be unlimited so the daemons started from this shell will inherit this limit.
The key is to modify the lava init script (just insert a ulimit before starting the daemons).
For reference, I found a lot of useful information reading this book:
Linux System Programming
by Robert Love
Publisher: O'Reilly
Pub Date: September 15, 2007
Print ISBN-10: 0-596-00958-5
Print ISBN-13: 978-0-59-600958-8
It is available through safari books online (O'Reilly - Safari Books Online - 0596009585 - Linux System Programming, 1st Edition)
Mehdi Bozzo-Rey
Total Comments 0
Comments
Total Trackbacks 0
Trackbacks
Recent Blog Entries by mehdi
- PVFS version 2, first try, part 2 (August 22nd, 2008)
- PVFS version 2, first try (August 22nd, 2008)
- LAVA, Open MPI, Infiniband (OFED) and ... RLIMIT_MEMLOCK (August 21st, 2008)








