Because MPI programs need to start multiple processes, an "MPI Starter" program is usually included with the library. At a minimum, the starter program is responsible for staring and stopping processes on cluster nodes, but interestingly was not part of the original MPI specification. The first de-facto starter program was called mpirun and could be found as part of every MPI library. Most MPI libraries now support the MPI-2 recommended mpiexec starter program, but often provide some legacy support for mpirun. Throughout this article we will use mpiexec, but in almost all cases mpirun can be used as well.
Methods for starting MPI programs are often library dependent. Essentially, mpiexec must remotely start and manage programs on other nodes. This task is usually accomplished by using either rsh or ssh. Sometimes mpiexec will start the program directly, but most often it will start a remote daemon that will then start and manage the remote programs. The daemon is used as a means to obtain finer control of the remote processes, and in some cases faster startup times. We will not go into the various MPI starter mechanisms, but rather focus on some of the options offered to the user at runtime, i.e. The options end user can control.
When running MPI programs most users are content to use mpirun or mpiexec with a simple arguments indicating the number of processes (using the -np or -n argument) they wish to start and a list of machines (-machinefile argument) on which to run. For example, the command:
Code:
$ mpiexec -np 4 -machinefile hosts myprogram
Placing Processes
How does one indicate how many cores to use? This is where the machinefile comes into play. The machinefile is a list of machines on which to run the processes. In simple terms, mpiexec will take the list of machines and assign processes to them, one per machine, until it reaches the total number of processes requested by the -np argument. When using a resource manager, the nodes in the machinefile are determined at run time. In the absence of some kind of machinefile, most mpirun programs will start the requested number of processes on the current node. Newer MPI starters are also designed to work directly with the popular batch schedulers (Platform Lava, LSF, Grid Engine, SLURM, Torque, PBS Pro, LoadLeveler). That is, if an MPI job is submitted through the batch scheduler, it will "know what to do."
In order to better understand the process of core mapping, assume that we have two 4-core processors for a total of eight cores. If we want to run an MPI job that uses four cores, where will the jobs go? And, should we care where they are executed? The answer to the first question is a little long and MPI library specific, but some examples will illustrate why you may want to exert some control over your process placement. Consider Figure One where three basic scenarios are shown.

Figure One: Various ways four processes can be run on two 4-core processors.
One might assume that packing all the processes onto a single processor would provide the best performance. This assumption is not always true and may depend on the specific details of your hardware environment. The same can be said for the other two distributions. If we add more processors, we see that other process layouts are possible as indicated in Figure Two below.

Figure Two: Various ways four processes can be run on three or four 4-core processors.
If you extrapolate to large numbers of cores, processors, and processes, you can see that there is a huge variation in where processes can land. In general, it is either on the same processor, on the same server, or on a different server. From an operating system standpoint, all cores on the same server are treated the same. That is, processes can be moved from core to core by the OS in order to better load balance resources. It is possible to pin a specific process to a specific core using processor affinity options. This subject reaches beyond the mpiexec command line and will not be covered. We will assume that when a process is placed on a node, the OS will decide on what specific core it will run.
Where your processes land can affect performance. Indeed, close packing jobs may allow a program to take advantage of communication through shared messages, while dispersing jobs may provide better access to node resources (memory, disk, interconnect). As is often said, "It all depends".
One must be careful though, as mpiexec sometimes allows for you to oversubscribe a single node with MPI processes. That is, running
Code:
mpiexec -n 16
At this point the specific details and options of various MPI version come into play. In order to illustrate how you as an end user can control process placement, we will look at the options available from two popular MPI packages: Open MPI and MPICH2.
Open MPI Options
Users can control the process placement with Open MPI using a combination of command line options and and a hostfile. The hostfile is a text file with at least one hosts specified per line. Each host can also specify maximum number of cores (called slots by Open MPI) to be used on that host (i.e., the number of available cores on that host). Comments are also supported. For example the following is an sample hostfile from the Open MPI web page:
Code:
# This is an example hostfile. Comments begin with # # # The following node is a single processor machine: foo.example.com # The following node is a dual-processor machine: bar.example.com slots=2 # The following node is a quad-processor machine, and we absolutely # want to disallow over-subscribing it: yow.example.com slots=4 max-slots=4
In this mode, Open MPI will schedule processes on a node until all of its default slots are exhausted before proceeding to the next node. That is jobs are "packed" into the nodes starting with the first node in the hostfile.
The other mode is called "by node" which is indicated by using the --bynode option with mpiexec. In this mode, Open MPI will schedule a single process on each node in a round-robin fashion (looping back to the beginning of the node list as necessary) until all processes have been scheduled. Nodes are skipped once their default slot counts are exhausted. This mode attempts to disperse the processes across the nodes.
If the number of processes exceeds the number of cores in the hostfile, then nodes will be over subscribed. The maxslots parameter can be used to limit this situation and prevent oversubscribing nodes.
There are some very good reasons why you may want to keep tight control over your processes. Consider the case where you have a threaded program, i.e., once started on the node, an MPI process will use all available cores. The by node option is used in this case so that the node does not get inundated with threads.
As mentioned, not all environments require a hostfile. For example, Open MPI will automatically detect when it is running in certain batch environments and use host information provided by those systems (i.e., it will ignore any provided hostfiles). In this case, the packing options have now moved to the batch scheduler and are normally specified in a submit script. Also note that if you do not use a batch scheduler or a host file all processes are launched on the local host.
As you might imagine, the whole process layout topic can get rather complicated. There are many more aspects and options offered by Open MPI and it is recommenced that you read the Open MPI FAQ to gain more insights.
MPICH2 Options
Similar to Open MPI, MPICH2 supports the use of a hostfile. Again, if you are using an MPICH2 aware scheduler, then the hostfile specification is ignored. The -machinefile option can be used to specify hostfile information. Machines are listed one per line and the number of processes to start on each machine is indicated by appending the number to the host name. An example hostfile is shown below:
Code:
# comment line hosta hostb:2 hostc hostd:4
If you wanted to pack in your processes, then you would specify the number of cores after each node name. If you wanted to disperse your processes, then you would append a ":1" or nothing to the node name. As over subscription is not allowed, you will need to make sure that you have enough processes specified for your MPI job. Using less than the total is allowable as well.
Similar to Open MPI, if you do not specify a hostfile or use a batch scheduler, all your jobs will be started on the same node. In addition, MPICH2 can be configured to one of several MPI runtime environments. The default are the MPD daemons which must be started on each node before running mpiexec. Once started, the user can issue commands like:
Code:
$mpiexec -n 8 myprogram
Final Thoughts
Chances are that you will be insulated from some of these issues and options because almost all clusters use batch schedulers. It is important to remember that there are options to help direct your MPI processes at several levels (mpiexec and the batch scheduler). Initially, mpirun scripts were in charge of placing and running jobs. Early batch schedulers did very little except provide the node, which contained two single core processors, and let you run your program. There is now an increased level of integration in the schedulers, thus moving the final control from mpiexec to the submit script. In either case, it is always nice to know what your options are.

