The introduction of multi-core processors to HPC is often viewed as a huge benefit. Who would not want more cores in the same space for essentially the same cost? While core count is an advantage, multi-core has introduced an additional layer of complexity for the HPC users. There are many new decisions that the programmer and end user must make in regards to multi-core technology. In this article, we will take a look at some of the programming issues facing the HPC coder. In particular, we will use a few benchmarks to illustrate the differences between MPI, OpenMP, and hybrid programs that include both paradigms.
HPC Programming School
While some programming issues will be discussed, we will use existing publicly available benchmark suites for our comparisons. If you are unfamiliar with the programming aspects of MPI (Message Passing Interface) and OpenMP, please see the MPI In 30 Minutes and OpenMP in 30 Minutes tutorials. These tutorials will get you started quickly using open source tools (i.e. you don't have to buy anything). There are links to the source code and instructions on how to build OpenMP and MPI programs using GNU tools in each article. If you are using a cluster distribution then you will probably already have a pre-built version of MPI installed.
OpenMP != MPI
There is often some confusion between OpenMP and MPI. This confusion is understandable since there is even a version of MPI called "Open MPI." From a programmer's standpoint, MPI is library that contains message passing routines. OpenMP, on the other hand, is a set of compiler directives that tell an OpenMP enabled compiler what portions of the program can be run as threads. The difference is therefore "threads vs messages." Let's take a look at both methods.
A multi-core processor looks the same as a multi-socket single-core server to the operating system. (i.e. before multi-core, dual socket servers provided two processors like today's dual core processors) Programming in this environment is essentially a mater of using POSIX threads. Thread programming can be difficult and error prone. OpenMP was developed to give programmers a higher level of abstraction and make thread programming easier. In a threaded or OpenMP environment, communication happens through memory. A single program will branch or launch multiple threads, which are then executed on separate cores. The entire program shares the same memory space. If one thread wants to communicate with another thread, it would say, "there is a message at this memory location" (i.e. "Hello over there"). The receiving thread would look at the memory location and then tell the sender my response is here. T! here is no copied data, there is only one copy and it is shared between threads. (Note, this is not strictly how it happens, but the "no copying" is what is important).
Conversely, MPI is a message passing method which basically copies memory. Let's consider a simple MPI message sent from one program to another. The first program sends the text message "Hello over there," and the receiving program responds with "Hi." The sending program will construct the "Hello over there" string in memory then send it to the receiving program. The receiving program will then take the string and place it into its own memory. There are now two copies of the string. The reply works exactly the same way. This type of communication is best for distributed memory systems like clusters. The programs sending messages do not share memory. By design, MPI processes can be located either on the same server or on a separate servers. Regardless of where it runs, each MPI process has it's own memory space from which messages are copied.
As mentioned, MPI can run across distributed servers and on SMP (multi-core) servers. OpenMP, however, is best run on a single SMP server or on multiple servers using something like ScaleMP. There is also a product from Intel called Cluster OpenMP that can run OpenMP applications across a cluster. For this reason, MPI codes usually scale to larger numbers of servers, while OpenMP is restricted to a single operating system domain.
Processes or Threads
Another difference between MPI and OpenMP is the way programs are run. OpenMP programs run as a single process and the parallelism is expressed as threads. (i.e. the program is started as one binary which then separates into individual "threads" that are run on the available cores on a server.) This behavior can be seen quite clearly when viewing an OpenMP program (lu.B) using top. As an example, consider Figure One where a single OpenMP binary is running on a four-core processor.
Figure One: OpenMP program (lu.B) running on four cores shown as a single process
There are some important things to notice about Figure One. First, all four cores are running at nearly 100% (to get a break out of the cores using top, enter a "1" while top is running.) There is only one version of lu.B running, but note the CPU usage of 399%. If you enter "H" into top (as shown in Figure Two), it becomes more clear as to what is happening with the OpenMP program.
Figure Two: OpenMP program (lu.B) running on four cores with all processes shown
Figure Two now shows four processes for lu.B. (note: Linux threads each have their own process number which allows them to be treated like any other process by the scheduler.) Now the percentage of CPU utilization is 100 or less and should add up to the cumulative percent shown in Figure One. In addition, note that the amount of memory for each thread is exactly the same -- it is actually the same memory. As a side note, the number of OpenMP threads can be controlled a number of ways. By default, the program will use as many threads as there are cores in the server. With GNU compilers, this can be changed by setting the environment variable OMP_NUM_THREADS to the number of desired threads. There is normally no benefit derived from running with more threads than there are number of cores.
In contrast to OpenMP, MPI actually starts a new process for each parallel part of the program. For example when using an MPI starter program such as mpiexec the user provides the number of processes that should be run (e.g mpiexec -np 4 lu.B.4). This situation is depicted in Figure Three where an MPI version of the same program lu.B.4, is running. The program is running as four MPI processes on a single quad-core processor.
Figure Three: MPI program (lu.B.4) running on four cores
Note that the number of processes is now four and each is single threaded with a 100% utilization rate. The processor (core) loads are about the same. Also, note that the the memory sizes are slightly different.
OpenMP vs MPI
Based on the above Figures, one might be inclined to ask, "which method is faster?" This is a good question and one that can best be answered by running a few tests. In order to test OpenMP and MPI in an apples-to-apples comparison, we need a program that uses both methods. Fortunately, the NAS Parallel Benchmarks have both OpenMP and MPI versions for multiple programs. Version 3.2.1 of the NAS tests is used for the results below. A description of the tests can be found at the end of this article. The tests were run on the Limulus personal cluster, which has three dual-core compute nodes and a quad-core head node. Before, we discuss the results, a bit of background may be helpful.
There are two assumptions about OpenMP and MPI that are not always true. The first is that by the nature of threads, OpenMP is always faster than distributed MPI programs (running on multiple servers) for the same number of cores (presumably because of the communication overhead introduced by MPI). The second assumption is that MPI programs must be spread across multiple nodes in order to run effectively. As Table One demonstrates, neither of the assumptions hold true 100% of the time.
In Table One, the tests were run in three different ways. All tests used a parallel "grain size" of four. While this is not large by todays standards, it is large enough to demonstrate the weakness of the above assumptions. The first method, in column one, used MPI with one process on each of four nodes for each of the six NAS tests. Column two shows the results for running the same four MPI processes on single quad-core processor. Finally, column three shows the results for the OpenMP versions running on a single quad core.
Table One: NAS Parallel Benchmarks (V3.2.1) for Open MPI (1.3.2) and OpenMP (GNU C and Fortran 4.3.3). Results are in MOPS, or millions of operations per second; higher is better.
If the first assumption is correct, then the OpenMP shown in column three should show the best results. Clearly, this is not the case for programs IS, LU, and MG. If we consider the second assumption, we see that program IS runs best as an MPI program on one quad-core processor. OpenMP wins in all other cases. Although the reasons for this behavior can vary, the most common issue with multi-core nodes is memory contention. Recall that a distributed MPI program has a single process running on a single node. Thus, each process has exclusive access to its local memory -- no contention. Also be aware that results can vary due to compiler and hardware differences.
Hybrid Approaches
One recently popular approach to multi-core HPC is to use both MPI and OpenMP in the same program. For example, a traditional MPI program running on two 8-core servers (each server contains two quad-core processors) is shown in Figure Five, below. There are a total of 16 independent processes for the MPI job.
Figure Five: 16-way MPI program execution on two nodes (16 MPI processes)
If one were to use both MPI and OpenMP, then the strategy in Figure Six would be the best way to create a hybrid program. As shown in the figure, each node runs just one MPI process, which then spawns eight OpenMP threads. The clusters that run the Top500 benchmark (HPL) use this type of approach, but the threading is done at a low level in the BLAS library.
Figure Six: 16-way MPI-OpenMP program execution on two nodes (2 MPI processes, each with 8 OpenMP threads)
Of course the question, "Will a hybrid approach work for me?" is important to consider. Fortunately, trying some simple things is not that hard. Indeed, one way to approach the problem is to use MPI for the "outer loops" and use OpenMP for the "inner loops." If your program is already written in MPI, then all you need to do is check to see if you have parallel loops that are running as part of each MPI process. Be aware however, that improving the performance of the parallel portions of your MPI code may reduce scalability as the communication to computation ratio changes. Also, remember that once your MPI processes are threaded (using OpenMP directives), you only need one MPI task per node otherwise your nodes will be seriously oversubscribed (i.e. starting 8 MPI tasks that each spawn eight threads will overload an 8-core node).
If you want to explore hybrid OpenMP/MPI programs on your cluster, you can download the HOMP benchmark. According to the web page, "HOMB is a simple benchmark based on a parallel iterative Laplace solver aimed at comparing the performance of MPI, OpenMP, and hybrid codes on SMP and multi-core based machines."
The results for various combinations of MPI processes and OpenMP threads are shown in Table Two. The non-parallel time was 2.25 seconds. The program was then run using four and eight MPI processes with no extra OpenMP threads. The speed-ups were quite good and what is most interesting is that the performance went down when adding the OpenMP threads. Indeed, running the code with only OpenMP threads produced a time that was longer than the sequential time. This indicates that memory contention is probably an issue on this hardware platform. If this code were moved to different hardware the results would most likey improve. Again, testing assumptions proved to be valuable in this case.
Table Two: HOMP Benchmarks (V3.2.1) for Open MPI (1.3.2) and OpenMP (GNU C and Fortran 4.3.3). Results are in seconds; lower is better. Program parameters were NR=14000, NC=14000, 20 Iterations.
Summary
A few things should be clear from the data obtained above. First, assumptions can often be wrong and the best way to answer performance questions is to run your code on your hardware. For instance, rewriting an MPI program in OpenMP may not yield a great improvement, or even enough of a performance increase to justify the code changes. Second, the hardware you use may contribute to performance, as memory contention issues vary by architecture. Finally, you may get a performance boost by using OpenMP to create a hybrid version of your MPI program (The OpenMP directives can be easily ignored by the compiler). The lesson in all these efforts is to test assumptions and run code. It is almost impossible to determine how well a program will perform in the complex hardware environments found in clusters today. The good news is that we have plenty of cores!
Sidebar One: Description of the NAS tests used in these tests
LU is a simulated CFD application that uses a symmetric successive overrelaxation (SSOR) method to solve a seven block diagonal system resulting from finite difference discretization of the NavierStokes equations in 3D by splitting into block Lower and Upper triangular systems.
FT contains the computational kernel of a 3D fast Fourier Transform (FFT) based spectral method. FT performs three one-dimensional (1D) FFTs, one for each dimension.
MG uses a Vcycle MultiGrid method to compute the solution of the 3D scalar Poisson equation. The algorithm works continuously on a set of grids that are made between coarse and fine. It tests both short and long distance data movement.
CG uses a Conjugate Gradient method to compute an approximation to the smallest eigenvalue of a large, sparse, unstructured matrix. This kernel tests unstructured grid computations and communications by using a matrix with randomly generated locations of entries.
EP is an Embarrassingly Parallel benchmark. It generates pairs of Gaussian random deviates according to a specific scheme. The goal is to establish the reference point for peak performance of a given platform. EP is almost independent of the interconnect because communication is minimal.
IS is a parallel integer sort algorithm that is very sensitive the interconnect latency.


LinkBack URL
About LinkBacks






Reply With Quote