Parallel programming is vast topic. A discussion of the underlying concepts could encompass several articles! In this presentation, however, we'll take a more pragmatic approach. First, the concept of data parallel computing will be introduced, after which some of the current software solutions will be presented. Lastly, we'll present an analysis of the programming methods with respect to cluster computing. Let's get started with some background.
Parallel Computing
There are many ways to compute in parallel. In 1972, Michael J. Flynn proposed the following classification:
* SISD - Single Instruction Single Data stream is a sequential computer that exploits no parallelism in either the instruction nor the data streams. Examples of an SISD machine are the traditional PC or a single core on a modern multi-core processor.
* SIMD - Single Instruction Multiple Data streams is where a single instruction is executed against multiple data streams. For instance, dividing all the values in a 2D array by the same number. This type of parallelism is called Data Parallel and is what is found on GP-GPU's and array processors. Data parallel processing is a very effective technique for high performance video cards. Data parallel machines often have some form of "shared memory" that is easily accessible by all of the processors.
* MIMD - Multiple Instruction Multiple Data streams - multiple instructions are executed against different data streams. For instance, dividing all the values in several 2D arrays by different numbers at the same time. Or, running two or more separate subroutines at the same time, which is often called task parallelism. A cluster is a good example of a MIMD architecture because each node is independent and has its own memory (distributed memory). Note that an MIMD machine can easily run SIMD programs, but an SIMD machine needs some help running MIMD codes. (See the MOG project below for an exception to this rule.)
* MISD - Multiple Instruction Single Data - this classification is often included for completes, but it has no real practical use in HPC. In this mode, a single data stream is given to independent processing units. It is, however, an effective way to add processing redundancy to single data stream.
An important point to remember about Flynn's classification scheme is that MIMD is the most flexible and can execute any of the other modes. For this reason, it is the most commonly used parallel architecture. Many HPC applications are data-parallel and as such can be accelerated with GP-GPUs. Because the GP-GPU architecture is designed specifically for data parallel processing it often executes data parallel operations much faster than a standard MIMD architecture, such as a multi-core processor.
The trade-off for SIMD parallelism is flexibility for speed. For many applications this can be a bargain. Examples of over 100x speed-ups with GP-GPU processors have been reported. It is not uncommon to realize a 10x speed-up with minimal programming work. Examples of applications can be found on the NVidia website.
As a summary, the following table illustrates the relationship between clusters, multi-core, and GP-GPUs. The table is based on the ability of each architecture to run SIMD or MIMD programs.
No code has to be inserted here.Table One: usage modes of various of clusters, multi-core and GP-GPUs. * signifies the ability of SIMD to simulate MIMD execution. This topic is currently under investigation, see MOG below.
The GP-GPU gets its speed from the ability to execute hundreds,even thousands of lightweight threads in parallel. The management of thread execution is normally not part of the users responsibility, however. For instance, a data parallel array operation is often specified by the following:
Code:
int gid = getglobalid(0); array[gid] = array[gid] / b;
For more information on parallel computing, Lawrence Livermore Labs offers a Tutorial that provides good background and examples. In addition, there are plenty of other online references available.
Practical Concerns
Programmers must keep a few practical issues in mind because the GP-GPU does not exist in isolation and its main goal is video processing. It must be resident on a PC host. The host must be able to support the power requirements of the GP-GPU board(s) and provide adequate cooling. Aside from the physical requirements, GP-GPUs have a few other restrictions.
First, all data must be moved to and from the video card memory. This represents a potential bottleneck between the host processor and the GP-GPU. It also means that for large problem sizes, large amounts of memory are required on the GP-GPU card. In addition, standard video cards are designed for graphics. In this environment, small errors may not be noticeable because the entire screen is refreshed in 1/60th of a second. For this reason, both NVidia and AMD/ATI offer HPC versions of their GP-GPU cards. Presumably, these boards have been designed for HPC applications and provide better tolerances than those for video applications.
Second, the amount of double precision support in GP-GPU chips can vary from none to partial. Normal video processing works quite well with single precision floating point. In many HPC applications, double precision is a critical part of the algorithm and an important requirement for scientific accuracy. Both NVidia and AMD/ATI have included double precision in their latest versions, but only in a limited capacity. That is, not every SIMD stream processor has its own double precision unit. The double precision unit must be used in a shared capacity and thus reduces the overall amount of double precision that can be done in parallel.
Third, GP-GPUs work best when there are hundreds and even thousands of threads, so applications that have large amounts of data parallelism perform better. Because the threads are lightweight, the thread code does not support recursion, dynamic memory allocation, or any kind of stack or heap. For many codes this does not seem to be a limitation, since many data parallel operations are array based.
Finally, while GP-GPUs represent a low-cost and fast co-processor for cluster servers, the integration into the cluster programming model (i.e. MPI, Message Passing Interface) is a challenge. Many of the languages designed for GP-GPU's do not extend beyond a single server and thus do not scale across cluster nodes. This is a programming model limitation and not a hardware issue because any cluster node can be augmented with GP-GPU's.
The good news for experimenters is that in many cases, standard video cards from both NVidia and AMD/ATI are capable of running GP-GPU programs. And, freely available development software exists as well. While standard video cards will not break any speed records, they do offer a very low cost platform for exploring data parallel programming. If you have a recent NVidia or AMD/ATI video card you may want to check and see if it supports GP-GPU programming. And, don't forget to check your laptop!
Programming Models
An issue facing many programmers is which programming model to use. For HPC users, the three main factors are portability, scalability, complexity. Experienced programmers may find it useful to look at A Comparison of MPI, OpenMP, and Stream Processing. While a full discussion of the various programming languages and methodologies would constitute another article (or two), the following language descriptions should provide an overview of the current state of GP-GPU programming. Be aware that GP-GPU computing is a rapidly moving market. Indeed, the recent OpenCL (see below) specification took less than six months to become a standard. OpenCL (not to be confused with OpenGL) is a low level specification for GP-GPUs and multi-core CPUs.
OpenCL
Open CL was developed by Apple Computer and is a standard API for GP-GPU and multi-core hardware. It is based on the ANSI C language, but adds some extensions to support parallel operations. A large amount of vendor support exists for the OpenCL specification. As the specification is new there are not that many OpenCL implementations, but both NVidia and AMD/ATI have pledged support for the standard. The model is powerful and supports both data parallel (GP-GPUs) and task parallel (multi-core) processing. OpenCL was clearly aimed at resolving the multi-core/GP-GPU situation as it supports both data parallel and task parallel constructs, It does not, however, clearly address using remote nodes as part of the computation
In OpenCL, all computation resources in a host system are seen as peers. These include CPU cores, GPUs, mobile processors, microcontrollers, and DSPs. OpenCL also has a clearly defined floating point representation (IEEE 754 with specified rounding and error). An important aspect of OpenCL is the ability to choose different resources at run-time. That is, if an OpenCL application is designed correctly, it can probe for available hardware and adjust execution based on the current environment (i.e. run-time binaries can be portable across many different hardware platforms.
OpenCL supports a memory hierarchy. At the lowest level is private memory that can only be used by a single stream compute unit. Local memory is memory that can be used by the groups of localized thread processors, often called a work group. Constant memory is memory that can be used to store constant data for read-only access by all of the thread processors in the GP-GPU device. Finally, global memory is memory that can be used by all the compute units on the device.
Because of it's complexity, OpenCL is considered a low-level interface and not the best choice for novice programmers. Indeed, as many HPC applications are already written in Fortran or C, only C program are possible candidates to port to a GP-GPU. In addition, OpenCL is not designed for "distributed-device" (cluster) applications.
CUDA
When NVidia introduced its GP-GPU chip-sets, it also had the foresight to introduce the CUDA programming model and make it freely available to developers. CUDA, or the Compute Unified Device Architecture, is designed to work on NVidia GP-GPUs. The programming model is higher level than OpenCL, and like OpenCL abstracts away thread management from the user. CUDA is similar to ANSI C, but does not support recursion. Full Fortran and C++ support is coming.
CUDA applications are compiled through a PathScale Open64 C compiler for execution on the GPU. CUDA also supports the computational interfaces of OpenCL and DirectX Compute. Third party wrappers are also available for Python, Fortran, Java and Matlab. One of the nice features of the dynamic CUDA model is that applications can be developed on low cost NVidia cards (starting with the GeForce 8X series) and easily ported to the HPC ready Tesla line of GP-GPU units. That is, the cost of entry is minimal -- a low cost video card and some time. There are numerous success stories for CUDA applications.
BrookGPU
BrookGPU is the Stanford University Graphics group's compiler and run-time implementation of the Brook stream programming language. BrookGPU is still considered "beta" (although it has been around for a while) and supports both NVidia and AMD/ATI hardware under a BSD license. Brook is an extension of standard ANSI C and is designed to incorporate the ideas of data parallel computing and arithmetic intensity. The general computational model, referred to as streaming, provides two main benefits over traditional conventional languages; Data Parallelism which allows the programmer to specify how to perform the same operations in parallel on different data; and Arithmetic Intensity, which encourages programmers to specify operations on data which minimize global communication and maximize localized computation.
Brook+
Brook+ is available from AMD/ATI and is based on Stanford University Graphics group's compiler and run-time implementation of the Brook stream programming language. Similar to both CUDA and OpenCL, Brook+ is based on ANSI C, but like the other GP-GPU languages has no recursion. Brook+ is for AMD/ATI hardware only. As AMD/ATI has pledged support for OpenCL, it may not fully support Brook+ in the future.
PGI Compilers
Portland Group compilers now support automatic use of CUDA primitives. PGI Accelerator compilers allow programmers to add OpenMP-like compiler directives to existing high-level standard-compliant Fortran and C programs. These programs are then compiled with the appropriate options to create GP-GPU assisted codes. This is an attractive solution for existing Fortran and C codes (C++ is not currently supported). For more information, see GPU Programming For The Rest Of Us which provides examples of this method. While this approach is attractive, it is a single vendor solution and thus comes with some risk in terms of portability and long term support. It is hoped that an open standard can evolve from this work. The ability to quickly modify existing applications without learning a new language makes this approach very attractive.
Intel Ct
Currently, Intel Ct is a research language based on C/C++ that is designed to support hundreds to thousands of hardware threads. It will probably be limited to Intel hardware if it is released as a commercial product.
RapidMind
RapidMind is a portable API for multi-core, GP-GPU and cell processors. It is a commercial product for C++ and is based on the proprietary RapidMind API. In light of OpenCL, building code on a closed API may not be the best choice for future portability. As it is based on C++, some users may find it integrates well with current projects.
MOG
MOG is currently a research project, but it allows MIMD On GP-GPUs (MOG). This project is worth watching as it may allow MPI programs to address individual stream processors.
What Next?
Although adoption of GP-GPUs has been brisk, it is important to realize we are just in the beginning of the HPC/GP-GPU era. Similar to HPC cluster computing, GP-GPU computing takes advantage of the mass market demand for computer hardware. The challenge facing the cluster user is how best to use a GP-GPU. Issues about programming complexity, portability, and scalability need to be addressed before GP-GPU computing can become fully mainstream. Perhaps the biggest issue for HPC cluster users is how to integrate GP-GPUs into the cluster model. It would seem that the best method may be to treat each node as a potential multi-core/GP-GPU powerhouse and design programs that scale from that perspective. In order to adapt to local conditions, software will need to be more dynamic in nature and allow for flexible scaling. At this point, there is no easy answer.
Finally, if you want to keep your eye on GP-GPU developments check out GP-GPU.org for the latest news and developments in this field. And remember, the output from your next HPC device may be staring you right in the face.

