In a previous article, Accelerating HPC: Installing and Running nVidia CUDA, a tutorial was presented that provided a low-cost and low-effort introduction to the nVidia CUDA software development kit (version 2.3). Just recently, nVidia introduced the
CUDA 3.0 update. The primary feature of this upgrade is support of the new Fermi processor from nVidia. In this article we will be discussing the other GP-GPU programming language: OpenCL.

If you wondered how to use AMD/ATI hardware as an HPC accelerator, you might be surprised to learn that CUDA is not an available option. CUDA is an nVidia solution that does not run on other vendors' hardware. Historically, AMD/ATI supported the BrookGPU model and included an enhanced version in their first SDK as Brook+. While Brook+ allowed for GP-GPU computing, the entire industry realized that some sort of a standard was needed. Thus, in June 2008, The Khronos Group and an industry-wide collection of companies launched the OpenCL Working Group in an effort to create a standard for CPU/GPU programming. The Khronos Group is a member-funded consortium focused on the creation of royalty-free open standards. In addition to OpenCL, they also maintain the OpenGL graphics standard (OpenGL is different than OpenCL).
Largely due to the work of Apple Computer, the OpenCL 1.0 standard was ratified in December of 2008, just six months after the Working Group was formed. The list of participating organizations includes: 3DLABS, Activision Blizzard, AMD, Apple, ARM, Broadcom, Codeplay, Electronic Arts, Ericsson, Freescale, Fujitsu, GE, Graphic Remedy, HI, IBM, Intel, Imagination Technologies, Los Alamos National Laboratory, Motorola, Movidia, Nokia, nVidia, Petapath, QNX, Qualcomm, Samsung, Seaweed, S3, ST Microelectronics, Takumi, Texas Instruments, Toshiba and Vivante.
As a supporter of OpenCL, AMD/ATI recently released the ATI Stream SDK v2.01 for both Linux (RHEL 5.3, Ubuntu 9.10, and openSUSE 11.0) and Windows (XP, Vista, and 7). This software is available at no cost from the AMD/ATI developer website. If you read the list of companies involved with the OpenCL specification you will notice that nVidia is on the list. While nVidia has made a considerable investment in CUDA, it also supports nVidia OpenCL as a GP-GPU programming language.
One important feature of OpenCL that is not present in CUDA is the ability to use both the GP-GPU and the local processors (CPUs) for computation. The support is in the form of data-parallelism (GP-GPU) and task-parallelism (CPU cores). The two types of parallelism are due to the hardware difference between GPUs and CPUs, and make it possible to write programs that use both GP-GPUs and CPUs. Indeed, it is possible to write programs that will run on the local CPU if no GP-GPU is found. The program may run slower, but it will still run. This feature will be demonstrated below.
Like CUDA, OpenCL is based on ANSI C and supports for C++ as well. AMD/ATI provides guides and documentation that can help get you started with OpenCL. As with CUDA, OpenCL can be incrementally added to existing C/C++ programs. In addition, OpenCL will increase the lines of code for any existing program because there is more house keeping required.
Hardware Setup
As with the CUDA software, AMD/ATI OpenCL will work with older hardware. For testing purposes this is a big advantage because there is no need to buy an expensive video card. Also, you can experiment with OpenCL programming without a supported GP-GPU because programs can also be written to run on the host processor. The AMD/ATI website lists the cards that are supported the SDK.
For this article, I am using a low cost Gigabyte GV-R455D3-512 (ATI 4550 Radeon). A brief set of specifications are provided in Table One, below.
| 80 Stream Processing Cores | 16 |
| Core Clock (MHz) | 600 |
| Memory Clock (MHz) | 1600 |
| Memory Amount | 512MB DDR3 |
| Memory Interface | 64-bit |
Table One: ATI 4550 Radeon specifications
As you can see the specifications are a bit lacking by todays standards. In particular, the 4550 only has 80 stream processors. While this may be enough for some production work, it only supports single precision floating point operations. The scalable nature of OpenCL allows for development on low cost hardware and then scaling to larger numbers of stream processors.
My host system is an Intel Core2 Duo Model 6400 (2.1 GHz) with 2 GB of memory and a Gigabyte GA 945GM motherboard. Again, nothing spectacular by today's hardware standards.
Software Setup
In the previous article, I used Fedora 10, as recommended by nVidia for CUDA. To minimize re-installation, I decided to use Fedora 10 for the AMD/ATI SDK as well. Although it is not "officially" supported, I decided to give it a try. I did encounter some issues that turned out to have nothing to do with Fedora 10, but I managed to resolve the problems. I am pleased to report that the SDK installs and works properly. Please be aware that RHEL 5.3, Ubuntu 9.10,and openSUSE 11.0 are the officially supported Linux versions and if you contact AMD/ATI with issues they may request that you use these versions before they resolve problems.
Downloads and System Preparation
Before you install the OpenCL SDK, ensure that you are running the latest video driver. An older driver may not work with the newer SDK. You can find the driver for your video card at the AMD/ATI Driver page. I downloaded and installed the driver according to the installation instructions found on the driver page. The file names are as follows:
- Current Video Driver - ati-driver-installer-10-3-x86.x86_64.run
- Linux Video Driver Installation Docs - catalyst_102_linux.pdf
The next step is to download the Stream SDK software. You can find the software on the Stream SDK page (toward the bottom). (Note, if not using Fedora, select the package for your supported Linux distribution.) You will also need to select a 32 or 64-bit version depending on your host system. I download the "lnx64" (Generic Linux 64 bit) version and not the "rhel64" (Red Hat Version). The file name is:
- ati-stream-sdk-v2.01-lnx64.tgz
I also found it useful to download some additional documentation. The SDK contains some documentation, but the
ATI Stream SDK v2.01 Documentation page had other notes that were valuable.
- Stream SDK FAQ - ATI_Stream_SDK_FAQ.pdf (part of SDK)
- Stream SDK Getting Started Guide - ATI_Stream_SDK_Getting_Started_Guide_v2.01.pdf (part of SDK)
- Stream SDK Release Notes Samples - ATI_Stream_SDK_Release_Notes_Samples.pdf (from Documentation page)
- Stream SDK Installation Notes - ATI_Stream_SDK_Installation_Notes.pdf (from Documentation page)
Installing and Building the SDK
As covered in the Stream SDK Installation Notes, the install is relatively simple, but it is not conducive to shared use by multiple users. In the examples below everything was run as root. Compiling and running the examples as a user requires modification to the "make" process. I installed the SDK under /usr/local> as follows:
# cd /usr/local
# tar xvzf ati-stream-sdk-v2.01-lnx64.tgz
There is a directory called ati-stream-sdk-v2.01-lnx64 with all the SDK software. Instead of using this directory, I soft-linked it to ati-opencl for easier typing.
# ln -s ati-stream-sdk-v2.01-lnx64/ ati-opencl
Next the instructions ask that you set the following environment variables to the installation root:
# export ATISTREAMSDKROOT=/usr/local/ati-opencl/
# export ATISTREAMSDKSAMPLESROOT=/usr/local/ati-opencl/samples
The next step suggests setting LD_LIBRARY_PATH to point to the OpenCL libraries. Instead of this method, I chose to edit /etc/ld.so.conf and add these lines:
/usr/local/ati-opencl/lib/x86_64/
/usr/local/ati-opencl/lib/x86
and then run ldconfig to update the shared library cache. The final step is to register the OpenCL installation. To do this step, create the following directory:
# mkdir -p /usr/lib/OpenCL/vendors
Then move to the directory and make the following soft links:
# cd /usr/lib/OpenCL/vendors
# ln -s /usr/local/ati-opencl/lib/x86_64/libatiocl64.so libatiocl64.so
# ln -s /usr/local/ati-opencl/lib/x86/libatiocl32.so libatiocl32.so
This will place links to the vendor libraries in the registration directory. The installation is now complete. The final step is to move to the root SDK directory and run make:
# cd /usr/local/ati-opencl
# make
The make should work without any problems, and build all the examples.
A Surprise Problem and Resolution
Once the video driver was working successfully on the test machine, I decided to work remotely on my normal desktop machine. Many *NIX users use X Windows and remote logins via ssh allow flexibility and networked access to almost any machine. When I attempted to run the OpenCL example programs, however, I was surprised to learn that the video card was not found. The program would run using the local CPU, but there was no trace of the video card. After a bit of Googling, there was still no resolution to this problem and I was starting to assume that the combination of Fedora 10, the video driver, and the SDK was not working.
As I final test, I attempted to run the example program directly on the test machine, and at last the video card was recognized. Evidently, when remotely logging in, the video environment of your remote machine is checked and if no compatible video card is found, the OpenCL program will only use the CPU resources. This situation is bound to cause further problems as more Linux users experiment with the Stream SDK. AMD/ATI has prepared a document entitled "KB19 - Running ATI Stream Applications Remotely" that should help solve the problem. This issue should be clearly marked in the usage notes.
Running Examples
Before we can run any examples, it is best to check that the SDK and video driver are working correctly. The easiest way to do this is to use the FindNumDevices in the cal/bin
directory. Cal is low-level abstraction layer that can talk to the video card. If your video card is present and working, you should see something similar to the results below:
#cd /usr/local/ati-opencl/samples/cal/bin/x86_64
# ./FindNumDevices
Supported CAL Runtime Version: 1.3.185
Found CAL Runtime Version: 1.4.556
Use -? for help
CAL initialized.
Finding out number of devices :-
Device Count = 1
CAL shutdown successful.
Press enter to exit...
If the device count is "0," then your video card is either not supported or there is a problem with your installation. Make sure you are using the machine directly and are not logged-n remotely.
If FindNumDevices worked, then it is time to try some OpenCL programs. The first program to run is CLInfo. To run the program move the sample/bin directory and run the following:
# cd /usr/local/ati-opencl/samples/opencl/bin/x86_64
# ./CLInfo
A lot of output will be generated. The output for my setup is shown in the Sidebar at the end of this article. Notice that CLInfo finds two devices, the CPU (CL_DEVICE_TYPE_CPU) and the GPU (CL_DEVICE_TYPE_GPU). If you only see one device, the CPU, then there is a problem finding your video card.
Finally, we come to the classic Mandelbrot application. By entering ./Mandelbrot
the following image should be created on your screen.
Figure One: Example Mandelbrot Image
Some some interesting options can be used to further explore this program. Every example allows the use of the --device, option where either the CPU or GPU can be set. The GPU is the default. Thus, the Mandelbrot program can be run on the CPU as follows: (Note: the -t option reports the run time, the -q option does not draw the picture, and -x is the size of the problem.)
# ./Mandelbrot -t -q --device cpu -x 2000
Executing kernel for 1 iterations
-------------------------------------------
Width Height Time(sec)
2048 2048 15.008
In contrast, the same program can be run on the GPU by changing the --device option to gpu:
# ./Mandelbrot -t -q --device gpu -x 2000
Executing kernel for 1 iterations
-------------------------------------------
Width Height Time(sec)
2048 2048 2.638
Notice the time difference between the two methods. You can play with the other sample programs as well. In addition to the standard Mandelbrot program, the requisite NBody program can be run for your enjoyment.
OpenCL and HPC Clusters
Similar to CUDA, OpenCL is designed for a single SMP environment (i.e. single server). OpenCL does not support MPI type communications (i.e. between servers). In terms of Fortran, it is probably a safe bet that The Portland Group is working on an OpenCL Fortran variant that is similar to their CUDA Fortran. When this happens, many users may even forgo OpenCL altogether and work within a Fortran context. As core counts continue to increase (both GPU and CPU), tools like OpenCL become an important part of the solution to the ever;growing mixed bag of data parallelism and task parallelism in HPC.
| Sidebar One: Example Output of ./CLInfo |
|
|


Sections
