HPCCommunity.org
 
Register

Go Back   HPC Community - High Performance Computing (HPC) Community > Blogs > Bearcat

Rate this Entry

Multi-Core and GPU Background

Posted June 27th, 2008 at 05:26 PM by Bearcat
Updated September 3rd, 2008 at 04:11 PM by Bearcat
High Performance programs, typically do the same thing multiple times with different data, or with different parameters, such as a simulation. Real World applications, such as a program that monitors a port, gets data, processes the data, returns a result, may need to operate on multiple cores as well, but I typically call this type of multi-threading, "concurrency", where your program is doing different things concurrently. This may not be everyone's definition, but I tend to look at it this way.

I may indulge a little in concurrency here in the future, but for the most part we'll be looking at programs that fall into the high performance category.

There are a number of technologies for high performance programming that are available or are coming out, and they fall into a few different camps. I'm going to be looking into the areas of Camp 0, 1 and 2, with a sample program and performance numbers, but first, I just wanted to provide a little background into the different areas.

Camp 0: (Multi-threading)

The old tried and true method of using multiple processor or cores in a machine. You use the threading libraries available on your operating system, and manage the thread and data access yourself. Can be easy to do, or complex to do, depending on your program.

Camp 1: (Multi-Core)

Toolkits that allow some parallelism in your program through meta-tagging of code segments. An example of this is: OpenMP. This tool allows you to tag sections of code for parallelism, and is compiled to the native architecture of the chip you’re using. If I compile for x86, then the resulting program runs on x86, and makes use of multiple x86 processors, if available. No special compiler needed, available in gcc.

NOTE: Camps 1 and 2 don’t work together automatically, unless you specifically use the techniques yourself.

Camp 2: (GPU)

The GPU accelerator camp, which includes the previous generation gpgpu programming, and now CUDA, and ATI’s Stream SDK for their new FireStream processor cards, are development kits that allow you to use the GPU as a floating point co-processing unit. While each of these get easier to use as the new versions come out, they usually require a special compiler, which compiles the code to native GPU instructions. Once the code is compiled, if run on a machine which doesn’t have the GPU, your program doesn’t work (go figure). Fast, but not general in nature. Specific compiler needed, or in the case of gpgpu programming, it emits the specific shader language instructions to the gpu. Eg. The CUDA sdk comes with an nVidia compiler, and only supports 8000 series cards and up, if you have the correct drivers.

Camp 3: (either Multi-Core or GPU)

Companies such as RapidMind have a model that tags code, and the tagged code is emitted as program text. This text is compiled at runtime, based on the backend required. RapidMind have back-ends for x86, ppc, glsl (Shader Language used by camp 2 gpgpu). At runtime a backend is selected and the code text is compiled to the native instructions for the target processor. One target backend is used, such as either x86 or gpu, they are not mixed together (at least in the current versions). This allows the program to be compiled, and run on any machine that has the RapidMind runtime. It will use the ATI or nVidia GPU if available, or just multi-thread the program segments on x86, if multiple cores are available. The benefit is application portability to different machines with different cpu and graphics cores. The RapidMind runtime manages access to bound and unbound variables to eliminate the locking required in general multi-threaded programming.

Camp 4: (both Multi-Core and GPU) (Near Future)

Apple seems to be taking an all approach to this. They have been a heavy participant in the LLVM compiler project, and have a number of Apple engineers working on the project. The LLVM compiler project takes your code, and generates intermediate code (think java or .net clr). This code can then generate native code using one of the LLVM code generation back-ends.

Apple has used this technology successfully in Mac OS X Leopard 10.5, in the new implementation of their OpenGL (note the GL). They compile the opengl libraries, that get installed to your system. Then when opengl is used on your system the Just In Time backend for the graphics card in your system compiles the intermediate code to native code for your machine, and placed in caches on your hard drive for later execution.

This appears to be the same technology that they will use in their OpenCL (Open Computing Language, using the GPU for calculations) announcement, for Mac OS X Snow Leopard 10.6, delivered next year. Sections of code destined for parallelism and the GPU will be compiled by LLVM to intermediate, and then Just In Time compiled by the backend to provide native GPU instructions for the execution.

Additionally the announcement of Grand Central Dispatch to tag program segments for parallelism and run on multi-cores seems similar to openMP. Apple states in their announcement that you will be able to use GPUs and CPUs at the same time in your parallel segments. This implies that the same LLVM is used, and a different backend to Just In Time compile for x86 native execution.

This camp will allow the JIT compile for multiple backends and their execution control in the Grand Central Dispatch environment. This is an additional step up, from Camp 3.

This is impressive for a couple of reasons. First it implies that they package up execution units and dispatch them to different processor cores or gpus as needed, and second, they are baking this into their own developer tools, and the operating system. The required runtimes will always be available to these types of compiled applications. Let's hope this lives up to the hype, I've just given it.

Camp X: (Cluster) (Far Future)

How do we get program segments running in a cluster? A problem someone will solve, I'm sure.

Leo

Total Comments 2

Comments

Old
Hi Bearcat,

Nice summary. For visualization I have been working in almost all camps you described. The current framework I am working on (Equalizer: Parallel Rendering) supports Multi-GPU machines and clusters. Some users have used it with RMDP for computation already.
permalink
Posted July 2nd, 2008 at 10:15 AM by eilemann eilemann is offline
Old
Bearcat's Avatar
Hi Stefan,

Equalizer looks like a very nice framework. I'll put it on my list of things to look at, when I'm done with the others on my list, especially considering it can be used for computation.

cheers, Leo
permalink
Posted August 11th, 2008 at 01:21 PM by Bearcat Bearcat is offline
 
Total Trackbacks 0

Trackbacks

Recent Blog Entries by Bearcat

All times are GMT. The time now is 09:35 AM.


Powered by vBulletin® Version 3.7.4
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.