<?xml version="1.0" encoding="ISO-8859-1"?>

<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/">
	<channel>
		<title>HPC Community - High Performance Computing (HPC) Community - Blogs - Bearcat</title>
		<link>http://www.hpccommunity.org/blogs/bearcat/</link>
		<description><![CDATA[HPCCommunity.org is a technical discussion HPC community portal for the High Performance Computing (HPC) community. The community includes Platform Computing R&D team members, architects and developers, external collaborators and a growing community of users and developers in the HPC world.]]></description>
		<language>en</language>
		<lastBuildDate>Fri, 10 Feb 2012 05:00:10 GMT</lastBuildDate>
		<generator>vBulletin</generator>
		<ttl>60</ttl>
		<image>
			<url>http://www.hpccommunity.org/images/misc/rss.jpg</url>
			<title>HPC Community - High Performance Computing (HPC) Community - Blogs - Bearcat</title>
			<link>http://www.hpccommunity.org/blogs/bearcat/</link>
		</image>
		<item>
			<title>Exploring HPC Programming: OpenMP</title>
			<link>http://www.hpccommunity.org/blogs/bearcat/exploring-hpc-programming-openmp-98/</link>
			<pubDate>Thu, 20 Nov 2008 12:55:54 GMT</pubDate>
			<description><![CDATA[Today, we'll have a quick look at OpenMP. OpenMP is a set of programming APIs, and compiler pragmas that support multi-platform, shared memory...]]></description>
			<content:encoded><![CDATA[<blockquote class="blogcontent restore">Today, we'll have a quick look at OpenMP. OpenMP is a set of programming APIs, and compiler pragmas that support multi-platform, shared memory multiprocessing programming in C/C++ and Fortran. The interesting thing about OpenMP, is that it is a very nice simple way to split loops (for, do) into tasks for multi-threading. Our program has a number of &quot;for&quot; loops in it, so that is the obvious avenue to explore for our particular program.<br />
<br />
Anyone can get a quick overview of OpenMP from Wikipedia here: <a href="http://en.wikipedia.org/wiki/OpenMP" target="_self">OpenMP - Wikipedia, the free encyclopedia</a> and the page contains a number of other references for additional information. There is also the OpenMP site: <a href="http://openmp.org/wp/" target="_self">OpenMP.org</a> for information. Read up, it's an interesting topic.<br />
<br />
First off, I'll profess that I'm not an OpenMP expert. I've learned enough by reading, trying, and experimenting that I can use OpenMP in it's basic form. To use OpenMP, you need a compiler that supports it. GCC 4.2 and up support OpenMP. Also Redhat has back ported OpenMP into the compiler supplied with Redhat 5.2. It is also the same compiler in CentOS 5.2. To check, you must have the omp.h include file, and the libgomp library.<br />
<br />
The really nice thing about OpenMP, is that it is much less intrusive on your program then converting the program to using threads. <br />
<br />
So let's get going.<br />
<br />
In previous articles, I changed a baseline program to multi-threaded giving a couple of options for attacking the problem. For the OpenMP example, I don't need the multi-threaded example, so I went back to the baseline example as a starting point. The multi-threaded examples executed in just under 5 minutes, so that will be a target. I don't expect to reach the target, but getting close would be nice, and convince me that OpenMP is a viable way of doing things.<br />
<br />
The first thing I did to the baseline example was replace the rand function (because it's not multi-threaded), with the distribution function I created for the multi-threaded example. Now we have a new starting point for OpenMP. <br />
<br />
The first step is to add the include file (who would have guessed), at the beginning of the source file:<br />
<br />
<div class="bbcode_container">
	<div class="bbcode_description">Code:</div>
	<pre class="bbcode_code" style="height:36px;">#include &lt;omp.h&gt;</pre>
</div> That's really all you need to do for preparation, really simple. I wanted to assure that OpenMP would start 4 threads on my Quad Core machine, so I added the following line in the &quot;main&quot; function of the program.<br />
<br />
<div class="bbcode_container">
	<div class="bbcode_description">Code:</div>
	<pre class="bbcode_code" style="height:36px;">omp_set_num_threads(4);</pre>
</div> That's it. Now to play with the pragmas.<br />
<br />
We have a &quot;for&quot; loop in the &quot;blackscholes&quot; function, so we'll try that first. After all it loops 1,000,000 times for each portfolio item, which there are 1024 of. This is the inner most loop that just performs the calculations. To unroll a &quot;for&quot; loop in OpenMP is quite simple, just place a pragma right before the &quot;for&quot; statement.<br />
<br />
<div class="bbcode_container">
	<div class="bbcode_description">Code:</div>
	<pre class="bbcode_code" style="height:48px;">   #pragma omp parallel for private(index)
   for (index = 0; index &lt; experiments; index++)</pre>
</div> I only put &quot;index&quot; in the private section, because I declared &quot;index&quot; outside the &quot;for&quot; loop. I'm not sure if I needed to do this. The private section of the pragma is to tell OpenMP, which variables are private to each thread unrolled from the &quot;for&quot; loop, such that OpenMP doesn't have to block and handle access to the variable from multiple threads.<br />
<br />
Everyone should read about variable scope in OpenMP, as it makes a big difference. I've also found that trying different combinations helps our understanding as well, and it's really simple, change the pragma line, compile, test, rinse, repeat. easy huh?<br />
<br />
So with these changes, and changes to the Makefile to compile, and include the correct libraries, how does this fair?<br />
<br />
Here are the results of the run:<br />
<br />
<div class="bbcode_container">
	<div class="bbcode_quote">
		<div class="quote_container">
			<div class="bbcode_quote_container"></div>
			
				[leo@bearcat1 OpenMP]$ time ./openmpprice<br />
=== Option Portfolio Calculations (OpenMP Test) ==========<br />
Portfolio size                    : 1024<br />
Experiments run per item   : 1000000<br />
Average Call Price             : 36.560187<br />
Average Put  Price             : 1.669589<br />
<br />
real    6m2.281s<br />
user    23m49.077s<br />
sys    0m0.302s
			
		</div>
	</div>
</div> Not Bad, 6 minutes and a couple of seconds, compared to the multi-threaded program at just under 5 minutes. Certainly a lot better then the original 19 minutes of the single threaded program. I'd say a GREAT result for a small effort.<br />
<br />
The other big loop is the &quot;for&quot; loop for the number of times the calculations are performed, in the &quot;portfolio&quot; function. To test this loop, I removed the pragma from the previous test run, and put a pragma on the &quot;portfolio&quot; loop, like this:<br />
<br />
 <div class="bbcode_container">
	<div class="bbcode_description">Code:</div>
	<pre class="bbcode_code" style="height:48px;">  #pragma omp parallel for private(i,j,pnum,snum,ynum)
   for (i = 0; i &lt; num_options; i++)</pre>
</div> This is the outer most loop, which also includes the initialization, and the &quot;blackscholes&quot; function call to performa the calculations. Here's the result:<br />
<br />
<div class="bbcode_container">
	<div class="bbcode_quote">
		<div class="quote_container">
			<div class="bbcode_quote_container"></div>
			
				[leo@bearcat1 OpenMP]$ time ./openmpprice<br />
=== Option Portfolio Calculations (OpenMP Test) ==========<br />
Portfolio size                    : 1024<br />
Experiments run per item   : 1000000<br />
Average Call Price             : 36.558490<br />
Average Put  Price             : 1.669706<br />
<br />
real    5m25.763s<br />
user    21m14.538s<br />
sys    0m0.239s
			
		</div>
	</div>
</div> Better, because I included more of the processing under OpenMP control. So 5 minutes and 25 seconds is not bad, and very close to the performance I got from hand coding my own threads.<br />
<br />
I'm impressed! Are you? A small effort, not intrusive, BIG gains.<br />
<br />
Two additional lines, and a pragma placed in the right spot, and OpenMP does a bang up job of multi-processing my program. I should note that during these 2 execution runs, my processors were pegged at 100% during the whole run, so the conversion of the program seems very efficient.<br />
<br />
Here's the program:<br />
<br />
<a href="http://www.hpccommunity.org/attachments/f21/190d1227185819-kusu-building-base-kit-openmp.zip" >OpenMP.zip</a><br />
<br />
OpenMP allows multiple pragmas, so if you have a program that has separate sections of calculations, then you can pragma the different sections to help speed them up. If you have older programs similar in structure such as this one, &quot;for&quot; loops, then OpenMP will certainly speed things up. As you learn more about OpenMP, I'm sure you'll find other uses for it, to help speed up other sections of your program.<br />
<br />
Have Fun,<br />
<br />
Leo Stutzmann</blockquote>

 ]]></content:encoded>
			<dc:creator>Bearcat</dc:creator>
			<guid isPermaLink="true">http://www.hpccommunity.org/blogs/bearcat/exploring-hpc-programming-openmp-98/</guid>
		</item>
		<item>
			<title>Exploring HPC Programming: Multi-threading Pt. 2</title>
			<link>http://www.hpccommunity.org/blogs/bearcat/exploring-hpc-programming-multi-threading-pt-2-89/</link>
			<pubDate>Tue, 02 Sep 2008 12:18:30 GMT</pubDate>
			<description>In the previous article about multi-threading, I mentioned that there are 2 options for breaking up the work that needs to be done. Option 1 was to...</description>
			<content:encoded><![CDATA[<blockquote class="blogcontent restore">In the previous article about multi-threading, I mentioned that there are 2 options for breaking up the work that needs to be done. Option 1 was to have each thread work on a subset of the portfolio, but do all the experiments. Option 2 was to have each thread work on a subset of the experiments, and to work on all of the portfolio.<br />
<br />
Today, we'll look at option 2, and I'll try to explain what was needed in the program to accomplish this task. Option 1 and option 2 programs will look similar on the surface, but will differ in the details. We still need to think about how our data is arranged and how we can access the data, without incurring any performance penalty, by restricting access to the data. Remember again, It's all about the data. (I can't stress that enough).<br />
<br />
So let's begin:<br />
<br />
The &quot;main&quot; function stays the same. The &quot;create_portfolio_threads&quot; function is changed to accomplish the goal. In option 1, I created arrays for each of the threads, because each thread operated on all the experiments which are housed in the arrays. In option 2, each thread will operate on a subset of the experiments, and a subset of the array, so I only need 1 set of arrays again. I added an item &quot;thexperiments&quot; to the thread structure to allow the threads to calculate the area of the array to work on.<br />
<br />
The next set of changes were made to the &quot;portfolio&quot; function, in which the area of the array to work on is based on the number of experiments, and the number of threads that will be used. Also I needed to change the average calculations to be based on the area of the array that was used, instead of the number of experiments.<br />
<br />
The last change was to the &quot;blackscholes&quot; function, to which I now pass the start and number of the indexes of the array to be worked on, instead of the number of experiments.<br />
<br />
That's it, not too hard, was it?<br />
<br />
Here is the code:<br />
<br />
<a href="http://www.hpccommunity.org/attachment.php?attachmentid=189&amp;d=1220358150" >MultiThread2.zip</a><br />
<br />
Let's run it and see how it does:<br />
<br />
[leo@compute70 MultiThread2]$ time ./mthread2<br />
=== Option Portfolio Calculations (Threading over number of Experiments Test) ==========<br />
Portfolio size             : 1024<br />
Experiments run per item   : 1000000<br />
Number of threads started  : 4<br />
Thread 0, Running Experiments 1 to 250000 for 1024 options.<br />
Thread 1, Running Experiments 250001 to 500000 for 1024 options.<br />
Thread 2, Running Experiments 500001 to 750000 for 1024 options.<br />
Thread 3, Running Experiments 750001 to 1000000 for 1024 options.<br />
Thread 0, Average Call Price         : 36.561064<br />
Thread 0, Average Put  Price         : 1.669508<br />
Thread 3, Average Call Price         : 36.829648<br />
Thread 3, Average Put  Price         : 2.142390<br />
Thread 1, Average Call Price         : 36.854430<br />
Thread 1, Average Put  Price         : 2.156009<br />
Thread 2, Average Call Price         : 36.809930<br />
Thread 2, Average Put  Price         : 2.150635<br />
<br />
real    4m50.480s<br />
user    19m3.825s<br />
sys     0m0.077s<br />
[leo@compute70 MultiThread2]$ <br />
<br />
<br />
There you go, 4 minutes and 50 seconds. This is about the same length of time that option 1 took to execute. So what does that tell us?<br />
<br />
It means, that this sample (and only this sample) has a certain number of calculations to do, and depending on how I slice up the work, each thread has a certain number of calculations to do, so all things being equal, if I slice up the calculations 4 ways, regardless of how I slice it up, it will take a specific amount of time to perform the calculations. So I have 1,000,000 * 1024 calculations to perform. I can slice it up by option 1 as 1,000,000 * 256 * 4, or option 2 as 250,000 * 1024 * 4, but the end result is the same.<br />
<br />
Each option may have an advantage to a particular program, depending on the program you want to make multi-threaded, and how the data is arranged, so it's nice to have a couple of options, knowing the end result is the same, from a performance standpoint.<br />
<br />
So, can we make this program any faster? Well, maybe, if we can run more threads, say 8, if we had hardware that big. Or, we could try to make the calculations run faster, if we had faster hardware, or maybe alternative hardware.<br />
<br />
I'll explore these additional methods in future articles. <br />
<br />
Multi-threading has been around for a long time, still not that well understood, I think, based on programs I've seen. Next I'll look at a newer piece of technology, that is supposed to make multi-threading easier, although I don't find multi-threading that hard to begin with.  OpenMP, is technology that allows you to thread tasks within your program, to make use of multi-threading. I'll take a look at the implications of using OpenMP in this program, and see how it performs.<br />
<br />
See you next time.<br />
<br />
Leo Stutzmann</blockquote>

 ]]></content:encoded>
			<dc:creator>Bearcat</dc:creator>
			<guid isPermaLink="true">http://www.hpccommunity.org/blogs/bearcat/exploring-hpc-programming-multi-threading-pt-2-89/</guid>
		</item>
		<item>
			<title>Exploring HPC Programming: Multi-threading</title>
			<link>http://www.hpccommunity.org/blogs/bearcat/exploring-hpc-programming-multi-threading-88/</link>
			<pubDate>Mon, 25 Aug 2008 16:38:47 GMT</pubDate>
			<description>One of the questions I hear a lot is: Is multi-threaded programming difficult? 
 
The answer to that question is: It depends. Not really an answer,...</description>
			<content:encoded><![CDATA[<blockquote class="blogcontent restore">One of the questions I hear a lot is: Is multi-threaded programming difficult?<br />
<br />
The answer to that question is: It depends. Not really an answer, but it does depend on a number of factors. The multi-threaded api is fairly simple, so coding a multi-threaded application is easy. What’s difficult depends on your application, what it does, and how the data is used and structured. If your application uses a lot of shared data, which you need to control access to, with semaphores and mutexes, then it becomes more difficult, and prone to error, as threads waiting to access data, slow down the process, or worse become dead-locked if you have multiple shared data items, and are not careful. These are just details however, and multi-threaded programming is really easy, you just have to think about it before you write your code.<br />
<br />
In a previous post: “Where to start”, there is a sample program, so it’s best to start with that, and think about how we would go about breaking this down into threads of execution. Ah ha, you say, the program has 2 big loops that are ideal candidates. One loop is used to calculate 1 million experiments, and the second loop is to do this 1024 times to simulate a portfolio of calculations.<br />
<br />
Before we begin, let’s make the assumption, that because I’m going to run the program on a Quad Core machine, that the optimal number of threads will be 4 (who would have guessed). We’ll create the program to be somewhat flexible in how many threads it creates, but the default will be 4.<br />
<br />
If you look at the sample, there appears to be two different strategies that one could take to process the calculations in a threaded manner. Option 1: To have each thread process a subset of the portfolio. When using 4 threads, this would have each thread process 256 items in the portfolio. Option 2: To have each thread process a subset of the experiments. When using 4 threads, this would have each thread process 250,000 of the calculations. Since doing both would make this exercise too long, I’ll choose Option 1. I think investigating Option 2 in a future post is worthwhile, and will provide a nice comparison of the differences made to each program to accomplish the other scenario.<br />
<br />
Now that I have a strategy of how I want to parallelize this program, I can start thinking of the data, and how it’s going to be manipulated. I will try and keep the functions as similar as possible to the original sample, and add helper functions to smooth the transition to multiple threads. For Option 1: I’m going to try and keep the inner function “blackscholes” the same, and add the supporting code around it.<br />
<br />
The ideal way to have multiple threads running in a high performance program, is to have each thread using its own data, without any overlap or contention of data. This is an important strategy. It will make the program a little more complicated in the setup of data, but will maximize the processing, because the threads will not be waiting for shared data items that can only be accessed one at a time.<br />
<br />
The threading API is basic, and doesn’t take a list of parameters. It does take a pointer, so if I need to pass a thread all the items for it to do its calculations, the easiest way is to create a structure of things, and pass the thread the pointer to the structure. You will see this structure definition at the beginning of the program. In the original sample, I used arrays to store the calculations for the experiments, and this works out quite nicely for multi-threaded programs, although now that I have 4 sets of calculations running, I will need more arrays.<br />
<br />
The first change to the program is in the “main” function, which I recoded a little for clarity of parameters, and all it does is call a new helper function I added to create the threads (create_portfolio_thread). This function does all the setup and tear down of the threads that are going to be doing the calculations. Now, instead of creating the arrays to hold the calculations, each thread will need its own set of arrays to make each thread independent of the other threads, and not share any data. Instead of the 5 arrays of numbers, I now have to create (5 * number of threads) of arrays. This is done using a pointer to the address of the array, which are allocated by using the number of threads for its size. Then the arrays are allocated, and the pointers saved in the pointers to the address. Quite Simple. All these pointers are saved in the thread structure for each thread, and the “pthread_create” API is used to create the execution thread, and pass the structure of data that the thread will work on. Then we wait on each thread to complete, and finish.<br />
<br />
The “portfolio_thread” function is a helper that casts the passed data, and calls the “portfolio” function.<br />
<br />
The “portfolio” function looks at the passed data to operate on, and decides which of the 1024 items, based on the number of threads that will be running, and calculates the range of items it will perform all the experiments on. It does this based on the number of threads, so you don’t have to change anything, if you change the number of threads to create. It then does the same thing the portfolio function did in the previous sample, but instead of 1024 items, it only does the items for this thread.<br />
<br />
That’s it. Now that wasn’t so hard, was it. Here is the program:<br />
<br />
<a href="http://www.hpccommunity.org/attachment.php?attachmentid=9" >Attachment 9</a><br />
<br />
<br />
Let’s run this program and see how well we did. The original program took almost 19 minutes. 4 thread should take under 5 minutes, elapsed time. OK, here goes:<br />
<br />
__________________________________________________  ___________________________________<br />
<br />
[leo@compute70 MultiThread1]$ time ./mthread1<br />
=== Option Portfolio Calculations (Threading over number of options Test) ==========<br />
Portfolio size             : 1024<br />
Experiments run per item   : 1000000<br />
Number of threads started  : 4<br />
Thread 0, Running Options 1 to 256 doing 256 iterations.<br />
Thread 3, Running Options 769 to 1024 doing 256 iterations.<br />
Thread 1, Running Options 257 to 512 doing 256 iterations.<br />
Thread 2, Running Options 513 to 768 doing 256 iterations.<br />
Thread 1, Average Call Price         : 36.836399<br />
Thread 1, Average Put  Price         : 2.151642<br />
Thread 2, Average Call Price         : 36.843320<br />
Thread 2, Average Put  Price         : 2.146836<br />
Thread 3, Average Call Price         : 36.849425<br />
Thread 3, Average Put  Price         : 2.141625<br />
Thread 0, Average Call Price         : 36.815204<br />
Thread 0, Average Put  Price         : 2.144223<br />
<br />
real    13m7.594s<br />
user    28m24.214s<br />
sys     22m30.525s<br />
__________________________________________________  ___________________________________<br />
<br />
Oh oh, Houston, we have a problem. While the program was faster at 13 minutes 7 seconds, that’s not even close to what I was expecting. My coding kung-fu is failing me. When you look at the above run, it also shows that 28 minutes were spent in user time, which is ok, because I was using 4 cores, but 22 and a half minutes were spent in system time, this is not good, why would the program spend so much time in system. You could get out the profiler and it will show you exactly where it is spending the time, but I have my own suspicions. In Linux the random functions are not thread safe, and when compiling the program for threading, the random functions serialize their execution, because the function depends on a global variable.<br />
<br />
I need to get rid of the random system functions. In the next sample, I just changed the “RandFloat” function to provide an even distribution between low and high values, based on the thread that’s calling it, so each thread gets a slightly different even distribution from the others. Again, this is fine for our tests, but does not simulate a real options pricing program.<br />
<br />
Here is the changed program:<br />
<br />
<a href="http://www.hpccommunity.org/attachment.php?attachmentid=188&amp;d=1219682441" >MultiThread1x.zip</a><br />
<br />
<br />
Lets run it again with the random system function eliminated:<br />
<br />
__________________________________________________  ___________________________________<br />
<br />
[leo@compute70 MultiThread1x]$ time ./mthread1<br />
=== Option Portfolio Calculations (Threading over number of options Test) ==========<br />
Portfolio size             : 1024<br />
Experiments run per item   : 1000000<br />
Number of threads started  : 4<br />
Thread 0, Running Options 1 to 256 doing 256 iterations.<br />
Thread 1, Running Options 257 to 512 doing 256 iterations.<br />
Thread 3, Running Options 769 to 1024 doing 256 iterations.<br />
Thread 2, Running Options 513 to 768 doing 256 iterations.<br />
Thread 1, Average Call Price         : 36.851565<br />
Thread 1, Average Put  Price         : 2.156646<br />
Thread 0, Average Call Price         : 36.558490<br />
Thread 0, Average Put  Price         : 1.669706<br />
Thread 2, Average Call Price         : 36.807067<br />
Thread 2, Average Put  Price         : 2.153668<br />
Thread 3, Average Call Price         : 36.826722<br />
Thread 3, Average Put  Price         : 2.140491<br />
<br />
real    4m41.657s<br />
user    18m35.799s<br />
sys     0m0.126s<br />
__________________________________________________  ___________________________________<br />
<br />
<br />
Now that’s better. 4 minutes and 41 seconds, using 18 and a half minutes of user time on 4 cores.<br />
<br />
With these multi-threading techniques, the time has come down to about 5 minutes, from 19 minutes in the single process example. Hope I’ve given you some ideas on how to incorporate multi-threading in your program. Got to watch out for those system functions though, or anything that will serialize your program execution.<br />
<br />
Cheers for now, and happy coding.<br />
<br />
Leo Stutzmann</blockquote>


<!-- attachments -->
	<div class="blogattachments">
		
		
		
		
			<fieldset class="blogcontent">
				<legend>Attached Files</legend>
				<ul>
					
				</ul>
			</fieldset>
		

	</div>
<!-- / attachments -->
 ]]></content:encoded>
			<dc:creator>Bearcat</dc:creator>
			<guid isPermaLink="true">http://www.hpccommunity.org/blogs/bearcat/exploring-hpc-programming-multi-threading-88/</guid>
		</item>
		<item>
			<title>Exploring HPC Programming: Where to start</title>
			<link>http://www.hpccommunity.org/blogs/bearcat/exploring-hpc-programming-where-start-83/</link>
			<pubDate>Mon, 11 Aug 2008 12:40:24 GMT</pubDate>
			<description>One of the topics I want to cover here, is HPC programming. That includes many things, so I want to look at such things as threading, toolkits such...</description>
			<content:encoded><![CDATA[<blockquote class="blogcontent restore">One of the topics I want to cover here, is HPC programming. That includes many things, so I want to look at such things as threading, toolkits such as openMP, graphics processor (GPU) toolkits, and cluster kits, such as MPI, as well as others that crop up from time to time. Learning how to use these toolkits can range from simple to complex. Getting the most out of each toolkit is an exercise for the reader. I will cover some basics about using each one, and run some performance tests to compare the different toolkits. This will more or less be an introduction, but I'm hoping that it will fuel some ideas in whomever reads this, and be able to apply some of the simple concepts to your own programs.<br />
<br />
Where to start is always the hard question. I'm going to start with a compute intensive program, that does Option Pricing. Lots of floating point calculations, lots of experiments to see what the best price might be over the years, and these calculations need to be run over many option positions. There are differences in this program and real option pricing programs. This program processes a number of experiments, and the experiments are set using a random number generation, with some approximation calculations. Real option pricing programs, I believe, will use a Stochastic distribution calculation to seed the experiments, so my program is not going to be accurate. That means you should not use this program to try to price any options (that's the warning). This program is for testing and performance testing only.<br />
<br />
The program performs 1 million experiments (option price calculations), and also performs this 1024 times, to simulate a portfolio of this many options that need to be priced. As I am going to explore parallelism in future posts, I decided to arrange the data into arrays. Thinking about parallelism has everything to do with the data, and how you handle it. I hope the arrays I've used will be adaptable to all the toolkits that will be explored, but time will tell. The program also only prints minimal data out at the end. I only print the last experiment, just to show that the calculations are actually done. A real pricing program would probably spit out all calculations for all experiments, or analyze and print out the optimal prices and maturity, and it would do this for all option positions in the portfolio. But I'm more interested in execution time.<br />
<br />
The environment that the baseline program is run in, and where the results are from is an Intel Core2 Quad (Q6600) processor running at 2.4Ghz. The machine has 4 gigs of ram, 250gig hard drive, and a graphics processor. The operating system is CentOS 5.2 with all the updates applied. Not a bad little machine, lots of horsepower, you say! As the baseline program is only a single thread of execution, it runs substantially long on this machine as the performance run will show. One core of the quad processor is pegged at 100%, while the other 3 sit idly around doing a little housekeeping here and there. The program gets the job done, but not very efficiently.<br />
<br />
So let's begin. Attached is a zip file containing the program and the Makefile.<br />
<br />
<a href="http://www.hpccommunity.org/attachment.php?attachmentid=7" >Attachment 7</a> <br />
 <br />
<br />
The test run was done using the Linux &quot;Time&quot; command to time the execution. I'm more interested in the overall program execution time, as opposed to some benchmarks I've seen that only time inner calculations. I mean when you're waiting for results it's the whole program that counts. It shows the timed execution on the Core2 Quad 2.4Ghz processor.<br />
<br />
[leo@compute70 Baseline]$ time ./optionprice<br />
=== Option Portfolio Calculations (Basline Test) ==========<br />
Portfolio size                         : 1024<br />
Experiments run per item   : 1000000<br />
Average Call Price               : 36.868640<br />
Average Put  Price               : 2.147528<br />
<br />
real    18m49.352s<br />
user    18m49.075s<br />
sys    0m0.090s<br />
[leo@compute70 Baseline]$ <br />
<br />
<br />
Here is a picture of &quot;top&quot; running to show that the execution of the program only exercises 1 core of the cpu.<br />
<br />
<a href="http://www.hpccommunity.org/attachments/f13/186d1218470894-integrating-symphony-de-matlab-parallel-computing-toolbox-baseline.jpg" id="attachment186" rel="Lightbox_83" ><img src="http://www.hpccommunity.org/attachments/f13/186d1218470894t-integrating-symphony-de-matlab-parallel-computing-toolbox-baseline.jpg" border="0" alt="Click image for larger version

Name:	baseline.jpg
Views:	633
Size:	64.2 KB
ID:	186" class="thumbnail" /></a> <br />
 <br />
<br />
Well there you go, almost 19 minutes to perform all the calculations. You might say that's not so bad, I can go have a coffee while that's going on. But suppose you are the person responsible for managing this portfolio, and you have to make pricing decisions quickly. I doubt saying &quot;come back in 20, while I figure this out&quot;, is acceptable. And this is only a small portfolio, with minimal experiments to determine optimal pricing.<br />
<br />
Over the next few posts, i'll look at how to improve this execution time. Sure I can optimize the program, use compiler optimizations to hopefully speed it up, but I want bigger speed ups then that.<br />
<br />
Talk to you soon.<br />
<br />
<br />
Leo Stutzmann</blockquote>


<!-- attachments -->
	<div class="blogattachments">
		
		
		
		
			<fieldset class="blogcontent">
				<legend>Attached Files</legend>
				<ul>
					
				</ul>
			</fieldset>
		

	</div>
<!-- / attachments -->
 ]]></content:encoded>
			<dc:creator>Bearcat</dc:creator>
			<guid isPermaLink="true">http://www.hpccommunity.org/blogs/bearcat/exploring-hpc-programming-where-start-83/</guid>
		</item>
		<item>
			<title>Multi-Core and GPU Background</title>
			<link>http://www.hpccommunity.org/blogs/bearcat/multi-core-gpu-background-77/</link>
			<pubDate>Fri, 27 Jun 2008 17:26:26 GMT</pubDate>
			<description>High Performance programs, typically do the same thing multiple times with different data, or with different parameters, such as a simulation. Real...</description>
			<content:encoded><![CDATA[<blockquote class="blogcontent restore">High Performance programs, typically do the same thing multiple times with different data, or with different parameters, such as a simulation. Real World applications, such as a program that monitors a port, gets data, processes the data, returns a result, may need to operate on multiple cores as well, but I typically call this type of multi-threading, &quot;concurrency&quot;, where your program is doing different things concurrently. This may not be everyone's definition, but I tend to look at it this way.<br />
<br />
I may indulge a little in concurrency here in the future, but for the most part we'll be looking at programs that fall into the high performance category.<br />
<br />
There are a number of technologies for high performance programming that are available or are coming out, and they fall into a few different camps. I'm going to be looking into the areas of Camp 0, 1 and 2, with a sample program and performance numbers, but first, I just wanted to provide a little background into the different areas.<br />
<br />
<b>Camp 0: (Multi-threading)</b><br />
<br />
The old tried and true method of using multiple processor or cores in a machine. You use the threading libraries available on your operating system, and manage the thread and data access yourself. Can be easy to do, or complex to do, depending on your program.<br />
<br />
<b>Camp 1: (Multi-Core)</b><br />
<br />
Toolkits that allow some parallelism in your program through meta-tagging of code segments. An example of this is: OpenMP. This tool allows you to tag sections of code for parallelism, and is compiled to the native architecture of the chip you’re using. If I compile for x86, then the resulting program runs on x86, and makes use of multiple x86 processors, if available. No special compiler needed, available in  gcc.<br />
<br />
NOTE: Camps 1 and 2 don’t work together automatically, unless you specifically  use the techniques yourself.<br />
<br />
<b>Camp 2: (GPU)</b><br />
<br />
The GPU accelerator camp, which  includes the previous generation gpgpu programming, and now CUDA, and ATI’s  Stream SDK for their new FireStream processor cards, are development kits that allow you to use the GPU as a floating point co-processing unit. While each of these get easier to use as the new versions come out, they usually require a special compiler, which compiles the code to native GPU instructions. Once the code is compiled, if run on a machine which doesn’t have the GPU, your program doesn’t work (go figure). Fast, but not general in nature. Specific compiler needed, or in the case of gpgpu programming, it emits the specific shader language instructions to the gpu. Eg. The CUDA sdk comes with an nVidia compiler, and only supports 8000 series cards and up, if you have the correct drivers.<br />
<br />
<b>Camp 3: (either Multi-Core or GPU)</b><br />
<br />
Companies such as RapidMind  have a model that tags code, and the tagged code is emitted as program text.  This text is compiled at runtime, based on the backend required. RapidMind have back-ends for x86, ppc, glsl (Shader Language used by camp 2 gpgpu). At runtime a backend is selected and the code text is compiled to the native instructions for the target processor. One target backend is used, such as  either x86 or gpu, they are not mixed together (at least in the current versions). This allows the program to be compiled, and run on any machine that has the RapidMind runtime. It will use the ATI or nVidia GPU if available, or just multi-thread the program segments on x86, if multiple cores are available. The benefit is application portability to different machines with different cpu and graphics cores. The RapidMind runtime manages access to bound and unbound variables to  eliminate the locking required in general multi-threaded programming.<br />
<br />
<b>Camp 4: (both Multi-Core and GPU) (Near Future)</b><br />
<br />
Apple seems to be taking an all approach to this. They have been a heavy participant in the LLVM compiler project, and have a number of Apple engineers working on the project. The LLVM compiler project takes your code, and generates intermediate code (think java or .net  clr). This code can then generate native code using one of the LLVM code generation back-ends.<br />
<br />
Apple has used this technology successfully in Mac OS X Leopard 10.5, in the new implementation of their OpenGL (note the GL).  They compile the opengl libraries, that get installed to your system. Then when opengl is used on your system the Just In Time backend for the graphics card in your system compiles the intermediate code to native code for your machine, and placed in caches on your hard drive for later execution.<br />
<br />
This appears to be the same technology that they will use in their OpenCL (Open Computing Language, using the GPU for calculations) announcement, for Mac OS X Snow Leopard 10.6, delivered next year. Sections of code destined for parallelism and the GPU will be compiled by LLVM to intermediate, and then Just In Time compiled by the backend to provide native GPU instructions for the execution.<br />
<br />
Additionally the announcement of Grand Central Dispatch to tag program segments for parallelism  and run on multi-cores seems similar to openMP. Apple states in their announcement that you will be able to use GPUs and CPUs at the same time in your parallel segments. This implies that the same LLVM is used, and a different backend to Just In Time compile for x86 native execution.<br />
<br />
This camp will allow the JIT compile for multiple backends  and their execution control in the Grand Central Dispatch environment. This is an additional step up, from Camp 3.<br />
<br />
This is impressive for a couple of reasons. First it implies that they package up execution units and dispatch them to different processor cores or gpus as needed, and second, they are baking this into their own developer tools, and the operating system. The required runtimes will always be available to these types of compiled applications. Let's hope this lives up to the hype, I've just given it.<br />
<br />
<b>Camp X: (Cluster) (Far Future)</b><br />
<br />
How do we get program segments running in a cluster? A problem someone will solve, I'm sure.<br />
<br />
Leo</blockquote>

 ]]></content:encoded>
			<dc:creator>Bearcat</dc:creator>
			<guid isPermaLink="true">http://www.hpccommunity.org/blogs/bearcat/multi-core-gpu-background-77/</guid>
		</item>
		<item>
			<title>Who Am I</title>
			<link>http://www.hpccommunity.org/blogs/bearcat/who-am-i-76/</link>
			<pubDate>Fri, 27 Jun 2008 16:26:31 GMT</pubDate>
			<description>My name is Leo Stutzmann, and I am an Architect at Platform Computing. I am in the research group, and tend to look at things related to the...</description>
			<content:encoded><![CDATA[<blockquote class="blogcontent restore">My name is Leo Stutzmann, and I am an Architect at Platform Computing. I am in the research group, and tend to look at things related to the developer. That means tools, compilers, coding methods, etc within the area of High Performance Computing. <br />
<br />
One area, I am looking at is: <b>Multi-Core and GPU issues</b>. This may expand into other co-processors. Another area is <b>Parallel programming languages, Models, and Tools</b>. These areas loosely correspond to 2 areas identified in Khalid's wonderful summary here:<br />
<br />
<a href="http://www.hpccommunity.org/blogs/khalid/research-topics-hpc-65/" target="_self">Research Topics in HPC - HPC Community - High Performance Computing (HPC) Community</a><br />
<br />
A lot of what you'll see, will be thoughts, examples, tests, performance numbers, etc. I will usually work with single system hardware I have, or machines I can get my hands on. I will try to stick to common hardware, so you can try these things yourself, and see if some of these technologies benefit your own programming.<br />
<br />
cheers for now,<br />
Leo</blockquote>

 ]]></content:encoded>
			<dc:creator>Bearcat</dc:creator>
			<guid isPermaLink="true">http://www.hpccommunity.org/blogs/bearcat/who-am-i-76/</guid>
		</item>
	</channel>
</rss>

