Exploring HPC Programming: Multi-threading
Posted August 25th, 2008 at 04:38 PM by Bearcat
Updated August 25th, 2008 at 04:51 PM by Bearcat (Added attachments)
Updated August 25th, 2008 at 04:51 PM by Bearcat (Added attachments)
One of the questions I hear a lot is: Is multi-threaded programming difficult?
The answer to that question is: It depends. Not really an answer, but it does depend on a number of factors. The multi-threaded api is fairly simple, so coding a multi-threaded application is easy. What’s difficult depends on your application, what it does, and how the data is used and structured. If your application uses a lot of shared data, which you need to control access to, with semaphores and mutexes, then it becomes more difficult, and prone to error, as threads waiting to access data, slow down the process, or worse become dead-locked if you have multiple shared data items, and are not careful. These are just details however, and multi-threaded programming is really easy, you just have to think about it before you write your code.
In a previous post: “Where to start”, there is a sample program, so it’s best to start with that, and think about how we would go about breaking this down into threads of execution. Ah ha, you say, the program has 2 big loops that are ideal candidates. One loop is used to calculate 1 million experiments, and the second loop is to do this 1024 times to simulate a portfolio of calculations.
Before we begin, let’s make the assumption, that because I’m going to run the program on a Quad Core machine, that the optimal number of threads will be 4 (who would have guessed). We’ll create the program to be somewhat flexible in how many threads it creates, but the default will be 4.
If you look at the sample, there appears to be two different strategies that one could take to process the calculations in a threaded manner. Option 1: To have each thread process a subset of the portfolio. When using 4 threads, this would have each thread process 256 items in the portfolio. Option 2: To have each thread process a subset of the experiments. When using 4 threads, this would have each thread process 250,000 of the calculations. Since doing both would make this exercise too long, I’ll choose Option 1. I think investigating Option 2 in a future post is worthwhile, and will provide a nice comparison of the differences made to each program to accomplish the other scenario.
Now that I have a strategy of how I want to parallelize this program, I can start thinking of the data, and how it’s going to be manipulated. I will try and keep the functions as similar as possible to the original sample, and add helper functions to smooth the transition to multiple threads. For Option 1: I’m going to try and keep the inner function “blackscholes” the same, and add the supporting code around it.
The ideal way to have multiple threads running in a high performance program, is to have each thread using its own data, without any overlap or contention of data. This is an important strategy. It will make the program a little more complicated in the setup of data, but will maximize the processing, because the threads will not be waiting for shared data items that can only be accessed one at a time.
The threading API is basic, and doesn’t take a list of parameters. It does take a pointer, so if I need to pass a thread all the items for it to do its calculations, the easiest way is to create a structure of things, and pass the thread the pointer to the structure. You will see this structure definition at the beginning of the program. In the original sample, I used arrays to store the calculations for the experiments, and this works out quite nicely for multi-threaded programs, although now that I have 4 sets of calculations running, I will need more arrays.
The first change to the program is in the “main” function, which I recoded a little for clarity of parameters, and all it does is call a new helper function I added to create the threads (create_portfolio_thread). This function does all the setup and tear down of the threads that are going to be doing the calculations. Now, instead of creating the arrays to hold the calculations, each thread will need its own set of arrays to make each thread independent of the other threads, and not share any data. Instead of the 5 arrays of numbers, I now have to create (5 * number of threads) of arrays. This is done using a pointer to the address of the array, which are allocated by using the number of threads for its size. Then the arrays are allocated, and the pointers saved in the pointers to the address. Quite Simple. All these pointers are saved in the thread structure for each thread, and the “pthread_create” API is used to create the execution thread, and pass the structure of data that the thread will work on. Then we wait on each thread to complete, and finish.
The “portfolio_thread” function is a helper that casts the passed data, and calls the “portfolio” function.
The “portfolio” function looks at the passed data to operate on, and decides which of the 1024 items, based on the number of threads that will be running, and calculates the range of items it will perform all the experiments on. It does this based on the number of threads, so you don’t have to change anything, if you change the number of threads to create. It then does the same thing the portfolio function did in the previous sample, but instead of 1024 items, it only does the items for this thread.
That’s it. Now that wasn’t so hard, was it. Here is the program:
MultiThread1.zip
Let’s run this program and see how well we did. The original program took almost 19 minutes. 4 thread should take under 5 minutes, elapsed time. OK, here goes:
__________________________________________________ ___________________________________
[leo@compute70 MultiThread1]$ time ./mthread1
=== Option Portfolio Calculations (Threading over number of options Test) ==========
Portfolio size : 1024
Experiments run per item : 1000000
Number of threads started : 4
Thread 0, Running Options 1 to 256 doing 256 iterations.
Thread 3, Running Options 769 to 1024 doing 256 iterations.
Thread 1, Running Options 257 to 512 doing 256 iterations.
Thread 2, Running Options 513 to 768 doing 256 iterations.
Thread 1, Average Call Price : 36.836399
Thread 1, Average Put Price : 2.151642
Thread 2, Average Call Price : 36.843320
Thread 2, Average Put Price : 2.146836
Thread 3, Average Call Price : 36.849425
Thread 3, Average Put Price : 2.141625
Thread 0, Average Call Price : 36.815204
Thread 0, Average Put Price : 2.144223
real 13m7.594s
user 28m24.214s
sys 22m30.525s
__________________________________________________ ___________________________________
Oh oh, Houston, we have a problem. While the program was faster at 13 minutes 7 seconds, that’s not even close to what I was expecting. My coding kung-fu is failing me. When you look at the above run, it also shows that 28 minutes were spent in user time, which is ok, because I was using 4 cores, but 22 and a half minutes were spent in system time, this is not good, why would the program spend so much time in system. You could get out the profiler and it will show you exactly where it is spending the time, but I have my own suspicions. In Linux the random functions are not thread safe, and when compiling the program for threading, the random functions serialize their execution, because the function depends on a global variable.
I need to get rid of the random system functions. In the next sample, I just changed the “RandFloat” function to provide an even distribution between low and high values, based on the thread that’s calling it, so each thread gets a slightly different even distribution from the others. Again, this is fine for our tests, but does not simulate a real options pricing program.
Here is the changed program:
MultiThread1x.zip
Lets run it again with the random system function eliminated:
__________________________________________________ ___________________________________
[leo@compute70 MultiThread1x]$ time ./mthread1
=== Option Portfolio Calculations (Threading over number of options Test) ==========
Portfolio size : 1024
Experiments run per item : 1000000
Number of threads started : 4
Thread 0, Running Options 1 to 256 doing 256 iterations.
Thread 1, Running Options 257 to 512 doing 256 iterations.
Thread 3, Running Options 769 to 1024 doing 256 iterations.
Thread 2, Running Options 513 to 768 doing 256 iterations.
Thread 1, Average Call Price : 36.851565
Thread 1, Average Put Price : 2.156646
Thread 0, Average Call Price : 36.558490
Thread 0, Average Put Price : 1.669706
Thread 2, Average Call Price : 36.807067
Thread 2, Average Put Price : 2.153668
Thread 3, Average Call Price : 36.826722
Thread 3, Average Put Price : 2.140491
real 4m41.657s
user 18m35.799s
sys 0m0.126s
__________________________________________________ ___________________________________
Now that’s better. 4 minutes and 41 seconds, using 18 and a half minutes of user time on 4 cores.
With these multi-threading techniques, the time has come down to about 5 minutes, from 19 minutes in the single process example. Hope I’ve given you some ideas on how to incorporate multi-threading in your program. Got to watch out for those system functions though, or anything that will serialize your program execution.
Cheers for now, and happy coding.
Leo Stutzmann
The answer to that question is: It depends. Not really an answer, but it does depend on a number of factors. The multi-threaded api is fairly simple, so coding a multi-threaded application is easy. What’s difficult depends on your application, what it does, and how the data is used and structured. If your application uses a lot of shared data, which you need to control access to, with semaphores and mutexes, then it becomes more difficult, and prone to error, as threads waiting to access data, slow down the process, or worse become dead-locked if you have multiple shared data items, and are not careful. These are just details however, and multi-threaded programming is really easy, you just have to think about it before you write your code.
In a previous post: “Where to start”, there is a sample program, so it’s best to start with that, and think about how we would go about breaking this down into threads of execution. Ah ha, you say, the program has 2 big loops that are ideal candidates. One loop is used to calculate 1 million experiments, and the second loop is to do this 1024 times to simulate a portfolio of calculations.
Before we begin, let’s make the assumption, that because I’m going to run the program on a Quad Core machine, that the optimal number of threads will be 4 (who would have guessed). We’ll create the program to be somewhat flexible in how many threads it creates, but the default will be 4.
If you look at the sample, there appears to be two different strategies that one could take to process the calculations in a threaded manner. Option 1: To have each thread process a subset of the portfolio. When using 4 threads, this would have each thread process 256 items in the portfolio. Option 2: To have each thread process a subset of the experiments. When using 4 threads, this would have each thread process 250,000 of the calculations. Since doing both would make this exercise too long, I’ll choose Option 1. I think investigating Option 2 in a future post is worthwhile, and will provide a nice comparison of the differences made to each program to accomplish the other scenario.
Now that I have a strategy of how I want to parallelize this program, I can start thinking of the data, and how it’s going to be manipulated. I will try and keep the functions as similar as possible to the original sample, and add helper functions to smooth the transition to multiple threads. For Option 1: I’m going to try and keep the inner function “blackscholes” the same, and add the supporting code around it.
The ideal way to have multiple threads running in a high performance program, is to have each thread using its own data, without any overlap or contention of data. This is an important strategy. It will make the program a little more complicated in the setup of data, but will maximize the processing, because the threads will not be waiting for shared data items that can only be accessed one at a time.
The threading API is basic, and doesn’t take a list of parameters. It does take a pointer, so if I need to pass a thread all the items for it to do its calculations, the easiest way is to create a structure of things, and pass the thread the pointer to the structure. You will see this structure definition at the beginning of the program. In the original sample, I used arrays to store the calculations for the experiments, and this works out quite nicely for multi-threaded programs, although now that I have 4 sets of calculations running, I will need more arrays.
The first change to the program is in the “main” function, which I recoded a little for clarity of parameters, and all it does is call a new helper function I added to create the threads (create_portfolio_thread). This function does all the setup and tear down of the threads that are going to be doing the calculations. Now, instead of creating the arrays to hold the calculations, each thread will need its own set of arrays to make each thread independent of the other threads, and not share any data. Instead of the 5 arrays of numbers, I now have to create (5 * number of threads) of arrays. This is done using a pointer to the address of the array, which are allocated by using the number of threads for its size. Then the arrays are allocated, and the pointers saved in the pointers to the address. Quite Simple. All these pointers are saved in the thread structure for each thread, and the “pthread_create” API is used to create the execution thread, and pass the structure of data that the thread will work on. Then we wait on each thread to complete, and finish.
The “portfolio_thread” function is a helper that casts the passed data, and calls the “portfolio” function.
The “portfolio” function looks at the passed data to operate on, and decides which of the 1024 items, based on the number of threads that will be running, and calculates the range of items it will perform all the experiments on. It does this based on the number of threads, so you don’t have to change anything, if you change the number of threads to create. It then does the same thing the portfolio function did in the previous sample, but instead of 1024 items, it only does the items for this thread.
That’s it. Now that wasn’t so hard, was it. Here is the program:
MultiThread1.zip
Let’s run this program and see how well we did. The original program took almost 19 minutes. 4 thread should take under 5 minutes, elapsed time. OK, here goes:
__________________________________________________ ___________________________________
[leo@compute70 MultiThread1]$ time ./mthread1
=== Option Portfolio Calculations (Threading over number of options Test) ==========
Portfolio size : 1024
Experiments run per item : 1000000
Number of threads started : 4
Thread 0, Running Options 1 to 256 doing 256 iterations.
Thread 3, Running Options 769 to 1024 doing 256 iterations.
Thread 1, Running Options 257 to 512 doing 256 iterations.
Thread 2, Running Options 513 to 768 doing 256 iterations.
Thread 1, Average Call Price : 36.836399
Thread 1, Average Put Price : 2.151642
Thread 2, Average Call Price : 36.843320
Thread 2, Average Put Price : 2.146836
Thread 3, Average Call Price : 36.849425
Thread 3, Average Put Price : 2.141625
Thread 0, Average Call Price : 36.815204
Thread 0, Average Put Price : 2.144223
real 13m7.594s
user 28m24.214s
sys 22m30.525s
__________________________________________________ ___________________________________
Oh oh, Houston, we have a problem. While the program was faster at 13 minutes 7 seconds, that’s not even close to what I was expecting. My coding kung-fu is failing me. When you look at the above run, it also shows that 28 minutes were spent in user time, which is ok, because I was using 4 cores, but 22 and a half minutes were spent in system time, this is not good, why would the program spend so much time in system. You could get out the profiler and it will show you exactly where it is spending the time, but I have my own suspicions. In Linux the random functions are not thread safe, and when compiling the program for threading, the random functions serialize their execution, because the function depends on a global variable.
I need to get rid of the random system functions. In the next sample, I just changed the “RandFloat” function to provide an even distribution between low and high values, based on the thread that’s calling it, so each thread gets a slightly different even distribution from the others. Again, this is fine for our tests, but does not simulate a real options pricing program.
Here is the changed program:
MultiThread1x.zip
Lets run it again with the random system function eliminated:
__________________________________________________ ___________________________________
[leo@compute70 MultiThread1x]$ time ./mthread1
=== Option Portfolio Calculations (Threading over number of options Test) ==========
Portfolio size : 1024
Experiments run per item : 1000000
Number of threads started : 4
Thread 0, Running Options 1 to 256 doing 256 iterations.
Thread 1, Running Options 257 to 512 doing 256 iterations.
Thread 3, Running Options 769 to 1024 doing 256 iterations.
Thread 2, Running Options 513 to 768 doing 256 iterations.
Thread 1, Average Call Price : 36.851565
Thread 1, Average Put Price : 2.156646
Thread 0, Average Call Price : 36.558490
Thread 0, Average Put Price : 1.669706
Thread 2, Average Call Price : 36.807067
Thread 2, Average Put Price : 2.153668
Thread 3, Average Call Price : 36.826722
Thread 3, Average Put Price : 2.140491
real 4m41.657s
user 18m35.799s
sys 0m0.126s
__________________________________________________ ___________________________________
Now that’s better. 4 minutes and 41 seconds, using 18 and a half minutes of user time on 4 cores.
With these multi-threading techniques, the time has come down to about 5 minutes, from 19 minutes in the single process example. Hope I’ve given you some ideas on how to incorporate multi-threading in your program. Got to watch out for those system functions though, or anything that will serialize your program execution.
Cheers for now, and happy coding.
Leo Stutzmann
Total Comments 0
Comments
Total Trackbacks 0
Trackbacks
Recent Blog Entries by Bearcat
- Exploring HPC Programming: OpenMP (November 20th, 2008)
- Exploring HPC Programming: Multi-threading Pt. 2 (September 2nd, 2008)
- Exploring HPC Programming: Multi-threading (August 25th, 2008)
- Exploring HPC Programming: Where to start (August 11th, 2008)
- Multi-Core and GPU Background (June 27th, 2008)








