Benchmarking any computer is always useful, but it is not quite as simple as running a few programs and reporting some numbers. Indeed, benchmarking is an art that requires some diligence and attention to detail. To be effective, one must have clear goals and objectives in mind. In addition, interpretation of results must be done within the context of the benchmark. In this article will take a look at this process and how benchmarks can be used to aid system administration without trying to win contests.
The Mother Of All HPC Benchmarks
Mention HPC benchmarking and the first thing that comes to mind is the Top500 list. The actual benchmark is called HPL, which stands for High Performance Linpack. The Linpack benchmark was designed to measure the floating point performance for solving a system of linear equations. If you don't know what that means, not to worry, it is simply one way to measure how much floating point performance a computer can achieve. There are other worthwhile benchmarks, but by virtue of a bi-annual Top500 list, HPL has become the most famous. As benchmarks go, HPL has many tunable options and thus can take a considerable amount of time to optimize. Also, consider that HPL can take hours or days to run, so getting a good "HPL number" can take a long time.
For the purposes of this article, we are going to assume that a benchmark is a reproducible rate of performance. For instance, in HPC the FLOPS metric is often used where FLOPS stands for Floating Point Operations per Second. Clusters are very good at delivering FLOPS, so good that the prefix of Tera (or T) or Giga (or G) are used. One TFLOP is 1x10E12 floating point operations per second, while one GFLOP is 1,000 times less or 1x10E9. The world’s fastest machines are now measured in PFLOPS, where P is Peta or 1x10E15.
FLOPS is not the only measure of performance for a cluster. In some cases, one may want to measure integer, or I/O, performance. The Standard Performance Evaluation Corporation (SPEC) is perhaps the best known independent set of benchmarks. SPEC has many benchmarks, including a specific set for MPI and OpenMP. The SPEC benchmarks are actually a suite of benchmarks from which a composite rating is computed. The SPEC benchmarks must be purchased and are mainly used by vendors to report/rank performance of new computer systems.
One Benchmark Equals One Data Point
Perhaps the biggest flaw in benchmarking is the interpretation of results. Let's take the Top500 list for example. Many people tout the system on the top of the list as the "fastest computer in the world." Unfortunately, they forgot to finish the sentence. The rest of the sentence should read "at running the HPL benchmark." Somehow the press releases do not sound as good with that qualifier.
HPL measures linear algebra performance. If your application does not do much linear algebra, then maybe HPL is not the benchmark you should be considering. Indeed, if you are not doing linear algebra on thousands of nodes, then the Top500 list is basically irrelevant. Which is why the suggestion that your application(s) are the best benchmark is sage advice.
There are times when application benchmarking is not possible or does not make sense. In these situations, some standard benchmarks can help you gauge various types of cluster designs. It is also good to have a suite of benchmarks with which you are familiar, and to develop a "feel" for how they should work on various pieces of hardware. A single benchmark number, like HPL, is useful, but a broader measure may be more valuable.
Performance, Baseline, or Burn-in
The intent of the benchmark is just as important the benchmark itself. There are three main reasons to run a benchmark.
- Performance Optimization - In this mode, you want to achieve the best and fastest performance. HPL falls into this category. This type of benchmarking can take a large amount of time because very often one variable is changed (a program or compiler option) and the benchmark is re-run. If you are trying to win a contest, like the Top500, then the "90/10 rule" usually holds. That is, the last 10% of performance is going to take 90% of the effort. Other than press releases, contest numbers don't help end users. Reserving the whole cluster for two weeks to get a Top500 number usually does not sit well with end users either.
Application performance optimization is a much more useful endeavor. Very often, results that are "good enough" are acceptable to end users. That last 10% is just not worth the effort. These types of benchmarks are usually run without using a scheduler or sharing nodes. As will be discussed below, multi-core nodes pose an interesting challenge when reporting performance for clusters. Often these results are reported as a percent of peak, that is, the performance compared to all floating point hardware in the processor working 100% of the time. In actuality, programs never hit the peak numbers because they need to do more than floating point calculations, but some get pretty close.
- Baseline Performance - Baseline benchmarking is perhaps the most important type of benchmark a cluster administrator can perform. The concept is very simple: run a set of benchmarks on the current cluster configuration and tuck them away. After an upgrade (hardware or software) is performed, re-run the benchmarks and see if things are better, worse, or the same. Upgrades do not automatically mean better performance. For instance, if you perform an OS upgrade, re-running your benchmark suite is important. And, like the previous runs, keep them for future reference. Of course, if you just upgrade an MPI library, re-run only the tests that use MPI.
These benchmarks are usually run though a batch scheduler, because the results should mirror an environment similar to that of the end user. Baseline performance records are perhaps one of the best ways to troubleshoot problems. If something is not working correctly, running some tests and comparing numbers is an easy way to identify problems. If you don't know how it "should" work, how do you know what is broken.
- Burn-in and Confidence Testing - These tests are similar to baseline testing, but the goal is different. Basically, the idea is to put the cluster through its paces. You want the cluster do as much as it can at the same time -- preferably a mixture of things. The goal is to see if anything breaks. Note that running a few codes may not be an effective way to test the cluster. Just because a program finishes does not mean the results are correct or that it is running optimally.
One good way to do this type of benchmarking is to first generate baseline numbers, then load the queue with a variety of jobs and see if the numbers change under a varying heavy load.
Open Benchmarks
If there were ever a case for open source, it is with benchmarks. In order to be fair and minimize cheating, the source code should be available to those who run the benchmarks and those who want to repeat and verify them. To properly document and report a benchmark result, you should reference the source code version, the steps that were used to build the binary files (compilers and compiler options), data sets, and any run-time options used.
There are many open benchmarks. A list of open cluster/HPC cluster benchmarks can be found at Cluster Tweaks. Note: Cluster Tweaks is community wiki sponsored by ClusterMonkey.net. Registered users can contribute their experience and knowledge to the community.
In general terms, there are two levels of benchmarking -- micro and macro. Micro-benchmarks usually measure one aspect of the cluster (processor, memory, network, disk) and ignore all other parts. In addition, micro-benchmarks can put upper limits on expected performance levels, because real applications usually do not push one aspect of the cluster at 100% load all the time. Collecting micro benchmark data is important when pinpointing problems in your cluster. The following table is a non-exclusive list of some common micro-benchmarks that are useful for HPC clusters.
[TABLE="head"]
Micro Benchmark|
Measures Bonnie++|Hard drive performance
Stream|Memory performance
Netperf|General network performance|
Unixbench|General Unix benchmarks
LMbench|Low level system benchmarks
Netpipe|Detailed network performance
Intel MPI Benchmarks|Low/High level MPI benchmarks
[/TABLE]
Table One: Micro-benchmarks for clusters
In contrast to micro-benchmarks, a macro-benchmarks (or real applications) tell a truer story of how the cluster is working. These benchmarks use multiple aspects of the cluster to run a real application (or something that resembles a real application). The HPC challenge suite is included in this list, but it is really a collection of micro and macro benchmarks. Overall, it does exercise large aspects of the cluster. Your applications fall into the macro category and running them with a known data set is a great way to track cluster performance. The NAS parallel benchmarks are self validating, that is if they check that the answer is right. This check is important as fast wrong answers are just as wrong as slow ones. The table below provides some good macro benchmarks. As these are fairly common benchmarks, you may find results from clusters posted on the Internet. As with the micro-benchmarks this is not an exhaustive list.
You will notice that HPL is not on my list. It is part of the HPC Challenge Suite, however. The reason for this is a good HPL number can require a large amount of work and for best results should be sized to the number of nodes and total amount of memory. Once this is done it can be used as a baseline test, but there is no real need to "over optimize" the HPL portion of the Challenge Suite.
Another type of benchmark not mentioned is parallel I/O. Because parallel I/O is so application specific, it is hard to have a set of benchmarks to measure the variation in all these of systems. With this in mind, it is possible to measure some aspects of parallel I/O as described in A Benchmark for Parallel File Systems on ClusterMonkey.net.
There are some wrapper scripts that help run multiple benchmarks. Have a look at Cbench or the Beowulf Performance Suite (BPS) if you want to automate some of your benchmarking. One nice feature of the BPS is the ability to create HTML output pages for the results. An example of the output can be found here. (Full disclosure: I helped create the BPS suite and now maintain this package.)
Remember Your Statistics
When talking about benchmarks, it is important to remember statistics. Any time you take a measurement, random variation comes into play. Using basic statistics is a way to measure and report this variation. The first step in good statistical benchmarking is to do repeated runs. If you want your benchmarks to have meaning, you should run each test a minimum of three times(five is much better) and report the average and standard deviation (a measure of the spread of the results). Almost all spreadsheets have standard deviation functions, or you can include the calculation in a wrapper script. Of course, multiple benchmark runs require more time, which is why good benchmarking data is hard to find. With today's processors, single runs are not much use because shared caches, throttling, and power-saving methods may skew individual results.
You May Never Know The Full Story
One more issue needs to be mentioned, which is often a neglected or forgotten aspect of multi-core processors. You can read a longer description of the problem in Good Enough Will Have To Do, but the following example illustrates the point.
Assume you have a program that runs well using a single core. On a shared multi-core node, you have no control over what else is happening with the other cores. The best you can do is to have exclusive use of the whole socket (processor), but you usually don’t because most nodes now have a minimum of eight cores. If you share the core, your performance can be reduced by other programs running on the same node and using the same interconnect. If you use a node exclusively, then you can make some statements about top speed, but if your program can use eight cores, then it is parallel, and if it is parallel it may run optimally on eight individual nodes (one core each) due to memory contention and cache use. Therefore, if you spread your application across nodes, then you are back to sharing a processor with other applications, unless you have reserved one core on each of eight nodes, which is not a very economical thing to do. Unless you have full non-shared access to a multi-core cluster (as is done with HPL runs), there is no way to ensure you are getting best the performance.
Your Turn
Benchmarking is an important and useful tool for any computer system. Benchmarks for HPC clusters are actually a good way to ensure that things are working and that any software or hardware changes are helping and not hurting performance.
It is important to understand that it is impossible to test everything and still have time for user applications. On a smoothly running cluster, a savvy administrator might drop a benchmark in the queue once in a while just to make sure things are working properly. And since you are an adroit system administrator you can compare them with the baseline numbers recorded earlier. Of course, if you don't keep tabs on how things are running, end users will always let you know!