In this section, I share more about selecting components of a cluster, what to look out for and some best practices.
Also, very often forgotten is building/buying a cluster is not just about lowest costs, most CPU....
It is about building a BALANCED cluster that will solve your compute needs
Balanced Cluster Design
Imagine having a high performance racing car, but the road that you drive on is a two-way expressway that is forever congested (fast processor but low network bandwidth). Or a 8-way super highway, but you drive a big slow truck laden with heavy goods (high bandwidth but slow processor or slow file-server).
In HPC, you want to match processor performance to:
- Memory
- I/O
- Interconnect performance
And consider metrics such as:- bytes of memory to FLOPS
- memory bandwidth to FLOPS
- interprocess communications bandwidth to FLOPS
- disk I/O bandwidth to FLOPS
- memory and interprocess communications latencies
3.1 Cluster Software
Operating Systems
Today for commodity Beowulf clusters, you have the choice of Linux or Windows (yes Windows HPC Server 2008 is out and feedback so far is that is it stable and relatively OK for HPC usage).
But Linux is the preferred OS for most Beowulf clusters today - be in in academic, government or commercial organization. This is not surprising since the very first Beowulf clusters were built on Linux back in the early 1990s.
Unfortunately (or fortunately depending on your opinions) with Linux there is a large number of Linux distributions to choose from, and this has often caused many debates and many variants of Linux HPC toolkits.
I would however recommend the use of the mainstream Linux OS such as Red Hat Enterprise Linux (RHEL) or its variants such as CentOS, and Suse or Opensuse, only because you will most likely use a commercial ISV software which is only supported on these Linux OS. It is also the most likely approved OS when you are out there working in the commercial HPC world.
For cluster management, use a HPC specific toolkit. Managing 1 server is easy, managing 128/512/1024 nodes without a HPC toolkit is just unforgivable. Today you have the choice of Kusu, Rocks, OSCAR etc; just choose one whose philosophy you agree with or which you find easiest to use or which you like the community the most. All of them works and are a productivity booster.
Compilers
You will buy the fastest CPU and HDD, but why save a few dollars on a decent compiler? GCC is excellent for many codes, but I have personally seen tremendous increase in code performance from a simple recompile of a C/C++ code with INTEL compilers. So investigate a little (INTEL, PGI offers free trial licenses) and choose one which works for you. Improving the performance of your code on a 64 node cluster by 50% means you may have spend the cost of 1 node, but saved the additional costs of buying an extra 31 nodes to achieve the same performance. Something which your CFO will agree it makes sense.
Parallel or Distributed Toolkits
MPI is the standard way to do parallel computing. There are now two popular open source MPI toolkits - MPICH and OpenMPI.
You should also investigate newer technologies such as UPC.
From the non-HPC space, popular distributed toolkits such as Platform Symphony (get the free Symphony Developer edition here) is commonly used in the financial domains. We think its usefulness extends beyond finance usage and researchers are exploring use of Symphony in non-financial areas.
Maths Libraries
Never ever code your own common numerical routines. Make use of proven and time-tested ones like the popular open source Lapack, Scalapack, BLAS etc. For higher performance, explore commercial libraries from INTEL, AMD and OptimaNumerics.
Batch Schedulers
Always use a batch scheduler to submit your jobs, yes, even if there is only one of you using the cluster. Most batch schedulers today is able to re-queue and re-run failed jobs - this alone is enough reason to use one - this frees you from having to monitor your jobs and manually re-running your jobs. This is especially so when the job failed at 8pm after you have left your office and you needed a result the next morning.
Typical benefits of using a batch schedulers:
- fair share of cluster resources
- ability to prioritize jobs (more important jobs complete first)
- charge-back
- improved monitoring of cluster usage
- improved cluster utilization (assuming there are jobs running!)
Popular batch schedulers include:
Commercial: Platform LSF, PBSPro, N1GE
Free/Opensource: Lava, SGE, Condor, Torque
3.2 Cluster Hardware
Beowulf clusters started out with PC-class machines, with researchers putting together a small cluster of 16 - 64 PCs. But as cluster size grew over time, server-class machines were used for its smaller footprint, increased reliability and serviceability.
Today, typical clusters are made of 1U or 2U servers or blades servers.
My recommendation is to go for proven tier1 OEM systems or well-known motherboard/system vendors. When you are running a production cluster of 1024 nodes, expect failures of nodes and you want to keep this low!
3.3 Cluster Interconnects
Most small clusters have only one network and this is typically Gigabit Ethernet. For larger clusters it is common to see i) Administration network typically GE, ii) HPC Network usually one of Infiniband, Quadrics or Myrinet, iii) Out-of-Band network.
The focus of this section is on the HPC Network.
Firstly determine if you your code will benefit from a HPC network. Most embarrassingly parallel do not. You would be better off spending the money on more compute nodes.
For bandwidth sensitive codes or IO intensive codes, then using Infiniband, Quadrics or Myrinet would help tremendously. Like my analogy earlier, no point having a fast sports car if you are always traveling in a congestion road.
I no longer track the performance characteristics of these HPC networks, but you would typically get in access of 10GB/s bandwidth, MPI latencies of 1 - 3us and very low CPu overheads as these HPC network adapters are typically mini-computers themselves and provide TCP/IP and OS stack bypass.
Which HPC network is the best?
For that answer - you would have to know the characteristics of your code. Best to run your applications on the network you intend to buy - beg and borrow resources from your friends and colleagues in the community or the vendor to test.
In real-world MPI codes, Quadrics is typically the fastest due to its SHMEM capabilities. Quadrics also has excellent MPI management software in the form of RMS. For people in the know, they are willing to pay the premium for Quadrics.
3.4 Cluster Filesystems
I have been advocating the use of parallel cluster filesystems since 2001 - when we first integrated then PVFS1 into our cluster management toolkit back then.
General observations:
- cluster are getting bigger
- IO is often neglected or a last minute addition
- IO often becomes the bottleneck of a cluster (not the CPU!)
- NFS just does not scale!
- NAS/SAN has its limitations and they are not designed for HPC use in the first place
NFS/NAS observations:
- shared storage and shared data
- use of single filer head
- multiple NFS/NAS heads is typically used for large clusters, however issues such as lack of single namespace, poor manageability (/home1, /hom2....) is a concern.

SAN observations:
- provides shared storage but to be useful in a HPC cluster where data needs to be globally read/write (and not restricted to individual LUNS as per standard SAN architecture) a SAN filesystem is required - this is additional costs
- not practical to directly connect SAN storage to compute nodes directly as it is both expensive and non-scalable
- hence often IO nodes are used with a backend SAN storage, this degrades performance (GFS -> NFS type conversions)

HPC Cluster Filesystems observations:
Most popular today are: Lustre, PVFS2 and GPFS. With Red Hat HPC Solution, maybe people will explore Red Hat GFS further as a mid-range cluster filesystem since you have already paid for it you might as well use it.
Such parallel cluster filesystems uses the same concept as Beowulf clusters. You put together many commodity servers with lots of HDDs (some use SAN storage for better reliability) and use software to present to the cluster a unified filesystem - or what is generally known as a Global Namespace Filesystem.
What is important to note in such an architecture is scaling of file IO in TWO(2) dimensions:
- increased CAPACITY from adding more HDDs to the nodes or by increasing the IO nodes
- increased BANDWIDTH from adding more IO nodes, for example a 32 IO nodes cluster filesystem has 32 GE NICs pumping out data 32x times faster than a single NFS fileserver

3.5 Some non-technical things to look out for
This section covers non-HPC specific issues, but which I find it important to address as I have encountered many of such problems in the deployment of HPC clusters.
Environment
So you have purchased your brand new cluster - but is your office ready for the cluster?
NEVER put your cluster in your office or your graduate student's office!
A cluster needs:
- Space- Density vs costs
- SMP saves space but costly
- Rack saves space but costly
- Power and Cooling- Ensure sufficient power and cooling for your cluster
- Get a certified PE or experienced HPC SI or data centre engineer to assist
- Do note that some MIS data centre engineers do not fully comprehend the requirements of a highy dense HPC cluster as their typical environment is very UNLIKE ours
Planning for your cluster
Make sure you consider the following factors in the diagram in your planning. Even very experienced HPC engineers have forgotten about floor loading and projects have been delayed for up to 6 months while building contractors were called in to reinforce the data centre floor.

Budget
You need to know what is your budget. - Do not plan for and ask vendors to quote for a cluster size which you can never afford. It destroys your credibility.
- Do some research on estimated price, work with your vendors and 'hint' to them by your budget.
- This would help both your vendors and yourself in coming to a balanced solution.
Never spend money on problems you won't have!- Design the HPC system to suit your applications. For example - do you need Infiniband? (this is where you need to know your applications/code), or do you need more than 4GB RAM on your compute node?
- If your code uses MPI, consider a HPC interconnect as that would probably get you more bang-for-the-buck than more compute nodes.
Never solve problems you can't afford to solve- Having multi-million simulations, time-steps or cells is ideal, but at what costs? Can you make do with less accuracy or explore other innovative algorithms that would require less compute power (those within your means).
Buying a HPC cluster is not just about server hardware! It includes:- Interconnects
- Operating systems and cluster management software, schedulers, tools
- Services to design, implement and even manage a cluster
- Ongoing support
- Electricity and Cooling
Selecting a HPC vendor/partner
A HPC cluster system while many vendors want you to think is a simple plug and play appliance - is very complex.
- You need a long term partner that can support you for your current and future HPC requirements.
- HPC is a niche solution requiring highly skilled experts and consultants (you need to pay a premium for the real experts)
- You want a HPC partner, not a hardware supplier!
This is the end of the KUSU 100 series. I hope this series have given newbies into HPC a good overall understanding of HPC and what goes into selecting, purchasing, managing a HPC Beowulf cluster.
Happy Clustering!