-
May 29th, 2009 04:44 AM #1
So You Want To Build A Cluster: Five Things to Consider Before You Start
Before we begin, let's be clear on one point. This is not another "how to build a cluster recipe" or article. It is what to know and what you should think about before you build your cluster. Perhaps you are familiar with Linux and X86 hardware, but without a good cluster background many of the ideas or methods may seem mysterious. Maybe you have already looked at several of the "how to build a cluster" articles and noticed they all seem to be doing things a bit differently. Not to worry. There is nothing really new or secret about clusters, although building, running, and maintaining them does require basic skills. Good Linux/Unix administration experience is very helpful, as is a good understanding of networking and storage issues. If you don't have these skills you will certainly get a chance to develop them building a cluster! Let's begin by considering the building blocks of a cluster system.
1. Clusters Are A System
If there is one thing to remember about clusters it is they are large systems composed of many individual parts. The goal is to use all of the parts to solve a single problem (i.e. run a program that uses all the component parts at the same time). All the parts must function as intended for the system to work as a "whole." This aspect is what separates HPC clustering from many other forms of computing -- and it is what makes it so exciting. From a hardware standpoint, clusters have three major components:
- Servers make up the bulk of a cluster. Servers are usually rack-mounted or blade-based systems, and are often referred to as nodes in the cluster. The most popular servers are commodity-based Intel and AMD systems, but other platforms can be used as well. The choice of processor depends on your needs and budget. There can be several types of servers used in a cluster and the most numerous are the compute nodes. These nodes are where the computing is done. Compute nodes may either have local hard drives or be "diskless" systems. Another type of server is a head node, which serves several purposes. It is a login node for the users (users do not log in to compute nodes), and it is an I/O node that serves file systems (e.g. NFS) to the compute nodes. Finally, it can be an administrative node, to which jobs are submitted to a batch scheduling system, workflow is monitored, and other administrative tasks are performed. All the various tasks can take place on a single head node server, or they can broken out into individual servers depending on how large the cluster is. In general, the head node does not assist the compute nodes with computation.
- Network(s) are what connect the servers. In general, all nodes (compute and head node) can talk over the private network to any other server. At a minimum there is one Ethernet based cluster network that runs throughout the cluster. Cluster networks are usually only accessible from inside the cluster and any access to nodes is through the head node. A cluster can have additional networks as well, depending on its design. For instance, InfiniBand or 10 Gig Ethernet are often used as a high performance network within the cluster. A slow network may prevent the compute nodes from working at their peak potential.
- Storage is often the forgotten component of a cluster. Quite often storage is part of the head node, which may employ RAID sets or another single source storage option (Network Attached Storage, NAS). Larger clusters often use parallel file systems (a multiple source option) that can support many compute nodes writing to the same file at the same time. If the compute nodes must wait for I/O, they may not be working at their full potential.
For a cluster to work successfully all three systems need to work together. Using the latest and greatest compute nodes with a slow network may not provide the expected performance level. Likewise, if you have applications that generate large amounts of I/O, a single head node with a few disks may be a bottleneck in your performance.
At a minimum a fully functioning cluster can be built with a head node, some storage, a Gigabit Ethernet network, a gigabit switch, and two or more compute nodes (with or without disk drives). Figure One is an example of a "typical" cluster design. Note the use of two networks -- one for compute traffic between nodes and one for file (NFS) and administrative (batch scheduler) functions. We will talk more about the software below.
Figure One: Typical Cluster Design
2. The Network(s) Is What Holds The Cluster Together
The compute nodes would be isolated systems were it not for the network. The network moves data to and from various nodes, and in HPC clustering, the network is of critical importance. An HPC program uses multiple compute nodes to run a single program. This single program actually exists as many sub-programs running concurrently on the compute nodes. All the sub-programs are linked together over the network. Because the network "feeds" the individual sub-programs, a poorly designed or functioning network can starve and stall the compute nodes.
The most common form of networking is Ethernet, which is virtually everywhere. Various software layers are used to implement networking over Ethernet. These software layers were designed for robust use and will function even in mis-configured and spotty networks. The end user is often unaware of how the network is functioning because most interaction takes place at a very high level. With HPC computing, networks need to be optimized and working at peak capacity; otherwise there can be program bottlenecks or failures. For this reason, a good understanding of Linux/Unix networking is important.
From a user's standpoint there two areas that need attention. The software mechanics of configuring/starting/stopping kernel-networks (TCP/IP) and user-space networks is an issue you will deal with from the very start. (Note: kernel-networks are TCP/UDP/IP based connections that are managed by the Operating System (OS) kernel, the faster user-space networks are managed by the user and work outside the kernel.) Secondly, understating the characteristics that define network performance is equally as important. Network communications can be characterized by throughput vs packet size (how fast the data moves), latency (how long it takes to send the smallest amount of data), and N/2 (the packet size at which the throughput is half its maximum). There are other factors to consider as well, but these three are the most important when evaluating a network without running any application benchmarks. Figure Two illustrates the relationship between throughput and N/2 for a network.
Figure Two: Definition of Throughput and N/2 rate
Throughput is the most often quoted performance metric for a network. The maximum value is usually given with no regard to the fact that throughput varies by the payload or packet size (see Figure One). If you want to send 1,000 bytes the effective throughput may be quite different than if you want to send 10,000 bytes. Latency effects throughput and is often an important criterion in HPC networks. One way to think about latency is the number of short messages that can be sent in a given amount of time. If the sub-programs running on the cluster need to send many short messages, then a high latency per message means that a smaller number of messages can be sent (which means less effective throughput). The lower messaging rate means more time waiting by the cores, which means slow performance. The N/2 is also an interesting measure of performance as it tells us how fast the throughput curve rises. The most desirable (and expensive) networks have high throughput rates, low latency, and small N/2 values. Remember that all networks require a switch of some sort and the switch may reduce throughput and increase latency and N/2 values. There are ways to measure many of these characteristics using benchmark software (See Below).
Finally, another issue that needs to be considered with regard to networking is multi-core processors. The advent of multi-core has increased the load on the node-to-network connection because instead of servicing the needs of one or two cores, a typical compute node may have four or eight cores. It is therefore important to keep in mind the number of cores per node and the expected amount of network traffic generated by the cores.
3. Software, Software, Software
Before we discuss software, it is interesting to note that almost all clusters are built using GNU/Linux (other operating systems include Unix, Apple OS, and Windows). The advantage to using Linux is that the entire software infrastructure (the plumbing) is open. Indeed, the rapid growth of Linux HPC clusters is largely due to the open nature of the software. The types of software needed for a cluster fall into three categories. First there is the base operating system (OS) software suite (e.g. Red Hat, Fedora, CentOS, SuSE, etc.) Next, a collection of cluster software is needed, things like libraries, compilers, batch schedulers, and monitoring software. Finally, there is application software that includes end user applications. Depending on your needs it is possible to build an entire cluster using open software -- include application software. There are, however, commercially supported versions of all software needed to build a cluster. Some of the supported software is open source and some is closed source -- it all depends on your needs.
Without the right software, a cluster is just a pile of hardware. Even if an OS is installed on each node, something is still needed to "get the nodes to work together." That something is the MPI (Message Passing Interface) layer or API (Application Programming Interface). MPI is a software library for C, C++, and Fortran programs that allows the sub-programs mentioned above to send information to each other. An MPI program is therefore a collection of sub-programs that work together to solve a big problem. The problem may be too big computationally to complete in a reasonable time without utilizing multiple compute nodes, or it may be too big in size to fit onto one compute node, or a combination of both. The important thing to remember is that MPI is standard way for compute nodes to talk to one another (or more properly put "a standard way for a core on a compute node to talk to any other core on the same node or on a different node"). Programs that do not use MPI can only work on one core unless they use something like OpenMP or ptheads, but in either case they are still limited to the cores on one motherboard. And here is a final important note: unless the program has been explicitly changed to run on a cluster (or across multiple cores) it will only ever run on one core. There is no free lunch when it comes to cluster computing!
Beyond MPI there is a need for a batch scheduler. A batch scheduler is an important part of a cluster for several reasons. First, it allows a user to share the compute nodes with other users and second, it allows nodes to be taken out of service without the users knowing they are missing. A batch scheduler works because a user requests a certain number of nodes to run a program (all nodes are assumed identical). When the nodes are available the scheduler then "gives" these nodes to the user for his MPI program. Because the cluster often has more nodes than are needed by any single user's program, multiple users can request nodes for jobs at various times and the cluster runs them when the resources are available -- often simultaneously. Because the user does not request specific nodes, it is possible to take compute nodes out of service with no impact to the users. As an aside, nodes may fail, crashing the current running job, but not the whole cluster. A failed node is automatically removed from the scheduler.
In order to fill the gap between the standard GNU/Linux OS (e.g. Red Hat) and a fully functioning cluster, several cluster tool kits or "Cluster Distributions" have been developed. These toolkits are designed to work "on top of" a standard OS install, or they may include the underlying OS as well. The goal of these toolkits it to provide turn-key provisioning and ready to run software, making it easy to get a cluster up and running. The freely available (no cost) options include:
- Kusu - the foundation for Platform Cluster Manager, is a standardized approach to easily build, manage and use Linux clusters and is a freely available cluster distribution!
- Rocks Clusters - Rocks is an open-source Linux cluster distribution that enables end users to easily build computational clusters, grid endpoints and visualization tiled-display walls.
- OSCAR - OSCAR allows users, regardless of their experience level with a *nix environment, to install a Beowulf type high performance computing cluster.
- Perceus - Perceus is the next generation of enterprise and cluster provisioning toolkit. Created by the developers of Warewulf (one of the most utilized Linux cluster toolkits), Perceus redefines the limits of scalability, flexibility and simplicity.
- oneSIS - is an open-source software package aimed at simplifying diskless cluster management. It is a simple and highly flexible method for deploying and managing a system image for diskless systems that can turn any supported Linux distribution into a master image capable of being used in a diskless environment.
Commercial versions of cluster toolkits are also available. All of them are based on GNU/Linux, and they often provide extra commercial applications and support as part of the cluster package.
- Scyld Clusterware™ - is an HPC cluster management solution. It was designed to make the deployment and management of a Linux cluster as easy as the deployment and management of a single system.
- ClusterCorp Rocks+ - Clustercorp offers the only licensed commercial solutions based on the Rocks Cluster package.
- Platform Cluster Manager - enables a new class of users by simplifying Linux cluster application, deployment and management. Backed by global 24x7 enterprise support, Platform Cluster Manager is a modular and hybrid stack that transparently integrates open source and commercial software into a single consistent cluster operating environment.
- Red Hat HPC Solution - The Red Hat HPC Solution is a low-cost, end-to-end software stack for high performance computing. It provides all the tools needed to deploy, run, and manage an HPC cluster in one easy to install package.
4. Benchmarks Rule
HPC cluster computing is parallel computing, the process by which more than one processor core is used to run a program. As mentioned above, the main program is broken into sub-programs. It should be mentioned that clusters can be used to run single core tasks as well as parallel tasks. Indeed, a user tells a batch scheduler how many cores it will need when a program is run.
Parallel computing is different than sequential (single core) computing, and many of the rules that apply to sequential computing do not necessarily apply to parallel computing. For instance, consider processor speed. In the sequential world the fastest processor is always the best choice. In the parallel world, the fastest processor must be balanced with an appropriate network or the processors will "wait" for data. This situation is similar to connecting a slow disk and fast processor in the sequential world. If the application needs to perform a large amount of I/O, the slow disk will cause the processor to "wait" for the disk. The same applies to I/O in a cluster. Therefore, it is important to make sure that your cluster is a balanced system. If you have a specific type of workload (user application) in mind, then it is important to make sure the system is balanced for your environment.
In general, cluster users all have a fixed budget with which to build a cluster. There are many decisions that need to be made when selecting components, including the type of processor/motherboard, network, and storage system. The cost of these components depends on their performance -- the faster they are, the more expensive they will be. Including a fast network (InfiniBand or 10 GigE) can be expensive and will limit the amount of compute nodes you can fit in your budget. On the other hand, standard GigE is inexpensive and will allow you to squeeze more compute nodes into your budget. The decision on which way to design your cluster should be based on your needs. That is, do your applications need a fast network or can they get away with GigE? Storage is a similar issue. Will a head node running NFS provide enough I/O for your cluster, or do you need to look at an optimized attached storage system (Network Attached Storage) or a more expensive parallel file system? Unfortunately, these are questions that may take some investigation. The best way to answer them is through examining benchmark numbers.
In the above section, we discussed important network parameters (Throughput, Latency, N/2). These are considered micro-benchmarks because they measure a single aspect of a cluster. Full benchmarks, such as application programs, are a better measure of performance, but may lack the ability to pinpoint one particular bottleneck. There are multiple collections of cluster benchmarks from which to choose. Have a look at the Benchmarking Packages page on Cluster Tweaks to find plenty of benchmarks to test your cluster performance and ensure that it is working correctly.
If you have a specific application (or applications) in mind it may be best to consult previous results as well as benchmarks from other users. Most members of the HPC cluster community are very open and will often share their experiences and results with you. (See the Web Resources section below) The importance of benchmarks cannot be overstated. As with many complex systems, the "devil is in the details" and clusters are no exception. Running full systems benchmarks and applications is the only sure way to know if a cluster is working properly. One of the biggest mistakes cluster rookies will make is to design and buy a cluster from product data sheets or web pages. All too often, individual processor performance for each cluster node is added together to get an aggregate cluster performance number. As we have seen, network performance and I/O can have a strong influence on such assumptions and can push such assumptions way off the mark. Remember, one benchmark is worth thousands of assumptions (or dollars).
5. We Need More Than Five Topics
It is not really possible to fit everything into this article! However, if you got this far you are probably better off than you were before you started, but you also probably have more questions! Don't panic, there are plenty of resources out there. The first thing to consider is your goals. If you want to build an educational cluster (one in which you can learn about clustering), then there are plenty of freely available resources (and old hardware) that will get you started. Also, you may want to have a look at Building Your First Cluster. On the other hand, if you need or are interested in a production cluster (one that is required to perform real work), then it may be best to consult with a hardware company that has experience with HPC clusters (note: selling rack-mounted servers does not necessarily mean the company has experience with HPC, ask for customer references). You may also want to consult How to Write a Technical Cluster RFP before you ask for hardware quotes or select a vendor. If you plan on spending money on hardware, many of the experienced HPC vendors will assist you in benchmarking or may even have the benchmarks in which you are interested.
The most important piece of advice for those pursuing HPC clustering is to ask questions and test assumptions. A great place to ask polite questions (After you have searched the archives) is the Beowulf Mailing List. (Note: A Beowulf cluster is essentially an HPC cluster. The name Beowulf came from the NASA project that developed one of the first commodity HPC clusters). Of course, you can always turn to Google and the Internet. The following section lists some of the more popular and useful places to find additional information on HPC clustering.
HPC Web Resources
The following sites will help you get started with HPC cluster content and news. - Beowulf.org - home of the original Beowulf project. It also host the Beowulf mailing list, which is one of the best resources for learning and asking questions about cluster HPC. This is where the Rocket Scientists hang out.
- Cluster Monkey - an open community oriented website for HPC cluster geeks (and non geeks) of all levels. There are tutorials, projects, and many "How-To" articles on the site. There is also a "links" section with an up to date set of links to essential cluster topics including books, software distributions, and more.
- HPCCommunity.org - a technical knowledge sharing and HPC community discussion portal for the High Performance Computing (HPC) community.
- Cluster Tweaks - a community wiki for cluster users that provides a central location for cluster information, trick, tips, how-to's, benchmarks, products, etc. that may be of interest to HPC cluster users. It a document for the community by the community.
- Linux Magazine - a general Linux site with a large focus on HPC. There are plenty of articles on many aspects of HPC and Linux.
- HPC Wire - a good site to keep abreast of news, events and editorial from the HPC market.
- InsideHPC - another good site for news about the HPC market.
- Scalability.org - a great blog that provides real experiences and opinions from the HPC trenches.
- Intel HPC - source for Intel white papers, tools, and products.
- AMD HPC - source of AMD white papers, tools, and products.
- Sun HPC Blogs - Need a break, visit the Sun HPC Water Cooler. A great source of blogs, articles, videos, editorial and more.
And finally, have fun! Clustering is about building a high performance computing machine around your problem. Never before in the history of computing has the end user had the freedom and flexibility to create and control their own supercomputing solution.
About the author: Douglas Eadline is the lead editor at Cluster Monkey and an all round cluster kind of guy.
-
May 30th, 2009 10:17 AM #2
Not all clusters are for a single program
Hi Doug, excellent article, my only niggle is that you say the goal is to "run a program that uses all the component parts at the same time", which in my experience over here is very rare, usually clusters are targeting large user communities all trying to use parts of the cluster at the same time.
Some might be doing parametric studies with lots of single CPU jobs and others may be doing large parallel MPI codes, but nobody gets the whole machine to themselves (well, not unless they bribe the sysadmins! :-)).
cheers!
Chris
-
July 3rd, 2009 02:14 AM #3
Excellent article
Very crisp and enlightening
Abhi D
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
Forum Rules