The Evolution of Cluster Provisioning: From dd to Plug-and-Play

    The modern HPC cluster is often a complicated affair. It consists of various hardware components that include compute servers, communication networks, and a storage system. In order to create a functional cluster one needs many types of software. This software includes:
    • Provisioning Tools
    • The Operating System Distribution
    • Management and Monitoring
    • Development Tools and Libraries
    • User Applications
    Managing all these components (both hardware and software) is a big job. In the past system administrators often took on the sole responsibility for software provisioning on the cluster. This approach was time consuming and a "one time" custom solution. For those who do not want to go at it alone as well as those who are looking to reduce their effort, there are methodologies and tool sets that assist with configuring and provisioning HPC clusters. Of course some hand tuning is still required, but most of the heavy lifting can be handled by many of the cluster distributions and tool kits.

    The Cluster Environment

    Before we talk about managing a cluster, let's take a look at the requirements for a functional cluster. As a review, consider the generalized cluster in Figure One. There is a head node which serves as the access point for users and administrators. In our simple cluster, the head node is also the NFS file server and runs the batch scheduler which manages users jobs. Next, there are a number of compute nodes that are used to execute applications. Finally, there are one or more networks connecting all the nodes. In this case, there is Gigabit Ethernet network for file sharing and administration and a high speed compute network which is usually InfiniBand (IB) or 10 Gigabit Ethernet (10GigE).


    Figure One: General Cluster Design

    On the surface, the software needs of a cluster appear quite simple.

    1. Provide a full software environment on the head node that includes user home directories
    2. Provide identical copies of the head node software on the nodes
    3. Export the user /home directories from the head node to the compute nodes

    Given this environment, it is easy to see that a cluster can be viewed like a collection of workstations each with their own self-contained environment (i.e. you can attach a keyboard and monitor to the nodes and use them as you would the head node) Since all of the nodes and the head node have identical software environments with a shared /home, parallel programs (multiple copies of a binary)can be run on any number of nodes. When the programs need to communicate, they can talk using the MPI (Message Passing Interface) API over the high speed compute network.

    The basic cluster environment is actually rather simple to install and set up. Indeed, many early clusters were configured using Red Hat install CDs on each node. The Red Hat kick start utility made the installation even simpler because nodes could be booted from a floppy disk and install their software from a central build server. You could also add customizations and additional software with the kickstart floppy disk, so it was actually possible to automatically build a complete cluster. Other methods included using dd (duplicate disk) to copy the head node disk for use on the nodes (There was some further configuration needed.) and System Imager which can snapshot and copy disk images.

    While these methods worked well there was one major issue that caused problems -- changes. Changes in the software environments can take several forms. The most common are a package upgrade or the addition of a new package. There are also changes that are local to the node itself. For instance each node has a unique IP address, hostname, and other configuration files. In essence, a consistent "state" must be maintained throughout the cluster, both locally and globally. If one node (or several nodes) becomes out of sync, parallel programs may fail to run.

    To help with this problem, there are parallel administration tools that allow simultaneous commands to be run on all nodes. For example there are tools like C3 (Cluster Command and Control) or PDSH (Parallel Distributed Shell) that provide cluster wide commands. As an example, when using PDSH it is possible to run commands like:

    Code:
     # pdsh uptime
    
    which will execute the uptime command on all nodes and send the results back to your terminal. You can also run commands on selected nodes. For example,

    Code:
    # pdsh -w node[7,9-10] uptime
    
    will execute the uptime command on nodes with names node7, node9, and node10. There is also a parallel copy command that works similarly:

    Code:
    # pdcp -w node[6,8-10] /etc/hosts /etc
    
    In the case above, the local /etc/hosts is copied to nodes with names node6, node8, node9, and node 10. While these tools may seem to solve many problems, they can actually create issues in the cluster. As you can see from the above example, it is possible to selectively change specific nodes and thus create different states across the cluster.

    Perhaps the most difficult issue facing a cluster is trying to determine the state of each node. Unless you have tracked every change on every node, it is almost impossible to be sure that all the nodes are in sync with the head node. Even with careful tracking, a down node may still be an issue. If for instance you have nodes that are offline for repair or to be used as spares, how do you keep these in sync with any changes? Plus, if you replace a disk drive in a node, you will need to retrace all the steps you've taken since the initial installation.

    Managing The Environment

    Managing change across the cluster is a major issue for any administrator. As mentioned the range of software needed on a cluster is quite large, although not all software is needed on all nodes. Indeed, the head node has a unique role in the cluster because it acts as a login node, an NFS server, a batch scheduler, a node monitor, and a development node for users. Many if not all of these services and other software are not really needed on the nodes.

    In addition, when the PXE boot environment (booting over the network using DHCP) became standard, the head node also took on the role of the boot server. It should be noted that in large clusters each of these services can be offloaded to an individual server if the load gets too high. That is, for a large cluster, running the batch scheduler, NFS, and monitoring from a single head node may overwhelm the node.

    Given that the head node could manage the boot environment, it made sense that it should also mange the software environment on the compute nodes at boot time.

    Image Based Management

    One of the first methodologies to manage compute nodes was based on hard drive images. While the software needed on the compute nodes was different than that needed on the head node, it is identical on each compute node. Since PXE booting and DHCP allowed each node to obtain its identity at boot time, a standard compute node image could be maintained on the head node and copied to the nodes the first time they are booted. That is, rather than a traditional RPM based install, the node hard disk is provisioned with an full image. The next time the node boots, the image is ready for use in the cluster.

    The advantage to this method was that node images were controlled on the head node. If the image is changed in some way it is simple to tell the nodes to re-image themselves with the new image. Of course this would require time for the new images to propagate, but it allowed tight control of exactly what was installed on the nodes.

    There are several freely-available distributions that provide this capability. Some of them have a commercial counterpart that includes support. Let’s look at some of the popular image based cluster distributions:

    • Kusu is the open foundation for commercial Platform Cluster Manager (PCM) and is a standardized approach to easily build, manage, and use Linux HPC clusters. There is an active development and user community located at HPC Community.
    • ROCKS is an open-source Linux cluster distribution that enables end users to easily build computational clusters for various purposes. There is a commercial version Rock++ available as well.
    • OSCAR allows users, regardless of their experience level with a *nix environment, to install a Beowulf-type high performance computing cluster. It also contains everything needed to administer and program this type of HPC cluster. OSCAR's flexible package management system has a rich set of pre-packaged applications and utilities, which means you can get up and running without laboriously installing and configuring complex cluster administration and communication packages.

    Stateless Provisioning

    As the capability of node hardware continued to grow, many developers wondered if the need to boot the node using a hard disk image was really necessary. Several methods have been developed to boot nodes in a "stateless" mode where the entire boot image comes from the boot server and lives in memory. When the node is rebooted, all state information is lost. The state is managed from the head node and is highly configurable. In addition, stateless memory images are usually small trimmed down versions of full system images that boot very quickly into RAM disks. The memory image is often supported by NFS mounts for application software or support programs/libraries. With stateless booting there is no need to have hard disk drives on the nodes and some users choose to configure their clusters in this fashion. There is no reason, however, that stateless nodes cannot have and use local disks for local storage. It should also be noted that Kusu, Rocks, and Oscar can now do statele! ss (disk-less) provisioning as well. The packages mentioned below were designed primarily for stateless operation:

    • Perceus is an enterprise and cluster provisioning toolkit. Created by the developers of Warewulf (one of the most utilized Linux cluster tool-kits), Perceus redefines the limits of scalability, flexibility and simplicity, facilitating open customization and site-required technologies without trading scalability, ease of use, or simplicity.
    • Scyld ClusterWare ™ is a commercial HPC cluster management solution. It is designed to make the deployment and management of a Linux cluster as easy as the deployment and management of a single system. Scyld ClusterWare makes it possible to leverage the superior price/performance ratio of Linux on commodity hardware without the pain of having to individually manage a multitude of systems.
    • xCAT is an open source application that allows the user to provision operating systems on physical or virtual machines, remotely manage systems, and quickly set up and control management node services, including DNS, HTTP, DHCP, and TFTP.

    Remote State Management

    In this scenario, the node state is maintained on a remote filesystem (e.g. NFS) or disk (e.g. iSCSI). One popular package for remote state management is oneSIS an open-source software package aimed at simplifying disk-less cluster management. oneSIS is a simple and highly flexible method for deploying and managing a system image for disk-less systems that can turn any supported Linux distribution into a master image capable of being used in a disk-less environment. One image is sufficient for serving thousands of nodes.

    Final Thoughts


    The array of options for the cluster administrator is quite vast. The goal of all packages mentioned above is to manage state and maintain consistency across the cluster while at the same time providing flexibility to the administrator. As is often the case, users may request specific software packages (even versions) be available on the nodes. The administrator must make sure that updating cluster node images (either disk or memory) does not provide a large interruption to cluster operation. Another aspect is scale. Almost all of the packages mentioned above will scale to very large clusters. If you are considering a small cluster, i.e., less than 64 nodes, virtually any of the above packages and methodologies will work for you. Deciding on how to provision larger clusters may take a bit more due diligence since needs and cluster designs can vary.

    In terms of support, all open source packages have active user communities from which you may be able to obtain support. Commercial support is available from several vendors, but realize that you will need to be using the "commercial version" and not the open version in order to enjoy full support.

    Finally, there is no need to try and provision a cluster on your own. The tools have evolved way beyond the simple provisioning methods used for the first clusters. The current array of cluster software is quite powerful and you would be hard pressed to duplicate these efforts without a considerable time investment. Pick your method and start provisioning your HPC dream machine today.
    This article was originally published in forum thread: The Evolution of Cluster Provisioning: From dd to Plug-and-Play started by deadline View original post