Use of Server Virtualization in HPC Environments
In this blog, I will outline characteristics of HPC environments and how server virtualization technologies can address some issues, along with some of what I see might be challenges in using these technologies. This blog assumes you have some familiarity with the basics of server virtualization. For a quick overview see Hypervisor - Wikipedia, the free encyclopedia.
The typical HPC environment today is characterized by “cluster silos” in which different departments or application groups set up and configure workload clusters to suit their needs. Because different applications require different software stacks including OS and system configuration, job schedulers, or middleware like MPI or PVM, each environment is relatively unique. Some larger organizations have managed to standardize different applications on one stack and consolidated multiple clusters into a single large central Grid managed by IT. However, for the majority of organizations where a more de-centralized structure is the norm, there is concern about relatively low utilization of individual cluster silos and the management overhead of associated with maintaining different physical infrastructures for each application or department.
One of the hot trends in the broader Enterprise IT landscape has been the rise of server virtualization technologies. While server virtualization has existed in the mainframe and most UNIX environments for a while, companies like VMware blazed the trail on the commodity X86 hardware platforms. The ability to carve up a physical machine into multiple logical machines in a manner transparent to the OS or application, enables the consolidation of lightly used enterprise servers for mail, print, web, etc onto fewer physical machines thus costs of physical servers as well as power, cooling and management costs.
For a long time VMware was the sole viable option for server virtualization on X86 hardware, but now there are a number of alternative choices including Microsoft Hyper-V, Xen, and KVM recently endorsed by RedHat. This trend to the commoditization of basic server virtualization functionality can be seen by the lower costs associated with this technology which have gone down from $1000s of dollars per node to free or nearly free. Obviously a high cost model does not work when deployed on the large scale distributed environments typical of HPC.
With the availability of commodity server virtualization software, often baked into the OS, here are some of the possible applications of it within HPC environments:
Checkpoint/Restart/Migration: It has always been a challenge of how to deal with long-running jobs that need to be periodically checkpointed to avoid losing work in the event of failures. Typically this relied on the application to preserve its state or the use of expensive hardware where the OS supported process-level checkpoint/restart. Today, most hypervisors on X86 have the ability to take a snapshot of the memory and disk state of a running VM and park it to disk and later restart that process on another physical machine. This provides a clean mechanism that is non-intrusive to the application. Some hypervisors even provide the ability to do live migration meaning that the VM is moved to another physical host without even impacting the network connections, which would be useful in MPI-style applications.
Dynamic Provisioning & Sharing: It is relatively easy to capture all the application environment and settings into a VM image which can be started in a few minutes or even seconds. This allows for dynamically creating an environment or rapidly adding resources to scale out the cluster as the workload demand increases or shutting down machines when it drops. Theoretically each long running job can be encapsulated in its own VM or a set of VMs created for a parallel job and then shut down when no longer needed. Machines become disposable in a virtual environment because all the important settings for an application are captured in the image which is maintained on disk.
Application Isolation & Security: Another advantage of VMs is that they can isolate applications belonging to different departments or potentially different organizations. Running jobs from different organizations on the same physical machine can lead to issues of data privacy, and reliability if one misbehaving application causes the OS to
Dev/Test Clusters: When new applications are being developed or a new version of a commercial application is being tested, it is necessary to set up a new cluster for short periods of time. Rather than physically setup a separate cluster, the use of server virtualization allows to create an entire virtual cluster for developers/QA and when it is no longer used to shut it down to free up resources for production use.
Power Management/ Green IT: In a virtualized environment VM machines can be migrated amongst physical machines. If several VMs are not fully utilizing the physical capacity, then they can be migrated onto a smaller set of physical hosts to improve utilization and then the physical machines can be powered off , thereby saving power and cooling costs.
While there are many potential uses and benefits to server virtualization technology in HPC, there are also some challenges:
Scaleable Storage backend: Given that machines are now transformed into files on disk, this places more strain on the storage system backing a compute farm. Some features, like live migration work, best if there is a centralized shared storage infrastructure. The cost of scaling up the storage backend across 100s or 1000s of nodes may outweigh the savings costs on the compute hardware. One option is to make use of local disks to store images and switch between them. But then, the issue of updating or patching local copies of an image will arise. Some sort of image distribution mechanism will be required.
IO overhead of VM: VM technologies still tend to have some overhead compared to running on raw physical hardware. This is especially true for the case of I/O intensive or latency sensitive applications. One approach might be only run one VM per physical machine to give maximum access to disk and network drivers to the single application while still taking advantage of the flexibility of dynamic provisioniong with VM. Hypervisor support for specialized communication transports such as Infiniband/Quadrics/Myrinet is another related issue.
VM Management: Management of VM environments becomes another challenge because now you are no longer dealing with just the physical boxes and a single OS instance running on them, but potentially a several OS instances. Patching, updating images, monitoring the OS instances, troubleshooting and diagnostics, policy-based resource allocation all become further complicated in a VM environment. Without an appropriate set of tools and procedures in places, this could obviate the benefits. The hypervisor vendors provide their own tools to address these issues in the context of their own technology. Tools such as Platform VMO are attempting to address the heterogeneous management challenge, but there is still scope for improvement and dealing with the unique requirements of HPC environments.
Application Performance benchmarking: Given that server virtualization is has not been that widely adopted within HPC yet, there aren’t that many ISVs or providers of libraries and tools like LINPACK that have done benchmarking of their software on hypervisors. This is a chicken-and-egg type of scenario where users have to push the vendors in order for the vendors to see the market demand.
So likely any technology, one has to weigh the costs and benefits of introducing it into an existing environment. Server virtualizations will be broadly used within the generic IT landscape over the coming years, so it would be prudent of HPC users to take a look and see how it can help. I would be interested in hearing peoples thoughts of whether you are looking at server virtualization, which technologies you are considering and which use cases you are targeting.
The typical HPC environment today is characterized by “cluster silos” in which different departments or application groups set up and configure workload clusters to suit their needs. Because different applications require different software stacks including OS and system configuration, job schedulers, or middleware like MPI or PVM, each environment is relatively unique. Some larger organizations have managed to standardize different applications on one stack and consolidated multiple clusters into a single large central Grid managed by IT. However, for the majority of organizations where a more de-centralized structure is the norm, there is concern about relatively low utilization of individual cluster silos and the management overhead of associated with maintaining different physical infrastructures for each application or department.
One of the hot trends in the broader Enterprise IT landscape has been the rise of server virtualization technologies. While server virtualization has existed in the mainframe and most UNIX environments for a while, companies like VMware blazed the trail on the commodity X86 hardware platforms. The ability to carve up a physical machine into multiple logical machines in a manner transparent to the OS or application, enables the consolidation of lightly used enterprise servers for mail, print, web, etc onto fewer physical machines thus costs of physical servers as well as power, cooling and management costs.
For a long time VMware was the sole viable option for server virtualization on X86 hardware, but now there are a number of alternative choices including Microsoft Hyper-V, Xen, and KVM recently endorsed by RedHat. This trend to the commoditization of basic server virtualization functionality can be seen by the lower costs associated with this technology which have gone down from $1000s of dollars per node to free or nearly free. Obviously a high cost model does not work when deployed on the large scale distributed environments typical of HPC.
With the availability of commodity server virtualization software, often baked into the OS, here are some of the possible applications of it within HPC environments:
Checkpoint/Restart/Migration: It has always been a challenge of how to deal with long-running jobs that need to be periodically checkpointed to avoid losing work in the event of failures. Typically this relied on the application to preserve its state or the use of expensive hardware where the OS supported process-level checkpoint/restart. Today, most hypervisors on X86 have the ability to take a snapshot of the memory and disk state of a running VM and park it to disk and later restart that process on another physical machine. This provides a clean mechanism that is non-intrusive to the application. Some hypervisors even provide the ability to do live migration meaning that the VM is moved to another physical host without even impacting the network connections, which would be useful in MPI-style applications.
Dynamic Provisioning & Sharing: It is relatively easy to capture all the application environment and settings into a VM image which can be started in a few minutes or even seconds. This allows for dynamically creating an environment or rapidly adding resources to scale out the cluster as the workload demand increases or shutting down machines when it drops. Theoretically each long running job can be encapsulated in its own VM or a set of VMs created for a parallel job and then shut down when no longer needed. Machines become disposable in a virtual environment because all the important settings for an application are captured in the image which is maintained on disk.
Application Isolation & Security: Another advantage of VMs is that they can isolate applications belonging to different departments or potentially different organizations. Running jobs from different organizations on the same physical machine can lead to issues of data privacy, and reliability if one misbehaving application causes the OS to
Dev/Test Clusters: When new applications are being developed or a new version of a commercial application is being tested, it is necessary to set up a new cluster for short periods of time. Rather than physically setup a separate cluster, the use of server virtualization allows to create an entire virtual cluster for developers/QA and when it is no longer used to shut it down to free up resources for production use.
Power Management/ Green IT: In a virtualized environment VM machines can be migrated amongst physical machines. If several VMs are not fully utilizing the physical capacity, then they can be migrated onto a smaller set of physical hosts to improve utilization and then the physical machines can be powered off , thereby saving power and cooling costs.
While there are many potential uses and benefits to server virtualization technology in HPC, there are also some challenges:
Scaleable Storage backend: Given that machines are now transformed into files on disk, this places more strain on the storage system backing a compute farm. Some features, like live migration work, best if there is a centralized shared storage infrastructure. The cost of scaling up the storage backend across 100s or 1000s of nodes may outweigh the savings costs on the compute hardware. One option is to make use of local disks to store images and switch between them. But then, the issue of updating or patching local copies of an image will arise. Some sort of image distribution mechanism will be required.
IO overhead of VM: VM technologies still tend to have some overhead compared to running on raw physical hardware. This is especially true for the case of I/O intensive or latency sensitive applications. One approach might be only run one VM per physical machine to give maximum access to disk and network drivers to the single application while still taking advantage of the flexibility of dynamic provisioniong with VM. Hypervisor support for specialized communication transports such as Infiniband/Quadrics/Myrinet is another related issue.
VM Management: Management of VM environments becomes another challenge because now you are no longer dealing with just the physical boxes and a single OS instance running on them, but potentially a several OS instances. Patching, updating images, monitoring the OS instances, troubleshooting and diagnostics, policy-based resource allocation all become further complicated in a VM environment. Without an appropriate set of tools and procedures in places, this could obviate the benefits. The hypervisor vendors provide their own tools to address these issues in the context of their own technology. Tools such as Platform VMO are attempting to address the heterogeneous management challenge, but there is still scope for improvement and dealing with the unique requirements of HPC environments.
Application Performance benchmarking: Given that server virtualization is has not been that widely adopted within HPC yet, there aren’t that many ISVs or providers of libraries and tools like LINPACK that have done benchmarking of their software on hypervisors. This is a chicken-and-egg type of scenario where users have to push the vendors in order for the vendors to see the market demand.
So likely any technology, one has to weigh the costs and benefits of introducing it into an existing environment. Server virtualizations will be broadly used within the generic IT landscape over the coming years, so it would be prudent of HPC users to take a look and see how it can help. I would be interested in hearing peoples thoughts of whether you are looking at server virtualization, which technologies you are considering and which use cases you are targeting.
Total Comments 0
Comments
Total Trackbacks 0
Trackbacks
Recent Blog Entries by Khalid
- Cloud Computing: Opportunities for HPC to go mainstream? (September 22nd, 2008)
- Use of Server Virtualization in HPC Environments (August 15th, 2008)
- Research Topics in HPC (May 21st, 2008)








