When the first Linux clusters were constructed Ethernet was one of the few choices for an interconnect. Of course there were more expensive and custom ways to connect computers, but Ethernet was the first network technology supported by Linux. Ethernet was also the most ubiquitous network, which also made it the cheapest.

Today, Ethernet continues to be the most pervasive interconnect on the planet. It is well understood, provides true "plug and play" capability, is almost universally supported, and enjoys commodity pricing. For these reasons, Ethernet is the preferred interconnect for High Performance Computing (HPC). In the most recent survey of the fastest 500 computers in the world 52% of the clusters used Gigabit Ethernet. (See the November 2009 Top500 List at Top500.org)
Many of the first HPC clusters used Fast Ethernet (100 Megabits/second). Currently, many systems use 1 Gigabit Ethernet (1000 Megabits/second) or 10 Gigabit Ethernet (10,000 Megabits/second). The need for a faster interconnect is largely due to the increase in processing cores per node. For some applications, a standard 1 Gigabit Ethernet (GigE) link may have enough bandwidth to fully support the eight or more cores found in a typical HPC server. But in other cases, using GigE as a cluster interconnect could introduce a communication bottleneck between cores, resulting in poor overall performance.
The growth of Ethernet networking in the Top500 List can be seen in Figure One. Since its introduction GigE has shown considerable growth in HPC systems, but recently, GigE use in the Top500 has declined, which is probably due to the communication needs of multi-core nodes. Meanwhile, InfiniBand has been growing in popularity due to its higher performance. Many consider InfiniBand and Ethernet to be a "two horse race" for the HPC interconnect market. The term "race" may not be quite accurate, however. There seems to be two populations of users in the HPC sector. The first is the power user who pushes the hardware to its limits in terms of scalability and performance. These users choose InfiniBand as their interconnect and often consider Ethernet as a lower-speed supplemental network. The second type of user wants a low-cost commodity network with which they are familiar. For these users, Ethernet is often a popular choice, if not the only choice.

Figure One: Historical Interconnect Trend for Top 500 computers (Courtesy of http://top500.org)
Based on past trends, 10 Gigabit Ethernet (10 GigE) is destined to take over for GigE as a cluster interconnect. In addition to offering a familiar technology, it also has the ability to offer wide functionality, increased performance, and at the same time, a reduced cost. The transition to 10 GigE for HPC is considered by many to be similar to the natural progression from Fast Ethernet to GigE. The commodity market has been pushing 10 GigE costs down to the point where it has become a viable option for HPC users. The additional bandwidth and often better messaging rate make 10 GigE the logical next choice for many cluster users.
Concerns about 10 GigE performance as compared to InfiniBand depend largely on the application set being utilized. By design, InfiniBand offers a high performance kernel by-pass model for communication (less message overhead). Traditional Ethernet, however, uses standard kernel services and requires more overhead per message. As will be described below, there are kernel by-pass options for Ethernet as well. As always, the best method for testing performance is to benchmark your own applications.
A Good Friend
Over the years, Ethernet has been a good friend to the computer industry. It continues to work reliably over commodity equipment, is supported by all operating systems, provides a compatible upgrade path, and offers an established deployment model. For these reasons, Ethernet is often considered a plug-and-play network and is considered the "simple" solution for cluster interconnects. The fact that virtually all HPC servers have on-board Ethernet capability also makes Ethernet the default network for many systems.
Another big advantage to using Ethernet is the ability to combine many of the separate networks found in clusters into a single network. Using 10 GigE, all storage, compute, and administrative traffic can be easily aggregated into one network thus reducing cable count and the cost of switches,and network cards. Indeed, the complete set of standard clusters services can operated over Ethernet (i.e. NFS, batch scheduler, remote login, etc.) In addition, encapsulated storage technologies such as FCoE (Fibre Channel over Ethernet) or iSCSI offer new storage solutions that were not possible in the past.
Finally, the commodity nature of Ethernet means that in addition to competitive pricing multiple vendors can provide insurance against obsolete or orphaned hardware. The continued availability of 10 GigE is expected to ensure a continued good relationship between Ethernet and the HPC market.
The Cables Are The Network
Every network technology needs a medium over which it can move data. In general, the faster the network, the more expensive the cables. There are two methods of moving data. The first involves electrons and uses copper wire to connect nodes and switches. There second method uses light and fiber optics to make these connections. The use of light usually incurs an extra cost because the electrical signals (electrons) must be converted to light (photons) and back again, as part of the network (these devices are sometimes referred to as transceivers). Copper cables are usually more bulky than fiber optic cables and provide a cheaper method for short-haul communications. Fiber optic cables can transmit over long distances and actually can actually cost less than copper cables when long distances are factored into the cost (i.e. the price of copper soon out weighs the additional cost of fiber optic transceivers.)
An immensely popular feature of Ethernet has been its cable technology. Anyone who has connected a cluster network (or a computer) knows the sound of an RJ-45 connector clicking into place. (RJ-45 connectors look like the modular phone connects but have more wires). Ethernet cables are rated by Category. In general, cables are rated for Ethernet speeds. The standard Category 5 (Cat-5 or Cat5e) is currently used for GigE and allows a small bend radius, is lightweight, and uses RJ-45 connectors on both ends. In contrast, InfiniBand, due to its higher performance, uses a bulkier copper cable and CX4 connectors with a higher weight and larger bend radius. (Fiber optic InfiniBand cables are available as well for extra cost).
Ethernet's popular RJ-45 cabling feature has been retained in 10 GigE, but may offer slightly lower performance. The official name of this interface is called 10GBASE-T. Using Cat-6 (or possibly Cat-5e) cables, 10GBASE-T provides a connection distance of 55 meters, which can be extended to 100 meters using Cat-6a cabling.
In addition to traditional copper cabling, the popular (Small-Form-Factor Pluggables) SFP+ interface allows for both optical and copper cables to be used in a 10 Gigabit Ethernet network. The SFP+ interface is a modular design that allows different communication media to be used with network cards and switches. That is, 10 GigE network interface cards and switches have SFP+ sockets that will accept various types of transceivers. Examples of SFP+ transceivers are shown in Figure Two below.

Figure Two: Examples of SFP+ based transceivers
One popular method for connecting 10 GigE in clusters is to by-pass the transceivers and connect the SFP+ sockets directly with a passive copper twinax cable. This solution will work for distances of less than five meters and is usually used in "top of rack" switch designs. Keep in mind, longer distances may require SFP+ transceivers and optical cabling. An example of a twinax cable with SFP+ connectors is shown in Figure Three.

Example of a passive twinax copper cable
Hardware Options
Due to its commodity nature, Ethernet offers many choices for both vendors and implementations. Currently, GigE is considered the standard low-cost method to deploy Ethernet, although there are still plenty of compatible 100BT (Fast Ethernet) networks. New switches accommodate all previous and current Ethernet standards, which allows for new and existing networks to be easily combined.
Unless you are certain that your applications can work with lower speeds, new cluster deployments should consider 10 GigE vs. GigE. As mentioned, the use of multi-core has created more traffic for each node than in the past. Expect prices to fall over the next year as the market for 10 GigE products increases.
If you are considering 10 GigE there are many options in terms of Network Interface Cards (NICs). There are 10GBASE-T NICs from both Mellanox (model MNTH18-XTC) and Intel (model EXPX9501AT) offer solutions with single RJ-45 ports. There are also SFP+ NICs from Mellanox (model MNPH28B-XTR), QLogic (model QLE8150), Intel (model X520) and Myricom (model 10G-PCIE-8B-S). Note that vendors often have a family of NICs with various options.
When choosing a switch there are many options. A good switch should have low latency, high bandwidth (Gbps or Gigabits per second), and a high packet rate (Mpps or Mega packets per second). Double check all your connections and cables for compatibility and length before you commit to a purchase. Current 10 GigE switch vendors include,HP, Force10, Arista, Cisco, Brocade (who recently purchased Foundry Networks), Fujitsu, Extreme Networks, and SMC.
Faster Than Ethernet
When using Ethernet, communication over a network normally takes place through the Linux kernel. (i.e. the kernel manages, and in a sense guarantees, that data will get to where it is supposed to go). This communication path, however, requires memory to be copied from the user's program space to a kernel buffer. The kernel then manages the communication. On the receiver node, the kernel will accept the data and place it in a kernel buffer. The buffer is then copied to the user's program space. This excess copying often adds to the latency for a given network. In addition, the kernel must process the TCP/IP stack for each communication. For applications that require low latency, the extra copying from user program space to kernel buffer on the sending node and then from kernel buffer to user program space on the receiving node can be very inefficient.
High performance network technologies such as InfiniBand use a kernel by-pass method to improve performance. This capability is also available for Ethernet, but is not widely used outside of the HPC community. One such methodology is Intel® Direct Ethernet Transport (DET), which works by providing a User Direct Access Programming Library (uDAPL) interface like InfiniBand. uDAPL defines a single set of user APIs for all Remote Memory Direct Access (RDMA)-capable transports. DET includes a kernel module and an uDAPL library for Ethernet and will work on almost any Ethernet NIC. It can be linked with any software requiring a uDAPL library, such as an MPI version.
Another popular kernel by-pass effort is the Open-MX project. Open-MX is based on the Myrinet MX protocol. Essentially, any software that links to the Myricom MX library should be able to link with Open-MX. Currently, Open MPI, MPICH2, and the PVFS2 file system have all been shown to work with Open-MX. While Open-MX will work with almost all GigE and 10-GigE chip-sets without modifying drivers, it does require kernel 2.6.15 or higher to work. Depending on the chip-set Open-MX latencies as low as 10 μseconds for GigE have been reported.
10 GigE Is Here
The next wave of Ethernet performance is here, and you should expect to see new clusters sporting 10GigE interfaces over the next several years. The use of GigE will fade as 10 GigE prices drop and the communication needs of HPC nodes continues to increase. The plethora of options and vendors in the Ethernet market continue to protect user investments in Ethernet technology. The simplicity and comfort offered by Ethernet will extend into the 10 GigE space and beyond, as vendors ready 40 GigE solutions. Ethernet, our old friend, will continue to move the data that feeds HPC servers for years to come.



Sections