Hi, all...
I guess I'll get this thing started. I did a presentation at PGC08 about our experiences in deploying a Dell-based HPC at the University of Alabama. In the presentation I mainly tried to address some of our 'soft' issues in the deployment, with the backdrop of those issues being the delays in deploying our system and shortcomings in our support structure. After the session I had some good conversations and got a comment that some people had wanted more information on the 'nuts and bolts' of our implementation. That's a topic I largely glossed over in the presentation.
So, I'm dropping this out there to see if anyone has any questions on the topic. I am certainly not a cluster guru or by any stretch an expert, and our deployment was pretty rocky and took much, much longer than we originally planned.
The basic gist of my presentation was that our implementation was a lot harder than it had to be. I work in central technology services, and essentially the original design and agreements for our cluster were external to central IT. In our case, the design was set up by a cooperative effort between Dell and one of our research groups. At the time, this was not that odd, because our IT structure at the University of Alabama has lots of departmental silos that do various things for individual groups. As that cluster acquisition came closer, though, it became clear the original department lacked some critical infrastructure components (power, AC, staffing) for the system.
Central IT ended up taking the system and installing it in our relatively roomy raised floor. We also ended up footing the bill for the system. As part of our acceptance of the system, we required that it be a generally-available campus resource, that all groups would get fair access to the system, and that central IT would be the 'root' users -- so no individual research group would get superuser on the system.
Unfortunately, the timeline was pretty rough - we came in basically too late to do much tweaking of the system or changing of the build. When the equipment arrived on our docks, that was our first look at the Dell 1855 blades, as well as central IT's first time to work with Infiniband. The system as delivered was:
* 130 Dell 1855 blade nodes. Each has either 4 or 6 GB of RAM, two 73 GB disks, and two P4 3.2 gHz processors. The PERC RAID controllers in these nodes will mirror, but not stripe.
* The blades fit 10 to a chassis, and the chassis have Infiniband and gig ethernet port blocks. As originally delivered the chasses had ethernet port aggregators. As we had to lean more in ethernet than Infiniband, we replaced these with passthroughs so that we didn't bottleneck so badly on chassis-to-chassis communications.
* Three 1850's to service our IBRIX filesystem. One is a fusion manager, the other two are segment servers.
* Two Dell PV-220S disk arrays. Each array is split into two RAID 5 containers and each container constitutes an IBRIX segment. Current IBRIX software allows larger segments so we could make this two segments now, but the version available when we launched had more restrictive segment size limits. Total usable storage is about 3.4 TB.
* One 1850 frontend node
* One 1850 'ITA' node for cluster management and monitoring
* Three Nortel Gbit ethernet switches, ganged together
* A TS-120 and a TS-740 for our Infiniband. These are now supported by Cisco after Cisco acquired Topspin.
Note I didn't mention a backup device. The design vision didn't include backups; the response to our concerns was that the shared storage was not for long term storage, but just for the storage of data sets. Unfortunately we weren't able to secure funding to remedy this problem with the design. The only real response we had procedurally was to include disclaimers and warnings to our users. We basically included language on the new account confirmations, in all responses to quota requests, on the cluster web site, and periodically we'd make an announcement on the listserv. All our cluster users are auto-subscribed to our announcement listserv when we generate their accounts. These announcements and notices were unfortunately little more in practice than CYA measure; a significant portion of our users unsubscribe from the annoucement listserv at the first opportunity and I'm convinced that half of our grad students and all of our faculty ignore everything in the account generation confirmation except their login information.
Our deployment was a mixture of easy and not-so-easy. Dell sent on-site contractors to rack-and-stack the system - unfortunately, it appears the installers had not had much experience with Infiniband, and the cables as installed were unacceptable. They were run under the floor and inadequately secured, so we couldn't keep a good connection through the cluster. Simple cable weight kept breaking our connections. Since this was my first time working with Infiniband, I was unprepared for how picky the connectors and cables can be. Our first recabling was basically the task of pulling out our Infiniband cables and rerunning them through the tops of the racks. This was a better solution but still picky, and it was essentially guaranteed that a human hand in the back of a rack would result in a some sort of Infiniband cable issue, requiring that we track down the bad connection and reseat the connectors. That was a pretty significant problem later since we had to get back there on a regular basis to move KVM connections so we could manually rebuild compute nodes.
In order to get the system up and going we needed more power; luckily we had just removed an old IBM mainframe and installed a new AC unit, so we had plenty of cooling capacity (I know, that makes us exceptionally lucky!). We brought in a new power panel to service the HPC. Our burn-in went pretty well; out of our 130 compute nodes we only had, if I recall correctly, two that were problems. Dell got those replaced quickly.
Complicating most of the rest of the work described here was our project situation. I had been committed to a project that was deemed 'high priority' by the UA administration. My role in that other project was to facilitate two other campus groups in deploying an infrastructure application. This other project was both more important institutionally than the HPC and a major time-sink. Essentially, the product we purchased was purchased through a reseller, as the application developer doesn't sell directly to users. Our reseller had no experience with the software version we were deploying, no experience with the database used by the app, and no experience whatsoever with any Unix system. Our software deployment team also had very limited Unix experience and no training on the application or database. My role was to try and keep the project from going off the rails, and to be available when needed. Unfortunately, the vendor and reseller response to most issues was to start over - reformat their system and install the OS, app, et c from scratch. Tech support issues required my presence - often sitting on hold with the reseller and software team for hours - since any tech support call that was made in my absence resulted in a diagnosis that "The OS must be the problem." The purpose of this tangent isn't to gripe about what was a highly dysfunctional deployment, but to underscore a problem. This challenging, high-priority project was competing for the same person-hours as the HPC deployment, but was considered to be of highest priority to my management. My researchers and other HPC users considered that project to be lower than the lowest priority; I fielded several furious calls from faculty who could not fathom why I'd be wasting time on the other project just because the administration wanted it. Most of these calls came from tenured faculty. Staff at UA don't get tenure.
My best way to get past this was to do more work than I really wanted to after-hours. A moment to praise my Dell deployment engineer - Aaron Rhoden really came through on this. He was very understanding of our project situation and had a strong commitment to doing what he could to ensure our success with the HPC.
The basic Platform Rocks installation went pretty smoothly. At that time, the process of imaging the nodes was much more bandwidth-dependent than in our current Platform OCS 4. Getting all our libraries installed, and our compilers set up for the users took a little more time, but was still pretty easy. The next several days were involved in running LINPACK and tweaking it to get a decent number. A couple of nodes locked up and had to be manually reimaged. At this time we didn't know the reliable way to fix nodes; we'd swap the two internal disks and try to reimage, but the node would just fail to PXE boot. Sometimes it would reimage but instead of wiping the drives the installer would just slice up the largest existing partition on the disks. In a few cases we just couldn't get it to work until I got frustrated with the situation and just booted from a recovery disc (RIP Linux is what we used, but there are a variety of options) and wiped the drives. Later I found that I could get past this sort of problem with a more conservative process:
1. Boot from a recovery disc.
2. Wipe the first portion of each internal hard drive:
dd if=/dev/zero of=/dev/sda bs=2048k count=1
dd if=/dev/zero of=/dev/sdb bs=2048k count=1
3. Reboot the system and physically hit F12 to PXE boot (otherwise the system just hangs)
When we started testing user apps (like Gaussian) we saw significantly more failures. We could not reliably trigger the failure, though. Due to the issues we had with the partitioning of the drives, and the fact that all of our diagnostics on failed nodes looked like hardware was good, and the fact that we had verified we were up to current versions on our BIOSes and firmwares, we focused on the software in our troubleshooting. During this interval, if I recall correctly, the Platform ROCKS product went to Platform OCS, so we moved to that OCS software. The problem persisted.
We still had the problem by April. At this point we had gone almost four months with the system unavailable, and it was politically impossible to delay opening it up to users. As you may imagine, though, when we added users and therefore workload to the system, we started having more node failures. Under some conditions we'd only have six or so nodes fail in a day. On the worst days I lost almost half the cluster. That manual 'dd' process didn't take very long, but multiplied to scale it was a big issue.
Meanwhile, the users on the system were really irritated that no job of any significant size would complete. If the job needed more than 8 nodes or ran for more than 4 hours, it was almost certain it would fail when one of the worker nodes died of a kernel panic. The users we'd held off the system due to the combination of their inexperience with clusters and the unstable state of the system didn't really believe the system was unstable - several stated openly that they believed we were lying about the system and had no intention of opening it up to the campus. I never could figure out that position -- if we were publicly making false statements in email and in print (in our guide to services, for instance) it made no sense. It's not like I can argue, while holding a half-eaten cheeseburger with mustard in my beard, that I am a vegetarian. This was the political situation, though.
At this point we indicated to Dell that we needed to escalate the issue past the conventional support lines we'd been following. While we had escalated our problem a few times, things had sort of flattened out at Dell. I suspect that everyone had looked at the issue and our problem had become a sort of status quo. We essentially had to get a resolution or publicly pull the system as a failed deployment. Dell sent in a new consultant, who checked over our system and found nothing really wrong with our configuration. He suggested a wipe-and-rebuild so that he could confirm, 100%, that we were set up correctly.
During this wipe, we rechecked our BIOS and firmwares and found a new hard drive firmware. I don't recall if this firmware update released in April or March, but it was applicable to the internal hard drives in all of our compute nodes. As part of the rebuild we flashed the drives. We also had had enough of the Infiniband flakiness and were able to get funding for a full replacement of our old cabling with newer SuperFlex cables. These cables we reran with lots of support. Finally, we could replug KVM without breaking Infiniband connections - and we could finally make the safety guys happy by closing the back of our rack with the TS-740. Previously the bulk of those Infiniband cables meant we had to keep the rack open in the back, with cables sticking out. It turns out that safety auditors don't think that's such a good idea, especially in rooms with halon fire suppressors.
Once we did that, the cluster has been really stable. We have a few other issues that are architectural or design issues rather than deployment. In other words, the system is working, just not working in the way we would like. We occasionally have a node hang up or kernel panic (caused, to all appearances, but the Topspin Infiniband drivers). Hung nodes can be reimaged by simply powering the blade down, powering it on, and hitting F12 to PXE boot. It's trivial, and except for our KVM port shortage (detailed below) can be done by anybody at the console.
Case 1: Infiniband under-utilization
We got Infiniband as a cluster communication network. In our initial deployment folks from Dell, Infiniband was pitched as a high speed, low latency network we'd use for MPI and access to our IBRIX storage. In practice, though, the Topspin driver stack is a problem - we have several applications that won't build against it, and we end up running those over Ethernet. Our IBRIX storage only works over TCP/IP, not over native Infiniband. When we ran it over our Infiniband, the IBRIX storage ran like we were on a hub. We got the full 4x speed when we had two nodes talking. When we scaled it up, though, our IOzone benchmarks showed that our throughput on the Infiniband degraded faster than gig Ethernet as the node count increased. This wasn't what we expected, but basically it looked like the whole Infiniband network turned into a hub. I have a sneaky suspicion that moving to OpenIB might revitalize the idea of running storage over Infiniband.
Case 2: KVM and management ports
This is an irritation - we don't have the gig E ports for all of our DRACs to be on the cluster network. We can have ONE DRAC online at a time. There are some nice features to the DRAC, like retrieval of software versions, service tags and some diagnostic information. Ideally, I'd have that all cabled in for monitoring and management. It's not as big of an issue as the KVM, though. We're short a few ports for the KVM. This becomes significant because we don't really want our operators reaching into the backs of the racks for any reason, so handling direct compute node intervention - like manually PXE booting a hung node - requires a sysadmin. I'd love to be able to delegate that to the operators - so when a node hangs and doesn't reimage in a fixed time, we could automatically have the operator on duty go in and PXE boot the node. It's less of an issue, but I really don't like the situation where there's one or two people who can perform what's essentially a very simple process.
Case 3: Application support
We don't have the person-hours to keep up with app support demands. There's not much more to say; I have a pipeline of app installation requests that I'm working through, but I am also tapped for some production support, some other projects, and doing user support. I slot in the app installations as best I can, but it's not really work that I'm good at doing in 15-minute chunks. Essentially, I only work on application deployments when I have a time chunk of 1 hour or more.
Case 4: Backups
So, with no backups, you're going to end up losing data. We know this. In our case, our IBRIX was made of 4 RAID containers. Initially we were assured by the designing department that they'd need all of that storage. In practice, only about 10% is used. Once we had the system online, it was too hard a sell to take it down long enough to reprovision the storage. Eventually, though, we lost data. In our case, due to bad luck - two drives in a RAID 5 failed within about 15 minutes. IBRIX worked all weekend for us, but we still had corruption and data loss. After that, I was able to scavenge a tape library (which didn't work) and later another tape library (which is on my slate to cable in). During our downtime to fix the IBRIX, I split our one big filesystem into two smaller ones. I do a nightly rsync to give me at least some recovery options. Our users still insist they need all that storage, but they haven't missed the 50% I took off the top. We could still lose data but we're insulated a bit - we'd either need rampant software corruption or to lose BOTH of our PV-220S's to really lose data. Still, I look forward to getting that tape library fully online.
So. Wow. That's a lot. I may have glossed over something of interest. If anyone wants to pick into a part of our experience, or has comments about what I've put here, please let me know. Anybody else want to start their thread about deploying their cluster? I'm hoping to add some more dos and don'ts to my library of information, and maybe build a decent meta-guide for deploying these things. Most of my issues with the deployment are pretty shallow in hindsight; but they were pernicious when we were actually dealing with the situation.
-c


LinkBack URL
About LinkBacks
Reply With Quote

