Internet2 Performance Tools: LHC Testing Challenges
A multi-domain troubleshooting team used the Internet2 performance tool BWCTL to diagnose a network bandwidth issue. The data allowed UltraLight engineers to quickly pinpoint and address a hardware failure instead of developing a configuration solution.
Products & Services
UltraLight is a collaboration of experimental physicists and network engineers working to provide the necessary network capabilities to enable petabyte-scale analysis of globally distributed data. Current Grid-based infrastructures provide massive computing and storage resources, but are currently limited by their treatment of the network as an external, passive, and largely unmanaged resource. UltraLight goals are to:
- Develop and deploy global services that promote the network as an actively managed component.
- Integrate and test UltraLight in Grid-based physics production and analysis systems currently under development in the Large Hadron Collider (LHC) ATLAS and CMS experiments.
- Engineer and operate a trans- and intercontinental optical network testbed, including high-speed data caches and computing clusters.
High-energy physicists at the University of Michigan involved in the US-ATLAS experiment recently began monitoring the network infrastructure connecting their site and Brookhaven National Laboratory, the US-ATLAS Tier 1 location from which they receive their data. The physicists also began monitoring between their site and the other peer LHC institutions—called Tier 2 sites—with which the University of Michigan shares data. Each of the US-ATLAS Tier 1 and Tier 2 sites have deployed a pair of perfSONAR Performance Nodes for this purpose.
Using BWCTL, via perfSONAR-BUOY, regular monitoring tests were scheduled. Reviewing the results it was noticed that performance to every peer site was a mere fraction (1/10) of what it should be in the outbound direction. While it was now clear there was a performance problem, it was still necessary to pinpoint what caused the issue and then determine the best course of action to resolve it.
A multi-domain troubleshooting team, which included staff from ESnet, Internet2, UltraLight, and the University of Michigan began the analysis task by using the NDT and NPAD tools to capture in-depth data that examined how TCP was operating over this path. These tools revealed a bottleneck in the path that was limiting the hosts’ performance. By using servers and clients along the path, the divide-and-conquer methodology proposed in the Internet2 Network Performance Workshops, the team was able to quickly isolate the problem to a subset of links on the path.
Armed with this knowledge the Ultralight engineers began a detailed look into the switches and routers they controlled. “It was finally discovered that the forwarding engine on one of the switches had entered into a fault state causing all packets to be processor-switched,” noted Shawn McKee. “The switch counters were not indicating loss, but the error condition had disabled the hardware line-rate packet forwarding feature of the switch and slowed outbound packets by a factor of 10. A simple reboot resolved the problem.”
This type of ‘soft failure’—not due to misconfiguration but an actual hardware problem—normally go undiagnosed and are the type of problems for which the Internet2 Performance Node is an ideal troubleshooting tool. “If a hardware fault causes work to stop, engineers fix the problem; if the network is limping along, engineers don’t have time to jump in and fix it. In this case, the monitoring clearly showed an asymmetric behavior, which convinced the engineers that it was a ‘real’ problem – but the fault probably existed for weeks before it was finally caught and now resolved,” says Rich Carlson, developer of the NDT tool and part of the troubleshooting team.