The size and compute capacity of high performance computing (HPC) clusters continues to grow at a breakneck pace, with the realization of Exascale computing firmly within our grasp in the next few years. However, the obstacles to Exascale are daunting—CPU performance, power efficiency, reliability, memory, and storage are just a few. Intel is continuing to break down these barriers through its technological advances.
Meeting the Challenge for Exascale Computing
In addition to solving compute and power issues, a fabric is needed that can scale beyond the current InfiniBand*-based limits of tens of thousands of nodes to hundreds of thousands of nodes without losing performance or reliability.
But it’s not just about huge computer clusters. Both private industry and academia require HPC capabilities at different, and sometimes smaller, scales used to bring products to market or accelerate their research—all with limited resources. They need maximum cluster performance while staying within tight budgets.
HPC centers, ranging from several node departmental or divisional clusters up to the largest supercomputers, face several key challenges:
- Performance. Processor capacity and memory bandwidth are scaling faster than system I/O. A solution is required that provides higher overall available I/O bandwidth per socket to accelerate message passing interface (MPI) rates for tomorrow’s HPC deployments.
- Cost and density. More components in a server limit density and increase fabric cost. An integrated fabric controller helps eliminate the additional costs and required space of discrete cards, enabling higher server density while freeing up a valuable PCIe* slot for other storage and networking controllers.
- Reliability and power. Discrete interface cards consume many watts of power. An integrated interface card on the processor can draw less power with fewer discrete components.
The Future of High Performance Fabrics
Current standards-based high performance fabrics, such as InfiniBand*, were not originally designed for HPC, resulting in performance and scaling weaknesses that are currently impeding the path to Exascale computing. Intel® Omni-Path Architecture (Intel® OPA) is being designed specifically to address these issues and scale cost-effectively from entry level HPC clusters to larger clusters with 10,000 nodes or more. To improve on the InfiniBand specification and design, Intel is using the industry’s best technologies including those acquired from QLogic and Cray alongside Intel® technologies.
While both Intel® OPA and InfiniBand* Enhanced Data Rate (EDR) will run at 100Gbps, there are many differences. The enhancements of Intel® OPA will help enable the progression towards Exascale while cost-effectively supporting clusters of all sizes with optimization for HPC applications at both the host and fabric levels for benefits that are not possible with the standard InfiniBand-based designs.
Intel® OPA is designed to provide the:
- Features and functionality at both the host and fabric levels to greatly raise levels of scaling
- CPU and fabric integration necessary for the increased computing density, improved reliability, reduced power, and lower costs required by significantly larger HPC deployments
- Fabric tools to readily install, verify, and manage fabrics at this level of complexity
The Next-Generation Fabric
Intel® Omni-Path Architecture (Intel® OPA), an element of Intel® Scalable System Framework (Intel® SSF), delivers the performance for tomorrow’s high performance computing (HPC) workloads and the ability to scale to tens of thousands of nodes—and eventually more—at a price competitive with today’s fabrics. The Intel® OPA 100 Series product line is an end-to-end solution of PCIe* adapters, silicon, switches, cables, and management software. As the successor to Intel® True Scale Fabric, this optimized HPC fabric is built upon a combination of enhanced IP and Intel® technology.
For software applications, Intel® OPA will maintain consistency and compatibility with existing Intel® True Scale Fabric and InfiniBand* APIs by working through the open source OpenFabrics Alliance (OFA) software stack on leading Linux* distribution releases.
Intel® Omni-Path Key Fabric Features and Innovations
Adaptive Routing monitors the performance of the possible paths between fabric end-points and selects the least congested path to rebalance the packet load. While other technologies also support routing, the implementation is vital. Intel’s implementation is based on cooperation between the Fabric Manager and the switch ASICs. The Fabric Manager—with a global view of the topology—initializes the switch ASICs with several egress options per destination, updating these options as the fundamental fabric changes when links are added or removed. Once the switch egress options are set, the Fabric Manager monitors the fabric state, and the switch ASICs dynamically monitor and react to the congestion sensed on individual links. This approach enables Adaptive Routing to scale as fabrics grow larger and more complex.
One of the critical roles of fabric management is the initialization and configuration of routes through the fabric between pairs of nodes. Intel® Omni-Path Fabric supports a variety of routing methods, including defining alternate routes that disperse traffic flows for redundancy, performance, and load balancing. Instead of sending all packets from a source to a destination via a single path, Dispersive Routing distributes traffic across multiple paths. Once received, packets are reassembled in their proper order for rapid, efficient processing. By leveraging more of the fabric to deliver maximum communications performance for all jobs, Dispersive Routing promotes optimal fabric efficiency.
Traffic Flow Optimization (TFO)
Traffic Flow Optimization (TFO) optimizes the quality of service beyond selecting the priority—based on virtual lane or service level—of messages to be sent on an egress port. At the Intel® Omni-Path Architecture (Intel® OPA) link level, variable length packets are broken up into fixed-sized containers that are in turn packaged into fixed-sized Link Transfer Packets (LTPs) for transmitting over the link. Since packets are broken up into smaller containers, a higher priority container can request a pause and be inserted into the ISL data stream before completing the previous data.
The key benefit is that Traffic Flow Optimization (TFO) reduces the variation in latency seen through the network by high priority traffic in the presence of lower priority traffic. It addresses a traditional weakness of both Ethernet and InfiniBand* in which a packet must be transmitted to completion once the link starts even if higher priority packets become available.
Packet Integrity Protection (PIP)
Packet Integrity Protection (PIP) allows for rapid and transparent recovery of transmission errors between a sender and a receiver on an Intel® Omni-Path Architecture (Intel® OPA) link. Given the very high Intel® OPA signaling rate (25.78125G per lane) and the goal of supporting large scale systems of a hundred thousand or more links, transient bit errors must be tolerated while ensuring that the performance impact is insignificant. Packet Integrity Protection (PIP) enables recovery of transient errors whether it is between a host and switch or between switches. This eliminates the need for transport level timeouts and end-to-end retries. This is done without the heavy latency penalty associated with alternate error recovery approaches.
Dynamic Lane Scaling (DLS)
Dynamic Lane Scaling (DLS) allows an operation to continue even if one or more lanes of a 4x link fail, saving the need to restart or go to a previous checkpoint to keep the application running. The job can then run to completion before taking action to resolve the issue. Currently, InfiniBand* typically drops the whole 4x link if any of its lanes drops, costing time and productivity.
Intel is clearing the path to Exoscale computing and addressing tomorrow’s HPC issues. Contact your Intel representative to discuss how Intel® Omni-Path Architecture (Intel® OPA) can improve the performance of your future HPC workloads.