Overview
Background
AlphaGo's victory in March 2016 was a major milestone in Artificial Intelligence (AI) research, declaring that the Fourth Industrial Revolution characterized by AI is coming. More and more enterprises incorporate AI into their subsequent strategies for digital transformation. Against this backdrop, the focus of enterprise data centers in the AI era is shifting from fast service provisioning to efficient data processing. Specifically, emerging applications, such as high-performance computing (HPC), distributed storage, and AI, require data center networks (DCNs) with zero packet loss, low latency, and high throughput. However, traditional TCP/IP-based networks cannot satisfy the requirements because the delay is high and large amounts of resources are consumed in key phases such as data copying.
Remote Direct Memory Access (RDMA) is an alternative networking technology that leverages related hardware and network technologies to enable network interface cards (NICs) to directly read memory data, achieving high bandwidth, low latency, and low resource consumption. However, the dedicated InfiniBand network architecture for RDMA is closed and incompatible with live networks, leading to high costs. RDMA over Converged Ethernet (RoCE) technology effectively solves these problems. RoCE is available in two versions: RoCEv1 and RoCEv2. RoCEv1 is a link layer protocol which cannot be used in different broadcast domains, whereas RoCEv2 is a UDP-encapsulated network layer protocol capable of implementing routing functions.
RoCEv2 is used by applications including HPC, distributed storage, and AI to reduce the CPU processing workload and delay and to improve application performance. However, because RDMA was used on lossless InfiniBand networks, RoCEv2 lacks a complete packet loss protection mechanism and is therefore sensitive to packet loss on networks. In addition, these distributed high-performance applications use the N:1 incast model, which may cause burst congestion or even packet loss in the internal queue buffer on Ethernet switches, increasing the application delay and decreasing the throughput. This will cause the performance of distributed applications to deteriorate.
Solution
In the AI era, Huawei grasps the opportunity of RDMA upgrade of DCNs and innovatively builds a next-generation intelligent lossless low-latency DCN solution –– AI Fabric. Based on two levels of AI chips and a unique intelligent congestion scheduling algorithm, AI Fabric implements zero packet loss, high throughput, and low latency of RDMA service flows. It accelerates computing and storage efficiency in the AI era, and achieves the performance comparable to the dedicated network at a low cost of the Ethernet, improving the overall ROI 45 folds. This helps build a unified, converged, and highly efficient DCN for future DCs.