Scenario
AI is a type of computer technology which is concerned with making machines work in an intelligent way, similar to the way that the human mind works. AI applications include robotics, voice recognition, image recognition, autonomous driving, and intelligent recommendation. The deep learning algorithm, which uses compute-intensive iterative floating-point operations, is crucial to AI. This algorithm extracts features of a large number of samples through multi-layer neural networks, and continuously performs adjustments and learning through parameters for training and then inference. To improve the deep computing capability, distributed nodes are used for AI training. The distributed AI training performance is measured by the speedup, which is calculated using the following formula: Overall performance of N nodes/(Performance of a single node x N), in percentage.
- In distributed AI training, the N:1 incast traffic model is used to synchronize the result of each iteration. The size of burst traffic during each iteration is proportional to the number of parameters. With the improvement of the computing capability and storage performance, the pressure on AI training increases sharply and the network devices need to be better at transmitting incast traffic.
- The AI GPU scenario imposes stringent requirements on computing performance. GPU chips are used for computing, and a relatively small number of servers are deployed.
- In large-scale scenarios, the distributed AI training performance is limited by the number of network transmissions and the network latency. Therefore, the speedup needs to be ensured to balance the tradeoff between the throughput (bandwidth) and latency, by improving the throughput and reducing the latency.