Server Networking
This section provides only the overall networking architecture. For details about the networking, see the Atlas Data Center Solution Networking Guide.
Logical Architecture
Atlas Data Center Solution provides training and inference computing capabilities. A single cabinet is first networked inside the cabinet, and then the cabinets are networked to form an AI cluster. Based on service types and security levels, the entire network is divided into different service zones: access zone, management zone, and service zone.
These zones are connected to each other through core switches. Different traffic streams are separated and protected based on the current situation of the data center. For example, traffic streams can be separated using VRF and protected using firewalls. This part is determined based on user requirements and current situation of the data center and is not described in this document.
- Access zone: used for Internet and private line network access. External network access devices of the data center are deployed in this zone.
- Security service zone: provides security protection capabilities such as DDoS and intrusion detection.
- Network service zone: provides basic network services, such as vRouter, vLB, and vFW.
- Management zone: management software, such as in-band management software, is deployed in this zone to provide the service management system and O&M management support components.
- Out-of-band management zone: connects to the management ports of network devices and BMC ports of servers to provide the out-of-band management network for physical devices. This network carries only the management traffic of physical devices and does not carry other service traffic.
- AI computing cluster zone: AI servers are deployed and form a cluster network to implement high-performance AI computing.
- General computing zone: provides general computing resources related to AI training, for example, deploying software such as the deep learning platform.
- Storage zone: high-speed and high-bandwidth interconnected storage system, which is used to store training data in AI scenarios.
This document focuses on the network design of the AI cluster computing zone and provides suggestions on the design of the storage zone. The solutions for other zones are irrelevant to the Atlas data center solution and relate to traditional data centers. Therefore, this document does not provide details about the solutions for other zones.
In addition, the management zone uniformly maintains and manages all servers and devices in the data center, such as AI servers, general-purpose servers, storage devices, and network switches.
- Parameter plane network: uses the dual-layer leaf-spine networking. The 100GE modules provided by the NPU are connected to the 100G leaf switch to implement parameter synchronization during multi-node distributed training.
- Storage plane network: The two 25GE ports configured for each AI server are connected to the 25GE leaf switches for accessing interconnected high-speed and high-bandwidth storage systems in the storage zone.
- Service plane network: Each AI server is configured with two 25GE/10GE ports to connect to 25GE leaf switches for service scheduling and management.
- In-band management plane network: One GE or 10GE port of each AI server is connected to the switch to access the management zone network for managing and operating cluster devices.
- Out-of-band management plane network: One GE port of each AI server is connected to the GE switch to access the management zone network for out-of-band managing and operating cluster devices.
Physical Networking
Figure 7-2 shows the physical architecture of the Atlas Data Center Solution network.
This document focuses on the network design of the AI cluster computing zone and provides suggestions on the design of the storage zone. The solutions for other zones are irrelevant to the Atlas data center solution and relate to traditional data centers. Therefore, this document does not provide details about the solutions for other zones.
In large-scale networking, the service/storage plane networks of the general computing zone, storage zone, and AI computing cluster zone are connected through core switches. In actual projects, if the number of aggregation switch ports is sufficient and you do not want to deploy core switches, you can directly connect them through aggregation switches.
The parameter plane network of the AI computing cluster zone is dedicated for distributed training and does not need to be connected to other networks on network-layer devices. The design principle of the parameter plane network is to use a 100GE non-blocking non-convergence network. Each NPU uses the RoCE protocol to synchronize gradient parameters in the training process to maximize the training performance.
AI servers can be connected to the existing management network in a unified manner. There is no separate design requirement. In addition to managing AI servers, the management zone also manages all general-purpose servers, storage devices, and network switches in a unified manner.
The AI server cluster uses the typical Atlas 800 AI training server (model 9000)/Atlas 800 AI training card (model 9010) as the basic training server unit. Figure 7-3 and Figure 7-4 describe the functions of external ports on the Atlas 800 AI training server (model 9000)/Atlas 800 AI training card (model 9010). For details about the hardware types supported in the data center training scenario, see Hardware Introduction in Atlas Data Center Solution V100R020C10 Center Training Solution Description.
- In addition to the LOM ports shown in the preceding figure, the service plane and storage plane can also use PCle NICs for networking.
- The plane network port is also called NPU network port.