Typical AI Fabric Configuration Case in the AI GPU Scenario
Networking Requirements
Figure 3-3 shows the AI Fabric networking of an AI-powered image recognition service. Servers have deep learning software installed for image training, and use GPUs for deep learning. The server NICs support the RoCEv2 and DCQCN functions. Leaf and spine switches are fully meshed through 100GE links. Servers are connected to leaf switches through 100GE links, and the oversubscription ratio is 1:1. In this example, the CE8861-4C-EI is used as the leaf node, and the CE8850-64CQ-EI is used as the spine node.
Priority Planning
Based on the service traffic characteristics, priorities in this example are planned as follows:
- Set the priority of CNP traffic to 6, scheduling mode to PQ, and DSCP value to 25.
- Set the priority of RoCEv2 traffic to 4, scheduling mode to DRR, weight to 65%, and DSCP value to 24.
- Set the priority of O&M traffic to 0, scheduling mode to DRR, and weight to 5%.
- Configure other priorities to be reserved.
Configuration Roadmap
In this example, the IP addresses and routes for interconnection between spine switches, leaf switches, and servers have been configured, and there are reachable routes between servers.
- Configure leaf and spine switches.
- Configure PFC. Before configuring PFC, you need to configure priority mapping and congestion scheduling.
- Configure the low-latency network function. After this function is configured, automatic buffer optimization and dynamic ECN threshold are enabled for lossless queues by default. You can optimize the two functions.
- Enable the AI ECN function. Before enabling this function, you need to disable the dynamic ECN function.
- Configure PFC deadlock detection.
- Configure server NICs. (The detailed procedures are not provided.)
- Configure NICs to work in RoCEv2 mode.
- Configure the RoCEv2 link setup mode.
- Configure NICs to trust the DSCP value, and configure the DSCP value of RoCEv2 packets.
- Enable PFC for the priority of RoCEv2 packets on NICs.
- Enable DCQCN for the priority of RoCEv2 packets on NICs.
Procedure
The following describes the configurations on Leaf1. The configurations on Leaf2 are similar.
- Configure PFC.
- Configure priority mapping and congestion scheduling.# In this example, the DSCP value of RoCEv2 packets is 24, and the DSCP value of CNP packets is 25. Configure a priority mapping profile in the DiffServ domain to map the priority of RoCEv2 packets to priority 4 (queue 4) and the priority of CNP packets to priority 6 (queue 6):
<HUAWEI> system-view [~HUAWEI] sysname Leaf1 [*HUAWEI] commit [~Leaf1] diffserv domain ds1 [*Leaf1-dsdomain-ds1] ip-dscp-inbound 24 phb af4 green //Map the priority of RoCEv2 packets to priority 4. [*Leaf1-dsdomain-ds1] ip-dscp-inbound 25 phb cs6 green //Map the priority of CNP packets to priority 6. [*Leaf1-dsdomain-ds1] quit [*Leaf1] port-group all_using //Configure a port group. [*Leaf1-port-group-all_using] group-member 100ge 1/0/1 to 100ge 1/0/6 [*Leaf1-port-group-all_using] quit [*Leaf1] commit [~Leaf1] port-group all_using [*Leaf1-port-group-all_using] trust dscp [*Leaf1-port-group-all_using] trust upstream ds1 [*Leaf1-port-group-all_using] quit [*Leaf1] commit
# Configure the congestion scheduling mode for each queue. By default, queues on an interface use the PQ scheduling mode. Therefore, queue 6 can use the default scheduling mode to ensure preferential scheduling of CNP packets.
[~Leaf1] port-group all_using [*Leaf1-port-group-all_using] qos drr 0 4 [*Leaf1-port-group-all_using] qos queue 0 drr weight 5 [*Leaf1-port-group-all_using] qos queue 4 drr weight 65 [*Leaf1-port-group-all_using] quit [*Leaf1] commit
- Configure PFC for the priority of RoCEv2 traffic.
# Configure the queue with priority 4 to carry RoCEv2 traffic on the network. To implement this, enable PFC for priority 4 on each interface and implement PFC based on the priority mapped from the DSCP value.
[~Leaf1] dcb pfc //Enter the view of the default PFC profile. [~Leaf1-dcb-pfc-default] priority 4 [*Leaf1-dcb-pfc-default] quit [*Leaf1] port-group all_using [*Leaf1-port-all_using] dcb pfc enable mode manual [*Leaf1-port-all_using] quit [*Leaf1] dcb pfc dscp-mapping enable slot 1 [*Leaf1] commit
After the preceding configurations are complete, RoCEv2 traffic is transmitted in the queue with priority 4, which is a lossless queue.
# In Optimizing Lossless Service Performance in Scenarios Without Packet Loss, CE8861-4C-EI switches are used as leaf nodes, and fewer than eight ports are used on each CE8861-4C-EI switch. You can change the dynamic threshold for triggering PFC frames to 6 to improve the performance of RoCEv2 services.
<Leaf1> system-view [~Leaf1] port-group all_using [*Leaf1-port-group-all_using] dcb pfc buffer 4 xoff dynamic 6 [*Leaf1-port-group-all_using] quit [*Leaf1] commit
- Configure priority mapping and congestion scheduling.
- Configure PFC deadlock detection.
# Set the PFC deadlock detection interval and recovery time to 100 ms for lossless queues, and configure the device to disable PFC when five PFC deadlocks occur within 20s.
[~Leaf1] dcb pfc [*Leaf1-dcb-pfc-default] dcb pfc deadlock-detect interval 10 [*Leaf1-dcb-pfc-default] priority 4 deadlock-detect time 10 [*Leaf1-dcb-pfc-default] priority 4 deadlock-recovery time 10 [*Leaf1-dcb-pfc-default] priority 4 turn-off threshold 5 [*Leaf1-dcb-pfc-default] quit [*Leaf1] commit
After the configuration is complete, if you need to modify the PFC deadlock detection configuration, run the shutdown command to disable the PFC-enabled interface to prevent configuration failures caused by deadlock recovery on the switch.
- Configure the low-latency network function.
- Configure the low-latency network function on Leaf1. This function takes effect after the switch restarts. After the configuration is successful, automatic buffer optimization and dynamic ECN threshold for lossless queues are enabled by default.
[~Leaf1] low-latency fabric [*Leaf1-low-latency-fabric] quit [*Leaf1] commit [~Leaf1] quit <Leaf1> save Warning: The current configuration will be written to the device. Continue? [Y/N]: y <Leaf1> reboot Warning: The system will reboot. Continue? [Y/N]: y
- Configure the low-latency network function on Leaf1. This function takes effect after the switch restarts. After the configuration is successful, automatic buffer optimization and dynamic ECN threshold for lossless queues are enabled by default.
- Enable the AI ECN function. Before enabling this function, you need to disable the dynamic ECN function of lossless queues.
[~Leaf1] low-latency fabric [~Leaf1-low-latency-fabric] undo qos dynamic-ecn-threshold enable [*Leaf1-low-latency-fabric] quit [*Leaf1] commit [~Leaf1] ai-service [*Leaf1-ai-service] ai-ecn [*Leaf1-ai-service-ai-ecn] ai-ecn enable [*Leaf1-ai-service-ai-ecn] quit [*Leaf1-ai-service] quit [*Leaf1] commit
Configure Spine1.
- Configure PFC.
- Configure priority mapping and congestion scheduling.# In this example, the DSCP value of RoCEv2 packets is 24, and the DSCP value of CNP packets is 25. Configure a priority mapping profile in the DiffServ domain to map the priority of RoCEv2 packets to priority 4 (queue 4) and the priority of CNP packets to priority 6 (queue 6):
<HUAWEI> system-view [~HUAWEI] sysname Spine1 [*HUAWEI] commit [~Spine1] diffserv domain ds1 [*Spine1-dsdomain-ds1] ip-dscp-inbound 24 phb af4 green //Map the priority of RoCEv2 packets to priority 4. [*Spine1-dsdomain-ds1] ip-dscp-inbound 25 phb cs6 green //Map the priority of CNP packets to priority 6. [*Spine1-dsdomain-ds1] quit [*Spine1] port-group all_using //Configure a port group. [*Spine1-port-group-all_using] group-member 100ge 1/0/1 to 100ge 1/0/9 [*Spine1-port-group-all_using] quit [*Spine1] commit [~Spine1] port-group all_using [*Spine1-port-group-all_using] trust dscp [*Spine1-port-group-all_using] trust upstream ds1 [*Spine1-port-group-all_using] quit [*Spine1] commit
# Configure the congestion scheduling mode for each queue. By default, queues on an interface use the PQ scheduling mode. Therefore, queue 6 can use the default scheduling mode to ensure preferential scheduling of CNP packets.
[~Spine1] port-group all_using [*Spine1-port-group-all_using] qos drr 0 4 [*Spine1-port-group-all_using] qos queue 0 drr weight 5 [*Spine1-port-group-all_using] qos queue 4 drr weight 65 [*Spine1-port-group-all_using] quit [*Spine1] commit
- Configure PFC for the priority of RoCEv2 traffic.
# Configure the queue with priority 4 to carry RoCEv2 traffic on the network. To implement this, enable PFC for priority 4 on each interface and implement PFC based on the priority mapped from the DSCP value.
[~Spine1] dcb pfc //Enter the view of the default PFC profile. [~Spine1-dcb-pfc-default] priority 4 [*Spine1-dcb-pfc-default] quit [*Spine1] port-group all_using [*Spine1-port-all_using] dcb pfc enable mode manual [*Spine1-port-all_using] quit [*Spine1] dcb pfc dscp-mapping enable slot 1 [*Spine1] commit
After the preceding configurations are complete, RoCEv2 traffic is transmitted in the queue with priority 4, which is a lossless queue.
# In Optimizing Lossless Service Performance in Scenarios Without Packet Loss, CE8850-64CQ-EI switches are used as spine nodes, and 8 to 16 ports are used on each CE8850-64CQ-EI switch. You can change the dynamic threshold for triggering PFC frames to 5 to improve the performance of RoCEv2 services.
<Spine1> system-view [~Spine1] port-group all_using [*Spine1-port-group-all_using] dcb pfc buffer 4 xoff dynamic 5 [*Spine1-port-group-all_using] quit [*Spine1] commit
- Configure priority mapping and congestion scheduling.
- Configure PFC deadlock detection.
# Set the PFC deadlock detection interval and recovery time to 100 ms for lossless queues, and configure the device to disable PFC when five PFC deadlocks occur within 20s.
[~Spine1] dcb pfc [*Spine1-dcb-pfc-default] dcb pfc deadlock-detect interval 10 [*Spine1-dcb-pfc-default] priority 4 deadlock-detect time 10 [*Spine1-dcb-pfc-default] priority 4 deadlock-recovery time 10 [*Spine1-dcb-pfc-default] priority 4 turn-off threshold 5 [*Spine1-dcb-pfc-default] quit [*Spine1] commit
After the configuration is complete, if you need to modify the PFC deadlock detection configuration, run the shutdown command to disable the PFC-enabled interface to prevent configuration failures caused by deadlock recovery on the switch.
- Configure the low-latency network function.
- Configure the low-latency network function on Spine1. This function takes effect after the switch restarts. After the configuration is successful, automatic buffer optimization and dynamic ECN threshold for lossless queues are enabled by default.
[~Spine1] low-latency fabric [*Spine1-low-latency-fabric] quit [*Spine1] commit [~Spine1] quit <Spine1> save Warning: The current configuration will be written to the device. Continue? [Y/N]: y <Spine1> reboot Warning: The system will reboot. Continue? [Y/N]: y
- Configure the low-latency network function on Spine1. This function takes effect after the switch restarts. After the configuration is successful, automatic buffer optimization and dynamic ECN threshold for lossless queues are enabled by default.
- Enable the AI ECN function. Before enabling this function, you need to disable the dynamic ECN function of lossless queues.
[~Spine1] low-latency fabric [~Spine1-low-latency-fabric] undo qos dynamic-ecn-threshold enable [*Spine1-low-latency-fabric] quit [*Spine1] commit [~Spine1] ai-service [*Spine1-ai-service] ai-ecn [*Spine1-ai-service-ai-ecn] ai-ecn enable [*Spine1-ai-service-ai-ecn] quit [*Spine1-ai-service] quit [*Spine1] commit
Verifying the Configuration
- Check the PFC threshold and headroom value.
[~Spine1] display dcb pfc buffer interface 100ge1/0/1 Xon: PFC backpressure stop threshold Xoff: PFC backpressure threshold Hdrm: Headroom buffer threshold Guaranteed: PFC guaranteed buffer threshold The actual PFC backpressure stop threshold is the higher value between the value of xon and the difference between the value of xoff and the value of xon-offset. C:cells B:bytes K:kilobytes M:megabytes D:dynamic alpha ------------------------------------------------------------------------------------ Interface Queue Guaranteed Xon Xon-Offset Xoff Hdrm ------------------------------------------------------------------------------------ 100GE1/0/1 4 10(C) 100(C) 20(C) 5(D) 250(C) ------------------------------------------------------------------------------------
- Check the numbers of PFC deadlocks and recovery times. If the values of DeadlockNum and RecoveryNum are 0, no deadlock is triggered.
[~Spine1] display dcb pfc interface 100ge 1/0/1 ----------------------------------------------------------------------------------------- Interface Queue Received(Frames) ReceivedRate(pps) DeadlockNum Transmitted(Frames) TransmittedRate(pps) RecoveryNum ----------------------------------------------------------------------------------------- 100GE1/0/1 4 0 0 0 0 0 0 -----------------------------------------------------------------------------------------
- Check the enabling status of the AI ECN function and the calculated ECN threshold.
[~Spine1] display ai-ecn calculated state interface 100ge 1/0/1 *: Indicates the queue where AI ECN takes effect. AI-ECN State: enabled -------------------------------------------------------------------- Interface Queue Low-Threshold High-Threshold Probability (Byte) (Byte) (%) -------------------------------------------------------------------- 100GE1/0/1 0 0 0 0 1 0 0 0 2 0 0 0 3 0 0 0 *4 4896 18874080 40 5 0 0 0 6 0 0 0 7 0 0 0
Verifying the Result
As mentioned in Scenario, the distributed AI training performance is measured by the speedup, which is calculated using the following formula: Overall performance of N nodes/(Performance of a single node x N), in percentage.
To verify the effect of distributed AI training on the AI Fabric networking, you need to measure the following two aspects to obtain the speedup:
- Overall performance of N nodes: This refers to the parallel processing effect of multiple GPU servers and can be regarded as the number of images that can be trained by nine GPU servers that work in parallel per second in this example.
- Performance of a single node: This refers to the processing effect of a single GPU server and can be regarded as the number of images that can be trained by a single GPU server per second in this example.
You can use deep learning software or the visualized O&M function of iMaster NCE-FabricInsight to verify the configuration of the AI Fabric network. For details about how to deploy iMaster NCE-FabricInsight for visualized O&M, refer to Best Deployment Practices of AI Fabric for Visualized O&M.