Typical AI Fabric Configuration Case in the HPC Scenario
Networking Requirements
Figure 4-3 shows the AI Fabric networking for an HPC service. All servers use CPUs for parallel computing. The server NICs support the RoCEv2 and DCQCN functions. Leaf and spine switches are fully meshed through 100GE links. Servers are connected to leaf switches through 100GE links, and the oversubscription ratio is 1:1. In this example, CE8850-64CQ-EI switches are used as leaf and spine switches.
Priority Planning
Based on the service traffic characteristics, priorities in this example are planned as follows:
- Set the priority of CNP traffic to 6, scheduling mode to PQ, and DSCP value to 25.
- Set the priority of RoCEv2 traffic to 4, scheduling mode to DRR, weight to 65%, and DSCP value to 24.
- Set the priority of O&M traffic to 0, scheduling mode to DRR, and weight to 5%.
- Configure other priorities to be reserved.
Configuration Roadmap
In this example, the IP addresses and routes for interconnection between spine switches, leaf switches, and servers have been configured, and there are reachable routes between servers.
- Configure leaf and spine switches.
- Configure PFC. Before configuring PFC, you need to configure priority mapping and congestion scheduling.
- Configure the low-latency network function. After this function is configured, automatic buffer optimization and dynamic ECN threshold are enabled for lossless queues by default. You can optimize the two functions.
- Enable the AI ECN function. Before enabling this function, you need to disable the dynamic ECN function.
- Configure PFC deadlock detection.
- Configure server NICs. (The detailed procedures are not provided.)
- Configure NICs to work in RoCEv2 mode.
- Configure the RoCEv2 link setup mode.
- Configure NICs to trust the DSCP value, and configure the DSCP value of RoCEv2 packets.
- Enable PFC for the priority of RoCEv2 packets on NICs.
- Enable DCQCN for the priority of RoCEv2 packets on NICs.
Procedure
The following describes the configurations on Leaf1. The configurations on Leaf2, Spine1, and Spine2 are similar.
- Configure PFC.
- Configure priority mapping and congestion scheduling.# In this example, the DSCP value of RoCEv2 packets is 24, and the DSCP value of CNP packets is 25. Configure a priority mapping profile in the DiffServ domain as follows to map the priority of RoCEv2 packets to priority 4 (queue 4) and the priority of CNP packets to priority 6 (queue 6):
<HUAWEI> system-view [~HUAWEI] sysname Leaf1 [*HUAWEI] commit [~Leaf1] diffserv domain ds1 [*Leaf1-dsdomain-ds1] ip-dscp-inbound 24 phb af4 green //Map the priority of RoCEv2 packets to priority 4. [*Leaf1-dsdomain-ds1] ip-dscp-inbound 25 phb cs6 green //Map the priority of CNP packets to priority 6. [*Leaf1-dsdomain-ds1] quit [*Leaf1] port-group all_using //Configure a port group. [*Leaf1-port-group-all_using] group-member 100ge 1/0/1 to 100ge 1/0/32 [*Leaf1-port-group-all_using] quit [*Leaf1] commit [~Leaf1] port-group all_using [*Leaf1-port-group-all_using] trust dscp [*Leaf1-port-group-all_using] trust upstream ds1 [*Leaf1-port-group-all_using] quit [*Leaf1] commit
# Configure the congestion scheduling mode for each queue. By default, queues on an interface use the PQ scheduling mode. Therefore, queue 6 can use the default scheduling mode to ensure preferential scheduling of CNP packets.
[~Leaf1] port-group all_using [*Leaf1-port-group-all_using] qos drr 0 4 [*Leaf1-port-group-all_using] qos queue 0 drr weight 5 [*Leaf1-port-group-all_using] qos queue 4 drr weight 65 [*Leaf1-port-group-all_using] quit [*Leaf1] commit
- Configure PFC for the priority of RoCEv2 traffic.
# Configure the queue with priority 4 to carry RoCEv2 traffic on the network. To implement this, enable PFC for priority 4 on each interface and implement PFC based on the priority mapped from the DSCP value.
[~Leaf1] dcb pfc //Enter the view of the default PFC profile. [~Leaf1-dcb-pfc-default] priority 4 [*Leaf1-dcb-pfc-default] quit [*Leaf1] port-group all_using [*Leaf1-port-all_using] dcb pfc enable mode manual [*Leaf1-port-all_using] quit [*Leaf1] dcb pfc dscp-mapping enable slot 1 [*Leaf1] commit
After the preceding configurations are complete, RoCEv2 traffic is transmitted in the queue with priority 4, which is a lossless queue.
In Optimizing Lossless Service Performance in Scenarios Without Packet Loss, CE8850-64CQ-EI switches are used as leaf and spine switches. Each leaf switch uses 32 ports, and each spine switch uses 16 ports. You can change the dynamic threshold for triggering PFC frames to 4 and 5 for leaf switches and spine switches, respectively, to improve the performance of RoCEv2 services.
# Configure Leaf1. The configuration on Leaf2 is the same as that on Leaf1.
<Leaf1> system-view [~Leaf1] port-group all_using [*Leaf1-port-group-all_using] dcb pfc buffer 4 xoff dynamic 4 [*Leaf1-port-group-all_using] quit [*Leaf1] commit
# Configure Spine1. The configuration on Spine2 is the same as that on Spine1.
<Spine1> system-view [~Spine1] port-group all_using [*Spine1-port-group-all_using] dcb pfc buffer 4 xoff dynamic 5 [*Spine1-port-group-all_using] quit [*Spine1] commit
- Configure priority mapping and congestion scheduling.
- Configure PFC deadlock detection.
# Set the PFC deadlock detection interval and recovery time to 100 ms for lossless queues, and configure the device to disable PFC when five PFC deadlocks occur within 20s.
[~Leaf1] dcb pfc [*Leaf1-dcb-pfc-default] dcb pfc deadlock-detect interval 10 [*Leaf1-dcb-pfc-default] priority 4 deadlock-detect time 10 [*Leaf1-dcb-pfc-default] priority 4 deadlock-recovery time 10 [*Leaf1-dcb-pfc-default] priority 4 turn-off threshold 5 [*Leaf1-dcb-pfc-default] quit [*Leaf1] commit
After the configuration is complete, if you need to modify the PFC deadlock detection configuration, run the shutdown command to disable the PFC-enabled interface to prevent configuration failures caused by deadlock recovery on the switch.
- Configure the low-latency network function.
- Configure the low-latency network function on Leaf1. This function takes effect after the switch restarts. After the configuration is successful, automatic buffer optimization and dynamic ECN threshold for lossless queues are enabled by default.
[~Leaf1] low-latency fabric [*Leaf1-low-latency-fabric] quit [*Leaf1] commit [~Leaf1] quit <Leaf1> save Warning: The current configuration will be written to the device. Continue? [Y/N]: y <Leaf1> reboot Warning: The system will reboot. Continue? [Y/N]: y
- Configure the low-latency network function on Leaf1. This function takes effect after the switch restarts. After the configuration is successful, automatic buffer optimization and dynamic ECN threshold for lossless queues are enabled by default.
- Enable the AI ECN function. Before enabling this function, you need to disable the dynamic ECN function of lossless queues.
[~Leaf1] low-latency fabric [~Leaf1-low-latency-fabric] undo qos dynamic-ecn-threshold enable [*Leaf1-low-latency-fabric] quit [*Leaf1] commit [~Leaf1] ai-service [*Leaf1-ai-service] ai-ecn [*Leaf1-ai-service-ai-ecn] ai-ecn enable [*Leaf1-ai-service-ai-ecn] quit [*Leaf1-ai-service] quit [*Leaf1] commit
Verifying the Configuration
- Check the PFC threshold and headroom value.
[~Leaf1] display dcb pfc buffer interface 100ge1/0/1 Xon: PFC backpressure stop threshold Xoff: PFC backpressure threshold Hdrm: Headroom buffer threshold Guaranteed: PFC guaranteed buffer threshold The actual PFC backpressure stop threshold is the higher value between the value of xon and the difference between the value of xoff and the value of xon-offset. C:cells B:bytes K:kilobytes M:megabytes D:dynamic alpha ------------------------------------------------------------------------------------ Interface Queue Guaranteed Xon Xon-Offset Xoff Hdrm ------------------------------------------------------------------------------------ 100GE1/0/1 4 10(C) 100(C) 20(C) 4(D) 250(C) ------------------------------------------------------------------------------------
- Check the numbers of PFC deadlocks and recovery times. If the values of DeadlockNum and RecoveryNum are 0, no deadlock is triggered.
[~Leaf1] display dcb pfc interface 100ge 1/0/1 ----------------------------------------------------------------------------------------- Interface Queue Received(Frames) ReceivedRate(pps) DeadlockNum Transmitted(Frames) TransmittedRate(pps) RecoveryNum ----------------------------------------------------------------------------------------- 100GE1/0/1 4 0 0 0 0 0 0 -----------------------------------------------------------------------------------------
- Check the enabling status of the AI ECN function and the calculated ECN threshold.
[~Leaf1] display ai-ecn calculated state interface 100ge 1/0/1 *: Indicates the queue where AI ECN takes effect. AI-ECN State: enabled -------------------------------------------------------------------- Interface Queue Low-Threshold High-Threshold Probability (Byte) (Byte) (%) -------------------------------------------------------------------- 100GE1/0/1 0 0 0 0 1 0 0 0 2 0 0 0 3 0 0 0 *4 4896 18874080 40 5 0 0 0 6 0 0 0 7 0 0 0
Verifying the Result
Latency is an important performance indicator in HPC scenarios. You can use a third-party tool or the visualized O&M function of iMaster NCE-FabricInsight to verify the configuration of the AI Fabric network. For details about how to deploy iMaster NCE-FabricInsight for visualized O&M, refer to Best Deployment Practices of AI Fabric for Visualized O&M.