Typical AI Fabric Configuration Case in the Distributed Storage Scenario
Networking Requirements
Figure 2-5 shows the AI Fabric networking of a distributed cloud storage service, where both TCP and RoCEv2 traffic is transmitted and all servers support the RoCEv2 protocol and have the DCQCN function enabled. Compute and storage servers are deployed in the same PoD, and the ratio of compute nodes to storage nodes is 3:1. Leaf and spine switches are fully meshed through 100GE links. Servers are connected to leaf switches through 25GE links, and the oversubscription ratio is 1:1. In this example, the CE6865-48S8CQ-EI is used as the leaf switch, and the CloudEngine 16800 (configured with CE-MPUE series MPUs) is used as the spine switch.
Priority Planning
Based on the service traffic characteristics, priorities in this example are planned as follows:
- Set the priority of CNP traffic to 6, scheduling mode to PQ, and DSCP value to 25.
- Set the priority of RoCEv2 traffic to 4, scheduling mode to DRR, weight to 65%, and DSCP value to 24.
- Set the priority of TCP traffic to 1, scheduling mode to DRR, weight to 15%, and DSCP value to 7.
- Set the priority of O&M traffic to 0, scheduling mode to DRR, and weight to 5%.
- Reserve priorities 2, 3, and 5 for future use.
Configuration Roadmap
In this example, the IP addresses and routes for interconnection between spine switches, leaf switches, and servers have been configured, and there are reachable routes between servers.
- Configure leaf switches.
- Configure PFC. Before configuring PFC, you need to configure priority mapping.
- Configure PFC deadlock detection.
- Configure the low-latency network function. After this function is configured, automatic buffer optimization and dynamic ECN threshold are enabled for lossless queues by default. You can optimize the two functions.
- Enable the AI ECN function. Before enabling this function, you need to disable the dynamic ECN function.
- Configure the fast CNP function.
- Configure spine switches.
- Configure PFC. Before configuring PFC, you need to configure priority mapping.
- Configure PFC deadlock detection.
- Optimize the buffer space.
- Enable the AI ECN function.
- Configure server NICs. (The detailed procedures are not provided.)
- Configure NICs to work in RoCEv2 mode.
- Configure the RoCEv2 link setup mode.
- Configure NICs to trust the DSCP value, and configure the DSCP values of RoCEv2 and CNP packets.
- Enable PFC for the priority of RoCEv2 packets on NICs.
- Enable DCQCN for the priority of RoCEv2 packets on NICs.
Procedure
The following describes the configurations on Leaf1. The configurations on Leaf2 are similar.
- Configure PFC.
- Configure priority mapping and congestion scheduling.# In this example, the DSCP value of RoCEv2 packets is 24, and the DSCP value of CNP packets is 25. Configure a priority mapping profile in the DiffServ domain as follows to map the priority of RoCEv2 packets to priority 4 (queue 4) and the priority of CNP packets to priority 6 (queue 6), and map the DSCP value 7 to priority 1:
<HUAWEI> system-view [~HUAWEI] sysname Leaf1 [*HUAWEI] commit [~Leaf1] diffserv domain ds1 [*Leaf1-dsdomain-ds1] ip-dscp-inbound 24 phb af4 green [*Leaf1-dsdomain-ds1] ip-dscp-inbound 25 phb cs6 green [*Leaf1-dsdomain-ds1] ip-dscp-inbound 7 phb af1 green [*Leaf1-dsdomain-ds1] quit [*Leaf1] port-group all_using [*Leaf1-port-group-all_using] group-member 100ge 1/0/1 to 100ge 1/0/4 [*Leaf1-port-group-all_using] group-member 25ge 1/0/1 to 25ge 1/0/8 [*Leaf1-port-group-all_using] quit [*Leaf1] commit [~Leaf1] port-group all_using [*Leaf1-port-group-all_using] trust dscp [*Leaf1-port-group-all_using] trust upstream ds1 [*Leaf1-port-group-all_using] quit [*Leaf1] commit
# Configure the congestion scheduling mode for each queue. By default, queues on an interface use the PQ scheduling mode. Therefore, queue 6 can use the default scheduling mode to ensure preferential scheduling of CNP packets.
[~Leaf1] port-group all_using [*Leaf1-port-group-all_using] qos drr 0 1 4 [*Leaf1-port-group-all_using] qos queue 0 drr weight 5 [*Leaf1-port-group-all_using] qos queue 1 drr weight 15 [*Leaf1-port-group-all_using] qos queue 4 drr weight 65 [*Leaf1-port-group-all_using] quit [*Leaf1] commit
- Configure PFC for the priority of RoCEv2 traffic.
# Configure the queue with priority 4 to carry RoCEv2 traffic on the network. To implement this, enable PFC for priority 4 on each interface and implement PFC based on the priority mapped from the DSCP value.
[~Leaf1] dcb pfc //Enter the view of the default PFC profile. [~Leaf1-dcb-pfc-default] priority 4 [*Leaf1-dcb-pfc-default] quit [*Leaf1] port-group all_using [*Leaf1-port-all_using] dcb pfc enable mode manual [*Leaf1-port-all_using] quit [*Leaf1] dcb pfc dscp-mapping enable slot 1 [*Leaf1] commit
After the preceding configurations are complete, RoCEv2 traffic is transmitted in the queue with priority 4, which is a lossless queue.
In Optimizing Lossless Service Performance in Scenarios Without Packet Loss, CE6865EI switches are used as leaf switches. On Leaf1, 12 ports are used; on Leaf2, 28 ports are used. You can change the dynamic threshold for triggering PFC frames to 5 to improve the performance of RoCEv2 services.
<Leaf1> system-view [~Leaf1] port-group all_using [*Leaf1-port-group-all_using] dcb pfc buffer 4 xoff dynamic 5 [*Leaf1-port-group-all_using] quit [*Leaf1] commit
- Configure priority mapping and congestion scheduling.
- Configure PFC deadlock detection.
# Set the PFC deadlock detection interval and recovery time to 100 ms for lossless queues, and configure the device to disable PFC when five PFC deadlocks occur within 20s.
[~Leaf1] dcb pfc [*Leaf1-dcb-pfc-default] dcb pfc deadlock-detect interval 10 [*Leaf1-dcb-pfc-default] priority 4 deadlock-detect time 10 [*Leaf1-dcb-pfc-default] priority 4 deadlock-recovery time 10 [*Leaf1-dcb-pfc-default] priority 4 turn-off threshold 5 [*Leaf1-dcb-pfc-default] quit [*Leaf1] commit
After the configuration is complete, if you need to modify the PFC deadlock detection configuration, run the shutdown command to disable the PFC-enabled interface to prevent configuration failures caused by deadlock recovery on the switch.
- Configure the low-latency network function.
- Configure the low-latency network function on Leaf1. This function takes effect after the switch restarts. After the configuration is successful, automatic buffer optimization and dynamic ECN threshold for lossless queues are enabled by default.
[~Leaf1] low-latency fabric [*Leaf1-low-latency-fabric] quit [*Leaf1] commit [~Leaf1] quit <Leaf1> save Warning: The current configuration will be written to the device. Continue? [Y/N]: y <Leaf1> reboot Warning: The system will reboot. Continue? [Y/N]: y
- Optimize the buffer space of lossless queues.
# In this example, both TCP and RoCEv2 traffic is transmitted. Therefore, you can manually reduce the threshold for the queue of TCP traffic. This ensures more shared buffer space of the chip available to lossless RoCEv2 traffic.
[~Leaf1] port-group all_using [*Leaf1-port-group-all_using] qos buffer queue 1 shared-threshold dynamic 1 [*Leaf1-port-group-all_using] quit [*Leaf1] commit
- Configure the low-latency network function on Leaf1. This function takes effect after the switch restarts. After the configuration is successful, automatic buffer optimization and dynamic ECN threshold for lossless queues are enabled by default.
- Enable the AI ECN function.
[~Leaf1] low-latency fabric [~Leaf1-low-latency-fabric] undo qos dynamic-ecn-threshold enable [*Leaf1-low-latency-fabric] quit [*Leaf1] commit [~Leaf1] ai-service [*Leaf1-ai-service] ai-ecn [*Leaf1-ai-service-ai-ecn] ai-ecn enable [*Leaf1-ai-service-ai-ecn] quit [*Leaf1-ai-service] quit [*Leaf1] commit
- Enable fast CNP.
[~Leaf1] low-latency fabric [*Leaf1-latency-fabric] qos fast-cnp enable [*Leaf1-latency-fabric] quit [*Leaf1] commit
The following describes the configurations on Spine1. The configurations on Spine2 are similar.
- Configure PFC.
- Configure priority mapping and congestion scheduling.# In this example, the DSCP value of RoCEv2 packets is 24, and the DSCP value of CNP packets is 25. Configure a priority mapping profile in the DiffServ domain as follows to map the priority of RoCEv2 packets to priority 4 (queue 4) and the priority of CNP packets to priority 6 (queue 6), and map the DSCP value 7 to priority 1:
<HUAWEI> system-view [~HUAWEI] sysname Spine1 [*HUAWEI] commit [~Spine1] diffserv domain ds1 [*Spine1-dsdomain-ds1] ip-dscp-inbound 24 phb af4 green [*Spine1-dsdomain-ds1] ip-dscp-inbound 25 phb cs6 green [*Spine1-dsdomain-ds1] ip-dscp-inbound 7 phb af1 green [*Spine1-dsdomain-ds1] quit [*Spine1] port-group all_using [*Spine1-port-group-all_using] group-member 100ge 1/0/1 to 100ge 1/0/2 [*Spine1-port-group-all_using] quit [*Spine1] commit [~Spine1] port-group all_using [*Spine1-port-group-all_using] trust upstream ds1 [*Spine1-port-group-all_using] quit [*Spine1] commit
# Configure the congestion scheduling mode for each queue. By default, queues on an interface use the PQ scheduling mode. Therefore, queue 6 can use the default scheduling mode to ensure preferential scheduling of CNP packets.
[~Spine1] port-group all_using [*Spine1-port-group-all_using] qos drr 0 1 4 [*Spine1-port-group-all_using] qos queue 0 drr weight 5 [*Spine1-port-group-all_using] qos queue 1 drr weight 15 [*Spine1-port-group-all_using] qos queue 4 drr weight 65 [*Spine1-port-group-all_using] quit [*Spine1] commit
- Configure PFC for the priority of RoCEv2 traffic.
# Configure the queue with priority 4 to carry RoCEv2 traffic on the network. To implement this, enable PFC for priority 4 on each interface and implement PFC based on the priority mapped from the DSCP value.
[~Spine1] dcb pfc //Enter the view of the default PFC profile. [~Spine1-dcb-pfc-default] priority 4 [*Spine1-dcb-pfc-default] quit [*Spine1] port-group all_using [*Spine1-port-all_using] dcb pfc enable mode manual [*Spine1-port-all_using] quit [*Spine1] dcb pfc dscp-mapping enable slot 1 [*Spine1] commit
After the preceding configurations are complete, RoCEv2 traffic is transmitted in the queue with priority 4, which is a lossless queue.
# In Adjusting the Buffer and PFC Thresholds, the CloudEngine 16800 is used as the spine switch. Set the XOFF parameter to 3000 cells and the headroom buffer to 2000 cells to improve the performance of RoCEv2 services.
<Spine1> system-view [~Spine1] port-group all_using [*Spine1-port-group-all_using] dcb pfc buffer 4 xoff static 3000 hdrm 2000 [*Spine1-port-group-all_using] quit [*Spine1] commit
- Configure priority mapping and congestion scheduling.
- Configure PFC deadlock detection.
# Set the hardware-based PFC deadlock detection interval and recovery time of lossless queues to 1500 ms.
[~Spine1] dcb pfc [*Spine1-dcb-pfc-default] dcb pfc deadlock-detect interval 100 [*Spine1-dcb-pfc-default] priority 4 deadlock-detect time 15 [*Spine1-dcb-pfc-default] priority 4 deadlock-recovery time 15 [*Spine1-dcb-pfc-default] quit [*Spine1] commit
After the configuration is complete, if you need to modify the PFC deadlock detection configuration, run the shutdown command to disable the PFC-enabled interface to prevent configuration failures caused by deadlock recovery on the switch.
- Optimize the buffer space.
# In Adjusting the Buffer and PFC Thresholds, both TCP and RoCEv2 traffic is transmitted. Therefore, you can manually reduce the threshold for the queue of TCP traffic and increase the threshold for the queue of RoCEv2 traffic. This ensures more shared buffer space of the chip available to lossless RoCEv2 traffic.
[~Spine1] port-group all_using [*Spine1-port-group-all_using] qos buffer queue 1 shared-threshold dynamic 1 [*Spine1-port-group-all_using] qos buffer queue 1 shared-threshold dynamic 15 [*Spine1-port-group-all_using] quit [*Spine1] commit
- Enable the AI ECN function.
[~Spine1] ai-service [*Spine1-ai-service] ai-ecn [*Spine1-ai-service-ai-ecn] ai-ecn enable [*Spine1-ai-service-ai-ecn] quit [*Spine1-ai-service] quit [*Spine1] commit
Verifying the Configuration
- Check the PFC threshold and headroom value.
[~Leaf1] display dcb pfc buffer interface 100ge1/0/1 Xon: PFC backpressure stop threshold Xoff: PFC backpressure threshold Hdrm: Headroom buffer threshold Guaranteed: PFC guaranteed buffer threshold The actual PFC backpressure stop threshold is the higher value between the value of xon and the difference between the value of xoff and the value of xon-offset. C:cells B:bytes K:kilobytes M:megabytes D:dynamic alpha ------------------------------------------------------------------------------------ Interface Queue Guaranteed Xon Xon-Offset Xoff Hdrm ------------------------------------------------------------------------------------ 100GE1/0/1 4 8(C) 200(C) 20(C) 5(D) 630(C) ------------------------------------------------------------------------------------
- Check the numbers of PFC deadlocks and recovery times. If the values of DeadlockNum and RecoveryNum are 0, no deadlock is triggered.
[~Leaf1] display dcb pfc interface 100ge 1/0/1 ----------------------------------------------------------------------------------------- Interface Queue Received(Frames) ReceivedRate(pps) DeadlockNum Transmitted(Frames) TransmittedRate(pps) RecoveryNum ----------------------------------------------------------------------------------------- 100GE1/0/1 4 0 0 0 0 0 0 -----------------------------------------------------------------------------------------
- Check the enabling status of the AI ECN function and the calculated ECN threshold.
[~Leaf1] display ai-ecn calculated state interface 100ge 1/0/1 *: Indicates the queue where AI ECN takes effect. AI-ECN State: enabled -------------------------------------------------------------------- Interface Queue Low-Threshold High-Threshold Probability (Byte) (Byte) (%) -------------------------------------------------------------------- 100GE1/0/1 0 0 0 0 1 0 0 0 2 0 0 0 3 0 0 0 *4 4896 18874080 40 5 0 0 0 6 0 0 0 7 0 0 0
Verifying the Result
In the distributed storage scenario, the input/output (I/O) performance of the storage media is tested by performing read and write operations on the storage to test the bandwidth and latency. You can use a third-party I/O test tool or the visualized O&M function of iMaster NCE-FabricInsight to verify the configuration of the AI Fabric network. For details about how to deploy iMaster NCE-FabricInsight for visualized O&M, refer to Best Deployment Practices of AI Fabric for Visualized O&M.