CloudEngine 16800, 12800, 9800, 8800, 7800, 6800, and 5800 Series Switches Typical Configuration Examples (V100 and V200)

Configuring an Intelligent Lossless Network

Configuring an Intelligent Lossless Network

Recommended Models

Table 1-20 lists the recommended models for different application scenarios of an intelligent lossless network.

Table 1-20 Recommended models

Application Scenario

Leaf

Spine

Storage

CE6865-48S8CQ-EI

CE8850-64CQ-EI/CloudEngine 16800 (configured with CE-MPUE series MPUs)

AI GPU

CE8861-4C-EI, CE8850-64CQ-EI

CE8850-64CQ-EI/CloudEngine 16800 (configured with CE-MPUE series MPUs)

HPC

CE8850-64CQ-EI

CE8850-64CQ-EI/CloudEngine 16800 (configured with CE-MPUE series MPUs)

Remarks

For details about device selection and recommended networking, see the CloudEngine 16800, 8800, and 6800 Series Switches AI Fabric Deployment Best Practices.

Networking Requirements

Figure 1-24 shows the networking of a RoCEv2 high-performance application, where both TCP and RoCEv2 traffic is transmitted and all servers support the RoCEv2 protocol and have the DCQCN function enabled. Compute and storage servers are deployed in the same PoD, and the ratio of compute nodes to storage nodes is 3:1. Leaf and spine switches are fully meshed through 100GE links. Servers are connected to leaf switches through 25GE links, and the oversubscription ratio is 1:1. In this example, a CE6865-48S8CQ-EI is used as a leaf switch, and a CloudEngine 16800 (configured with CE-MPUE series MPUs) is used as a spine switch.

Figure 1-24 Networking diagram for configuring an intelligent lossless network

Priority Planning

Based on the service traffic characteristics, priorities in this example are planned as follows:

  • Set the priority of CNP traffic to 6, scheduling mode to PQ, and DSCP value to 25.
  • Set the priority of RoCEv2 traffic to 4, scheduling mode to DRR, weight to 65%, and DSCP value to 24.
  • Set the priority of TCP traffic to 1, scheduling mode to DRR, weight to 15%, and DSCP value to 7.
  • Set the priority of O&M traffic to 0, scheduling mode to DRR, and weight to 5%.
  • Reserve priorities 2, 3, and 5 for future use.

Configuration Roadmap

  • The parameter settings in this example are for reference only. You need to configure each device based on the traffic model in the actual networking. For more information, refer to the CloudEngine 16800, 8800, and 6800 Series Switches AI Fabric Deployment Best Practices.
  • In this example, the IP addresses and routes for interconnection between spine switches, leaf switches, and servers have been configured, and there are reachable routes between servers.

The configuration roadmap is as follows:

  • Configure leaf switches.
    1. Configure PFC. Before configuring PFC, you need to configure priority mapping.
    2. Configure PFC deadlock detection.
    3. Configure the low-latency network function. After this function is configured, automatic buffer optimization and dynamic ECN threshold are enabled for lossless queues by default. You can optimize the two functions.
    4. Enable the AI ECN function. Before enabling this function, you need to disable the dynamic ECN function. The AI ECN function is supported only in V200R019C10 and later versions. If a device does not support this function, you can configure the dynamic or static ECN function.
    5. Configure the fast CNP function.
  • Configure spine switches. If CE8850-64CQ-EI switches are used as the spine switches, the configuration of spine switches is similar to that of leaf switches. The CloudEngine 16800 (configured with CE-MPUE series MPUs) is used as an example.
    1. Configure PFC. Before configuring PFC, you need to configure priority mapping.
    2. Configure PFC deadlock detection.
    3. Optimize the buffer space.
    4. Enable the AI ECN function. This function is supported only in V200R019C10 and later versions.
  • Configure server NICs. (The detailed procedures are not provided.)
    1. Configure NICs to work in RoCEv2 mode.
    2. Configure the RoCEv2 link setup mode.
    3. Configure NICs to trust DSCP values, and configure the DSCP values of RoCEv2 and CNP packets.
    4. Enable PFC for the priority of RoCEv2 packets on NICs.
    5. Enable DCQCN for the priority of RoCEv2 packets on NICs.

Procedure

The following describes the configurations on Leaf1. The configurations on Leaf2 are similar.

  1. Configure PFC.

    1. Configure priority mapping and congestion scheduling.
      # In this example, the DSCP value of RoCEv2 packets is 24, and the DSCP value of CNP packets is 25. Configure a priority mapping profile in the DiffServ domain as follows to map the priority of RoCEv2 packets to priority 4 (queue 4) and the priority of CNP packets to priority 6 (queue 6), and map the DSCP value 7 to priority 1:
      <HUAWEI> system-view
      [~HUAWEI] sysname Leaf1
      [*HUAWEI] commit 
      [~Leaf1] diffserv domain ds1 
      [*Leaf1-dsdomain-ds1] ip-dscp-inbound 24 phb af4 green  
      [*Leaf1-dsdomain-ds1] ip-dscp-inbound 25 phb cs6 green  
      [*Leaf1-dsdomain-ds1] ip-dscp-inbound 7 phb af1 green  
      [*Leaf1-dsdomain-ds1] quit 
      [*Leaf1] port-group all_using   
      [*Leaf1-port-group-all_using] group-member 100ge 1/0/1 to 100ge 1/0/4
      [*Leaf1-port-group-all_using] group-member 25ge 1/0/1 to 25ge 1/0/8
      [*Leaf1-port-group-all_using] quit
      [*Leaf1] commit
      [~Leaf1] port-group all_using
      [*Leaf1-port-group-all_using] trust dscp  
      [*Leaf1-port-group-all_using] trust upstream ds1  
      [*Leaf1-port-group-all_using] quit
      [*Leaf1] commit

      # Configure the congestion scheduling mode for each queue. By default, queues on an interface use the PQ scheduling mode. Therefore, queue 6 can use the default scheduling mode to ensure preferential scheduling of CNP packets.

      [~Leaf1] port-group all_using
      [*Leaf1-port-group-all_using] qos drr 0 1 4  
      [*Leaf1-port-group-all_using] qos queue 0 drr weight 5   
      [*Leaf1-port-group-all_using] qos queue 1 drr weight 15
      [*Leaf1-port-group-all_using] qos queue 4 drr weight 65    
      [*Leaf1-port-group-all_using] quit
      [*Leaf1] commit
    2. Configure PFC for the priority of RoCEv2 traffic.

      # Configure the queue with priority 4 to carry RoCEv2 traffic on the network. To implement this, enable PFC for priority 4 on each interface and enable PFC based on the priority mapped from the DSCP value.

      [~Leaf1] dcb pfc  
      [~Leaf1-dcb-pfc-default] priority 4  
      [*Leaf1-dcb-pfc-default] quit
      [*Leaf1] port-group all_using
      [*Leaf1-port-all_using] dcb pfc enable mode manual 
      [*Leaf1-port-all_using] quit
      [*Leaf1] dcb pfc dscp-mapping enable slot 1 
      [*Leaf1] commit

      After the preceding configurations are complete, RoCEv2 traffic is transmitted in the queue with priority 4, which is a lossless queue.

      # CE6865EI, CE6865E, CE8850E-32CQ-EI switches are used as leaf switches. On Leaf1, 12 ports are used; on Leaf2, 28 ports are used. You can change the dynamic threshold for triggering PFC frames to 5 to improve the performance of RoCEv2 services.

      <Leaf1> system-view
      [~Leaf1] port-group all_using
      [*Leaf1-port-group-all_using] dcb pfc buffer 4 xoff dynamic 5
      [*Leaf1-port-group-all_using] quit
      [*Leaf1] commit

  2. Configure PFC deadlock detection.

    # Set the PFC deadlock detection interval and recovery time to 100 ms for lossless queues, and configure the switch to disable PFC when five PFC deadlocks occur within 20s.

    [~Leaf1] dcb pfc
    [*Leaf1-dcb-pfc-default] dcb pfc deadlock-detect interval 10  
    [*Leaf1-dcb-pfc-default] priority 4 deadlock-detect time 10 
    [*Leaf1-dcb-pfc-default] priority 4 deadlock-recovery time 10  
    [*Leaf1-dcb-pfc-default] priority 4 turn-off threshold 5 
    [*Leaf1-dcb-pfc-default] quit
    [*Leaf1] commit

    After the configuration is complete, if you need to modify the PFC deadlock detection configuration, run the shutdown command to disable the PFC-enabled interface to prevent configuration failures caused by deadlock recovery on the switch.

  3. Configure the low-latency network function.

    1. Configure the low-latency network function on Leaf1. This function takes effect after the switch restarts. After the configuration is successful, automatic buffer optimization and dynamic ECN threshold for lossless queues are enabled by default.
      [~Leaf1] low-latency fabric 
      [*Leaf1-low-latency-fabric] quit
      [*Leaf1] commit
      [~Leaf1] quit
      <Leaf1> save
      Warning: The current configuration will be written to the device. Continue? [Y/N]: y 
      <Leaf1> reboot 
      Warning: The system will reboot. Continue? [Y/N]: y
    2. Optimize the buffer space of lossless queues.

      # In this example, both TCP and RoCEv2 traffic is transmitted on the network. Therefore, you can manually reduce the threshold for the queue of TCP traffic. This ensures more shared buffer space of the chip available to lossless RoCEv2 traffic.

      [~Leaf1] port-group all_using
      [*Leaf1-port-group-all_using] qos buffer queue 1 shared-threshold dynamic 1 
      [*Leaf1-port-group-all_using] quit
      [*Leaf1] commit

  4. Enable the AI ECN function.

    [~Leaf1] low-latency fabric
    [~Leaf1-low-latency-fabric] undo qos dynamic-ecn-threshold enable  
    [*Leaf1-low-latency-fabric] quit
    [*Leaf1] commit
    [~Leaf1] ai-service
    [*Leaf1-ai-service] ai-ecn
    [*Leaf1-ai-service-ai-ecn] ai-ecn enable 
    [*Leaf1-ai-service-ai-ecn] quit 
    [*Leaf1-ai-service] quit 
    [*Leaf1] commit 

  5. Enable the fast CNP function.

    [~Leaf1] low-latency fabric
    [*Leaf1-latency-fabric] qos fast-cnp enable
    [*Leaf1-latency-fabric] quit
    [*Leaf1] commit 

The following describes the configurations on Spine1. The configurations on Spine2 are similar.

  1. Configure PFC.

    1. Configure priority mapping and congestion scheduling.
      # In this example, the DSCP value of RoCEv2 packets is 24, and the DSCP value of CNP packets is 25. Configure a priority mapping profile in the DiffServ domain as follows to map the priority of RoCEv2 packets to priority 4 (queue 4) and the priority of CNP packets to priority 6 (queue 6), and map the DSCP value 7 to priority 1:
      <HUAWEI> system-view
      [~HUAWEI] sysname Spine1
      [*HUAWEI] commit 
      [~Spine1] diffserv domain ds1 
      [*Spine1-dsdomain-ds1] ip-dscp-inbound 24 phb af4 green   
      [*Spine1-dsdomain-ds1] ip-dscp-inbound 25 phb cs6 green   
      [*Spine1-dsdomain-ds1] ip-dscp-inbound 7 phb af1 green     
      [*Spine1-dsdomain-ds1] quit 
      [*Spine1] port-group all_using  
      [*Spine1-port-group-all_using] group-member 100ge 1/0/1 to 100ge 1/0/2
      [*Spine1-port-group-all_using] quit
      [*Spine1] commit
      [~Spine1] port-group all_using 
      [*Spine1-port-group-all_using] trust upstream ds1  
      [*Spine1-port-group-all_using] quit
      [*Spine1] commit

      # Configure the congestion scheduling mode for each queue. By default, queues on an interface use the PQ scheduling mode. Therefore, queue 6 can use the default scheduling mode to ensure preferential scheduling of CNP packets.

      [~Spine1] port-group all_using
      [*Spine1-port-group-all_using] qos drr 0 1 4  
      [*Spine1-port-group-all_using] qos queue 0 drr weight 5   
      [*Spine1-port-group-all_using] qos queue 1 drr weight 15
      [*Spine1-port-group-all_using] qos queue 4 drr weight 65    
      [*Spine1-port-group-all_using] quit
      [*Spine1] commit
    2. Configure PFC for the priority of RoCEv2 traffic.

      # Configure the queue with priority 4 to carry RoCEv2 traffic on the network. To implement this, enable PFC for priority 4 on each interface and enable PFC based on the priority mapped from the DSCP value.

      [~Spine1] dcb pfc   
      [~Spine1-dcb-pfc-default] priority 4  
      [*Spine1-dcb-pfc-default] quit
      [*Spine1] port-group all_using
      [*Spine1-port-all_using] dcb pfc enable mode manual 
      [*Spine1-port-all_using] quit
      [*Spine1] dcb pfc dscp-mapping enable slot 1 
      [*Spine1] commit

      After the preceding configurations are complete, RoCEv2 traffic is transmitted in the queue with priority 4, which is a lossless queue.

      # CloudEngine 16800 switches are used as the spine switches. Set the XOFF parameter to 3000 cells and the headroom buffer to 2000 cells to improve the performance of RoCEv2 services.

      <Spine1> system-view
      [~Spine1] port-group all_using
      [*Spine1-port-group-all_using] dcb pfc buffer 4 xoff static 3000 hdrm 2000
      [*Spine1-port-group-all_using] quit
      [*Spine1] commit

  2. Configure PFC deadlock detection.

    # Set the hardware-based PFC deadlock detection interval and recovery time of lossless queues to 1500 ms.

    [~Spine1] dcb pfc
    [*Spine1-dcb-pfc-default] dcb pfc deadlock-detect interval 100  
    [*Spine1-dcb-pfc-default] priority 4 deadlock-detect time 15 
    [*Spine1-dcb-pfc-default] priority 4 deadlock-recovery time 15  
    [*Spine1-dcb-pfc-default] quit
    [*Spine1] commit

    After the configuration is complete, if you need to modify the PFC deadlock detection configuration, run the shutdown command to disable the PFC-enabled interface to prevent configuration failures caused by deadlock recovery on the switch.

  3. Optimize the buffer space.

    # In this example, both TCP and RoCEv2 traffic is transmitted on the network. Therefore, you can manually reduce the threshold for the queue of TCP traffic and increase the threshold for the queue of RoCEv2 traffic. This ensures more shared buffer space of the chip available to lossless RoCEv2 traffic.

    [~Spine1] port-group all_using
    [*Spine1-port-group-all_using] qos buffer queue 1 shared-threshold dynamic 1 
    [*Spine1-port-group-all_using] qos buffer queue 4 shared-threshold dynamic 15 
    [*Spine1-port-group-all_using] quit
    [*Spine1] commit

  4. Enable the AI ECN function.

    [~Spine1] ai-service
    [*Spine1-ai-service] ai-ecn
    [*Spine1-ai-service-ai-ecn] ai-ecn enable 
    [*Spine1-ai-service-ai-ecn] quit 
    [*Spine1-ai-service] quit 
    [*Spine1] commit 

Verifying the Configuration

  • Check the PFC threshold and headroom value.
    [~Leaf1] display dcb pfc buffer interface 100ge1/0/1   
    Xon:        PFC backpressure stop threshold                                     
    Xoff:       PFC backpressure threshold                                          
    Hdrm:       Headroom buffer threshold                                           
    Guaranteed: PFC guaranteed buffer threshold                                     
    The actual PFC backpressure stop threshold is the higher value between the value
    of xon and the difference between the value of xoff and the value of xon-offset.
    C:cells   B:bytes   K:kilobytes   M:megabytes   D:dynamic alpha                 
    ------------------------------------------------------------------------------------
    Interface      Queue    Guaranteed         Xon Xon-Offset       Xoff        Hdrm
    ------------------------------------------------------------------------------------
    100GE1/0/1         4         8(C)       200(C)      20(C)       5(D)      630(C)
    
    ------------------------------------------------------------------------------------
  • Check the numbers of PFC deadlocks and recovery times. If the values of DeadlockNum and RecoveryNum are 0, no deadlock is triggered.
    [~Leaf1] display dcb pfc interface 100ge 1/0/1 
    -----------------------------------------------------------------------------------------
    Interface         Queue         Received(Frames)        ReceivedRate(pps)     DeadlockNum
                                 Transmitted(Frames)     TransmittedRate(pps)     RecoveryNum
    -----------------------------------------------------------------------------------------
    100GE1/0/1            4                        0                        0               0
                                                   0                        0               0
    -----------------------------------------------------------------------------------------
  • Check the enabling status of the AI ECN function and the calculated ECN threshold.
    [~Leaf1] display ai-ecn calculated state interface 100ge 1/0/1 
    *: Indicates the queue where AI ECN takes effect.                               
    AI-ECN State: enabled                                                           
    --------------------------------------------------------------------            
    Interface       Queue   Low-Threshold   High-Threshold   Probability            
                                   (Byte)           (Byte)           (%)            
    --------------------------------------------------------------------            
    100GE1/0/1          0               0                0             0            
                        1               0                0             0            
                        2               0                0             0            
                        3               0                0             0 
                       *4            4896         18874080            40
                        5               0                0             0            
                        6               0                0             0            
                        7               0                0             0