CloudEngine 8800, 7800, 6800, and 5800 V200R019C10 Configuration Guide - Network Management and Monitoring

This document describes the configurations of Network Management and Monitoring, including SNMP, RMON, NETCONF, RESTCONF, OpenFlow, OVSDB, LLDP, NQA, Mirroring, Packet Capture, Packet Trace, Discarded Packet Capture, Path and Connectivity Detection, NetStream, sFlow, iPCA, IOAM, Telemetry, TWAMP, TWAMP Light and Intelligent Traffic Analysis.

Intelligent Traffic Analysis for RoCEv2 Flows

Intelligent Traffic Analysis for RoCEv2 Flows

RoCEv2 networks are being increasingly deployed in replacement of traditional TCP/IP networks to support applications such as high-performance computing (HPC), distributed storage, and artificial intelligence (AI), reducing CPU processing and latency and improving application performance. However, these distributed high-performance applications use the N:1 incast traffic model, which may cause instantaneous burst congestion or even packet loss in the internal queue buffer of Ethernet switches. As a result, the latency of applications increases and the throughput decreases, which causes the performance of distributed applications to deteriorate. The intelligent traffic analysis function for RoCEv2 flows is designed to address these issues. This function enables the switch to match the received RoCEv2 packets based on ACL rules and send the matched packets to the TAP. The TAP then analyzes the packet loss, latency, throughput, and path information to monitor the RoCEv2 network in real time.

Basic Concepts

Remote Direct Memory Access (RDMA) is used on InfiniBand networks. It is a direct memory access from the memory of one computer into that of another computer without involving either one's operating system or CPU processing. This implements high throughput, low latency, and high energy efficiency on the network.

RDMA can also be applied to Ethernet networks using the network protocol RDMA over Converged Ethernet (RoCE). There are two RoCE versions: RoCEv1 and RoCEv2.

  • RoCEv1 is a link layer protocol that allows communication between any two hosts in the same broadcast domain.
  • RoCEv2 is a network layer protocol that implements routing of RoCEv2 packets to allow hosts in different broadcast domains to communicate. RoCEv2 is encapsulated based on the UDP protocol. Figure 21-8 shows the format of an RoCEv2 packet.
    Figure 21-8 Format of an RoCEv2 packet
    • Ethernet Header: contains the source and destination MAC addresses.
    • IP Header: contains the source and destination IP addresses.
    • UDP Header: contains the source and destination port numbers. The destination port number is 4791.
    • InfiniBand Base Transport Header: contains main fields for intelligent traffic analysis. For details, see Table 21-5.
    • InfiniBand Payload: indicates the payload of a message.
    • ICRC and FCS: are used for redundancy check and frame check.
    Table 21-5 Main fields in the InfiniBand base transport header

    Field

    Description

    Opcode

    Specifies the type of RoCEv2 packets:
    • ConnectMsg: Packets of this type are used to set up an RoCEv2 connection. This process is called Communication Management (CM) connection setup. The local and remote ends can exchange data packets only after the connection is set up.
    • Send: Packets of this type are sent to the remote end. The sender does not control where the receiver stores data.
    • Write: Packets of this type carry the address, key, and length of data to be written to the remote end.
    • Read: Packets of this type carry the address, key, and length of data to be read by the remote end. RoCEv2 packets of the Send, Write, and Read types are analyzed during throughput analysis.
    • ACK (acknowledge): Packets of this type are response messages returned by the receiver. Depending on the ACK Extended Transport Header specific to RoCEv2 ACK packets, the ACK packet can be one of two types:
      • Common ACK packet: is a response packet indicating that data is successfully received.
      • NAK packet: indicates that packet loss occurs.

      The latency of data packets can be measured based on the ACK packet and the last Send packet.

    Pad Count

    Specifies the number of extra bytes padded to the InfiniBand payload.

    Dest QP

    Refers to Destination Queue Pair, which identifies an RoCEv2 flow. It is equivalent to the destination port number of an RoCEv2 packet. This field is also a key value used by the intelligent traffic analysis module to create an RoCEv2 flow table.

    PSN

    Specifies the sequence number of an RoCEv2 packet. Packet loss is detected by checking whether the PSNs of packets are consecutive. If packet loss occurs, an NAK packet is returned.

    RoCEv2 packets of the ConnectMsg type include the Connection Request, Connect Reply, and ReadyToUse packets. Figure 21-9 shows the CM connection setup process.
    Figure 21-9 CM connection setup process
    • The client sends a Connect Request packet to the server to request to set up an RoCEv2 connection.
    • Upon receipt of the request, the server returns a Connect Reply packet to the client. After receiving the Connect Reply packet, the client considers that an RoCEv2 connection has been set up with the server.
    • The client sends a ReadyToUse packet to the server. After receiving this packet, the server considers that the CM connection is set up successfully.

    RDMA connections can be set up in either of the following modes: CM connection setup (based on RoCE packets) and TCP connection setup (based on user-defined fields in TCP packets). Currently, intelligent traffic analysis for RoCEv2 flows can analyze packets transmitted for CM connection setup as well as those transmitted for TCP connection setup in the public cloud scenario with Huawei's FusionStorage. The flow analysis processes for the two types of packets are similar. This section describes only the common CM connection setup mode.

RoCEv2 Flow Matching

  • Flow matching on the TDE

    The intelligent traffic analysis module provides two types of intelligent traffic analysis functions for RoCEv2 flows: RoCEv2 packet loss visualization and RoCEv2 performance visualization. RoCEv2 networks are very sensitive to packet loss. Therefore, to help achieve zero packet loss, all RoCEv2 traffic passing through devices needs to be graphically displayed using the RoCEv2 packet loss visualization function. When packet loss occurs on a service flow, the RoCEv2 packet loss visualization function works with the RoCEv2 performance visualization function to measure and display high-precision performance indicators such as the latency and throughput based on the basic information about the RoCEv2 flow, including the source IP address, destination IP address, and inbound interface, ultimately facilitating fault locating.

    When the intelligent traffic analysis module collects RoCEv2 traffic on the inbound interfaces of a switch for performance visualization, the switch identifies the UDP port number and Opcode field in the received RoCEv2 packets and matches the packets based on the delivered ACL rules. The switch mirrors and sends the matched RoCEv2 packets to the TAP. Currently, only some advanced ACL rules are supported. The ACL rules that are not supported cannot be delivered, preventing the TAP from receiving corresponding service flows.

    • Rule 1: UDP + destination IPv4 address
    • Rule 2: UDP + source IPv4 address
    • Rule 3: UDP + source IPv4 address + destination IPv4 address

    Based on the configured ACL rule, the TDE delivers another ACL rule for matching the service flow in the opposite direction. In this way, packets in both directions of a service flow matching the ACL rules are sent to the TAP for high-precision analysis of flow characteristics.

    Measuring RoCEv2 packet loss on the network requires all RoCEv2 traffic passing through an interface to be monitored. Therefore, user-defined ACL rules cannot be configured to match specific RoCEv2 traffic. After intelligent traffic analysis for RoCEv2 packet loss visualization is enabled, the TDE automatically delivers ACL rules to match all RoCEv2 traffic passing through the interface.

  • Flow matching on the TAP

    The TAP analyzes a received RoCEv2 flow. Currently, only common IPv4 RoCEv2 packets can be analyzed. If the received packets are not common IPv4 RoCEv2 packets or exceed the processing capability of the TAP, the TAP discards these packets.

RoCEv2 Flow Analysis

After intelligent traffic analysis for RoCEv2 flows is enabled on a switch, the TDE automatically delivers ACL rules to match the Opcode field in RoCEv2 packets to obtain RoCEv2 packets. The Opcode field indicates the packet type. The TAP creates flow entries based on key values such as 4-tuple information in RoCEv2 connection setup packets.

A flow table is created based on RoCEv2 connection setup packets. Therefore, to ensure that intelligent traffic analysis takes effect, enable intelligent traffic analysis for RoCEv2 flows before an RoCEv2 connection is set up.

After creating a flow table, the TAP collects statistics on some key fields in the flow table based on RoCEv2 data packets sent from the TDE, and analyzes the statistical results to obtain characteristics of the RoCEv2 flow.

The statistics in the flow table can be viewed on the switch. In addition, the statistical results are exported to the TDA for further display and analysis after the flow is aged out.

  • Flow table creation based on 4-tuple information

    Intelligent traffic analysis for RoCEv2 flows supports flow table creation based on four-tuple information in RoCEv2 connection setup packets. The 4-tuple information uniquely identifies a RoCEv2 session. Table 21-6 lists the four key values in 4-tuple information for creating an RoCEv2 flow table.

    Table 21-6 Key values in 4-tuple information for creating an RoCEv2 flow table

    Key Value

    Description

    ServerIP

    Specifies the IP address of a server that sends and receives RoCEv2 flows. Currently, only IPv4 addresses are supported.

    ClientIP

    Specifies the IP address of a client that sends and receives RoCEv2 flows. Currently, only IPv4 addresses are supported.

    ClientQP

    Specifies the QP identifier of the RoCEv2 flow sent from the client. The value is the same as that of the Dest QP field in RoCEv2 packets. This field identifies an RoCEv2 flow.

    ServerQP

    Specifies the QP identifier of the RoCEv2 flow returned by the server.

  • Flow table characteristics

    After creating an RoCEv2 flow table, the TAP collects statistics on fields in the flow table based on subsequent RoCEv2 data packets and analyzes the characteristics of the flow. You are advised to configure the 1588v2 function to make intelligent analysis of RoCEv2 traffic characteristics more precise.

    With the exception of M-LAG and stacking scenarios, the intelligent traffic analysis function for RoCEv2 flows requires that RoCEv2 packets be sent and received along the same path.

    Table 21-7 lists the RoCEv2 traffic characteristics that can be analyzed by the TAP.

    Table 21-7 RoCEv2 traffic characteristics that can be analyzed by the TAP

    Characteristic

    Description

    Packet loss

    The TAP can count the number of NAK packets in both directions of an RoCEv2 flow. A number other than 0 indicates that packet loss occurs. If RoCEv2 packets are lost, the TAP records packet loss information, adds timestamps, and sends the packet loss information to the TDA.

    Latency

    The TAP can calculate the smoothed RTT for packets transmitted in both directions, which is accurate to the nearest nanosecond.

    Throughput

    The TAP can collect statistics on the throughput of RoCEv2 packets per unit of time. The TAP measures the throughput only for RoCEv2 packets in elephant flows, but not mice flows.

    Path

    The TAP can collect statistics about inbound interfaces of RoCEv2 packets and sends the statistics to the TDA. After intelligent traffic analysis for RoCEv2 flows is configured on the entire network, you can view the actual path of the RoCEv2 flow on the TDA.

    NOTE:

    The intelligent traffic analysis function for RoCEv2 flows must be configured on the entire network to monitor the paths of network-wide RoCEv2 flows.

Analysis of RoCEv2 Packets Sent and Received Along Different Paths

On a data center network, access gateways are typically deployed in a stack or an M-LAG system to ensure access reliability. Intelligent traffic analysis for RoCEv2 flows is also supported in these scenarios. The function is implemented in a similar way in the two scenarios. This section describes how intelligent traffic analysis for RoCEv2 flows is implemented when access gateways are deployed in an M-LAG active-active system.

Figure 21-10 Intelligent traffic analysis for RoCEv2 flows in the M-LAG dual-homing access scenario
On the RoCEv2 network shown in Figure 21-10, intelligent traffic analysis for RoCEv2 flows is deployed, and Leaf1, Leaf2, Leaf3, and Leaf4 are all M-LAG active-active gateways. The path along which RoCEv2 packets are sent from VM1 to VM3 is as follows: VM1 -> Leaf1 -> Spine1 -> Leaf4 -> VM3. The return path is as follows: VM3 -> Leaf3 -> Spine1 -> Leaf2 -> VM1. To ensure that characteristics of RoCEv2 packets transmitted in both directions are sent to the TDA, data needs to be synchronized between Leaf1 and Leaf2 and between Leaf3 and Leaf4. Using Leaf1 and Leaf2 as an example, the data synchronization process is as follows:
  1. After intelligent traffic analysis for RoCEv2 flows is enabled on Leaf2, Leaf2 automatically delivers ACL rules to match the Opcode field in RoCEv2 packets to obtain RoCEv2 packets. The Opcode field indicates the packet type.
  2. After matching a Connect Reply packet, Leaf2 sends the Connect Reply packet to Leaf1 through the peer-link interface (or a stack member interface in the stacking scenario).
  3. After receiving the packet, Leaf1 processes the Connect Reply packet together with other connection setup information and synchronizes the processing result to Leaf2. In this way, both Leaf1 and Leaf2 can create a bidirectional RoCEv2 flow table and analyze the packet loss, latency, throughput, and path of subsequent RoCEv2 data packets.

    To prevent repeated analysis of RoCEv2 packets, M-LAG or stack member switches do not create flow tables for or analyze RoCEv2 packets sent from peer-link interfaces or stack member interfaces.

  4. When Leaf2 matches RoCEv2 ACK packets, Leaf2 forwards the packets to Leaf1 through the peer-link interface.
  5. Leaf1 checks whether packet loss occurs based on ACK packets and measures the latency of data packets by analyzing Send packets.
  6. When the bidirectional RoCEv2 flow tables on Leaf1 and Leaf2 meet aging conditions, they export the flow tables to the TDA.

RoCEv2 Flow Table Export

After creating a flow table based on RoCEv2 packets sent from the TDE, the TAP exports the flow table that contains the flow analysis result to the specified TDA for further processing and graphical display of the flow information. The RoCEv2 flow table export process is similar to the TCP flow table export process. For details, see TCP Flow Table Export.

In contrast to intelligent traffic analysis for TCP flows, that of RoCEv2 flows only supports the following two flow aging modes:

  • Active aging
    When the active time (from the flow table creation time to the current time) of an RoCEv2 flow exceeds the configured active aging period, the device considers that the RoCEv2 flow is active.
    • The flow table is periodically exported to the TDA at an interval of the active aging period.
    • If the number of NAK packets is not 0, packet loss occurs. When the active aging period expires, the TAP deletes the NAK packet statistics from the flow table and prepares for packet loss detection in the next active aging period.
    • If statistics about the latency or throughput in the flow table do not change in an active aging period, the TAP deletes the statistics and re-collects statistics in the next active aging period.
  • Inactive aging

    When the inactive time (from the time when the last RoCEv2 packet is received to the current time) of an RoCEv2 flow exceeds the configured inactive aging period, the device considers that the RoCEv2 flow is inactive (the flow is interrupted). Because the TAP's capacity is limited, the TAP exports the current flow table to the TDA and deletes it from the switch.

    This mode is applicable to scenarios where a large number of flows are transmitted in a short period of time on the network.