No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>Search

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

Packet Loss Occurs on the NIC of the RH2288 V3

Publication Date:  2019-04-23 Views:  21 Downloads:  0
Issue Description

Multiple RH2288 V3 servers, together with Huawei big data platform, are deployed at a site. On the FusionInsight Manager portal, a network packet loss alarm is continuously reported, and the service side reports a long delay in data query and write.

Alarm Information

A network packet loss alarm is continuously reported on the FusionInsight Manager portal.

Handling Process

The packet loss statistics of the 10GE NIC are as follows:

eth2      Link encap:Ethernet  HWaddr E4:A8:B6:97:A2:CE

          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1

          RX packets:22311179313 errors:0 dropped:212932 overruns:0 frame:0

          TX packets:19863061884 errors:0 dropped:0 overruns:0 carrier:0

          collisions:0 txqueuelen:1000

          RX bytes:49594989710939 (45.1 TiB)  TX bytes:27686963916333 (25.1 TiB)

 

eth3      Link encap:Ethernet  HWaddr E4:A8:B6:97:A2:CE

          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1

          RX packets:46459354889 errors:0 dropped:312597 overruns:0 frame:0

          TX packets:58374998148 errors:0 dropped:0 overruns:0 carrier:0

          collisions:0 txqueuelen:1000

          RX bytes:95965226801957 (87.2 TiB)  TX bytes:69855612439915 (63.5 TiB)

 

ethtool -S eth2

{

NIC statistics:

     rx_packets: 22311179703

     tx_packets: 19863062276

     rx_bytes: 49594989747075

     tx_bytes: 27686963955170

     rx_pkts_nic: 35662916328

     tx_pkts_nic: 19863062274

     rx_bytes_nic: 50620484550566

     tx_bytes_nic: 27766799456380

     lsc_int: 9

     tx_busy: 0

     non_eop_descs: 20975198746

     rx_errors: 0

     tx_errors: 0

     rx_dropped: 0

     tx_dropped: 0

     multicast: 1091666

     broadcast: 937755

     rx_no_buffer_count: 0

     collisions: 0

     rx_over_errors: 0

     rx_crc_errors: 0

     rx_frame_errors: 0

     hw_rsc_aggregated: 18940077554

     hw_rsc_flushed: 5588340927

     fdir_match: 21123563285

     fdir_miss: 18307510764

     fdir_overflow: 58857

     rx_fifo_errors: 0

     rx_missed_errors: 212932

 

ethtool -S eth3

{

NIC statistics:

     rx_packets: 46459355292

     tx_packets: 58374998542

     rx_bytes: 95965226840776

     tx_bytes: 69855612477675

     rx_pkts_nic: 58927970523

     tx_pkts_nic: 58374998539

     rx_bytes_nic: 97025415708655

     tx_bytes_nic: 70089871434535

     lsc_int: 9

     tx_busy: 0

     non_eop_descs: 40836101363

     rx_errors: 0

     tx_errors: 0

     rx_dropped: 0

     tx_dropped: 0

     multicast: 1091738

     broadcast: 5089745

     rx_no_buffer_count: 0

     collisions: 0

     rx_over_errors: 0

     rx_crc_errors: 0

     rx_frame_errors: 0

     hw_rsc_aggregated: 20910806075

     hw_rsc_flushed: 8442190819

     fdir_match: 52579073218

     fdir_miss: 6482918673

     fdir_overflow: 116389

     rx_fifo_errors: 0

     rx_missed_errors:  312597

The overall packet loss rate is within one in a hundred thousand. In the lab, the packet loss rate is normal under heavy pressure, and the packet loss type is rx_missed_errors. The packet loss occurs because the CPU cannot process the packets in the ring buffer in DMA. The NIC hardware and network protocol stack are normal.

In addition, the OS version of the servers is RHEL6.5 native and has not been upgraded.

Kernel /boot/vmlinuz-2.6.32-431.el6.x86_64 ro root=UUID=4646c097-2af0-44d5-9a1d-3c3c9a54f353 rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD SYSFONT=latarcyrheb-sun16 crashkernel=auto  KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM rhgb quiet

The bonding module has a defect that affects the performance. For details, go to Red Hat official website at the following address:

https://access.redhat.com/solutions/631123

In conclusion, the packet loss of the network port occurs because the CPU cannot process the packets in the DMA area of the memory in time. The NIC bonding module of OS kernel is defective. You are advised to upgrade the OS kernel and fine tune the NIC.


Root Cause

The packet loss of the network port occurs because the CPU cannot process the packets in the DMA area of the memory in time. The NIC bonding module of OS kernel is defective. You are advised to upgrade the OS kernel and fine tune the NIC.

Solution

1. Upgrade the kernel of RHEL to 2.6.32-431.11.2.el6 or later to eliminate the negative impact on NIC performance.

2. Update the NIC driver to the NIC driver released on the Huawei support website (the native driver does not support performance optimization and the performance is low and the RSS cannot be set). Set the NIC RSS parameters so that the processing of interrupt queues of the NIC can be evenly distributed on each CPU core. According to the logs, the CPU model of the server on the live network is E5-2618L v3. After the hyper-threading function is enabled, the CPU has 16 cores. Therefore, you can set the RSS parameter to 16.

Use the new RSS parameter to load the driver and make the driver take effect. Stop services and network services before performing this step. Assume that four Intel 825.99 10GE ports are configured on the host. The operation procedure is as follows:

1. Run rmmod ixgbe /* to uninstall the old driver */.

2. Run modprobe ixgbe RSS=16,16,16,16.

3. Run dmesg | grep –i tx to check whether information similar to the following is displayed. Ensure that the queue depth of all 10GE network ports is changed to 16.

0 to eth3, the eth0 to eth3 can be bound to the CPU threads 8-15 and 40-47 of the node1 (CPU2). This operation does not cause network interruption. For details, see Appendix 2. In addition, you can use the irqbalance service provided by the OS to automatically allocate the CPU to which the interrupt belongs. (The prerequisite is that the RSS parameter has been adjusted).

7. Add options ixgbe RSS=16,16,16,16 to /etc/modprobe.conf.

In this way, the configuration still takes effect after the server is restarted.

Suggestions

1. It is recommended that the server hardware and big data software platform be inspected periodically. If any alarm is found, handle it in a timely manner.

2. In the case of network packet loss, try to optimize the NIC parameters.

END