Contents

4  Performance Problem Diagnosis

4.1  Overview

4.2  Locating Problems

4.3  Diagnosing Host Problems

4.3.1  Checking Whether Insufficient Host Memory or High CPU Usage Exists

4.3.2  Checking Whether the Host CPU Is Underclocked

4.3.3  Checking Whether Linux Is Using a Kernel Buffer

4.3.4  Optimizing Block Device Parameter Settings of Linux

4.3.5  Checking Whether the Maximum Concurrency Capability of the Host HBA Is Sufficient

4.3.6  Checking Whether the Host HBA Is Running the Latest Driver

4.4  Diagnosing Network and Link Problems

4.4.1  Checking the Zoning Settings on Fibre Channel Switches

4.4.2  Checking the Multipathing Status

4.4.3  Checking the Multipathing Parameter Settings

4.4.4  Checking Whether the Front-End Link Bandwidth Reaches a Bottleneck

4.5  Diagnosing Storage Problems

4.5.1  Diagnosis Process

4.5.2  Collecting Information

4.5.3  Checking the Performance Impacts of Internal Transactions and Value-Added Features

4.5.4  Analyzing the CPU Performance

4.5.5  Analyzing the Performance of Front-End Host Ports

4.5.6  Analyzing Cache Performance

4.5.7  Analyzing the LUN Performance

4.5.8  Analyzing the RAID Performance

4.5.9  Analyzing the Performance of Back-End Ports and Disks

4  Performance Problem Diagnosis

The performance of a storage system is determined by the weakest point. In the event of performance tuning and problem diagnosis, you must first learn about the service scenario and performance requirements. Then, use the system I/O path as a clue to determine the module where the performance problem resides. Finally, diagnose the performance problem and tune the performance.

4.1  Overview

In terms of performance problems, a direct user experience is that the application response or service processing takes a longer time.

Common performance problems are as follows:
  • The I/O latency is long. Users experience a longer service response time.
  • IOPS and bandwidth cannot meet service requirements.
  • Performance fluctuates greatly.
Figure 4-1 shows a typical storage network that consists of application servers, switches, and a storage system.
Figure 4-1  Schematic diagram of a storage network

From the perspective of networking, the service I/O delivery process can be roughly divided in to the logical modules shown in Figure 4-2.

Figure 4-2  Service I/O processing flowchart

Performance problem diagnosis can be divided into two steps:

  • Determine the I/O path layer where a performance problem resides.
  • Determine the root cause of the performance problem.

4.2  Locating Problems

Performance is associated with the entire process. A performance problem may occur in any aspect of the entire I/O path. For example, the host application may have an error, the network where the host resides may be congested, or the back-end storage system may have an error. The performance of a storage system is determined by the weakest point. Therefore, you must check all aspects to find out the performance bottleneck and eliminate it.

Context

A common method of locating a performance problem is to compare the average latency of a host with that of a storage system and then determine whether the problem resides on the host side, the network link layer, or the storage side.
  • If both the latency on the host side and that on the storage side are large and the difference between the latencies is small, the problem may reside in the storage system. Common reasons are as follows: The disk performance reaches the upper limit; the mirror bandwidth reaches the upper limit; a short-term performance deterioration occurs because LUN formatting is not complete.
    NOTE:

    The preceding problems are related to the read latency. Write latency includes the time spent in transmitting data from a host to a storage system. Therefore, if the write latency is large, the problem is not always caused by the storage system, and you should check all the aspects that may cause the problem.

  • If the latency on the host side is much larger than that on the storage side, the configuration on the host side may be inappropriate, leading to a performance or network link problem. Common reasons are as follows: I/Os are stacked because the concurrency capability of the block device or HBA is insufficient; the CPU usage of a host reaches 100%; the bandwidth reaches a bottleneck; the switch configuration is inappropriate; the multipathing software selects paths incorrectly.
  • After determining the location where a performance problem resides, such as on the host, storage, or network side, analyze and troubleshot the problem.

Procedure

  1. Check the latency on the host side.

    • On a Linux host, use different tools to query the host latency.
      • Use the performance statistics function of service software, such as the AWR report function of Oracle, to query the host latency.
      • Use iostat, a Linux disk I/O query tool, to query the latency.

        Run iostat -kx 1.

        In the command output, await indicates the average time in processing each I/O request, namely, the I/O response time, expressed in milliseconds.

      • Use vdbench, a Linux performance test tool, to query the latency.

        In the command output, resp indicates the average time in processing each I/O request, namely, the I/O response time, expressed in milliseconds.

    • On a Windows host, use different tools to query the host latency.
      • Use the performance statistics function of service software to query the host latency.
      • Use IOmeter, a performance test tool commonly used in Windows, to query the host latency.

      • Use the performance monitoring tool delivered with Windows to query the host latency. Windows Performance Monitor, a performance monitoring tool integrated with Windows, can monitor the performance of CPUs, memory, disks, network connections, and applications.
        The method of using Windows Performance Monitor to monitor disk performance is as follows:
        • On the Windows desktop, choose Start > Run. In the Run dialog box, enter perfmon to open the performance monitoring tool. The Performance Monitor window is displayed.
        • In the left navigation tree, choose Monitoring Tools > Performance Monitor and click to add performance items.
        • In the Add Counters window, select PhysicalDisk and add performance items that you want to monitor. Then, choose Add > OK. Windows Performance Monitor starts monitoring disk performance.
      • Table 4-1 describes performance items related to latency.
        Table 4-1  Disk performance items related to latency

        Indicator

        Subitem

        Description

        Latency indicator

        Avg. Disk sec/Transfer

        Average time in processing each I/O on the storage side, expressed in milliseconds.

        Avg. Disk sec/Read

        Average time in processing each read I/O on the storage side.

        Avg. Disk sec/Write

        Average time in processing each write I/O on the storage side.

  2. Check the latency on the storage side.

    • Use SystemReporter to query the latency on the storage side.

      Operation path: Monitoring > Real-Time Monitoring > Controller

    • Run the CLI command to query the latency on the storage side.
      Log in to the CLI of the storage system and run show performance controller to query the average I/O response time of the specified controller, namely, Average I/O Latency (ms).
      admin:/>show performance controller controller_id=0A
      0.Max. Bandwidth (MB/s)                                 
      1.Usage Ratio (%)                                       
      2.Queue Length                                          
      3.Bandwidth (MB/s)                                      
      4.Throughput(IOPS) (IO/s)                               
      5.Read Bandwidth (MB/s)                                 
      6.Average Read I/O Size (KB)                            
      7.Read Throughput(IOPS) (IO/s)                          
      8.Write Bandwidth (MB/s)                                
      9.Average Write I/O Size (KB)                           
      10.Write Throughput(IOPS) (IO/s)                        
      11.Service Time (Excluding Queue Time) (ms)             
      12.Read I/O distribution: 512 B                         
      13.Read I/O distribution: 1 KB                          
      14.Read I/O distribution: 2 KB                          
      15.Read I/O distribution: 4 KB                          
      16.Read I/O distribution: 8 KB                          
      17.Read I/O distribution: 16 KB                         
      18.Read I/O distribution: 32 KB                         
      19.Read I/O distribution: 64 KB                         
      20.Read I/O distribution: 128 KB                        
      21.Read I/O distribution: 256 KB                        
      22.Read I/O distribution: 512 KB                        
      23.Write I/O distribution: 512 B                        
      24.Write I/O distribution: 1 KB                         
      25.Write I/O distribution: 2 KB                         
      26.Write I/O distribution: 4 KB                         
      27.Write I/O distribution: 8 KB                         
      28.Write I/O distribution: 16 KB                        
      29.Write I/O distribution: 32 KB                        
      30.Write I/O distribution: 64 KB                        
      31.Write I/O distribution: 128 KB                       
      32.Write I/O distribution: 256 KB                       
      33.Write I/O distribution: 512 KB                       
      34.CPU Usage (%)                                        
      35.Avg. Cache Usage (%)                                 
      36.Average I/O Latency (ms)                             
      37.Max. I/O Latency (ms)                                
      38.Percentage of Cache Flushes to Write Requests (%)    
      39.Cache Flushing Bandwidth (MB/s)                      
      40.Full-stripe Write Request                            
      41.Cache Read Usage (%)                                 
      42.Cache Write Usage (%)                                
      43.Average Read I/O Latency                             
      44.Average Write I/O Latency                            
      45.Average IO Size                                      
      46.Complete SCSI commands per second                    
      47.Verify commands per second                           
      48.% Read                                               
      49.% Write                                              
      50.Max IOPS(IO/s)                                       
      51.File Bandwidth(MB/s)                                 
      52.File OPS                                             
      53.File Bandwidth(MB/s) 
      Input item(s) number seperated by comma:
      NOTE:

      For details about how to log in to the CLI of a storage system and the usage of CLI commands, see the Command Reference of the corresponding version.

4.3  Diagnosing Host Problems

A host is the party that initiates I/Os. I/O characteristics are determined by the service software and operating system running on the host as well as the hardware configuration of the host. When diagnosing a host problem, check the typical host performance problems described in the following sections. If the performance problem still persists, check the network, links, and storage system.

4.3.1  Checking Whether Insufficient Host Memory or High CPU Usage Exists

Insufficient host memory or high CPU usage affects the service delivery capability of a host, leading to performance deterioration.

Windows Hosts

On a Windows host, use Performance Monitor delivered with Windows or Windows task manager to query the memory and CPU usage. How to use Windows task manager is described as follows:
  1. Start the task manager.
    1. Choose Start > Run.

      The Run dialog box is displayed.

    2. Enter taskmgr and click OK.

      The Windows Task Manager window is displayed.

  2. Select the Performance tab. View the memory and CPU usage of the host. See Figure 4-3.

    Figure 4-3  Memory and CPU usage of a host

Linux Hosts

On a Linux host, run top to view the resource usage. The top command, a performance analysis tool commonly used in Linux, displays the resource usage of each process in real time.
  1. Log in to a Linux host as user root.
  2. Run top to view the resource usage of each process.



    In the preceding command output, 99.7%id indicates the idle percentage of CPU0, and 100%-99.7% indicates the usage of CPU0.

Follow-up Procedure

In application scenarios that require high performance, if the memory or CPU usage on the host side is high, replace the existing hosts with higher-performance hosts or add hosts for parallel testing.

4.3.2  Checking Whether the Host CPU Is Underclocked

In certain host running modes, a host takes the initiative to decrease the CPU clock frequency during off-peak hours, thereby reducing the power consumption. However, CPU underclocking affects the performance in interaction between a host and a storage system, increasing the I/O latency on the host side. Therefore, if you are concerned about the latency in light-load scenarios, configure the host running mode to ensure that the CPU clock frequency will not decrease, thereby obtaining the optimal performance.

Windows Hosts

On a Windows host, you can view the CPU clock frequency.

Choose Start > Run. In the Run dialog box, enter DXDIAG and click OK to check whether the CPU clock frequency is decreased.

You can change the host running mode to the high performance mode. In doing so, the CPU clock frequency will not decrease.

Operation path: Start > Control Panel > System and Security > Power Options > Select a power plan > High performance
NOTE:

After setting View by (the control panel view mode) to Category, you can view System and Security.


Linux Hosts

On a Linux host, run cat /proc/cpuinfo to view the CPU clock frequency.

To ensure that the CPU clock frequency does not decrease, configure the host running mode as follows:

  1. Run cd to enter the /sys/devices/system/cpu/ directory.
  2. Run ll to view the number of CPUs in the /sys/devices/system/cpu directory.
  3. Run echo performance for each CPU in the /sys/devices/system/cpu/ directory.

    linux-ob3a:~ # cd /sys/devices/system/cpu
    linux-ob3a:/sys/devices/system/cpu # ll
    total 0
    drwxr-xr-x 7 root root    0 May 25 08:35 cpu0
    drwxr-xr-x 7 root root    0 May 25 08:35 cpu1
    drwxr-xr-x 7 root root    0 May 25 08:35 cpu10
    drwxr-xr-x 7 root root    0 May 25 08:35 cpu11
    drwxr-xr-x 7 root root    0 May 25 08:35 cpu2
    drwxr-xr-x 7 root root    0 May 25 08:35 cpu3
    drwxr-xr-x 7 root root    0 May 25 08:35 cpu4
    drwxr-xr-x 7 root root    0 May 25 08:35 cpu5
    drwxr-xr-x 7 root root    0 May 25 08:35 cpu6
    drwxr-xr-x 7 root root    0 May 25 08:35 cpu7
    drwxr-xr-x 7 root root    0 May 25 08:35 cpu8
    drwxr-xr-x 7 root root    0 May 25 08:35 cpu9
    drwxr-xr-x 3 root root    0 May 28 14:43 cpufreq
    drwxr-xr-x 2 root root    0 May 28 14:43 cpuidle
    -r--r--r-- 1 root root 4096 May 25 08:35 kernel_max
    -r--r--r-- 1 root root 4096 May 28 14:43 offline
    -r--r--r-- 1 root root 4096 May 25 08:35 online
    -r--r--r-- 1 root root 4096 May 28 14:43 possible
    -r--r--r-- 1 root root 4096 May 28 14:43 present
    --w------- 1 root root 4096 May 28 14:43 probe
    --w------- 1 root root 4096 May 28 14:43 release
    -rw-r--r-- 1 root root 4096 May 28 14:43 sched_mc_power_savings
    -rw-r--r-- 1 root root 4096 May 28 14:43 sched_smt_power_savings
    linux-ob3a:/sys/devices/system/cpu # echo performance > cpu0/cpufreq/scaling_governor
    

4.3.3  Checking Whether Linux Is Using a Kernel Buffer

If a common performance test is conducted in a Linux environment, it is recommended that you set the disk access mode to Direct I/O.

Linux has a kernel buffer. The Buffered I/O mechanism of Linux stores I/O data to the page cache. That is, data is first copied to the buffer of the Linux kernel and then to the address space of the application program. The Direct I/O mechanism enables data to be directly transmitted between the buffer of user address space and disks without the need to use the page cache.

If you use the Vdbench tool or the dd command to test performance without specifying Direct I/O, Buffered I/O is used by default. The default page cache size employed by Linux is 4 KB. If an I/O is larger than 4 KB, the following issues occur:
  • In the kernel buffer, the host splits the I/O. On the block device layer, small I/Os are combined. As a result, the CPU overhead is large.
  • In a large-concurrency scenario, small I/Os may not be combined into large I/Os on the block device layer before being written to disks. As a result, the I/O model is changed, and disks may not be able to provide the maximum bandwidth performance.
Set the disk access mode to Direct I/O as follows:
  • When Vdbench is used, set openflags to o_direct in sd.
    sd=default,openflags=o_direct,threads=32
  • When the dd command is used:
    • Set oflag to direct to test the write performance.
      dd if=/dev/zero of=/testw.dbf bs=4k oflag=direct count=100000
    • Set iflag to direct to test the read performance.
      dd if=/dev/sdb of=/dev/null bs=4k iflag=direct count=100000

4.3.4  Optimizing Block Device Parameter Settings of Linux

Block device is a major factor that affects host performance. Correctly configuring the queue depth of the block device, scheduling algorithm, prefetch volume, and I/O alignment helps improve the system performance.

Queue Depth

The queue depth determines the maximum number of concurrent I/Os written to the block device. In Linux, the default value is 128. Do not change the value unless absolutely necessary. To query the queue depth of the block device, run cat.
linux-ob3a:~ # cat /sys/block/sdc/queue/nr_requests
128
In the event of testing the highest system performance, you can set the queue depth to a larger value to increase the I/O write pressure and the probability of combining I/Os in the queue. You can use the following method to temporarily change the queue depth of the block device:
echo 256 > /sys/block/sdc/queue/nr_requests
NOTE:

You can tune the performance by temporarily changing the queue depth of the block device. After the application server is restarted, the queue depth is restored to the default value.

Scheduling Algorithm

Linux 2.6 kernel supports four types of block device scheduling algorithms: noop, anticipatory, deadline, and cfq. The default scheduling algorithm is cfq. To query the block device scheduling algorithm in use, run cat.
linux-ob3a:~ # cat /sys/block/sdc/queue/scheduler
noop deadline [cfq] 
Inappropriate configuration of the scheduling algorithm affects system performance, such as I/O concurrency. You can use the following method to temporarily change the scheduling algorithm of the block device:
echo noop > /sys/block/sdc/queue/scheduler
NOTE:

You can tune the performance by temporarily changing the scheduling algorithm. After the application server is restarted, the scheduling algorithm is restored to the default value.

Prefetch volume

Similar to the prefetch algorithm of a storage array, the prefetch function of Linux only applies to sequential read, identifies sequential streams, and reads data of the read_ahead_kb length (in units of sectors) in advance. For example, the default prefetch volume in SUSE 11 is 512 sectors, namely, 256 KB. To query the prefetch volume of the block device, run cat.

linux-ob3a:~ # cat /sys/block/sdc/queue/read_ahead_kb
512

If an application needs to read a large number of large-sized files, you can set the prefetch volume to a relatively large value for higher performance. To change the prefetch volume of a block device, use the following method:

echo 1024 > /sys/block/sdc/queue/read_ahead_kb

I/O Alignment

If MBR partitions are created in Linux or Windows whose version is earlier than Windows Server 2003, the first 63 sectors of a disk are reserved for the master boot record and partition table. The first partition starts from the 64th sector by default. As a result, misalignment occurs between data blocks (database or file system) delivered by hosts and data blocks stored in the storage array, causing poor I/O processing efficiency.

In Linux, you can resolve I/O misalignment in either of the following ways:

  • Method 1: Change the start location of a partition.

    When creating MBR partitions in Linux, it is recommended that you enter the expert mode of the fdisk command and set the start location of the first partition to the start location of the second extent on a LUN. (The default extent size is 4 MB.) The following is a quick command used to create an MBR partition in /dev/sdb. The partition uses all space of /dev/sdb. The start sector is set to 8192, namely, 4 MB.

    printf "n\np\n1\n\n\nx\nb\n1\n 8192\nw\n" | fdisk /dev/sdb
  • Method 2: Create GPT partitions.

    The following is a quick command used to create a GPT partition in /dev/sdb. The partition uses all space of /dev/sdb. The start sector is set to 8192, namely, 4 MB.

    parted -s -- /dev/sdb "mklabel gpt" "unit s" "mkpart primary 8192 -1" "print"

To create MBR partitions in Windows whose version is earlier than Windows Server 2003, it is recommended that you run diskpart to set partition alignment.

diskpart> select disk 1
diskpart> create partition primary align=4096

4.3.5  Checking Whether the Maximum Concurrency Capability of the Host HBA Is Sufficient

The concurrency capability of an HBA indicates the maximum number of I/Os that each LUN can transmit at a time. In a high concurrency scenario, an insufficient concurrency capability of an HBA or block device typically leads to poor performance.

Windows Hosts

In Windows, the concurrency capability of most HBAs is 128. In certain Windows versions, the default concurrency capability of the HBA driver may be small. For example, in Windows Server 2012 R2, the default concurrency capability of an Emulex HBA is 32. Insufficient concurrency capability leads to the following issue: The host pressure cannot be fully transferred to the storage side. If the difference between the latency on the host side and that on the storage side is great, you can use the management software provided by the HBA vendor to query the concurrency capability of the HBA and set it to a proper value if necessary.

Linux Hosts

In Linux, the queue parameter settings of an HBA vary depending on the HBA type and driver. For details, see the specifications provided by the HBA vendor. For example, the QLogic 8 Gbit/s dual-port Fibre Channel HBA allows the maximum queue depth of each LUN to be 32. If the difference between the latency on the host side and that on the storage side is great, run iostat to check whether the concurrency bottleneck is reached.



Observe avgqu-sz, namely, the average queue depth of the block device corresponding to a LUN. If the value of avgqu-sz is 10 or larger for a long time, it is possible that I/Os are accumulated on the block device layer of the host due to the concurrency limitation. The host pressure is not transferred to the storage side. In this case, you can increase the concurrency capability of the HBA.

4.3.6  Checking Whether the Host HBA Is Running the Latest Driver

If the driver of a host HBA is outdated, the following issues may occur: The HBA splits large I/Os, the default concurrency capability is insufficient, and the I/O latency increases, especially in light-load, 100% hit ratio, and bandwidth-sensitive scenarios.

Windows Hosts

Open Server Manager. In the device list, select the desired HBA. On the device property page, click the Driver tab and view the driver version.

Linux Hosts

  1. Run lsscsi to query the channel IDs corresponding to the HBA.

    linux-ob3a:~ # lsscsi
    [0:2:0:0]    disk    IBM      ServeRAID M5015  2.12  /dev/sda 
    [1:0:0:0]    cd/dvd  TSSTcorp DVD-ROM TS-L333H ID03  /dev/sr0 
    [5:0:0:0]    disk    HUASY    XXXXXX           2105  -       
    [5:0:0:1]    disk    HUASY    XXXXXX           2105  -       
    [5:0:0:2]    disk    HUASY    XXXXXX           2105  -       
    [5:0:0:3]    disk    HUASY    XXXXXX           2105  -       
    [5:0:0:4]    disk    HUASY    XXXXXX           2105  -       
    [5:0:0:5]    disk    HUASY    XXXXXX           2105  -       
    [5:0:0:6]    disk    HUASY    XXXXXX           2105  -       
    [5:0:0:7]    disk    HUASY    XXXXXX           2105  -       
    [5:0:1:0]    disk    HUASY    XXXXXX           2105  -       
    [5:0:1:1]    disk    HUASY    XXXXXX           2105  -       
    [5:0:1:2]    disk    HUASY    XXXXXX           2105  -       
    [5:0:1:3]    disk    HUASY    XXXXXX           2105  -       
    [5:0:1:4]    disk    HUASY    XXXXXX           2105  -       
    [5:0:1:5]    disk    HUASY    XXXXXX           2105  -       
    [5:0:1:6]    disk    HUASY    XXXXXX           2105  -       
    [5:0:1:7]    disk    HUASY    XXXXXX           2105  -       
    [6:0:0:0]    disk    HUASY    XXXXXX           2105  -       
    [6:0:0:1]    disk    HUASY    XXXXXX           2105  -       
    [6:0:0:2]    disk    HUASY    XXXXXX           2105  -       
    [6:0:0:3]    disk    HUASY    XXXXXX           2105  -       
    [6:0:0:4]    disk    HUASY    XXXXXX           2105  -       
    [6:0:0:5]    disk    HUASY    XXXXXX           2105  -       
    [6:0:0:6]    disk    HUASY    XXXXXX           2105  -       
    [6:0:0:7]    disk    HUASY    XXXXXX           2105  -       
    [6:0:1:0]    disk    HUASY    XXXXXX           2105  -       
    [6:0:1:1]    disk    HUASY    XXXXXX           2105  -       
    [6:0:1:2]    disk    HUASY    XXXXXX           2105  -       
    [6:0:1:3]    disk    HUASY    XXXXXX           2105  -       
    [6:0:1:4]    disk    HUASY    XXXXXX           2105  -       
    [6:0:1:5]    disk    HUASY    XXXXXX           2105  -       
    [6:0:1:6]    disk    HUASY    XXXXXX           2105  -       
    [6:0:1:7]    disk    HUASY    XXXXXX           2105  -       
    

    In the preceding command output, the channel IDs corresponding to the HBA are 5 and 6.

  2. Run cd to go to the corresponding host directory.

    linux-ob3a:~ # cd /sys/class/scsi_host/host5
    linux-ob3a:/sys/class/scsi_host/host5 # ls
    84xx_fw_version  cmd_per_lun     driver_version    fw_state    isp_id      model_name           optrom_fcode_version    phy_version        prot_guard_type    sg_tablesize    thermal_temp       unique_id            zio_timer
    active_mode      device          fabric_param      fw_version  isp_name    mpi_version          optrom_fw_version       power              scan               state           total_isp_aborts   vlan_id
    beacon           diag_megabytes  flash_block_size  host_busy   link_state  optrom_bios_version  optrom_gold_fw_version  proc_name          serial_num         subsystem       uevent             vn_port_mac_address
    can_queue        diag_requests   fw_dump_size      host_reset  model_desc  optrom_efi_version   pci_info                prot_capabilities  sg_prot_tablesize  supported_mode  unchecked_isa_dma  zio

  3. Run cat to query driver_version.

    linux-ob3a:/sys/class/scsi_host/host5 # cat driver_version
    8.04.00.13.11.3-k

Follow-up Procedure

If the HBA driver is outdated, update it to the latest version.

4.4  Diagnosing Network and Link Problems

A performance problem may exist in a link if the latency on the host side is large, that on the storage side is small, and no problem is found on the host side.

4.4.1  Checking the Zoning Settings on Fibre Channel Switches

Considering reliability and load balancing, hosts and storage systems in an actual site deployment are connected to Fibre Channel switches to form a crossover network, instead of being directly interconnected. If zones are incorrectly configured, link contention exists, leading to a decrease in host performance.

Similar to the VLAN function of an Ethernet switch, the zoning function of a Fibre Channel switch allows users to isolate links, thereby reducing faulty domains and link contention between hosts or applications. In a zoning plan based on the point to point rule, each zone contains only one initiator and one target. This policy minimizes the interference between devices in different zones.

For example, each of two hosts at a site is connected to two switches to form a crossover network. Each switch is connected controllers A and B through two 8 Gbit/s Fibre Channel cables, respectively. No zone is configured. See Figure 4-4.
Figure 4-4  Network example (without zoning)

Theoretically, the bandwidth can reach a maximum of 3.2 GB/s. However, in the performance test where the dd command is used, the maximum read bandwidth is merely 2.4 GB/s. The reason is that the eight physical paths between the two switches and the storage system are shared with the two hosts. In the event of a heavy load, link contention occurs, leading to bandwidth fluctuations and the failure to reach the maximum bandwidth.

Based on zoning, shared links can be grouped for isolation purposes. See Figure 4-5. Each host uses four links between switches and controllers, preventing link contention. After zones are configured, the tested bandwidth is 3.2 GB/s.

Figure 4-5  Network example (with zoning)

In addition to zoning, you can group shared links for isolation purposes by using multipathing software to disable paths or binding a LUN group with a port group.

Switches are provided by different vendors. Zoning can be configured on a GUI or CLI. If a zoning problem occurs, the switch vendor is required to assist in diagnosing the problem.

4.4.2  Checking the Multipathing Status

If the multipathing mechanism goes wrong, the working controller of a LUN and the owning controller of that LUN may be different, leading to I/O forwarding and performance deterioration.

Multipathing software ensures redundancy, reliability, and high performance of links between a host and a storage system. UltraPath, multipathing software developed by Huawei, is recommended. If the multipathing software delivered with the operating system or third-party multipathing software is used instead of UltraPath, paths may be selected incorrectly due to an inappropriate configuration or compatibility issues. For example, if the multipathing software delivered with an ESX host is used, you must enable the ALUA protocol, select the Fixed mode, and specify paths. Otherwise, the working controller of a LUN will be switched over repeatedly, leading to I/O forwarding.

  • On a Windows host, open UltraPath Console and check whether links are normal and owning controllers are correct.



  • On a Linux host, run show path to check whether physical paths are normal.
    UltraPath CLI #1 >show path
    ----------------------------------------------------------------------------------
     Path ID   Initiator Port   Array Name  Controller    Target Port     Path State  
        0     10000000c9a1b0ee   Array8.1       0A      20080022a10beb2a    Normal    
        1     10000000c9a1b0ef   Array8.1       0B      201a0022a10beb2a    Normal    
    ----------------------------------------------------------------------------------
    --------------------------------
    Check State  Port Type  Port ID 
        --          FC        --    
        --          FC        --    
    --------------------------------

    Path State indicates the status of a link.

NOTE:

For details about the query methods used in other operating systems, see the UltraPath User Guide intended for those operating systems.

In certain scenarios, links may be absent, that is, initiator ports are absent. Then, you must use UltraPath to check whether the number of links is the same as the number of physical connections. In IOPS-sensitive scenarios, link absence causes I/O forwarding, adversely affecting performance. In bandwidth-sensitive scenarios, link absence means that the number of physical connections is reduced by one, decreasing the bandwidth.

4.4.3  Checking the Multipathing Parameter Settings

If the load balancing mode, load balancing algorithm, or other parameters of multipathing software are set inappropriately, the I/O pressure on links will be imbalanced or the sequence of sequential I/Os is affected, preventing the optimal performance from being provided and adversely affecting the bandwidth capability.

  • On a Window host, use UltraPath Console to query the multipathing parameter settings.

    Operation path: System > Global Settings



  • On a Linux host, run show upconfig to query the multipathing parameter settings.
    UltraPath CLI #2 >show upconfig
    =======================================================
    UltraPath Configuration
    =======================================================
    Basic Configuration
        Working Mode : load balancing within controller
        LoadBalance Mode : min-queue-depth
        Loadbanlance io threshold : 1
        LUN Trespass : on
    
    Advanced Configuration
        Io Retry Times : 10
        Io Retry Delay : 0
        Faulty path check interval : 10
        Idle path check interval : 60
        Failback Delay Time : 600
        Io Suspension Time : 60
        Max io retry timeout : 1800
    
    Path reliability configuration
        Timeout degraded statistical time : 600
        Timeout degraded threshold : 1
        Timeout degraded path recovery time : 1800
        Intermittent I/O error degraded statistical time : 300
        Min. I/Os for intermittent I/O error degraded statistical : 5000
        Intermittent I/O error degraded threshold : 20
        Intermittent I/O error degraded path recovery time : 1800
        Intermittent fault degraded statistical time : 1800
        Intermittent fault degraded threshold : 3
        Intermittent fault degraded path recovery time : 3600
        High latency degraded statistical time : 300
        High latency degraded threshold : 1000
        High latency degraded path recovery time : 3600
    
    HypperMetro configuration
        HyperMetro Primary Array SN : Not configured
        HyperMetro WorkingMode : read write within primary array
        HyperMetro Split Size : 128MB
NOTE:

For details about the query methods used in other operating systems, see the UltraPath User Guide intended for those operating systems.

Load Balancing Mode

UltraPath V100R008 provides the following load balancing modes:

  • Intra-controller load balancing
  • Inter-controller load balancing

By default, intra-controller load balancing is used. If inter-controller load balancing is enabled, UltraPath uses all paths to deliver I/Os without considering the preferred and non-preferred controllers of a LUN. Inter-controller load balancing is suitable for the scenario where the load must be balanced among all logical links. To prevent I/O forwarding, intra-controller load balancing should be selected in most cases for high performance.

Load Balancing Algorithm

If the load balancing algorithm of multipathing software is inappropriate, the I/O pressure among links may also be imbalanced, preventing the optimal performance from being offered. UltraPath R8 supports three load balancing policies: min-queue-depth, round-robin, and min-task.
  • min-queue-depth: This policy collects statistics about the number of I/Os queuing in each path in real time and sends I/Os to the path where the number of queuing I/Os is the least. When an application server sends I/Os to a storage system, the path with the minimum I/O queue takes precedence over other paths in sending I/Os.

    min-queue-depth is the default path selection algorithm and provides the optimal performance in most cases.

  • round-robin: I/Os sent by an application server to a storage system for the first time are transmitted over path 1, I/Os sent for the second time are transmitted over path 2, and so on. Paths are used in turn to bring each path into full play.

    The round-robin policy is typically used with the inter-controller load balancing policy. In most cases, the round-robin policy adversely affects the bandwidth performance because I/Os are sent in sequence to each logical link without considering the load of a link. As a result, a congested link may encounter a greater pressure, and the sequence of sequential I/Os may be affected.

  • min-task: When an application server sends I/Os to a storage system, the overall amount of data is calculated based on the block size of each I/O request. Then, I/Os are sent to the path that has the minimum amount of data.

    The min-task policy is seldom used. It has a similar performance impact as the min-queue-depth policy.

Load Balancing Consecutive I/O Quantity

Load balancing consecutive I/O quantity indicates the number of I/Os sent by a host at a time over each selected path. For example, in UltraPath for Linux, Loadbalance io threshold indicates the consecutive I/O quantity. The default value is 100, indicating that a host sends 100 I/Os over each selected path. Controlling the consecutive I/O quantity serves the following purpose: A Linux block device combines consecutive I/Os. If I/Os sent to a path involve consecutive addresses, the block device combines them into a large I/O and sends it to the storage system, thereby improving efficiency.

In certain scenarios, if the consecutive I/O quantity is set to an inappropriate value, the performance is adversely affected.

  • In a Windows operating system, the default value of the consecutive I/O quantity is 1. Do not change the value because Windows does not have a block device and will not combine I/Os.
  • In a Linux operating system, set the consecutive I/O quantity to a small value for large-I/O scenarios. The maximum I/O size adopted by a Linux block device is 512 KB. Therefore, the size of an I/O will not exceed 512 KB even if I/Os are combined. If I/Os whose sizes are 512 KB are sent, they will not be combined. In this case, set the consecutive I/O quantity to 1.
  • In random I/O scenarios, I/Os almost cannot be combined. To reduce the overhead, set the consecutive I/O quantity to 1.

4.4.4  Checking Whether the Front-End Link Bandwidth Reaches a Bottleneck

The bandwidth of a storage system is determined by the front-end link bandwidth, controller bandwidth, and back-end disk bandwidth. In bandwidth testing, the front-end link bandwidth bottleneck is most common.

In terms of estimating the front-end link bandwidth, you must learn about the network plan, the switch cascading, and the weakest point that restricts the bandwidth capability. For example, if a host is connected to a switch through two optical fibers whereas the switch is connected to a controller through only one 8 Gbit/s FC cable, the maximum unidirectional bandwidth is 780 MB/s. Table 4-2 lists the tested bandwidth of common Fibre Channel links.
Table 4-2  Tested bandwidth of common Fibre Channel links

Link

Unidirectional Bandwidth (MB/s)

Bidirectional Bandwidth (MB/s)

4 Gbit/s Fibre Channel link

390

690

8 Gbit/s Fibre Channel link

780

1300 to 1400

NOTE:

The maximum bandwidth of a Fibre Channel link is the largest value provided in the event of sequential large I/Os. If random small I/Os are transmitted, the maximum bandwidth is reduced.

Host HBAs, Fibre Channel switches, and Fibre Channel modules on storage systems can work at different rates. The actual transmission rate is the lowest one among them. If the link rate is inappropriate or a link failure occurs, the performance may decrease. To offer the optimal performance and reliability, the link configuration must ensure load balancing and redundancy.

4.5  Diagnosing Storage Problems

If a performance problem is caused by the storage side, learn about the service types, I/O characteristics, and storage resource configuration, collect performance data, and diagnose the problem based on the I/O path.

4.5.1  Diagnosis Process

Before diagnosing a performance problem on the storage side, learn about the diagnosis process that helps improve diagnosis efficiency and accuracy.

Figure 4-6 illustrates the process of diagnosing a performance problem on the storage side.

Figure 4-6  Process of diagnosing a performance problem on the storage side

4.5.2  Collecting Information

The purpose of collecting information is to provide reference for diagnosing a performance problem on the storage side. For example, you can collect information about I/O characteristics to help analyze the CPU performance of a controller more accurately and efficiently. Information that needs to be collected includes the types of services served by the storage system, I/O characteristics, and storage resource plan.

Service Types and I/O Characteristics

Before diagnosing a performance problem on the storage side, learn about the types of services served by the storage system and I/O characteristics, so that you can determine and focus on the key points.
  • Service types include Oracle database OLTP and OLAP, VDI, and Exchange mail services. Analyzing service types helps determine whether you should pay attention to the IOPS or bandwidth and whether a low latency is required.
  • I/O characteristics include the I/O size, read/write ratio, cache hit ratio, and hotspot data distribution. You can use SystemReporter to observe I/O characteristics. Analyzing I/O characteristics helps understand whether the current service mainly involves sequential or random I/Os, large or small I/Os, and read or write I/Os.

Storage Resource Plan

Before diagnosing a performance problem on the storage side, learn about the storage resource plan so that you can determine the faulty domains. You can use DeviceManager or the Information Collection function of Toolkit to collect information about the storage resource plan. A storage resource plan covers the following aspects:
  • Product models, specifications, and software versions

    To ensure load balancing among multiple controllers, the number of interface modules and the number of connected front-end and back-end ports on a controller must be approximately equal to those on another. Otherwise, one controller may be overloaded whereas others are idle.

  • Types and number of disks

    SSDs provide the highest random small I/O performance, whereas they do not outperform HDDs in the bandwidth capability. Therefore, in a resource plan, SSDs should be used to serve applications that involve random small I/Os, and SAS or NL-NAS disks can be used to serve applications that involve large I/Os or bandwidth-sensitive applications.

  • Number of front-end and back-end interface modules and number of connections

    In a bandwidth-sensitive scenario, a proper number of back-end interface modules must be equipped to provide the bandwidth capacity required by the front-end interface modules, so that the bandwidth performance of the storage system can be brought into full play.

  • Disk domains and storage pools, including the number of disks in each disk domain, the owning relationship of storage pools, and the stripe size

    Resource allocation must be in direct proportion to the service pressure. If disk domains have disks of the same types or have similar disk configuration ratios, the storage pool that will face the heaviest load must be created in the disk domain that consists of the most number of disks. SSDs must be allocated to the disk domain that has the heaviest load. In a storage pool that contains multiple tiers, the LUN that has the heaviest load must be allocated to the high-performance tier or the performance tier. If SmartPartition is configured, sufficient cache resources must be reserved for the LUN that has the heaviest load.

  • LUN properties and space distribution, such as whether a LUN is a thin LUN or a thick LUN, the working controller of a LUN, and the LUN capacity percentage on each layer
  • Value-added feature configuration, for example, whether snapshot, clone, remote replication, and SmartTier are configured

    The mechanism for implementing a value-added feature causes extra performance overheads. On the one hand, a large number of metadata operations (such as initial LUN space allocation and formatting) are performed. On the other hand, a lot of non-host I/Os may be generated (for example, remote replication records and deletes data differences). Therefore, you must learn about the value-added features configured on the storage system.

4.5.3  Checking the Performance Impacts of Internal Transactions and Value-Added Features

Internal transactions refer to the work completed in the background in response to user operations, for example, formatting a newly created LUN, reconstructing data upon a disk failure, and data balancing upon capacity expansion. In addition to numerous metadata operations, internal transactions and value-added features may generate a large number of non-host I/Os. Therefore, host services processed by the storage system are affected.

Before diagnosing a performance problem, check whether there are internal transactions or value-added features. Common internal transactions or value-added features are as follows:
  • LUNs not formatted

    LUNs that are not formatted generate a large amount of formatting data, greatly affecting host I/Os. Therefore, before service rollout or performance testing, ensure that LUNs are formatted.

    You can use DeviceManager or run the CLI command to query the properties of a LUN.

    admin:/>show lun_format_process lun_id=0
    
      LUN ID       : 0     
      Task Process : 1%    
      Remain Time  : 28506s 



  • Precopy, reconstruction, and balancing
    • Precopy: The precopy technique enables the storage system to regularly check the status of hardware. If a disk may fail, the storage system migrates the data from the disk to mitigate data loss risks.
    • Reconstruction: If a disk fails, the reconstruction function recovers data to newly allocated hot spare space based on a RAID redundancy mechanism, leading to a large amount of RAID computing and data copying.
    • Balancing: If disks are added to expand the capacity, data is automatically migrated based on SmartMotion to balance data among all disks. In this scenario, a large amount of data is copied.
    Data copy tasks are generated during the precopy, reconstruction, or balancing process. If a large amount of data needs to be copied, the service performance is adversely affected. Therefore, before conducting a performance test, log in to DeviceManager or run the related command on the CLI to check whether precopy, reconstruction, or balancing tasks exist.

    admin:/>show disk_domain task disk_domain_id=0
    
    Disk Domain ID : 0
    Type : Reconstruct
    Disk Enclosure : DAE101
    Disk Slot : 23
    Data Size : 17.625GB
    Data Finished Size : 1.375GB
    Remain Time : 0Day(s) 1Hour(s) 26Minute(s) 4Second(s)
    Progress : 7%
    -----------------------------------------------------------------
    Disk Domain ID : 0
    Type : Precopy
    Disk Enclosure : DAE000
    Disk Slot : 13
    Data Size : 19.125GB
    Data Finished Size : 2.250GB
    Remain Time : 0Day(s) 0Hour(s) 2Minute(s) 58Second(s)
    Progress : 11%
    -----------------------------------------------------------------
    Disk Domain ID : 0
    Type : Balancing
    tierId : tier0
    Data Size : 13.750GB
    Data Finished Size : 2.750GB
    Remain Time : 0Day(s) 0Hour(s) 1Minute(s) 59Second(s)
    Progress : 20%
    NOTE:

    If disk domains are involved in precopy, reconstruction, or balancing tasks, you can view the progress of each task. If disk domains are not involved in these tasks, a message is displayed indicating that the command is successfully executed.

  • Migration

    Migration mentioned in this section includes two cases: data migration implemented by SmartTier between tiers; LUN migration implemented by SmartMigration within or between storage devices. These two types of migration involve a large number of data read, write, and copy operations, leading to performance deterioration. The two types of migration allow you to specify a migration speed. A high speed has a great impact on host I/Os, and a low speed has a minor impact.

  • Other value-added features

    Value-added features such as snapshot, remote replication, and clone cause extra overheads and prolong the I/O processing. Therefore, value-added features adversely affect the performance.

4.5.4  Analyzing the CPU Performance

The CPU capability is the most critical factor that determines the maximum performance of a storage controller. Therefore, when you are locating a performance problem based on the I/O path, the first step is to analyze the CPU performance of controllers.

Checking the CPU Usage of Controllers

When the CPU usage is high, the latency of system scheduling increases. As a result, the I/O latency increases.

The CPU usage of a storage system is closely related to and varies with I/O models and networking modes. For example,
  • Write I/Os consume more CPU resources than read I/Os do.
  • Random I/Os consume more CPU resources than sequential I/Os do.
  • IOPS-sensitive services consume more CPU resources than bandwidth-sensitive services do.
  • iSCSI networks consume more CPU resources than Fibre Channel networks do.

You can use SystemReporter or run the CLI command to query the CPU usage of the current controller.

  • Use SystemReporter.

    Operation path: Monitoring > Real-Time Monitoring > Controller



  • On the CLI, run show performance controller.
    admin:/>show performance controller controller_id=0A
    0.Max. Bandwidth (MB/s)                                 
    1.Usage Ratio (%)                                       
    2.Queue Length                                          
    3.Bandwidth (MB/s)                                      
    4.Throughput(IOPS) (IO/s)                               
    5.Read Bandwidth (MB/s)                                 
    6.Average Read I/O Size (KB)                            
    7.Read Throughput(IOPS) (IO/s)                          
    8.Write Bandwidth (MB/s)                                
    9.Average Write I/O Size (KB)                           
    10.Write Throughput(IOPS) (IO/s)                        
    11.Service Time (Excluding Queue Time) (ms)             
    12.Read I/O distribution: 512 B                         
    13.Read I/O distribution: 1 KB                          
    14.Read I/O distribution: 2 KB                          
    15.Read I/O distribution: 4 KB                          
    16.Read I/O distribution: 8 KB                          
    17.Read I/O distribution: 16 KB                         
    18.Read I/O distribution: 32 KB                         
    19.Read I/O distribution: 64 KB                         
    20.Read I/O distribution: 128 KB                        
    21.Read I/O distribution: 256 KB                        
    22.Read I/O distribution: 512 KB                        
    23.Write I/O distribution: 512 B                        
    24.Write I/O distribution: 1 KB                         
    25.Write I/O distribution: 2 KB                         
    26.Write I/O distribution: 4 KB                         
    27.Write I/O distribution: 8 KB                         
    28.Write I/O distribution: 16 KB                        
    29.Write I/O distribution: 32 KB                        
    30.Write I/O distribution: 64 KB                        
    31.Write I/O distribution: 128 KB                       
    32.Write I/O distribution: 256 KB                       
    33.Write I/O distribution: 512 KB                       
    34.CPU Usage (%)                                        
    35.Avg. Cache Usage (%)                                 
    36.Average I/O Latency (ms)                             
    37.Max. I/O Latency (ms)                                
    38.Percentage of Cache Flushes to Write Requests (%)    
    39.Cache Flushing Bandwidth (MB/s)                      
    40.Full-stripe Write Request                            
    41.Cache Read Usage (%)                                 
    42.Cache Write Usage (%)                                
    43.Average Read I/O Latency                             
    44.Average Write I/O Latency                            
    45.Average IO Size                                      
    46.Complete SCSI commands per second                    
    47.Verify commands per second                           
    48.% Read                                               
    49.% Write                                              
    50.Max IOPS(IO/s)                                       
    51.File Bandwidth(MB/s)                                 
    52.File OPS                                             
    53.File Bandwidth(MB/s)                                 
    Input item(s) number seperated by comma:34
    
      CPU Usage (%) : 8 
    
      CPU Usage (%) : 6 
    

If the CPU usage remains high for a long time, the maximum performance of the controller is reached. It is recommended that you migrate some services to another storage system to mitigate the service pressure or add controllers to the existing storage system to improve the performance.

Functions Related to the CPU Performance

By default, a storage system has enabled the CPU Quality of Service (QoS) and underclocking functions. When the CPU usage reaches the threshold or CPUs work under a light load. The underclocking function may reduce the system performance. The CPU QoS function restricts the system performance.

When the CPU usage reaches 85%, the CPU QoS function triggers QoS flow control. Then, the front-end modules take the initiative to increase the latency to reduce the IOPS pressure, thereby reducing the CPU load. Therefore, during the performance monitoring, a sharp decrease in IOPS or a sharp increase in I/O latency may occur. If it occurs, check whether the service load is heavy or the CPU usage is high.

To conserve energy and protect the environment, a storage system has enabled the CPU underclocking function to reduce power consumption during off-peak hours. When the CPU usage is low, the CPU clock frequency is automatically reduced. Use the CLI to log in to the storage system and switch to the developer mode. In developer mode, run show cpu to check whether the CPU clock frequency is reduced.

admin:/>change user_mode current_mode user_mode=developer
developer:/>show cpu

  ID        Temperature(Celsius)  Voltage(V)  Work Frequency(MHz)  
  --------  --------------------  ----------  -------------------  
  CTE0.A.0  55                    0.8         2100                 
  CTE0.A.1  57                    0.8         2100                 
  CTE0.B.0  52                    0.8         2100                 
  CTE0.B.1  56                    0.8         2100                
  Current Frequency(MHz)  Frequency Enable 
  ----------------------  ---------------- 
  1200                    Yes              
  1200                    Yes              
  2101                    Yes              
  2101                    Yes 
  • Work Frequency (MHz) indicates the CPU clock frequency.
  • Current Frequency (MHz) indicates the current CPU clock frequency. If Current Frequency (MHz) is lower than Work Frequency (MHz), the CPU clock frequency is reduced.
  • Frequency Enable indicates whether the CPU underclocking function is enabled. Yes indicates the CPU underclocking function is enabled. No indicates the CPU underclocking function is disabled.

The performance provided when the CPU works at a low clock frequency is lower than that provided when the CPU works at a high clock frequency. In a light-load test, for example, running the dd command, copying a single file, and using IOmeter to test single concurrency, the performance is relatively low. Therefore, before conducting a low-load performance test, it is recommended that you run change cpu frequency in developer mode to disable the CPU underclocking function.

developer:/>change cpu frequency enabled=no
DANGER: You are going to modify the frequency for CPU. This operation may interrupt services or cause service exceptions. 
Suggestion: Before you perform this operation, determine whether the modification is necessary.
Have you read danger alert message carefully?(y/n)y

Are you sure you really want to perform the operation?(y/n)y
Command executed successfully.

4.5.5  Analyzing the Performance of Front-End Host Ports

Front-end host ports process host I/Os. Analyzing the factors that may affect the performance of front-end host ports helps discover possible performance bottlenecks in a storage system.

Checking Information About Front-End Host Ports

Before analyzing the performance of front-end host ports, confirm the locations of interface modules and the number, statuses, and speeds of connected ports.

You can use DeviceManager or run the CLI command to query information about front-end host ports.
  • Use DeviceManager to query information about front-end host ports.

  • Run show port general on the CLI to query information about front-end host ports.
    admin:/>show port general physical_type=FC
    
      ID              Health Status  Running Status  Type       Working Rate(Mbps)  WWN               Role         Working Mode  Configured Mode  
      --------------  -------------  --------------  ---------  ------------------  ----------------  -----------  ------------  ---------------  
      CTE0.A.IOM0.P0  Normal         Link Up         Host Port  8000                20001051720c8a51  INI and TGT  FC-AL         Auto-Adapt       
      CTE0.A.IOM0.P1  Normal         Link Down       Host Port  --                  20011051720c8a51  INI and TGT  --            Auto-Adapt       
      CTE0.B.IOM0.P0  Normal         Link Up         Host Port  8000                20101051720c8a51  INI and TGT  FC-AL         Auto-Adapt       
      CTE0.B.IOM0.P1  Normal         Link Down       Host Port  --                  20111051720c8a51  INI and TGT  --            Auto-Adapt 
After confirming the locations of front-end interface modules and the number, statuses, and speeds of connected ports, pay attention to the following aspects:
  • The interface modules in two adjacent slots share the same PCIe chip. In scenarios where a high bandwidth is required, it is recommended that front-end and back-end interface modules are deployed in turn, so that PCIe chips can bring bidirectional channels into full play, maximizing the bandwidth capability.
  • The number of connected front-end host ports on one controller should be the equal to that on the other controller. The connected ports should be evenly selected from multiple front-end interface modules for load balancing between controllers and between front-end interface modules. For example, controller A is equipped with two 8 Gbit/s Fibre Channel interface modules, and a switch is connected to controller A through four optical fibers. In this case, each Fibre Channel interface module is connected to two optical fibers.
  • Confirm that the working rates of front-end host ports displayed on DeviceManager or on the CLI are the same as the actual specifications. That is, no degrade issue exists.
  • Confirm that the displayed work mode complies with the actual connection mode. P2P indicates a switch-based connection mode, and FC-AL indicates a direction connection mode.

Checking the Concurrency Pressure of Front-End Host Ports

To test the maximum performance, you must ensure that the host side provides a sufficient concurrency pressure. If the number of concurrent tasks on the host side is large enough whereas the performance (IOPS and/or bandwidth) is not high, it is possible that the host pressure is not transferred to the front end of the storage system or the storage system has reached a bottleneck.

In addition to comparing latencies, you can check the front-end concurrency pressure of the storage system to help analysis.

  • Method 1: Use a formula.

    This method is suitable for scenarios where the pressure is fixed. For example, use IOmeter or Vdbench to test performance under a fixed pressure. During the test, the number of concurrent tasks is typically fixed. After observing the IOPS and latency, use the formula of IOPS = Number of concurrent tasks x 1000/latency (ms) to obtain the number of concurrent tasks, thereby learning about the front-end concurrency pressure. For example, if the IOPS is 3546 and the latency is 6.77 ms, the number of concurrent tasks is 24 (namely, 3546 x 6.77/1000).

  • Method 2: Run the CLI command to obtain an approximate number of front-end concurrent tasks.
    This method is suitable for scenarios that involve a changing pressure. The show controller io io_type=frontEnd controller_id=XX command is used to query the front-end concurrent I/O tasks delivered to the specified controller. Run this command multiple times and use a stable value as an approximate number of front-end concurrent tasks. XX indicates a controller ID.
    admin:/>show controller io io_type=frontEnd controller_id=0A 
    
      Controller Id   : 0A     
      Front End IO    : 0      
      Front End Limit : 17408 
    NOTE:

    If the latency is low, the number of front-end concurrent tasks obtained by running show controller io may be inaccurate. In this case, use methods 1 to assist analysis.

If the front-end concurrency pressure is insufficient, it is recommended that you increase the concurrency pressure on the host side and check whether the front-end concurrency pressure increases. If the front-end concurrency pressure remains insufficient, locate problems on the host side.

Checking Whether Front-End Host Ports Have Bit Errors

If the performance frequently fluctuates or abnormally drops, the front-end host ports or links may be abnormal. Run the CLI command or use the inspection report function to check whether front-end host ports have bit errors.

Run show port bit_error to view the bit errors of front-end host ports.

admin:/>show port bit_error
ETH port:

  ID               Error Packets  Lost Packets  Over Flowed Packets  Start Time                     
  ---------------  -------------  ------------  -------------------  -----------------------------  
  CTE0.L1.IOM1.P0  0              0             0                    2015-08-14/23:02:16 UTC+08:00  
  CTE0.L1.IOM1.P1  0              0             0                    2015-08-14/23:02:16 UTC+08:00  
  CTE0.L1.IOM1.P2  0              0             0                    2015-08-14/23:01:58 UTC+08:00  
  CTE0.L1.IOM1.P3  0              0             0                    2015-08-14/23:01:58 UTC+08:00  
  CTE0.L2.IOM1.P0  0              0             0                    2015-08-14/23:01:58 UTC+08:00  
  CTE0.L2.IOM1.P1  0              0             0                    2015-08-14/23:01:58 UTC+08:00  
  CTE0.L2.IOM1.P2  0              0             0                    2015-08-14/23:01:58 UTC+08:00  
  CTE0.L2.IOM1.P3  0              0             0                    2015-08-14/23:01:58 UTC+08:00  
  CTE0.L3.IOM1.P0  0              0             0                    2015-08-14/23:02:16 UTC+08:00  
  CTE0.L3.IOM1.P1  0              0             0                    2015-08-14/23:02:16 UTC+08:00  
FC port:

  ID               Lost Signals  Link Errors Codes  Lost Synchronizations  Failed Connections  Start Time                     
  ---------------  ------------  -----------------  ---------------------  ------------------  -----------------------------  
  CTE0.R3.IOM0.P0  0             0                  0                      0                   2015-08-14/22:58:02 UTC+08:00  
  CTE0.R3.IOM0.P1  0             5                  0                      0                   2015-08-14/22:58:02 UTC+08:00  
  CTE0.R3.IOM0.P2  0             5                  0                      0                   2015-08-14/22:58:02 UTC+08:00  
  CTE0.R3.IOM0.P3  0             4                  0                      0                   2015-08-14/22:58:02 UTC+08:00  
  CTE0.R1.IOM1.P0  0             5                  0                      0                   2015-08-14/22:58:16 UTC+08:00  
  CTE0.R1.IOM1.P1  0             1                  0                      0                   2015-08-14/22:58:16 UTC+08:00  
  CTE0.R1.IOM1.P2  0             1                  0                      0                   2015-08-14/22:58:16 UTC+08:00  
  CTE0.R1.IOM1.P3  0             0                  0                      0                   2015-08-14/22:58:16 UTC+08:00  
SAS port:

  ID            Invalid Dword  Consist Errors  Loss Of DWORD  PHY Reset Errors  Start Time                     
  ------------  -------------  --------------  -------------  ----------------  -----------------------------  
  CTE0.R5.P0    61             52              2              25                2015-08-14/23:04:02 UTC+08:00  
  CTE0.R5.P1    0              0               0              0                 2015-08-14/23:04:03 UTC+08:00  
  CTE0.R5.P2    0              0               0              0                 2015-08-14/23:04:03 UTC+08:00  
  CTE0.R5.P3    0              0               0              0                 2015-08-14/23:04:04 UTC+08:00  
  CTE0.R5.P4    0              0               0              0                 2015-08-14/23:04:05 UTC+08:00  
  CTE0.R5.P5    0              0               0              0                 2015-08-14/23:04:06 UTC+08:00  
  CTE0.R5.P6    0              0               0              0                 2015-08-14/23:04:06 UTC+08:00  
  CTE0.R5.P7    0              0               0              0                 2015-08-14/23:04:07 UTC+08:00  
  CTE0.R5.P8    0              0               0              0                 2015-08-14/23:04:08 UTC+08:00  
  CTE0.R5.P9    0              0               0              0                 2015-08-14/23:04:08 UTC+08:00  
  CTE0.R5.P10   0              0               0              0                 2015-08-14/23:04:09 UTC+08:00  
  CTE0.R5.P11   0              0               0              0                 2015-08-14/23:04:10 UTC+08:00  
  CTE0.L5.P0    42             43              1              6                 2015-08-14/23:04:12 UTC+08:00  
  CTE0.L5.P1    0              0               0              0                 2015-08-14/23:04:13 UTC+08:00  
  CTE0.L5.P2    0              0               0              0                 2015-08-14/23:04:14 UTC+08:00  
  CTE0.L5.P3    0              0               0              0                 2015-08-14/23:04:15 UTC+08:00  
  CTE0.L5.P4    0              0               0              0                 2015-08-14/23:04:15 UTC+08:00  
  CTE0.L5.P5    0              0               0              0                 2015-08-14/23:04:16 UTC+08:00  
  CTE0.L5.P6    0              0               0              0                 2015-08-14/23:04:17 UTC+08:00  
  CTE0.L5.P7    0              0               0              0                 2015-08-14/23:04:17 UTC+08:00  
  CTE0.L5.P8    0              0               0              0                 2015-08-14/23:04:18 UTC+08:00  
  CTE0.L5.P9    0              0               0              0                 2015-08-14/23:04:25 UTC+08:00  
  CTE0.L5.P10   0              0               0              0                 2015-08-14/23:04:27 UTC+08:00  
  CTE0.L5.P11   0              0               0              0                 2015-08-14/23:04:28 UTC+08:00  
  DAE000.A.PRI  0              0               0              0                 2015-08-14/23:00:03 UTC+08:00  
  DAE000.A.EXP  0              0               0              0                 2015-08-14/23:00:03 UTC+08:00  
  DAE000.B.PRI  0              0               0              0                 2015-08-14/23:04:29 UTC+08:00  
  DAE000.B.EXP  0              0               0              0                 2015-08-14/23:04:29 UTC+08:00  
  DAE001.A.PRI  2253           2256            3              5                 2015-08-14/23:00:12 UTC+08:00  
  DAE001.A.EXP  0              0               0              0                 2015-08-14/22:59:46 UTC+08:00  
  DAE001.B.PRI  65             66              2              4                 2015-08-14/23:00:01 UTC+08:00  
  DAE001.B.EXP  0              0               0              0                 2015-08-14/23:00:01 UTC+08:00  
  DAE002.A.PRI  35             35              2              5                 2015-08-14/23:00:02 UTC+08:00  
  DAE002.A.EXP  0              0               0              0                 2015-08-14/23:00:02 UTC+08:00  
  DAE002.B.PRI  77             79              2              5                 2015-08-14/22:59:46 UTC+08:00  
  DAE002.B.EXP  0              0               0              0                 2015-08-14/23:00:02 UTC+08:00  
FCoE port:
        
  ID               Error Packets  Lost Packets  Over Flowed Packets  Start Time                     
  ---------------  -------------  ------------  -------------------  -----------------------------  
  CTE0.L3.IOM1.P0  0              0             0                    2015-08-14/23:02:16 UTC+08:00  
  CTE0.L3.IOM1.P1  0              0             0                    2015-08-14/23:02:16 UTC+08:00 

If the bit errors of a front-end host port keep increasing, the front-end host port or the link has a performance problem. In this case, replace the optical fiber or optical module.

Checking Performance Indicators of Front-End Host Ports

Pay attention to the following performance indicators of front-end host ports: average read I/O response time, average write I/O response time, average I/O size, IOPS, and bandwidth.

You can use SystemReporter or run the CLI command to the performance indicators of front-end host ports.
  • Use SystemReporter to query performance indicators of front-end host ports.

    Operation path: Monitoring > Real-Time Monitoring > Front-End Host Port.



  • Run show performance port on the CLI to query performance indicators of front-end host ports.
    admin:/>show performance port port_id=CTE0.A.IOM0.P0
    0.Max. Bandwidth (MB/s)                        
    1.Usage Ratio (%)                              
    2.Queue Length                                 
    3.Bandwidth (MB/s)                             
    4.Throughput(IOPS) (IO/s)                      
    5.Read Bandwidth (MB/s)                        
    6.Average Read I/O Size (KB)                   
    7.Read Throughput(IOPS) (IO/s)                 
    8.Write Bandwidth (MB/s)                       
    9.Average Write I/O Size (KB)                  
    10.Write Throughput(IOPS) (IO/s)               
    11.Service Time (Excluding Queue Time) (ms)    
    12.Read I/O distribution: 512 B                
    13.Read I/O distribution: 1 KB                 
    14.Read I/O distribution: 2 KB                 
    15.Read I/O distribution: 4 KB                 
    16.Read I/O distribution: 8 KB                 
    17.Read I/O distribution: 16 KB                
    18.Read I/O distribution: 32 KB                
    19.Read I/O distribution: 64 KB                
    20.Read I/O distribution: 128 KB               
    21.Read I/O distribution: 256 KB               
    22.Read I/O distribution: 512 KB               
    23.Write I/O distribution: 512 B               
    24.Write I/O distribution: 1 KB                
    25.Write I/O distribution: 2 KB                
    26.Write I/O distribution: 4 KB                
    27.Write I/O distribution: 8 KB                
    28.Write I/O distribution: 16 KB               
    29.Write I/O distribution: 32 KB               
    30.Write I/O distribution: 64 KB               
    31.Write I/O distribution: 128 KB              
    32.Write I/O distribution: 256 KB              
    33.Write I/O distribution: 512 KB              
    34.Average I/O Latency (ms)                    
    35.Max. I/O Latency (ms)                       
    36.Average Read I/O Latency                    
    37.Average Write I/O Latency                   
    38.Average IO Size                             
    39.Complete SCSI commands per second           
    40.Verify commands per second                  
    41.% Read                                      
    42.% Write                                     
    43.Max IOPS(IO/s)                              

By analyzing performance indicators of front-end hosts, you can determine possible performance problems.

  • Average read/write latency: When locating a performance problem, first compare the average latency of front-end ports you have queried with the latency you have observed on the host side. Check whether the values are greatly different, and determine whether the performance problem resides in the storage side. In addition, compare the latency of front-end ports with that of back-end disk domains and disks to determine I/O characteristics. For example, in a read service scenario, if the front-end port latency is much lower than the back-end port latency, sequential read I/Os may be delivered or the hit ratio may be high. If the front-end port latency is close to the back-end port latency, random read I/Os may be delivered.
    NOTE:

    The measured front-end I/O latency does not include the latency caused by the interaction between the host and the storage system and I/O transmission over links.

  • Average I/O size: It indicates the average size of I/Os received by the storage system. If the I/O size is different from that delivered by a host, the host side or the HBA driver has split or combined I/Os.
  • IOPS and bandwidth: Compare all front-end ports from the perspective of IOPS and bandwidth to determine the service pressure delivered by each host connected to each port and determine whether the front-end pressure is balanced. Check whether the bandwidth of front-end ports is close to the theoretical maximum bandwidth of links, which makes the front-end bandwidth a performance bottleneck.

4.5.6  Analyzing Cache Performance

A cache is the key module that improves performance and user experience. When analyzing the cache performance, pay attention to the impact of the cache configuration on the read/write performance.

Impact of the Cache Configuration on the Read Performance

In terms of data reading, the cache prefetches data to increase the cache hit ratio and reduce the number of read I/Os delivered to disks, thereby minimizing the latency and providing higher performance.

When reading data from a LUN, a storage system prefetches data from disks to the cache based on the specified policy. A storage system supports four prefetch policies: non-prefetch, intelligent prefetch, constant prefetch, and variable prefetch.
  • Non-prefetch: Data requested by a host is read from disks. This policy is suitable for scenarios where all I/Os are random.
  • Intelligent prefetch: Whether to prefetch data is dynamically determined based on I/O characteristics. If I/Os are sequential, data is prefetched. Otherwise, data is not prefetched. Intelligent prefetch is the default policy and recommended in most scenarios.
  • Constant prefetch: The size of data prefetched each time is a predefined fixed value. This policy is suitable for scenarios where multiple channels of sequential I/Os are delivered and the I/O size is fixed. This policy can be used in media&entertainment (M&E) and video surveillance scenarios.
  • Variable prefetch: The size of data prefetched each time is a multiple of the I/O size. (The multiple is user-defined, ranging from 0 to 1024.) This policy is suitable for scenarios where multiple channels of sequential I/Os are delivered and the I/Os vary in size. This policy can be used in M&E and video surveillance scenarios.
If a prefetch policy is set inappropriately, excessive or insufficient prefetch may occur.
  • Excessive prefetch indicates that the amount of prefetched data is much larger than the amount of data that is actually read.

    For example, if the constant prefetch policy is set in a scenario where all I/Os are random, excessive prefetch definitely occurs. Excessive prefetch will cause poor read performance, fluctuations in the read bandwidth, or the failure to reach the maximum read bandwidth of disks. To determine whether excessive prefetch occurs, you can use SystemReporter to compare the read bandwidth of a storage pool with that of the disk domain where the storage pool resides. Excessive prefetch occurs if the read bandwidth of a storage pool is much lower than that of the disk domain where the storage pool resides. If excessive prefetch occurs, change the prefetch policy to non-prefetch.

  • Insufficient prefetch indicates that the amount of prefetched data is insufficient for sequential I/Os. As a result, all I/Os must be delivered to disks to read data, and no data is hit in the cache.

    For example, if the non-prefetch policy is set in the event of small sequential I/Os (database logs), insufficient prefetch definitely occurs. Insufficient prefetch leads to a relatively long I/O latency. To determine whether insufficient prefetch occurs, you can use SystemReporter to query the read cache hit ratio of a LUN. If the read cache hit ratio of a database log LUN is low, change the prefetch policy to intelligent prefetch or constant prefetch.

Impact of the Cache Configuration on the Write Performance

The write-back and write-through cache policies as well as the high and low watermarks affect the write performance of a storage system.

Cache Write Policy

A cache write policy can be classified into write-back and write-through.

  • Write-back: A write success message is returned to the host as soon as each write I/O arrives at the cache. Then, the cache sorts and combines data and write data to disks.
  • Write-through: Each write request is directly sent to its destination disk. The write performance is subject to the disk performance, especially the key disk parameters such as the disk type, rotational speed, and seek time. The reliability of write-through is higher than that of write-back.
NOTE:

You can set the cache write policy when you are creating a LUN by running create lun on the CLI. If you use DeviceManager to create a LUN, the default cache write policy is write-back and the policy cannot be changed.

The default cache write policy is write-back.

  • If HDDs are used, the write-back policy provides a higher performance level than the write-through policy because writing data to the cache is much faster than writing data to HDDs.
  • If SSDs are used, whether write-back or write-through provides a higher performance level is determined by the algorithm in use.

The following issues may trigger write-through: A BBU fails; only one controller is working; the temperature is high; the number of LUN fault pages exceeds the threshold.

You can use DeviceManager or run show lun general on the CLI to query the properties of a LUN, where the cache write policy is displayed.

admin:/>show lun general lun_id=0

  ID                              : 0                               
  Name                            : 000_Report_LUN001               
  Pool ID                         : 0                               
  Capacity                        : 5.000GB                         
  Subscribed Capacity             : 5.187GB                         
  Protection Capacity             : 0.000B                          
  Sector Size                     : 512.000B                        
  Health Status                   : Normal                          
  Running Status                  : Online                          
  Type                            : Thick                           
  IO Priority                     : Low                             
  WWN                             : 6111545100326912061d0d4500000000 
  Exposed To Initiator            : No                              
  Data Distributing               : 97,3,0                          
  Write Policy                    : Write Back                      
  Running Write Policy            : Write Through                   
  Prefetch Policy                 : Intelligent                     
  Read Cache Policy               : Default                         
  Write Cache Policy              : Default                         
  Cache Partition ID              : --                              
  Prefetch Value                  : --                              
  Owner Controller                : 0A                              
  Work Controller                 : 0A                              
  Snapshot ID(s)                  : --                              
  LUN Copy ID(s)                  : --                              
  Remote Replication ID(s)        : --                              
  Split Clone ID(s)               : 0                               
  Relocation Policy               : Highest Available               
  Initial Distribute Policy       : Automatic                       
  SmartQoS Policy ID              : 0                               
  Protection Duration(days)       : 0                               
  Has Protected For(h)            : 0                               
  Estimated Data To Move To Tier0 : 0.000B                          
  Estimated Data To Move To Tier1 : 0.000B                          
  Estimated Data To Move To Tier2 : 0.000B                          
  Is Add To Lun Group             : No                              
  Smart Cache Partition ID        : --                              
  DIF Switch                      : No                              
  Remote LUN WWN                  : --                              
  Disk Location                   : Internal                        
  LUN Migration                   : --                              
  Progress(%)                     : --                              
  Smart Cache Cached Size         : 0.000B                          
  Smart Cache Hit Rage(%)         : 0                               
  Mirror Type                     : --                              
  Thresholds Percent(%)           : --                              
  Thresholds Switch               : -- 
  • Write Policy indicates the configured cache write policy.
  • Running Write Policy indicates the cache write policy currently in use.

If write-through is triggered, find out the cause. Possible causes are as follows: A BBU fails; only one controller is working; the temperature is high; the number of LUN fault pages exceeds the threshold.

Cache High and Low Watermarks

The high or low watermark of a cache indicates the maximum or minimum amount of dirty data that can be stored in the cache. An inappropriate high or low watermark of a cache can cause the write performance to deteriorate.

When the amount of dirty data in the cache reaches the upper limit, the dirty data is synchronized to disks at a high speed. When the amount of dirty data in the cache is between the upper and lower limits, the dirty data is synchronized to disks at a medium speed. When the amount of dirty data in the cache is below the lower limit, the dirty data is synchronized to disks at a low speed.

It is recommended that you set the high and lower watermarks as follows:
  • Do not set a too large value for the high watermark. If the high watermark value is too large, the page cache must be small. When the front-end I/O traffic surges, the I/Os become unstable and the latency is prolonged, adversely affecting the write performance.
  • Do not set a too small value for the low watermark. If the low watermark value is too small, cached data is frequently written to disks, reducing the write performance.
  • Do not allow a too small difference between the high and low watermarks. If the difference is too small, the back-end bandwidth cannot be fully utilized.
  • The recommended high and low watermarks for a cache are 80% and 20%, respectively.

Check whether the high and low watermarks meet the recommendations. If the high and low watermarks do not meet the recommendations, change the watermark values and check whether the performance become higher.

To set high and low watermarks, run change system cache on the CLI. For details about the command usage, see the Restricted Command Reference of the corresponding version.
NOTICE:

Modifying the high or low watermark of a cache affects the frequency and size of writing data from the cache to disks. Do not modify the high or low watermark unless absolutely necessary.

4.5.7  Analyzing the LUN Performance

Analyzing the performance of LUNs, learning about LUN types, and understanding the performance impact of local access to LUNs help discover possible performance bottlenecks in a storage system.

Performance Differences Between a Thin LUN and a Thick LUN

After you enable the SmartThin function on a storage system, the storage system creates a thin LUN without allocating all the specified capacity to that LUN. Instead, the storage system dynamically allocates storage resources based on the capacity actually used by a host. A thin LUN and a thick LUN differ in read and write performance.

Write Performance
  • First write: A thick LUN is formatted immediately after it is created. Therefore, only host I/Os need to be written. When new data is written to a thin LUN, spaces are allocated at the same time, leading to a lot of metadata I/O read and write operations. In addition, the write process is longer, and disks face an extra pressure. In the first write scenario, the performance of a thick LUN is higher than that of a thin LUN.
  • Overwrite: In the overwrite scenario, both a thin LUN and a thick LUN have already been allocated spaces. Therefore, no extra overhead is generated, and the two types of LUNs provide similar performance.
Read Performance
  • Sequential read: a thin LUN provides storage resources on demand. Space allocation is not consecutive from the perspective of time. Therefore, the spaces mapped to a disk may not be consecutive. When a thick LUN is being created, the storage system allocates all the required storage resources at a time based on the automatic resource configuration technology, ensuring that spaces mapped to a disk are consecutive. The efficiency of sequential read on HDDs is higher. Therefore, a thick LUN provides a higher performance level than a thin LUN in a sequential read scenario.
  • Random read: In a random read scenario, the access addresses are not consecutive. Therefore, the two types of LUNs provide similar performance.
You can use DeviceManager or run the CLI command to query the LUN type.
  • In DeviceManager, go to the LUN property page to view the LUN type.

  • On the CLI, run show lun general to query the LUN type.
    admin:/>show lun general lun_id=0
    
      ID    : 0     
      Name  : LUN000
      Pool ID                         : 0     
      Capacity                        : 100.000GB
      Subscribed Capacity             : 100.187GB
      Protection Capacity             : 0.000B
      Sector Size                     : 512.000B
      Health Status                   : Normal
      Running Status                  : Online
      Type                            : Thick  
      IO Priority                     : Low   
      WWN   : 63400a3100d716bf01574c9c00000000 
      Exposed To Initiator            : No    
      Data Distributing               : 0,0,100
      Write Policy                    : Write Back
      Running Write Policy            : Write Back
      Prefetch Policy                 : Intelligent
      Read Cache Policy               : Default
      Write Cache Policy              : Default
      Cache Partition ID              : --    
      Prefetch Value                  : --    
      Owner Controller                : 0A    
      Work Controller                 : 0A    
      Snapshot ID(s)                  : --    
      LUN Copy ID(s)                  : --    
      Remote Replication ID(s)        : --    
      Split Clone ID(s)               : --    
      Relocation Policy               : None  
      Initial Distribute Policy       : Automatic
      SmartQoS Policy ID              : --    
      Protection Duration(days)       : 0     
      Has Protected For(h)            : 0     
      Estimated Data To Move To Tier0 : 0.000B
      Estimated Data To Move To Tier1 : 0.000B
      Estimated Data To Move To Tier2 : 0.000B
      Is Add To Lun Group             : No    
      Smart Cache Partition ID        : --    
      DIF Switch                      : No    
      Remote LUN WWN                  : --    
      Disk Location                   : Internal
      LUN Migration                   : --    
      Progress(%)                     : --    
      Smart Cache Cached Size         : 0.000B
      Smart Cache Hit Rage(%)         : 0     
      Mirror Type                     : --    
      Thresholds Percent(%)           : --    
      Thresholds Switch               : -- 

Local Access to LUNs

Ensure that LUNs can be accessed by the local controller, so that the system performance can be protected.

Local Access
Local access to a LUN means that I/Os destined for a LUN are directly delivered to the owning controller of that LUN. As shown in Figure 4-7, a host is physically connected to controller A, the owning controller of LUN 1 is controller A, and that of LUN 2 is controller B.
  • When the host attempts to access LUN 1, controller A directly delivers the access requests to LUN 1. Such a LUN access mode is called local access.
  • When the host attempts to access LUN 2, the access requests are first delivered to controller A. Then, controller A forwards them to controller B through the mirror channel between controllers A and B. Finally, controller B delivers the access requests to LUN 2. Such a LUN access mode is called peer access.
Figure 4-7  Network diagram

The peer access scenario involves the mirror channel between controllers. The channel limitations affect LUN read/write performance. To prevent peer access, you must ensure that a host has a physical connection to each of controllers A and B. If a host is physically connected to only one controller, set the owning controller of the LUN to the controller connected to the host.

Ping-Pong Effect
In a clustered multipathing network environment, UltraPath is able to automatically switch over the working controller of a LUN. When two application servers attempt to access the same LUN whose owning controller is controller A:
  1. If a link connected to application server 1 fails as shown in Figure 4-8, the UltraPath running on application server 1 switches the working controller of the LUN to controller B.
  2. The two links connected to application server 2 work properly. In this case, the working controller and owning controller of the LUN are the same. Therefore, the UltraPath running on application server 2 attempts to switch the working controller of the LUN back to controller A. As a result, the two application servers keep switching the working controller of the LUN.
Such repeated switchovers of the LUN's working controller are called the ping-pong effect. This effect reduces the LUN access performance and makes I/O timeout likely occur on application servers.
Figure 4-8  Schematic diagram of the ping-pong effect
If the ping-pong effect occurs in a storage system, take the following measures:
  1. Disable the automatic LUN switchover function of UltraPath. For details, see the UltraPath User Guide of the corresponding version.
  2. Recover the interrupted link as soon as possible. Ensure that the link between each node and each storage controller works properly.

4.5.8  Analyzing the RAID Performance

Analyzing the RAID performance and understanding the performance impacts of RAID levels and stripe depths help discover possible performance bottlenecks in a storage system.

Impacts of RAID Levels on Performance

RAID is an algorithm that combines disks into a group and implements striping. RAID enables multiple disks to efficiently work at the same time, improving the I/O processing capability of a storage system and boosting data integrity.

Different RAID levels provide different read/write performance levels in different I/O models.
  • Random read: In this I/O model, all RAID levels provide similar performance.
  • Random write: In this I/O model, the parity check and mirroring implemented by the RAID algorithm lead to an extra I/O overhead called write penalty. A larger write penalty of the RAID algorithm results in lower random write performance. Therefore, the random write performance of RAID 10 is higher than that of RAID 5, and the random write performance of RAID 5 is higher than that of RAID 6.
  • Sequential read: In the RAID 2.0+ architecture, no independent parity disk is used. All disks in a disk domain can provide read performance. Therefore, RAID 5 and RAID 6 provide high sequential read performance. RAID 10 implements mirroring and therefore provides lower performance than RAID 5 and RAID 6.
  • Sequential write: In this I/O model, write I/Os can be delivered to a disk based on a full stripe. The extra write overhead lies in writing parity bits (RAID 5 and RAID 6) and mirroring (RAID 10). Therefore, in a sequential write scenario, a larger percentage of parity bits to the total bits results in lower performance.

In the storage planning phase, select the most suitable RAID level based on site requirements, with performance, space efficiency, and reliability taken into consideration.

Impacts of Stripe Depths on Performance

Stripe depths determine the sizes of I/Os written to disks after host I/Os are processed based on the RAID algorithm. Therefore, the impacts of stripe depths on performance vary depending on I/O characteristics.

  • Random small I/Os: Such I/Os are typically smaller than 16 KB. Random I/Os cannot be combined in the cache. When they are delivered to disks, their original sizes are typically retained. Therefore, if the stripe depth (128 KB by default) is multiple times of the I/O size, there is a low probability that a small I/O crosses stripes, that is, an I/O is unlikely to be split. The random small I/O scenario is slightly affected by the stripe depth. It is recommended that you retain the default stripe depth, namely, 128 KB.
  • Sequential small I/Os: Multiple sequential small I/Os in the cache are combined into a large I/O. Ideally, small I/Os are combined into an I/O whose size is equal to the stripe depth and then delivered to a disk. Therefore, a large stripe depth helps reduce the number of I/Os written to disks, improving the data write efficiency. It is recommended that you set the stripe depth to 128 KB or larger.
  • Sequential or random large I/Os: Such an I/O is 256 KB or larger. If the stripe depth is smaller than the I/O size, I/Os are split on the RAID layer, affecting the data write efficiency. Therefore, it is recommended that you select the maximum stripe depth, namely, 512 KB.

You can specify the stripe depth only when creating a storage pool in developer mode. Then, the stripe depth cannot be dynamically changed. The default stripe depth is 128 KB.

developer:/>create storage_pool name=StoragePool disk_type=SAS capacity=100GB stripe_depth=256KB
Command executed successfully.

In developer mode, you can run show storage_pool tier to query the stripe depth of a storage pool.

developer:/>show storage_pool tier pool_id=1

  Name                     : Capacity     
  Pool ID                  : 1            
  Health Status            : Normal       
  Running Status           : Online       
  Capacity                 : 1.000TB      
  Allocated Capacity       : 2.937GB      
  Free Capacity            : 1021.062GB   
  RAID Level               : RAID6        
  RAID Disk Number         : 6            
  Stripe Depth             : 64.000KB     
  Estimated Move-up Data   : 0.000B       
  Estimated Move-down Data : 0.000B

4.5.9  Analyzing the Performance of Back-End Ports and Disks

Analyzing the performance of back-end ports and disks and understanding the impacts of back-end ports and disks on the storage performance help discover possible performance bottlenecks in a storage system.

Analyzing the Performance of Back-End Ports

Back-end ports refer to SAS ports that are used to connect a controller to a disk enclosure and provide a channel for reading and writing data to disks. Back-end SAS ports affect the performance, and the impact typically lies in a disk enclosure loop. The OceanStor V3 series storage systems support 12 Gbit/s SAS ports.

The bandwidth of a single SAS port is limited. Therefore, ensure that the bandwidth supported by SAS ports in a loop is higher than the total bandwidth of all disks in the disk enclosures that compose the loop. In addition, as the number of disk enclosures in a loop becomes larger, the latency caused by expansion links is longer. As a result, the back-end I/O latency is affected, thereby affecting the IOPS. Considering the preceding situations, when there are sufficient SAS ports, the following measures are recommended:
  • Distribute disk enclosures to multiple loops.
  • If a single controller has multiple back-end interface modules, distribute loops to multiple modules instead of using the SAS ports on one module.
  • Form a loop using less than five disk enclosures.

You can use DeviceManager or CLI to query the disk enclosure IDs and then determine loop connections. The ID of a disk enclosure is in the format of DAEabc (a, b, and c are integers), where a indicates the engine ID, b indicates the loop ID, and c indicates the ID of an enclosure in the loop. For example, the disk enclosure IDs are DAE000, DAE010, DAE020, DAE021, DAE030, and DAE031. The IDs indicate that there are two 2-enclosure loops (DAE020 and DAE021 compose one loop; DAE030 and DAE031 compose another) and two single-enclosure loops (DAE000; DAE010). The loop connections comply with the preceding performance rules.

SSDs and HDDs are greatly different in performance. In a loop, the high latency of HDDs may cause SAS ports to be occupied and SSD resources unable to be scheduled in time. Therefore, if there are sufficient SAS ports, do not cascade SSDs and HDDs in the same loop.

Rules of Selecting Disks for a Disk Domain

The performance of a storage system varies depending on the disks selected for a disk domain.

To ensure high performance, observe the following rules of selecting disks for a disk domain:

  • Prevent dual-controller access in a bandwidth-sensitive scenario.

    For the purpose of high reliability, each disk enclosure loop in a standard environment is connected to a SAS port on each of controllers A and B in the same engine. That is, controllers A and B can access the disks in the loop at the same time. Dual-controller access indicates that controllers A and B can deliver I/Os to disks in a disk domain at the same time. Single-controller access indicates that only one controller delivers I/Os to disks in a disk domain. In services that involve sequential I/Os, dual-controller access affects the sequence of I/Os delivered to disks. Therefore, the bandwidth performance of dual-controller access is lower than that of single-controller access. Therefore, in scenarios that involve sequential I/Os and require a high bandwidth, such as the M&E industry, you can adopt single-controller access, that is, set the owning controller to the same controller for each LUN in the storage pool corresponding to the disk domain. In scenarios that involve random I/Os, you can adopt dual-controller access.

  • Do not select disks across engines.

    You are allowed to select disks across engines for a disk domain. That is, when a storage system is equipped with multiple engines, you can select disks from disk enclosures connected to multiple engines for a disk domain. If disks are selected across engines, I/Os are forwarded through switching channels between engines, affecting the latency and bandwidth performance. Therefore, when disks are sufficient, do not select disks across engines for a disk domain. In addition, it is recommended that you select disks from the same disk enclosure for a disk domain.

  • Prevent intermixing of different types of disks

    Disks of different rotational speeds and capacities vary in I/O processing latency and bandwidth. Therefore, in a RAID environment, disks that provide low performance may limit the performance of a stripe group. In addition, fast and slow disks may coexist, and disks may be used unevenly. If disks are sufficient, select disks of the same rotational speed and capacity for a disk domain. Avoid intermixing of different types of disks.

    You can use DeviceManager to query whether disks in the same disk domain are of the same type and capacity.



Analyzing the Disk Performance

Common storage media include SSDs, SAS disks, and NL-SAS disks. Different types of storage media vary greatly in storage cost and performance. Therefore, when making a storage plan, determine disk types based on the service pressure and I/O characteristics.

The difference in the performance levels provided by different tiers consisting of different types of disks is the basis of the tiered storage technology. Therefore, before configuring storage services, learn about the performance characteristics of different types of disks.

  • SSDs

    SSDs do not have the spin latency that can be seen in HDDs. SSDs greatly outperform HDDs in I/O models with hotspot data access and sensitive to the response latency, especially in random small I/O read models of database applications. In bandwidth-sensitive applications, SSDs slightly outperform HDDs. In the tiered storage technology, SSDs are used to compose the high-performance tier that addresses a high IOPS pressure. The performance of SSDs is determined by the types of flash chips. SLC outperforms eMLC, and eMLC outperforms MLC.

  • SAS disks

    SAS disks store data on platters that are spinning at a high speed. SAS disks provide relatively high performance, capacity, and reliability. Typically, the rotational speeds of SAS disks are 10k rpm and 15k rpm. In the tiered storage technology, SAS disks are used to compose the performance tier that provides high performance, including a stable latency, high IOPS, and high bandwidth. In addition, the prices of SAS disks are moderate.

  • NL-SAS disks

    The rotational speed of an NL-SAS disk is typically 7.2k rpm, lower than that of a SAS disk. NL-SAS disks provide the largest capacity but the lowest performance. Therefore, they are used to compose the capacity tier in the tiered storage technology. According to statistics, 60% to 80% of the capacity consumed by most applications addresses a light load. Inexpensive large-capacity NL-SAS disks can be used to meet such capacity needs. In addition, NL-SAS disks reduce the power consumption by 96% per TB compared with SAS disks. The performance of an HDD is determined by the rotational speed. 15k rpm outperforms 10k rpm, and 10k rpm outperforms 7.2k rpm.

If a performance problem occurs and the front end of the storage system is normal, check whether disks have reached the performance threshold. If disks have reached the performance threshold (the disk usage is almost 100%), the storage performance is restricted by the back-end disk performance, and the IOPS and bandwidth cannot increase. You can use SystemReporter to query the disk usage.

To ensure disk reliability and prolong disk service life, it is recommended that you maintain the disk usage under 70%. If the usage of most disks in a disk domain is over 90%, it is recommended that you add disks to the disk domain or migrate services to disks that provide higher performance.