No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

Performance issue of VMware on Oceanstor S5500T V2

Publication Date:  2016-03-26 Views:  176 Downloads:  0
Issue Description

1. Fault symptom: Many of the VMs completely standstill.  Two hours later, they  recovered after some of the VMs were restarted.

2. Storage Version: OceanStor S5500TV2 V200R002C00

3. Storage configuration: One diskdomain with 8 SSD disks, 24 SAS disks, 64 NL_SAS disks; 

One storage pool with RAID5-5 in tier0, RAID5-9 in tier1 and RAID6-10 in tier2;

17 LUNs map to 8 ESXi cluster servers;

3 LUNs have SmartQos policy of protect IO latency less than 15 ms, their IO priority is high;

2 LUNs have Middle IO priority;

 The other LUNs have low IO priority, including the 4 ones which were limited bandwidth.

The detailed SmartQos policy is below:

   ID:  1

    Name:Qos_limit_25MByte_APSIP_WH

    Description:

    Heath Status:Normal

    Running Status:Running

    Enable Status:Enable

    Type:ReadWrite

    Performance Info:BandWidth

    Max IOPS:0

    Min IOPS:0

    Max BandWidth:25

    Min BandWidth:0

    Latency:0

    Priority:Normal

    Type:Control Type

    Schedule Policy:Weekly

    Schedule Days:Sunday,Monday,Tuesday,Wednesday,Thursday

    Schedule Start Date:2015-11-19

    Schedule Start Time:07:00

    Schedule Duration Time:12:0:0

    LUN list:13

 

    ID:  3

    Name:QoS_limit_rw60MByte_working_hou

    Description:

    Heath Status:Normal

    Running Status:Running

    Enable Status:Enable

    Type:ReadWrite

    Performance Info:BandWidth

    Max IOPS:0

    Min IOPS:0

    Max BandWidth:60

    Min BandWidth:0

    Latency:0

    Priority:Normal

    Type:Control Type

    Schedule Policy:Weekly

    Schedule Days:Sunday,Monday,Tuesday,Wednesday,Thursday

    Schedule Start Date:2015-6-16

    Schedule Start Time:06:00

    Schedule Duration Time:13:0:0

    LUN list:8,11

 

    ID:  4

    Name:Qos_limit_rw_400MByte_working_h

    Description:

    Heath Status:Normal

    Running Status:Running

    Enable Status:Enable

    Type:ReadWrite

    Performance Info:BandWidth

    Max IOPS:0

    Min IOPS:0

    Max BandWidth:400

    Min BandWidth:0

    Latency:0

    Priority:Normal

    Type:Control Type

    Schedule Policy:Weekly

    Schedule Days:Sunday,Monday,Tuesday,Wednesday,Thursday,Friday,Saturday

    Schedule Start Date:2015-6-16

    Schedule Start Time:06:00

    Schedule Duration Time:13:0:0

    LUN list:21

 

    ID:  7

    Name:QoS_protect_latency_15_VerwUser

    Description:LUN wszvcl01_User

    Heath Status:Normal

    Running Status:Running

    Enable Status:Enable

    Type:ReadWrite

    Performance Info:Latency

    Max IOPS:0

    Min IOPS:0

    Max BandWidth:0

    Min BandWidth:0

    Latency:15

    Priority:Normal

    Type:Trigger Type

    Schedule Policy:Weekly

    Schedule Days:Sunday,Monday,Tuesday,Wednesday,Thursday,Friday,Saturday

    Schedule Start Date:2015-8-26

    Schedule Start Time:07:00

    Schedule Duration Time:12:0:0

    LUN list:3

 

    ID:  8

    Name:QoS_protect_latency_15_VerwDate

    Description:LUN_wszvcl01_Daten

    Heath Status:Normal

    Running Status:Running

    Enable Status:Enable

    Type:ReadWrite

    Performance Info:Latency

    Max IOPS:0

    Min IOPS:0

    Max BandWidth:0

    Min BandWidth:0

    Latency:15

    Priority:Normal

    Type:Trigger Type

    Schedule Policy:Weekly

    Schedule Days:Sunday,Monday,Tuesday,Wednesday,Thursday,Friday,Saturday

    Schedule Start Date:2015-8-26

    Schedule Start Time:07:00

    Schedule Duration Time:12:0:0

    LUN list:4

 

    ID:  9

    Name:QoS_protect_latency_15_VM_01

    Description:LUN_VMware_Cluster_01_tiered

    Heath Status:Normal

    Running Status:Running

    Enable Status:Enable

    Type:ReadWrite

    Performance Info:Latency

    Max IOPS:0

    Min IOPS:0

    Max BandWidth:0

    Min BandWidth:0

    Latency:15

    Priority:Normal

    Type:Trigger Type

    Schedule Policy:Weekly

    Schedule Days:Sunday,Monday,Tuesday,Wednesday,Thursday,Friday,Saturday

    Schedule Start Date:2015-8-26

    Schedule Start Time:00:00

    Schedule Duration Time:24:0:0

    LUN list:10

4. Networking topology: Half of the servers connected with 8G Fibre channel(two Brocades 6150) and the others connected with 10G iSCSI( two Cisco Catalyst 4500 ).

5. IO model: 50% read, nearly all random

Alarm Information

Null

Handling Process

1. Check the alarms and historical events of storage. The file path is "[Master controller IP]_MAIN/Event/Event.txt" in the logs of master controller. There's no abnormal events was found, it seems that the storage was running normally when issue happen.

2. Analyze the performance statistics logs by "Performance History Monitor" tool in OceanStor Toolkit.

Since the issue may caused by high IO latency, we analyzed the LUN IO latency first. When the issue occured, we can see the performance statistics shown by LUN priority as below. The read bandwidth of high priority LUNs reached 220MB/s and other LUNs' was very small.

3. As the LUN_PRIORITY_STATISTIC picture below, we found a very strange thing. At about 15:30, the IO latency of high priority LUNs became over 15ms. After about half an hour, the latency of low priority LUNs quickly reached 150ms. Another 30 minutes later, the latency of middle priority LUNs quickly reached 100ms.  After the IO latency of high priority LUNs became less than 15ms at about 17:25, the IO latency of  middle priority and low priority LUNs  became normal. We firmly believe that these poor performance VMs must running on these middle or low priority LUNs.

Associated with the SmartQos policy, we can confirm that the high latency of low and middle priority LUNs was caused by SmartQos flow control.

4. Why the IO latency of high priority LUNs was over 15ms? We analyzed the data distribution first. As below, most of the data in them was distributed in tier2(NL_SAS).

        Lun Id: 3
        Lun Name: LUN_wszvcl01_User          
        Userpool Id: 0
        Type: Thin
        User Lun Capacity: 5368709120(KB)
        Allocated Capacity: 3184558080(KB)
        Sector Size: 512(B)
        Health status: Normal
        Running status: Online
        WWN: 6f84abf1005e2fd4d0cb0cb800000003
        Owning Controller: 0A
        Work Controller: 0A
        Home Pair Id: 0
        SmartQoS Id: 7
        Cache Partition Id: 0
        Snapshot IDs:
        LUN Copy IDs:
        Remote Replication IDs:
        Split Mirror IDs:
        Write Policy: Write Back
        Running Cache Write Policy: Write Back
        Lun Prefetch Type: Intelligent
        Lun Prefetch Value: 0(KB)
        Initial Distribute Policy: Automatic
        Relocation Policy: Automatic
        Lun Io Priority: High
        Read Cache Policy: Middle
        Write Cache Policy: Middle
        Is Mapped : Yes
        Dif Status: Close
        Data Distributing: [2%,12%,86%]
        Data Move To Tier0: 0(KB)
        Data Move To Tier1: 0(KB)
        Data Move To Tier2: 0(KB)
        Retentionable : Yes
        Retention State : Read/Write
        Retention Term :
        Retention PassTerm :
        Retention Set Time :
        Deduplication Switch: Close

        Lun Id: 4
        Lun Name: LUN_wszvcl01_Daten         
        Userpool Id: 0
        Type: Thin
        User Lun Capacity: 5368709120(KB)
        Allocated Capacity: 2013110272(KB)
        Sector Size: 512(B)
        Health status: Normal
        Running status: Online
        WWN: 6f84abf1005e2fd4d0cb362900000004
        Owning Controller: 0A
        Work Controller: 0A
        Home Pair Id: 0
        SmartQoS Id: 8
        Cache Partition Id: 0
        Snapshot IDs:
        LUN Copy IDs:
        Remote Replication IDs:
        Split Mirror IDs:
        Write Policy: Write Back
        Running Cache Write Policy: Write Back
        Lun Prefetch Type: Intelligent
        Lun Prefetch Value: 0(KB)
        Initial Distribute Policy: Automatic
        Relocation Policy: Lowest
        Lun Io Priority: High
        Read Cache Policy: Middle
        Write Cache Policy: Middle
        Is Mapped : Yes
        Dif Status: Close
        Data Distributing: [0%,0%,100%]
        Data Move To Tier0: 0(KB)
        Data Move To Tier1: 0(KB)
        Data Move To Tier2: 0(KB)
        Retentionable : Yes
        Retention State : Read/Write
        Retention Term :
        Retention PassTerm :
        Retention Set Time :
        Deduplication Switch: Close

        Lun Id: 10
        Lun Name: LUN_VMware_Cluster_01_tiered
        Userpool Id: 0
        Type: Thick
        User Lun Capacity: 4294967296(KB)
        Allocated Capacity: 4294967296(KB)
        Sector Size: 512(B)
        Health status: Normal
        Running status: Online
        WWN: 6f84abf1005e2fd40d121a950000000a
        Owning Controller: 0A
        Work Controller: 0A
        Home Pair Id: 0
        SmartQoS Id: 9
        Cache Partition Id: 0
        Snapshot IDs:
        LUN Copy IDs:
        Remote Replication IDs:
        Split Mirror IDs:
        Write Policy: Write Back
        Running Cache Write Policy: Write Back
        Lun Prefetch Type: Intelligent
        Lun Prefetch Value: 0(KB)
        Initial Distribute Policy: Automatic
        Relocation Policy: Automatic
        Lun Io Priority: High
        Read Cache Policy: Middle
        Write Cache Policy: Middle
        Is Mapped : Yes
        Dif Status: Close
        Data Distributing: [2%,8%,90%]
        Data Move To Tier0: 0(KB)
        Data Move To Tier1: 0(KB)
        Data Move To Tier2: 0(KB)
        Retentionable : Yes
        Retention State : Read/Write
        Retention Term :
        Retention PassTerm :
        Retention Set Time :
        Deduplication Switch: Close

Meanwhile, we found the average IO size of the LUN was far more larger than usual. When the average IO size is larger than 200KB, the latency of NL_SAS should be more than 15ms, it's a common sense.

So, we can see the average IO latency of  high priority LUNs was about 20ms even the total IOPS didn't reach the bottleneck.





 

Root Cause

1. When IO size became very large, the average IO latency of NL_SAS disks was over 15ms.

2.Since the storage can't  achieve the agreement of SmartQos policy for  high priority LUNs. It will start flow control to reduce the performance of middle and low priority LUNs by enlarging the latency of these LUNs gradually until the maximum value. As the default setting of IO latency threshold, 100ms for middle priority LUNs and 150ms for low priority LUNs.

Solution
Change the threshold for minimum IO latency of high priority LUNs  in SmartQos policy from 15ms to 30ms. The new threshold will be reached only if the NL_NAS disks are truely very busy.
Suggestions

1. The SmartQos policy should refer to the real performance of LUNs. Never set a threshold which the storage can't achieve even it's in normal status. Please estimate carefully before planning the configuration.

2. We can enable the "storage IO control" feature on VMware to control the IO latency, bandwidth or something else of VMs. In this way, we can control the whole system narrowly and improve the reliability of system. 

3. We suggest add more high performance disks like SSD and SAS to ensure lower IO latency.

END