No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

Too much read/write IO on sysdisk

Publication Date:  2014-09-19 Views:  131 Downloads:  0
Issue Description
After the system hang due to some overload in a peak hour , the customer needed to perform a hard reboot in order for the system to be usable again.After the reload , two alarms were present in ISM.
Alarm Information
source: nssc01
SN: 589
ID: 0x3021f0015
Level: Major
Occurred At: 2014-02-13 19:32:32
Description: Too much read/write IO on sysdisk

Detail: nssc01_01:Read/Write IO on sysdisk is over workload of sysdisk.

Repair Suggest: Contact the Engineer.


source: nssc01
SN: 593
ID: 0x3021f0015
Level: Major
Occurred At: 2014-02-13 19:38:19
Description: Too much read/write IO on sysdisk

Detail: nssc01_02:Read/Write IO on sysdisk is over workload of sysdisk.

Repair Suggest: Contact the Engineer. 
Handling Process
Third thing checked was the number of shares created (173)--please see NFSShares.txt attached-- which dealt with at the same time will hang your system with every occasion.Supporting this is the hardware specifications of the system which occasionally can encounter problems when facing with that kind of load.Please see CPU.Memory.txt attached. Added to that the CPU is the lowest-end supported by this equipment and the RAM memory in this case is way to low (16 Gb).
Root Cause
Customer asked to install Oceanstor Toolkit so he can extract logs from the system.After access was granted and application installed by him ,  he succesfully provided the logs.

After checking the logs I noticed the following :

First of all I noticed that during that before that period of time when the alarm was received the CPU load was 100% and a little later the engine has shown some failures putting the CPU in Leaving state.

2014/02/13 19:17:05 VCS INFO V-16-10061-14001 HostMonitor:VCShm:monitor:Updating System attribute with CPU usage = 100% and Swap usage = 0%.
2014/02/13 19:17:35 VCS INFO V-16-10061-14001 HostMonitor:VCShm:monitor:Updating System attribute with CPU usage = 100% and Swap usage = 0%.
2014/02/13 19:18:05 VCS INFO V-16-10061-14001 HostMonitor:VCShm:monitor:Updating System attribute with CPU usage = 100% and Swap usage = 0%.
2014/02/13 19:18:35 VCS INFO V-16-10061-14001 HostMonitor:VCShm:monitor:Updating System attribute with CPU usage = 100% and Swap usage = 0%.

Second the engine reported an error:

2014/02/13 19:23:05 VCS WARNING V-16-2-13108 Thread(4147590832) Engine reported error(e02)

After that it entered Leaving state:

2014/02/13 19:23:05 VCS WARNING V-16-2-13109 Thread(4147590832) Engine reported error; error info is (Operation 'hasys -modify ... -update' rejected as the node is in LEAVING state).

After noticing the high CPU load I checked to see if any shares we’re made during that time and found that that specific period of time was very busy , system struggling between finishing tasks and asking for resources for the new tasks.As well , it can be seen there that even after  a share was completed the system was so loaded that some faults appeared stating that the resources could not be recovered after the action.
Please shares.txt attached for more details.

Suggestions
My first and basic suggestion for this is to reduce the high load during the peak hours.After doing that please keep monitoring closely the system and see if the issue reoccurs.If so , then we must take this to the next level.

For this particular client i noticed that there were somewhere around 15 file systems created and only 5 luns sustaining those file systems.

~~~   storage fs list   ~~~
FS                        STATUS       SIZE    LAYOUT              MIRRORS   COLUMNS   USE%  NFS SHARED  CIFS SHARED  SECONDARY TIER  POOL LIST
========================= ======       ====    ======              =======   =======   ====  ==========  ===========  ==============  ==========
datafeeds                 online    512.00M    simple              -         -          43%    yes          no           no           tier2-pool
ftproot                   online      1.00T    simple              -         -          37%    yes          no           no           tier2-pool
gfx-core                  online      1.19T    simple              -         -            -    yes          no           no           tier1-pool
gfx-ggfx                  online     50.00G    simple              -         -            -    yes          no           no           tier1-pool
gfx-ilink                 online     50.00G    simple              -         -            -    yes          no           no           tier1-pool
gfx-pdf                   online    100.00G    simple              -         -            -    yes          no           no           tier1-pool
lonres3-gfx-core          online      2.00T    simple              -         -            -    yes          no           no           tier0-pool
lonres3-gfx-ggfx          online     50.00G    simple              -         -            -    yes          no           no           tier0-pool
lonres3-gfx-ilink         online     50.00G    simple              -         -            -    yes          no           no           tier0-pool
lonres3-gfx-pdf           online    100.00G    simple              -         -            -    yes          no           no           tier0-pool
lonres3-php_session       online      1.00G    simple              -         -            -     no          no           no           tier0-pool
mobile_site               online      1.00G    simple              -         -          13%    yes          no           no           tier1-pool
mylonres_assets           online     32.00G    simple              -         -           1%    yes          no           no           tier2-pool
php_session               online      1.00G    simple              -         -           4%    yes          no           no           tier1-pool
site-shares               online     32.00G    simple              -         -            -    yes          no           no           tier2-pool
upload-images             online      4.00G    simple              -         -            -    yes          no           no           tier2-pool


~~~   storage pool list   ~~~
Pool                                       List of disks                            
==================================         ====================================     
tier0-pool                                 huawei-s2600-0_14 huawei-s2600-0_15 huawei-s2600-0_16
tier1-pool                                 huawei-s2600-0_13
tier2-pool                                 huawei-s2600-0_12 

My sugestion for this kind of cases is to retake everithing into consideration , and start from scratch creating raid groups properly , luns, data disks , and file systems.All of them created accordingly (proper raid selection for each type of bussines , proper number of luns basically for each file system to create one).

END