User report: the service has suspended, locate it is a fault of the lowest layer storage, and request to resolve it on site.
Login in the storage ISM to view, the system runs well, but analyze the logs of the lowest layer system, we find that the HP host (version: HP-UX 11.31,ia64) has dispensed lot of ABTS information to the B controller, the ABTS is the operation which the host cancels the IO. Chose a LUN from the host randomly and test it, the performance of the “dd” reads data is bad, it can only read 960 byte in 5 seconds. The logs of the host DMP software (VCS version: 126.96.36.199) displays that: there has been multiple times of link switching for a certain LUN, and then finally has switched into the primal path, but the data package dispensed unsuccessfully.
Restart the B controller, test a random LUN’s “dd” reading of the host, we find the performance of the reading recovers normal. However, on the up-layer, many hosts find there are lots of LUN errors connected to the B controller. We doubt it may be the link fault, we check the link in the machine room, and check the route connected the B controller and the fiber channel switch, and find the fiber at the switch hasn’t plugged well, then we plug it well, and the system recover to be normal.
Because the B controller has received lots of ABTS data packages, we doubt it may be the B controller’s own problem, check the lowest layer of the B controller, there isn’t any abnormal error.
Afterwards, by the research analysis, the root cause is that: there are lots of bit errors, which caused the B controller to be in the protective status, and it can’t deal with the data request dispensed from the host. The Symantec DMP multiple paths software installed on the host has found one LUN read-writing unsuccessfully, and request to switch the link, but not send the command to switch the controller aimed to the LUN, which leads to this LUN accesses the LUN mounted at the B controller via the image route of the A controller, but due to the B controller has been in the protective status, it can’t deal with the read-writing request delivered by the image route, and finally causes the up-layer service to failure. We can conclude that it’s the storage’s protective mechanism, and the Symantec DMP multiple paths software can’t collaborate well with the S6800T, and finally causes the hanging service on the up-layer.
The link bit error on the low layer can’t be identified in the storage ISM interface and Web management interface of the fiber channel switch, the controller’s logs contain the checking and solving mechanism for the link bit error, but don’t have recorded. So once doubting there are link problems, we suggest you to login in the CLI interface of the fiber channel switch, and execute command to check the link status, then you can find the link fault easily, this can reduce the customer’s halt time.