AIX Cluster Cannot Start Up Properly Due to the Sequential Execution Error

Publication Date:  2012-07-22 Views:  140 Downloads:  0
Issue Description

The S2600 controller maps each of the two application servers (ASs) to a LUN, and when both LUNs are activated, the primary node can be started properly but not for the other node when the cluster is being started.

Product and version information:
  • S2600 (with the single controller)
  • The version of the controller software is 1.04.01.205.T01
  • The version of the SES is S021
  • The OS of the AS node is AIX6100–01
  • The AS node uses the HACMP (HA) cluster software
  • The version of the HA is 5.4.1.0
The networking of the cluster is as shown in Figure 1.
Figure 1 Networking of the cluster
Alarm Information
None
Handling Process
Take measures as suggested below to avoid this problem:
  • Adjust the startup sequence in the cluster: Start the HA, and then run the command varyonvg to manually activate the volume group(s) to which the LUNs belong.
  • Adjust the stop sequence in the cluster: Run the command varyonvg to manually stop the volume group(s) to which the LUNs belong, and then stop the HA.

 

Root Cause
  1. By checking the system log, the information about executing and clearing the reserve command (strings like Reserve and ClearReserve) can be found therein.
    Oct 14 23:25:56 AK-I kernel: [372919227]Reserve (6)[16] command for Host LUN 0, Device 
    Lun 8  @ [jif=372919227] SCSI_PrintDebugInfo : 1382
    ......
    Oct 14 23:25:56 AK-I kernel: [372919957]SCSI_ClearReserveExec
    Oct 14 23:25:56 AK-I kernel: [372919957]  @ [jif=372919957] SCSI_ClearReserveExec : 2200
    Oct 14 23:25:56 AK-I kernel: [372919957]This is the master controller
    Oct 14 23:25:56 AK-I kernel: [372919957]  @ [jif=372919957] SCSI_ClearReserveExec : 2207
    Oct 14 23:25:56 AK-I kernel: [372919957]Enter SCSI_ClearReserve
    Oct 14 23:25:56 AK-I kernel: [372919957]  @ [jif=372919957] SCSI_ClearReserve : 2286

Conclusion:

  • The Advanced Interactive Executive (AIX) Cluster cannot start up properly due to the sequential execution error of starting and stopping nodes and their private LUNs.
Suggestions

After executing the Reserve command, the initially started primary node doesn't clear the command timely.

If no LUNs are mapped to ASs, when you start the cluster, the initially started primary node executes the Reserve command, and then the LOGOUT command to clear the Reserve command. Later, the other node repeats the same procedure as well. Since the LOGOUT command clears all the existed sessions while the LUNs and shared disks share the same session for the primary node, when one of the LUNs that is private to a node is activated, the primary node doesn't execute the LOGOUT command to clear all the sessions, which results in the startup failure of the other node.

 

The AIX is a set of UNIX OS developed by IBM, which complies with the Open Group UNIX 98 Base Brand.

The AIX enables the concurrent execution of 32-bit and 64-bit applications, and is supported on the IBM-P series, IBM RS/6000 workstations, servers and parallel supercomputers.

The AIX provides three types of Shell (Korn from SYSTEM V, BOURNE Shell, and 4.3BSD C Shell) for users to use as the system interface of UNIX.

END