T Series]A Cluster Heartbeat Failure During Sun Cluster Read/Write Operations Led to a Reset of All Cluster Nodes

Publication Date:  2012-07-19 Views:  202 Downloads:  0
Issue Description

Product and version information:

  • S5500T V100R001 V100R002
  • S5600T V100R001 V100R002
  • S5800T V100R001 V100R002
  • S6800T V100R001 V100R002
  • Host operating system:

       Host 1: Solaris10u9 for SPARC; HBA: Qlogic 375-3356-02 HBA; HBA native driver version: 20100301-3.00.
       Host 2: Oracle Solaris 10 9/10 s10s_u9wos_14a SPARC configured with the Emulex LPe11002-E HBA; HBA native driver version: 2.50o (2010.01.08.09.45).

  • Cluster software: Oracle Sun cluster 3.3

The cluster network is shown in Figure1.
    Figure 1 Cluster network

 

The hosts and the storage array were connected through a single-path direct connection network, and the cluster environment was configured. The cluster was configured with three quorum devices. A LUN was mapped from the storage array to the cluster. Two resource groups were created, and a file system was created on them. The file system was mounted to both host nodes. Data read/write operations were performed on the file system. During the operations, the cluster heartbeat failed, and then both nodes reset.

 

Normally, when two nodes preempt quorum devices, one node stays online while the other resets.

Alarm Information
None
Handling Process
  • Method 1: After the problem has occurred, restore the heartbeat between the cluster nodes, and the nodes add themselves into the cluster automatically after a restart.
  • Method 2: In initial cluster deployment, perform the following steps:

          Configure N - 1 quorum devices for the cluster where N is the number of nodes. This can reduce the time needed for arbitration.
          Modify the /etc/system file to set a higher arbitration timeout value, 60 seconds for example. Perform the following steps:

            1. Obtain the root right.
            2. On each cluster node, modify the /etc/system file to set xxx in "set cl_haci:qd_acquisition_timer=xxx" to 60. If there is no "set cl_haci:qd_acquisition_timer=xxx", add this line to the file.
            3. On any node, run the phys-schost-1# cluster shutdown –g 0 -y command to restart the cluster. The parameter -g is set to 0 so the cluster is shut down immediately. The parameter -y is set to yes to confirm the shut down.
            4. On each node, run the phys-schost-1# boot command to restart the cluster. Then the modifications to the /etc/system file take effect.

           

            Do not change the default timeout value on Oracle RAC because some split-brain caused by heartbeat failures can lead to Oracle RAC VIP switchover failures. If quorum devices are unable to complete configurations within the default timeout period 25s in such an environment, replace the quorum devices.

Root Cause
 1. Check whether the alarm "Unable to acquire the quorum device, Sun cluster" is reported.
 2. Check whether the HBA used is the Emulex LPe11002-E HBA.
Suggestions

In initial cluster deployment, perform the following steps:
 1. Configure N - 1 quorum devices for the cluster where N is the number of nodes.
 2. Set a higher default cluster arbitration timeout value to avoid timeouts caused by preemptions when the cluster is configured again. For details, see method 2.

This problem is a known bug and is recorded on http://wikis.sun.com/display/SunCluster/Known Bugs in Oracle Solaris Cluster 3.3. The record indicates that cluster nodes are unable to complete arbitration operations for configuring the cluster again within the default time (25s). As a result, the nodes reset.

END