[T Series]A Cluster Heartbeat Failure During Sun Cluster Read/Write Operations Led to a Reset of All Cluster Nodes

Publication Date:  2012-07-19 Views:  602 Downloads:  0
Issue Description

Product and version information:

  • S5500T V100R001 V100R002
  • S5600T V100R001 V100R002
  • S5800T V100R001 V100R002
  • S6800T V100R001 V100R002
  • Operating system of application server 1: Solaris10u9 for SPARC
  • HBA 1: Qlogic 375-3356-02 with native driver version 20100301-3.00
  • Operating system of application server 2: Oracle Solaris 10 9/10 s10s_u9wos_14a SPARC
  • HBA 2: Emulex LPe11002-E with native driver version 2.50o(2010.01.08.09.45)
  • Cluster software: Oracle Sun cluster 3.3

Symptom:
Cluster networking mode:

1.The hosts and the storage array were connected through a single-path direct connection network, and the cluster environment was set up.
2.The cluster was configured with three quorum devices. A LUN was mapped from the storage array to the cluster. Two resource groups were created, and a file system was created on them. The file system was mounted to both host nodes. Data read/write operations were performed on the file system.
3.During the operations, the cluster environment was set up, and then both nodes reset (Normally, when two nodes preempt quorum devices, one node stays online while the other resets).

Alarm Information
None
Handling Process
  • Restore the heartbeat between the cluster nodes, and the nodes add themselves into the cluster automatically after a restart.
  • In initial cluster deployment, perform the following steps:
    1. Obtain the root right.
    2. On each cluster node, modify the /etc/system file to set xxx in set cl_haci:qd_acquisition_timer=xxx to 60.
      set cl_haci:qd_acquisition_timer=60
      If there is no set cl_haci:qd_acquisition_timer=xxx, add this line to the file.
    3. On any node, run the cluster shutdown command to restart the cluster. For example:
      phys-schost-1# cluster shutdown –g 0 -y
      The parameter -g is set to 0 so the cluster is shut down immediately. The parameter -y is set to yes to confirm the shut down.
    4. On each node, run the boot command to restart the cluster. Then the modifications to the /etc/system file takes effect.

 

Do not change the default timeout value on Oracle RAC because some split-brain caused by heartbeat failures can lead to Oracle RAC VIP switchover failures. If quorum devices are unable to complete configurations within the default timeout time 25s in such an environment, replace the quorum devices.

Root Cause
1.Check whether the alarm "Unable to acquire the quorum device, Sun cluster" is reported.
2.Check whether the HBA used is the Emulex LPe11002-E HBA.
Suggestions

In initial cluster deployment, perform the following steps:
1.Configure N - 1 quorum devices for the cluster where N is the number of nodes.
2.Set a higher default cluster arbitration timeout value to avoid timeouts caused by preemptions when the cluster is configured again.

This problem is a known bug and is recorded on http://wikis.sun.com/display/SunCluster/Known Bugs in Oracle Solaris Cluster 3.3

The record indicates that cluster nodes are unable to complete the arbitration within the default period (25s) after the cluster is re-configured. As a result, the nodes reset.

END