Product and version information:
Host 1: Solaris10u9 for SPARC; HBA: Qlogic 375-3356-02 HBA; HBA native driver version: 20100301-3.00.
Host 2: Oracle Solaris 10 9/10 s10s_u9wos_14a SPARC configured with the Emulex LPe11002-E HBA; HBA native driver version: 2.50o (2010.01.08.09.45).
The cluster network is shown in Figure1.
Figure 1 Cluster network
The hosts and the storage array were connected through a single-path direct connection network, and the cluster environment was configured. The cluster was configured with three quorum devices. A LUN was mapped from the storage array to the cluster. Two resource groups were created, and a file system was created on them. The file system was mounted to both host nodes. Data read/write operations were performed on the file system. During the operations, the cluster heartbeat failed, and then both nodes reset.
Normally, when two nodes preempt quorum devices, one node stays online while the other resets.
Configure N - 1 quorum devices for the cluster where N is the number of nodes. This can reduce the time needed for arbitration.
Modify the /etc/system file to set a higher arbitration timeout value, 60 seconds for example. Perform the following steps:
1. Obtain the root right.
2. On each cluster node, modify the /etc/system file to set xxx in "set cl_haci:qd_acquisition_timer=xxx" to 60. If there is no "set cl_haci:qd_acquisition_timer=xxx", add this line to the file.
3. On any node, run the phys-schost-1# cluster shutdown –g 0 -y command to restart the cluster. The parameter -g is set to 0 so the cluster is shut down immediately. The parameter -y is set to yes to confirm the shut down.
4. On each node, run the phys-schost-1# boot command to restart the cluster. Then the modifications to the /etc/system file take effect.
Do not change the default timeout value on Oracle RAC because some split-brain caused by heartbeat failures can lead to Oracle RAC VIP switchover failures. If quorum devices are unable to complete configurations within the default timeout period 25s in such an environment, replace the quorum devices.
In initial cluster deployment, perform the following steps:
1. Configure N - 1 quorum devices for the cluster where N is the number of nodes.
2. Set a higher default cluster arbitration timeout value to avoid timeouts caused by preemptions when the cluster is configured again. For details, see method 2.
This problem is a known bug and is recorded on http://wikis.sun.com/display/SunCluster/Known Bugs in Oracle Solaris Cluster 3.3. The record indicates that cluster nodes are unable to complete arbitration operations for configuring the cluster again within the default time (25s). As a result, the nodes reset.