No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

The cluster’s heartbeat fault while the Sun cluster is reading and writing causes all the cluster nodes are resetting.

Publication Date:  2012-11-07 Views:  102 Downloads:  0
Issue Description
Product and version:
•S5500T V100R001 V100R002
•S5600T V100R001 V100R002
•S5800T V100R001 V100R002
•S6800T V100R001 V100R002
Host system:
Host 1: Solaris10u9 for SPARC; the HBA card version is: Qlogic 375-3356-02; the driver of the HBA is the system’s own, and the version is: 20100301-3.00.
Host 2: Oracle Solaris 10 9/10 s10s_u9wos_14a SPARC; the HBA card version is: Emulex LPe11002-E; the driver of the HBA is the system’s own, and the version is: 2.50o (2010.01.08.09.45)
The cluster software: Oracle Sun cluster 3.3
The cluster’s network mode is displayed as following:
Figure 1. Cluster network mode


Direct single path networking, the host connects the disk array normally and configure the cluster environment successfully; we configure three quorum devices for the cluster, map a LUN from the disk array and create two source group, after creating the file system, mount to two host nodes respectively, read and write the mounted file system; during the process of reading and writing, the heartbreak appears problem, there are two nodes have reset.
In the normal condition, one node is online and the other has reset after the two nodes preempting the quorum device.

Alarm Information
None
Handling Process
Method one: Restore the heartbreak among the cluster nodes, the nodes will add into the cluster after restarting.
Method two: We can take the following steps to avoid the problem during deploying the cluster:
Configure (N-1) cluster quorum devices (N is the number of the cluster nodes), then we can reduce the configuring time.
Modify the configuration file “/etc/system”, increase the default time of the quorum, such as 60 seconds. The operation steps are as follow:
1. Acquire the “root” authority.
2. Modify the “/etc/system” configuration file at each cluster node, revise the parameter “xxx” in the “set cl_haci:qd_acquisition_timer=xxx” as “60”, then it will display: “set cl_haci:qd_acquisition_timer=60”, if there isn’t the “set cl_haci:qd_acquisition_timer=xxx”, please add this line.
3. Execute the cluster closing command “phys-schost-1# cluster shutdown –g 0 -y” at any node, restart the cluster. The parameter “-g” sets the operating time before shutdown as 0, i.e. shutdown immediately; the parameter “-y” sets to confirm the shutdown operation automatically as “yes”.
4. Execute the command “phys-schost-1# boot” at each nodes, restart the cluster, then the configuration file “/etc/system” has been modified successfully.
For the Oracle RAC, we can’t modify the default time, in some brain crack fault condition caused by the heartbreak problem, it will lead to the Oracle RAC VIP switching failed. In this condition, if there is the problem that the quorum device can’t complete the configuring operation in the 25 seconds by default, please replace the quorum device.

Root Cause
1. Check if the host has the error reporting “Unable to acquire the quorum device,Sun cluster”.
2. Check whether the version of the using HBA card is “Emulex LPe11002-E HBA”.
Suggestions
When deploying the cluster, take the following measures to avoid the above problem:
1. Set (N-1) cluster quorum devices. (N is the number of the cluster nodes)
2. Revise quorum’s default overtime, increases it and avoids the quorum’s preemption is excess while configuring the cluster. For detail information, refer to the method two in the handing process.
There are known bug records in the website: “http://wikis.sun.com/display/SunCluster/Known Bugs in Oracle Solaris Cluster 3.3”, it illustrates the reason of this fault as: the cluster node can’t complete the quorum operation of the cluster configuration in the default time (25 seconds), and then the node reset.

END