No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>Search

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

Failed to Start the Cluster After the pg_xlog Is Deleted by Mistake

Publication Date:  2019-04-12 Views:  10 Downloads:  0
Issue Description

The data disk at a site is full and the cluster is unavailable. When the pg_xlog is moved, the pg_xlog of the active node is moved by mistake. As a result, the cluster cannot be started, and a standby node is repeatedly built.

Alarm Information

As shown in the following figure, after running the cm_ctl query -Cv command, it is found that the active node cannot be started, and the standby node is waiting for modification or keeps recovering.

Handling Process

Run the cm_ctl query -Cvd command to view the directory of each data node.


Run the cm_ctl stop -n 1 -D /srv/BigData/mppdb/datanode/master1 command to stop each active node, as shown in the preceding figure.

cm_ctl stop -n 1 -D /srv/BigData/mppdb/data1/master1

In the preceding figure, -n indicates the value in the node column, and the content after -D indicates the data directory.

After all the active nodes are stopped, all the standby nodes except those being repeatedly built have been started and promoted to the active nodes.


For nodes that cannot be automatically built, stop the nodes and then manually build them, as shown in the following figure.

Stop the node:

cm_ctl stop -n 2 -D /srv/BigData/mppdb/data1/slave1

Run the following command to build the node:

cm_ctl build -n 2 -D /srv/BigData/mppdb/data1/slave1

After the building is complete, start each active node (the standby node after startup).

cm_ctl start –n 1 -D /srv/BigData/mppdb/data1/master1


Run the following command to perform the active/standby switchover to restore the cluster.

cm_ctl switchover -a

Root Cause

After the active node pg_xLog is moved by mistake, the active node is down directly. Then, the standby node is determined to be promoted to the active node after arbitration. However, the standby node is always being built and cannot be started because the active node pg_xlog is lost, and the standby node cannot be promoted to the active node, either.

Solution

Stop the active node, and then switch the standby node to the active node (the faulty standby node needs to be manually built). After that, start the active node, and perform the active/standby switchover to restore the nodes.

Run the following command to stop the node:

cm_ctl stop –n 1 -D /srv/BigData/mppdb/data1/master1

Run the following command to start node:

cm_ctl start –n 1 -D /srv/BigData/mppdb/data1/master1

Run the following command to build node:

cm_ctl build -n 2 -D /srv/BigData/mppdb/data1/slave1

Run the following command to perform the active/standby switchover:

cm_ctl switchover -a

END