Geographic Redundancy
- Disaster Recovery System Overview
- Maintenance Scenarios of the Disaster Recovery System
- DR Status Overview
- Creating or Deleting a DR System
- Freezing Products to Disable Automatic Service Startup
- Routine Maintenance
- Fault Maintenance
Disaster Recovery System Overview
The DR system consists of two sets of iMaster NCE-Campus. The two sets of iMaster NCE-Campus are deployed at two remote sites. When the DR system is running properly, data of the site that provides services is synchronized to the peer site in real time to ensure data consistency between the two sites. When a fault occurs at the site that provides services, you can manually switch the services from the faulty site to the peer site. Automatic switchover is provided if the arbitration service is deployed. This ensures service continuity and reduces the loss caused by disastrous incidents.
The DR system of iMaster NCE-Campus has the following benefits.
- Ease of use: One-click operations are performed on the client.
- Real-time database synchronization between the primary and secondary sites: Data in the recovery point objective (RPO) is consistent.
- Reliability: RPO and recovery time objective (RTO) are at minute-level.
- Powerful automatic recovery capability: Faults are automatically rectified when services are abnormal.
After a geographic redundancy switchover, the links between existing devices and iMaster NCE-Campus will be unavailable. iMaster iMaster NCE-Campus will bring the devices offline, and the devices will automatically go online again. It is estimated that the statuses of the devices will be updated to normal on the iMaster NCE-Campus web UI after 10 to 20 minutes.
Common concepts
Common concepts and differences between the active and standby sites in the DR system.
Common concepts
Concept |
Description |
---|---|
Primary site |
Physical primary site. The primary site is determined during the installation and will not change with the active/standby switchover. The primary site product is active and provides services at most time. |
Secondary site |
Physical secondary site. The secondary site is determined during the installation and will not change with the active/standby switchover. The secondary site product is standby and provides protection for the primary site at most time. The management plane at the secondary site maintains and monitors the secondary site. |
Active site |
Site that provides services. |
Standby site |
Site that provides protection for the active site. |
DR system |
A set of the management plane and product with the same planning scheme is deployed at both the primary and secondary sites. |
DR relationship |
Relationship between products with DR protection deployed at both the primary and secondary sites. |
Heartbeat link |
The primary and secondary sites communicate with each other through heartbeat links to detect the status of the peer site. The DR system checks the heartbeat status every 10 seconds. If the results of two consecutive checks are consistent, the DR system updates the heartbeat status between the primary and secondary sites. |
Data replication link |
Data is synchronized between the primary and secondary sites through the data replication link. The DR system checks the data replication status every 30 seconds and updates the data replication status between the primary and secondary sites. |
Differences between the active and standby sites
Scenario |
Difference |
---|---|
Log in to the management plane |
The DR system provides protection for site products. The data of the management plane is not synchronized, and the management plane of both the primary and secondary sites is running. Therefore, you can log in to the management plane web client of both sites. |
Log in to the service plane |
After the DR system is created, only the product at the active site provides services externally. Therefore, you can log in to the service plane web client of the active site, but cannot log in to the service plane web client of the standby site. |
Service status |
Compared with the active site, fewer services of the product are running at the standby site. At the standby site, only DR-related basic services of the product are running. You can view the services running at the standby site on the Services tab page of the Product > System Monitoring page of the standby site. Other services that the DR system does not depend on are not displayed on this tab page. |
DR operations |
You can perform switch-to-standby operations for the product at the active site, but cannot perform takeover operations. You can perform takeover operations for the product at the standby site, but cannot perform switch-to-standby operations. |
Alarm reporting |
Alarms of the active site are reported to the service plane of the active site. As login to the service plane client of the standby site is not allowed, alarms of the standby site are also reported to the service plane of the active site. |
Solution Introduction
Manual Switchover
Solution introduction:
The primary and secondary sites communicate with each other through heartbeat links and detect the status of the peer site in real time. The primary site synchronizes product data to the secondary site in real time through the data replication link to ensure product data consistency between the primary and secondary sites.
When a disaster occurs at the primary site, perform the takeover operation at the secondary site. The secondary site becomes the active site and provides services externally. The primary site becomes standby.
Manual switchover trigger conditions:
- The disaster such as an earthquake, fire, or power failure occurs at the primary site caused the system as a whole to be unable to provide services.
- The primary site is faulty, causing some key nodes to be damaged and unable to provide corresponding services. For example, database node (DB) corruption, platform service node (Common_Service) corruption, management domain service node (NMS) corruption, control domain service node (Controller or TController) corruption.
Solution schematic diagram:
The DR network can reuse the original network of iMaster NCE-Campus to reduce the network configuration of the primary and secondary sites.
DR Link |
IP Address |
Network Plane |
---|---|---|
Data replication link |
Replication IP address |
DR network NOTE:
The DR network can reuse the inter-node communication network or northbound network or use an independent network. |
Heartbeat link |
Heartbeat IP address |
DR network. The heartbeat IP address and replication IP address must be on the same network plane. |
Automatic Switchover (with Arbitration Service)
Solution introduction:
The arbitration service periodically checks the connectivity between the primary, secondary, and third-party site, and share the check results through data sharing links. When the network connection is abnormal or a site fault causes an arbitration heartbeat exception, the arbitration service selects the optimal site in the network based on the internal algorithms to perform an active/standby switchover.
Automatic switchover trigger conditions:
- A disaster such as an earthquake, fire, or power failure occurs at the primary site, and the fault is not rectified within the specified time.
- The heartbeat link between the primary and secondary sites is interrupted, and the data sharing link between the primary site and the third-party site is interrupted.
- In the manager+controller+analyzer compact deployment and manager deployment scenarios:
- If any of the default key microservices of the system is faulty, the DR system triggers an automatic switchover to ensure normal service running.
- If the service network (southbound or northbound network) is faulty due to a network port fault on the server, the system automatically triggers a switchover.
- If all database instances are faulty, the system automatically triggers a switchover.
Manager+Controller+Analyzer deployment scenarios, nodes and application services are deployed in active/standby or cluster mode, and local protection is configured. Key microservice failover, server service network ports failover and all database instances failover are not separately configured.
The priorities of triggering an automatic switchover are as follows: All database instances are faulty > Server service network ports are faulty > Key microservices are faulty. If all database instances at the secondary site are faulty, an automatic switchover is not triggered even if key microservices at the primary site are faulty.
Arbitration service deployment:
- The CPU architecture of the primary site, secondary site and third-party site is required to be consistent. If the primary and secondary sites are ARM architecture servers, the third-party site is also required to be ARM architecture server.
- iMaster NCE-Campus in Manager+Controller+Analyzer deployment scenarios adopts five-node arbitration service deployment. The arbitration service is deployed at three sites in 2+2+1 mode.
- Two arbitration nodes are deployed at both the primary site and secondary site. It is recommended that the two arbitration nodes be deployed on the Common_Service node. The arbitration nodes between the two sites are mutually protected. One arbitration node is deployed at the third-party site.
- ETCD is deployed on the five arbitration nodes to form an ETCD cluster. Monitor is deployed on the four nodes of the primary site and secondary site, which monitors the network connectivity between sites and saves the results in the ETCD cluster.
Figure 3-24 A five-node DR system - iMaster NCE-Campus in manager+controller+analyzer compact deployment and manager deployment scenarios adopts three-node arbitration service deployment. The arbitration service is deployed at three sites in 1+1+1 mode.
- One arbitration node is deployed at the primary site. One arbitration node is deployed at the secondary site. It is required that the arbitration node be deployed on the Common_Service node in manager+controller+analyzer compact deployment scenarios, and the arbitration node be deployed on the NMS_Server node in manager deployment scenarios. One arbitration node is deployed at the third-party site.
- ETCD is deployed on the three arbitration nodes to form an ETCD cluster. Monitor is deployed on the two nodes of the primary site and secondary site, which monitors the network connectivity between sites and saves the results in the ETCD cluster.
Figure 3-25 A three-node DR system
Solution schematic diagram:
The DR network can reuse the original network of iMaster NCE-Campus to reduce the network configuration of the primary and secondary sites.
DR Link |
IP Address |
Network Plane |
---|---|---|
Data replication link |
Replication IP address |
DR network NOTE:
The DR network can reuse the inter-node communication network or northbound network or use an independent network. |
Heartbeat link |
Heartbeat IP address |
DR network. The heartbeat IP address and replication IP address must be on the same network plane. |
Arbitration heartbeat/data sharing link |
arbitration site communication IP address |
DR network NOTE:
|
Prerequisites for Configuring a DR System
Before configuring the DR system, check that the configurations of the primary and secondary sites meet the requirements to ensure that the DR system can be configured.
- Select the DR system as the protection site to ensure the DR services are the same.
- The deployment schemes (such as product version, Product, component solution) of the two sites must be the same, that is, the software packages, number and specifications of the nodes at the two sites, Management plane and product languages, versions, services, service versions, DR system certificates, root keys, working keys, and the database user passwords of the product must be the same.
- The primary and secondary sites use different IP addresses to interconnect with an OSS. It is advisable to preconfigure the secondary site information on the OSS so that the secondary site can take over services when the primary site is down. If secondary site information cannot be preconfigured, change the currently connected IP address to the IP address of the secondary site after switchover in the DR system.
Maintenance Scenarios of the Disaster Recovery System
You need to routinely check and maintain running systems to identify and eliminate potential faults in advance, so that systems can run securely, stably, and reliably for a long time.
Differences Between the Active and Standby Sites
The deployment scheme is consistent between the active and standby sites. However, the operations and services at the two sites are different.
Maintenance Scenarios of the Disaster Recovery System
Maintenance Tasks |
Scheme |
---|---|
During routine maintenance, check whether the products at the secondary site can take over services from the products at the primary site. |
|
|
|
|
|
Modify the arbitration site communication IP address. |
See Geographic Redundancy System Installation to reinstall the arbitration service. in |
Table 3-115 describes the operation scenarios in using the DR system when faults occur.
Scenario |
Scheme |
Remarks |
---|---|---|
Key services of products are abnormal and cannot provide services externally. Product hardware is faulty. A disaster occurs at the primary site. The primary site is faulty. Example:
|
|
|
If products at the primary and secondary sites are in the dual-active state, the heartbeat status between the primary and secondary sites is |
||
|
Synchronizing Product Data Between Primary and Secondary Sites |
|
If the heartbeat status between the primary and secondary sites is |
|
|
If data at the primary site is abnormal or lost due to misoperations or external attacks, you need to delete the data synchronization relationship between the primary and secondary sites, restore the data at the primary site, and then forcibly synchronize data between the primary and secondary sites to prevent abnormal data at the primary site from being synchronized to the secondary site. |
- |
DR Status Overview
This section describes the common status and status change principles in a DR system.
Table 3-116 lists DR system status and primary and secondary site product status.
DR System Status |
Product DR Status |
Description |
---|---|---|
Normal state |
Primary site product: active Secondary site product: standby |
The primary site product provides services. The heartbeat between the primary and secondary sites is normal, and data is synchronized from the primary site product to the secondary site product. The secondary site provides protection for the primary site. |
Switched over state |
Primary site product: standby Secondary site product: active |
The secondary site product provides services. The heartbeat between the primary and secondary sites is normal, and data is synchronized from the secondary site product to the primary site product. The primary site provides protection for the secondary site. |
Fault takeover state |
Primary site product: unknown Secondary site product: active |
The primary site product is faulty, and the secondary site product provides services. |
Dual-active state |
Primary site product: active Secondary site product: active |
Both the primary and secondary site products provide services. |
Dual-standby state |
Primary site product: standby Secondary site product: standby |
Neither the primary site product nor the secondary site product provides services. |
Protection loss state |
Primary site product: active Secondary site product: unknown |
The primary site product provides services. |
System failure state |
Primary site product: faulty Secondary site product: faulty |
Both the primary and secondary site products are faulty. No service is provided. The heartbeat status between the primary and secondary sites and the data synchronization between the primary and secondary site products are abnormal. |
Figure 3-26 shows the changes of DR status between the primary and secondary sites, and Table 3-117 describes the trigger conditions of the changes.
- ←→ indicates that the two statuses can switch between each other.
- → indicates that a status can switch to the other status.
No. |
Status Change |
Trigger Condition |
---|---|---|
1 |
Normal status (the primary site product is active, and the secondary site product is standby) → Dual-standby status (the primary site product is standby, and the secondary site product is standby) |
When the heartbeat status between the primary and secondary sites is abnormal, the primary and secondary sites are powered off and then powered on, or the primary site product is switched to standby. The services will be interrupted when the switchover is being performed. Exercise caution. |
2 |
Dual-standby status (the primary site product is standby, and the secondary site product is standby) → Normal status (the primary site product is active, and the secondary site product is standby) |
The heartbeat status between the primary and secondary sites recovers, but the data synchronization between the primary site product and secondary site product is abnormal. When data is forcibly synchronized from the primary site product to the secondary site product, the data at the secondary site is overwritten. |
3 |
Normal status (the primary site product is active, and the secondary site product is standby) → Protection loss status (the primary site product is active, and the secondary site product is faulty) |
The secondary site is faulty. |
4 |
Normal status (the primary site product is active, and the secondary site product is standby) → Fault takeover status (the primary site product is faulty, and the secondary site product is active after takeover) |
When the primary site is powered off or faulty, the secondary site takes over services from the primary site. |
5 |
Fault takeover status (the primary site product is faulty, and the secondary site product is active after takeover) → System failure status (the primary site product is faulty, and the secondary site product is faulty) |
The primary and secondary sites are faulty. |
6 |
Protection loss status (the primary site product is active, and the secondary site product is faulty) → System failure status (the primary site product is faulty, and the secondary site product is faulty) |
The primary site becomes faulty after the secondary site is faulty. |
7 |
Normal status (the primary site product is active, and the secondary site product is standby) ←→ Switched status (the primary site product is standby, and the secondary site product is active) |
The products at the primary and secondary sites are switched over. |
8 |
Fault takeover status (the primary site product is faulty, and the secondary site product is active after takeover) → Dual-active status (the primary site product is active, and the secondary site product is active after takeover) |
In the manual switchover or automatic switchover scenario (without the arbitration service), when the heartbeat status between the primary and secondary sites is abnormal, the primary site recovers after faults. |
9 |
Dual-active status (the primary site product is active, and the secondary site product is active after takeover) → Normal status (the primary site product is active, and the secondary site product is standby) |
After forcible data synchronization is performed between the primary and secondary site products, product data is synchronized from the primary site to the secondary site, and the data at the secondary site is overwritten. |
10 |
Dual-active status (the primary site product is active, and the secondary site product is active after takeover) → Switched status (the primary site product is standby, and the secondary site product is active) |
After forcible data synchronization is performed between the primary and secondary site products, product data is synchronized from the secondary site to the primary site, and the data at the primary site is overwritten. |
11 |
Dual-standby status (the primary site product is standby, and the secondary site product is standby) → Switched status (the primary site product is standby, and the secondary site product is active) |
After forcible data synchronization is performed between the primary and secondary site products, product data is synchronized from the secondary site to the primary site, and the data at the primary site is overwritten. |
12 |
Dual-standby status (the primary site product is standby, and the secondary site product is standby) → Protection loss status (the primary site product is active, and the secondary site product is faulty) |
If the heartbeat status between the primary and secondary sites is abnormal, the DR system is abnormal after the DR service at the active site is restarted. |
13 |
Normal status (the primary site product is active, and the secondary site product is standby) → Dual-active status (the primary site product is active, and the secondary site product is active after automatic takeover) |
In the automatic switchover scenario (without the arbitration service), the heartbeat status between the primary and secondary sites is abnormal. |
Table 3-118 describes the status of the heartbeat between sites, product DR status, and data replication status in a DR system.
Monitoring Item |
Description |
---|---|
Heartbeat status |
|
DR status |
Status of products with the DR relationship
|
Data synchronization status |
|
Creating or Deleting a DR System
If the DR system is no longer required or you need to modify configurations that affect the DR system, you can delete the DR system. If you need to create a DR system again or after the configurations that affect the DR system are modified, you can create a DR system.
Checking Before Configuring the DR System
Before configuring the DR system, check that the configurations of the primary and secondary sites meet the requirements to ensure that the DR system can be configured.
Prerequisites
Before configuring the DR system, check that the configurations of the primary and secondary sites meet the following requirements.
- Deployment solution requirements:
- Select the DR system as the protection site to ensure the DR services are the same.
- The deployment schemes (such as product version, Product, component solution) of the two sites must be the same. When the deployment schemes of the two sites are the same, the software packages, number and specifications of the nodes at the two sites, management plane and product languages, versions, services, service versions, the root certificate of the DR system certificates, root keys, working keys, and the database user passwords of the product are the same by default.
- Configure the NTP server for the primary and secondary sites to ensure the UTC time consistency. The time zone for the primary site can be different from that for the secondary site. Set the time zone according to the actual situation.
- A CA certificate is dynamically generated when the management plane is installed. So you should update the CA certificate to ensure that the CA certificates at the two sites are consistent. For details, see Uploading and Updating CA Certificates (DR Scenario).
- Networking requirements:
- The communication of the Inter-node communication IP address between the primary and secondary sites is normal.
- The IP addresses of all nodes, excluding the NTP servers, at the primary and secondary sites are different. The IP address version must be the same. If IPv6 addresses are planned, the primary and secondary sites must use IPv6 addresses.
- The bandwidth between the primary and secondary sites must meet the requirements. For details about bandwidth requirements, see "HLD > Bandwidth Planning" in EasySuite.
- iMaster NCE-Campus requirements: All services and database instances at the primary and secondary sites are normal. For details, see System Monitoring.
- The primary and secondary sites use different IP addresses to interconnect with an OSS. It is advisable to preconfigure the secondary site information on the OSS so that the secondary site can take over services when the primary site is down. If secondary site information cannot be preconfigured, change the currently connected IP address to the IP address of the secondary site after switchover in the DR system.
- In the Centralized deployment scenario, if the floating IP address for connecting to the northbound network is configured on the management node, ensure that the NIC for the floating IP address for connecting to the northbound network at the secondary site is disabled.
- Use PuTTY to log in to the management node at the secondary site as the ossadm user in SSH mode.
- Run the following command to query the NIC usage of the management node:
> ifconfig
- If the query result contains the NIC for the floating IP address for connecting to the northbound network (for example, bond0:1), the NIC is not disabled. Run the following command to disable the NIC:
> ifconfig bond0:1 down
- Run the following command again to query the NIC usage of the management node. The query result does not contain the NIC for the floating IP address for connecting to the northbound network.
> ifconfig
- You have logged in to the management plane at the primary or secondary site. For details, see Logging In to the Management Plane.
Context
The precheck verifies that:
- The communication of the heartbeat links and data replication link between the primary and secondary sites is normal.
If the primary and secondary sites can communicate with each other, the heartbeat links between the two sites must be normal and the DR system certificates are consistent between the two sites.
- The node quantity is consistent between the primary and secondary sites.
- Services and their versions are consistent between the primary and secondary sites.
- The CA certificate is consistent between the primary and secondary sites.
- The root key and working keys are consistent between the primary and secondary sites, respectively.
- The password for the same database instance user is consistent between the primary and secondary sites.
- All database instances and local replication status are normal at the primary and secondary sites.
- The language of the management plane is consistent between the primary and secondary sites.
- The time difference between the primary and secondary sites is less than 1 minute.
- The enabling status of the SSL protocol of the local master and slave database instances is consistent between the primary and secondary site products.
- Verify that the installation interval between the primary and secondary sites is less than 7 days.
During initial DR creation, if the installation interval between the primary and secondary sites is greater than 7 days, the site installed later cannot be used as the primary site. If the site installed later is used as the primary site, the system displays the message "The newly installed cluster cannot be used as the primary cluster." This prevents data loss of the original cluster when the new cluster is used as the primary cluster in the site reconstruction scenario.
Precautions
- Perform operations provided in this section only at one site.
- If a backup or restoration task is in progress, perform operations in this section after the task is complete. Otherwise, the task or the DR operation may fail.
Procedure
- On the management plane, choose HA > Remote High Availability System Management > Manage DR System from the main menu.
- On the Manage Remote DR System page, click Configure DR System.
- Refer to Table 3-119 to configure the site.Table 3-119 DR system parameters
Parameter
Description
Site name
Identifier of a site.
Heartbeat IP address
IP address of the management node used for heartbeat communication, that is, IP address of the management nodes in the DR network.
- In the Distributed scenario, set the heartbeat IP addresses to the IP address in the DR network of OMP_01 and OMP_02 nodes at the primary and secondary sites respectively.
- In the centralized scenario, set the heartbeat IP addresses to the IP address in the DR network of NMS_Server node at the primary and secondary sites respectively.NOTE:
The DR network can reuse the inter-node communication network or northbound network or use an independent network. The heartbeat IP address and replication IP address must be on the same network plane. Obtain the heartbeat IP address based on the site requirements.
- Click Add Product. Select the primary and secondary site products, and the data replication direction. Click Precheck.
- If yes, the configuration requirements for the DR system are met, click Save Draft.
- If no, resolve the problems as prompted.
Configuring the DR System
You do not need to apply for a license for the secondary cluster. After the database synchronization is successful, the license of the primary cluster is automatically synchronized to the secondary cluster. After an active/standby switchover, you do not need to import a license to the original standby cluster.
Configuring the DR System (Manual Switchover)
This section describes how to create a DR system using the primary and secondary sites.
Prerequisites
- The check before configuring the DR system has been performed. For details, see Checking Before Configuring the DR System.
- You have logged in to the management plane of the primary or secondary site. For details, see Logging In to the Management Plane.
Precautions
- Perform operations provided in this section only at one site.
- The database status needs to be updated after the DR system is deleted. You are advised to create a DR system 5 minutes after the old one is deleted. Otherwise, the operation may fail.
- After a DR system is created, you can change the site names or heartbeat IP addresses only by deleting the DR system and creating a new one based on the planning information.
- If a backup or restoration task is in progress, perform operations in this section after the task is complete. Otherwise, the task or the DR operation may fail.
- During the configuration or deletion of the DR system, DR switchover, or forcible product data synchronization, the database at the secondary site is restarted. As a result, the management plane at the secondary site reports GaussDB T V3 process has not started. After the DR operation is complete, the alarm is automatically cleared.
Procedure
- On the management plane, choose HA > Remote High Availability System > Manage DR System from the main menu.
- On the Manage DR System page, click Configure DR System.
- Click the Set Health Check Parameters tab page. Enable scheduled evaluation is enabled by default.
Daily start time is defaulted to 07:00, which indicates that the scheduled task performs health check on the DR system at 07:00 every day. You can modify the value based on site requirements.
- Perform operations as prompted. Note that the manual switchover solution does not require DR Extended Configuration.
After the primary and secondary sites are associated, services at the primary site are still running, and some services at the secondary site are stopped.
- Check the operation result.
- On the management plane, choose HA > Remote High Availability System > Manage DR System from the main menu.
- On the Manage DR System page, verify that the heartbeat status between the primary and secondary sites is
.
- On the Manage DR System page, verify that Data Synchronization Status of all products is Synchronized or Synchronizing. If Data Synchronization Status is Delayed, a large volume of data is being synchronized between the primary and secondary sites. Check the status after data synchronization is complete.
- Verify that you can log in to the service plane at the active site. For details, see Logging In to service plane.
- If the operation result is not as expected, contact Huawei technical support.
- (Optional) Manually start processes or change the process startup type based on site requirements.
- If Startup Type of a process is Manual, perform the following operations:
- On the management plane, choose Product > System Monitoring from the main menu.
- In the upper left corner of the System Monitoring page, move the pointer to
and select the product.
- On the Processes tab page, select the process to be started, click Start in the upper right corner of the process list, and perform operations as prompted.
- For details about how to change the startup type of a process, see Configuring Process Startup Types.
- If Startup Type of a process is Manual, perform the following operations:
Configuring the DR System (Automatic Switchover, Without Arbitration service, Centralized)
This section describes how to create a DR system using the primary and secondary sites.
Prerequisites
- The check before configuring the DR system has been performed. For details, see Checking Before Configuring the DR System.
- You have logged in to the management plane of the primary or secondary site. For details, see Logging In to the Management Plane.
Precautions
- Perform operations provided in this section only at one site.
- The database status needs to be updated after the DR system is deleted. You are advised to create a DR system 5 minutes after the old one is deleted. Otherwise, the operation may fail.
- After a DR system is created, you can change the site names or heartbeat IP addresses only by deleting the DR system and creating a new one based on the planning information.
- If a backup or restoration task is in progress, perform operations in this section after the task is complete. Otherwise, the task or the DR operation may fail.
- During the configuration or deletion of the DR system, DR switchover, or forcible product data synchronization, the database at the secondary site is restarted. As a result, the management plane at the secondary site reports GaussDB T V3 process has not started. After the DR operation is complete, the alarm is automatically cleared.
Procedure
- On the management plane, choose HA > Remote High Availability System > Manage DR System from the main menu.
- On the Manage DR System page, click Configure DR System.
- Click the Set Health Check Parameters tab page. Enable scheduled evaluation is enabled by default.
Daily start time is defaulted to 07:00, which indicates that the scheduled task performs health check on the DR system at 07:00 every day. You can modify the value based on site requirements.
- On the DR Extended Configuration tab page, click the
button to enable the Automatic Switch function.
- On the page that is displayed, set the type of arbitration service to No Arbitration Switching, and set the heartbeat configuration such as Switching Hold-off Time according to the product, click OK.
Configure Heartbeat: The default heartbeat interruption detection and switchover delay time is 5 minutes, which is calculated as follows: Heartbeat Interval (10 seconds by default) x Number of Heartbeat Timeouts (18 by default) + Switching Hold-off Time (2 minutes by default). You can adjust the time based on the actual situation.
- Perform operations as prompted.
After the primary and secondary sites are associated, services at the primary site are still running, and some services at the secondary site are stopped.
- Check the operation result.
- On the management plane, choose HA > Remote High Availability System > Manage DR System from the main menu.
- On the Manage DR System page, verify that the heartbeat status between the primary and secondary sites is
.
- On the Manage DR System page, verify that Data Synchronization Status of all products is Synchronized or Synchronizing. If Data Synchronization Status is Delayed, a large volume of data is being synchronized between the primary and secondary sites. Check the status after data synchronization is complete.
- Verify that you can log in to the service plane at the active site. For details, see Logging In to service plane.
- If the operation result is not as expected, contact Huawei technical support.
- (Optional) Manually start processes or change the process startup type based on site requirements.
- If Startup Type of a process is Manual, perform the following operations:
- On the management plane, choose Product > System Monitoring from the main menu.
- In the upper left corner of the System Monitoring page, move the pointer to
and select the product.
- On the Processes tab page, select the process to be started, click Start in the upper right corner of the process list, and perform operations as prompted.
- For details about how to change the startup type of a process, see Configuring Process Startup Types.
- If Startup Type of a process is Manual, perform the following operations:
Deleting the DR System
If the DR system is no longer required or configurations which affect the DR system are modified, you can delete the DR system.
Prerequisites
You have logged in to the management plane at the primary and secondary sites. For details, see Logging In to the Management Plane.
Precautions
- After the primary and secondary sites are deleted, the DR relationship between products at the two sites are deleted. In this case, the secondary site no longer provides DR protection for the primary site, and data is no longer synchronized between the two sites, but the data at the two sites is not deleted.
- After the primary and secondary sites are deleted, the historical health check records of the sites are deleted.
- If a backup or restoration task is in progress, perform operations in this section after the task is complete. Otherwise, the task or the DR operation may fail.
- During the configuration or deletion of the DR system, DR switchover, or forcible product data synchronization, the database at the secondary site is restarted. As a result, the management plane at the secondary site reports GaussDB T V3 process has not started. After the DR operation is complete, the alarm is automatically cleared.
Procedure
- On the management plane, choose HA > Remote High Availability System > Manage DR System from the main menu.
- In the upper right corner, click Delete.
- Perform operations as prompted.
If the heartbeat status between the primary and secondary sites is
, the DR relationship of both sites will be deleted. If the heartbeat status is
, the DR relationship of the peer site cannot be deleted, and therefore you need to perform the operation at both sites.
- Check the operation result. If the operation result is not as expected, contact Huawei technical support.
- On the management plane, choose HA > Remote High Availability System > Manage DR System from the main menu.
- On the Manage Remote DR System page, verify that the DR system that has been deleted no longer exists.
- Verify that you can log in to the service plane at the active site. For details, see Logging In to service plane.
- On the Task List page, if the task details indicate that the DR system and product deletion is partially successful, the DR system may be deleted when a product node is powered off or abnormal. In this case, you need to manually clear the DR system information on the product node after the node is restored. Otherwise, the services on the node will be abnormal in a non-DR scenario. For details, see "Clearing the DR Information of the Product Nodes After the DR System Is Deleted" in Troubleshooting Guide.
- Back up the product data and the management plane. This is because after the DR system is deleted, the historical backup files have become invalid. For details, see Backing Up Products and Backing Up the Management Plane.
If you need to create a DR system again, back up the product data and the management plane after the DR system is created.
Follow-up Procedure
After the DR system is deleted, the product services at the primary and secondary sites are still in the state before the deletion.
- In the Distributed scenario, do not start services at the secondary site to prevent service preemption and repeated service provisioning caused by the dual-active state of products at the primary and secondary sites. If you need to start services at the secondary site, do not provision services at the primary and secondary sites during the startup of the secondary site.
- In the centralized scenario, if the floating IP address for northbound interconnection is configured for the primary and secondary sites and the services of the products at the secondary site are manually started, the floating IP address for northbound interconnection may conflict. Manually disable the NIC of the floating IP address for northbound interconnection at the secondary site.
- Use PuTTY to log in to the management node at the secondary site as the sopuser user in SSH mode.
- Run the following command to switch to the root user:
> su - root
Password: password for the root user
- Run the following command to query the NIC usage of the management node:
# ifconfig
- If the query result contains the NIC for the floating IP address for connecting to the northbound network (for example, bond0:0), the NIC is not disabled. Run the following command to disable the NIC:
# ifconfig bond0:0 down
- Run the following command again to query the NIC usage of the management node. The query result does not contain the NIC for the floating IP address for connecting to the northbound network.
# ifconfig
Connecting the Primary and Secondary Site Products
After modifying configurations that affect the DR system, re-establish the product DR relationship between the primary and secondary sites.
Prerequisites
- The planning requirements for the primary and secondary sites are met. For details, see Checking Before Configuring the DR System.
- You have logged in to the management plane at the primary or secondary site. For details, see Logging In to the Management Plane.
- All services and database instances at the primary and secondary sites are running properly. For details, see System Monitoring.
- Ensure that the heartbeat status between the primary and secondary sites is
.
Precautions
- Perform operations provided in this section only at one site.
- The database status needs to be updated after the DR system is deleted. You are advised to create a DR system 5 minutes after the old one is deleted. Otherwise, the operation may fail.
- If a backup or restoration task is in progress, perform operations in this section after the task is complete. Otherwise, the task or the DR operation may fail.
- In the automatic switchover (with the arbitration service) scenario, after the DR relationship between the primary and secondary site products is deleted, the product services and databases at the primary and secondary sites are still in the active or standby state before the deletion. To prevent the dual-active problem caused by the conflict between the status of the new product and that of the product before the deletion, disable the automatic switchover function before re-establishing the DR relationship between the primary and secondary site products. After a product is added, enable the automatic switchover function again.
Procedure
- On the management plane, choose HA > Remote High Availability System > Manage DR System from the main menu.
- In the automatic switchover (with the arbitration service) scenario, you need to disable the automatic switchover function.
- On the Manage Remote DR System page, click Add Product.
- Perform operations as prompted.
- Check the operation result.
- On the management plane, choose HA > Remote High Availability System > Manage DR System from the main menu.
- On the Manage Remote DR System page, verify that the heartbeat status between the primary and secondary sites is
.
- On the Manage Remote DR System page, verify that Data Synchronization Status of all products is Synchronized or Synchronizing. If Data Synchronization Status is Delayed, a large volume of data is being synchronized between the primary and secondary sites. Check the status after data synchronization is complete.
- Verify that you can log in to the service plane at the active site. For details, see Logging In to service plane.
- If the operation result is not as expected, contact Huawei technical support.
- In the automatic switchover (with the arbitration service) scenario, click the DR Extended Configuration tab and set Automatic Switch to
to enable the Automatic Switchover function. Restore the DR extension configuration by referring to 2.b.
Separating the Primary and Secondary Site Products
If the DR protection for the product is no longer required, or configuration that affects the product DR function is modified, you can separate the primary and secondary site products.
Prerequisites
You have logged in to the management plane of the primary or secondary site. For details, see Logging In to the Management Plane.
Precautions
- After the primary and secondary site products are separated, products at the primary and secondary sites are independent from each other. The secondary site product no longer provides DR protection for the primary site product and data is not synchronized between the two sites. However, the product data at the two sites is not deleted.
- If a backup or restoration task is in progress, perform operations in this section after the task is complete. Otherwise, the task or the DR operation may fail.
Procedure
- On the management plane, choose HA > Remote High Availability System > Manage DR System from the main menu.
- In the row of the product with DR relationship to be deleted, click
.
- Perform operations as prompted.
If the heartbeat status between the primary and secondary sites is
, the operation at a site will synchronously delete the product from the DR system at the peer site. If the heartbeat status is
, the operation cannot delete the product from the DR system at the peer site, and therefore you need to perform the operation at both sites.
- Check the operation result. If the operation result is not as expected, contact Huawei technical support.
- On the management plane, choose HA > Remote High Availability System > Manage DR System from the main menu.
- On the Manage Remote DR System page, verify that the heartbeat status between the primary and secondary sites is
.
- On the Manage Remote DR System page, check that the deleted product does not exist.
- Verify that you can log in to the service plane of the active site. For details, see Logging In to service plane.
Follow-up Procedure
To prevent the dual-active state, after the primary and secondary site products are separated, the product services and databases remain in the state before the deletion. If you need to start the services at the secondary site, log in to and perform the operations at the secondary site. For details, see Starting the Service Plane.
Deleting the Data Synchronization Relationship Between Products at the Primary and Secondary Sites
If data exceptions occur at the primary site product, you can stop data replication from the primary site product to the secondary site product to prevent data exceptions at the secondary site.
Prerequisites
- You have logged in to the management plane of the primary or secondary site. For details, see Logging In to the Management Plane.
- The heartbeat status between the primary and secondary sites is
.
Precautions
- Perform operations provided in this section only at one site.
- After the data synchronization relationship between the primary and secondary site products is deleted, the data at the two sites still exists, but the data, excluding the HFS files, cannot be synchronized. To delete the synchronization relationship between the HFS files data at the primary and secondary sites, you need to separate the primary and secondary site products or delete the DR system. For details, see Separating the Primary and Secondary Site Products and Deleting the DR System.
- If a backup or restoration task is in progress, perform operations in this section after the task is complete. Otherwise, the task or the DR operation may fail.
Procedure
- On the management plane, choose HA > Remote High Availability System > Manage DR System from the main menu.
- On the Manage Remote DR System page, verify that Data Synchronization Status of all products is Synchronized or Synchronizing. If Data Synchronization Status is Delayed, a large volume of data is being synchronized between the primary and secondary sites. Check the status after data synchronization is complete.
- In the row of the product with data synchronization relationship to be deleted, click
.
- Perform operations as prompted.
- Check the operation result. If the operation result is not as expected, contact Huawei technical support.
- On the management plane, choose HA > Remote High Availability System > Manage DR System from the main menu.
- On the Manage Remote DR System page, verify that the heartbeat status between the primary and secondary sites is
.
- On the Manage Remote DR System page, verify that Data Synchronization Status of the products is Abnormal.
- Verify that you can log in to the service plane of the active site. For details, see Logging In to service plane.
Follow-up Procedure
After the fault of the primary site product is rectified, restore the data synchronization relationship between the primary and secondary site products based on site requirements.
- If data generated on the products at the primary site does not affect services, fully synchronize data from the primary site product to the secondary site product. For details, see Synchronizing Product Data Between Primary and Secondary Sites.
- If data generated on the products at the primary site affects services, fully synchronize data from the secondary site product to the primary site product. For details, see Synchronizing Product Data Between Primary and Secondary Sites. After the synchronization, restore the products at the two sites to the status before the faults. For details, see Performing DR System Drills.
Freezing Products to Disable Automatic Service Startup
By default, the DR system checks the startup status of all active site services and some of the product services of the standby site every 5 minutes. The system starts these services if they are stopped, which ensures that these services are running. When you need to stop services for maintenance or fault diagnosis, freeze the product so that the DR system will not automatically start the services that are not running. To ensure the proper running of product services, keep the product unfrozen.
Prerequisites
You have logged in to the management plane of the active or standby sites. For details, see Logging In to the Management Plane.
Precautions
- When the heartbeat status between the active and standby sites is normal, if you change the product freezing status of a site, the freezing status of the product at the peer site is changed accordingly. When the heartbeat status is abnormal, the change of the freezing status takes effect only on the selected product at the current site. The freezing status of the product at the peer site remains unchanged.
- Perform operations provided in this section only at one site.
Procedure
- On the management plane, choose HA > Remote High Availability System > Manage DR System from the main menu.
- In the Frozen column of the row that contains the product, refer to the following table to perform operations.Table 3-120 Freezing status
Status
Description
The product is unfrozen. The DR system checks the service status of the product at the active and standby sites every 5 minutes, and starts all services at the active site and part of the system services at the standby site, if they are not running.
The product is frozen. The DR system does not check the service status of the product at the active and standby sites. After the product node is powered off and then powered on, the product services will not be automatically started. For details about how to manually start the product services, see Starting the Service Plane.
NOTICE:In the scenario where the management node and product node are the same node, if the node of a frozen product is powered off and then powered on, the DR system will automatically start a task for automatically restoring the product DR status, and the services of the frozen product will be started.
- Perform operations as prompted.
Routine Maintenance
Through routine maintenance, you can detect and rectify the potential faults to ensure the secure, stable, and reliable running of the DR system.
Performing DR System Drills
After the DR relationship between the primary and secondary sites is established, you can perform a switchover between the primary and secondary sites to check whether the secondary site can take over services from the primary site. Services of the product node will be restarted during the test, you are advised to perform this operation in off-peak hours.
Prerequisites
- You have logged in to management plane of the primary or secondary site. For details, see Starting the Management Plane.
- The heartbeat status between the primary and secondary sites is
, and the data synchronization status of all products is Synchronized or Synchronizing.
Precautions
- Perform operations provided in this section only at one site.
- During the switchover, if the data synchronization status is Synchronizing, data of the product is being synchronized between the primary and secondary sites. The system waits for the data synchronization to complete and then performs the switchover. If the waiting times out, the switchover fails, that is, the primary site still functions as the active site and the secondary site functions as the standby site.
- If a backup or restoration task is in progress, perform operations in this section after the task is complete. Otherwise, the task or the DR operation may fail.
- During the configuration or deletion of the DR system, DR switchover, or forcible product data synchronization, the database at the secondary site is restarted. As a result, the management plane at the secondary site reports GaussDB T V3 process has not started. After the DR operation is complete, the alarm is automatically cleared.
Context
Procedure
- On the management plane, choose HA > Remote High Availability System > Manage DR System from the main menu.
- On the Manage DR System page, perform operations as required.
- DR system drill for a single product
In the Operation column of the row that contains the product, click
and perform operations as prompted.
- DR system drill for all products
Select all products, click Switch Over above the product list, and perform operations as prompted.
During the data synchronization, the database status on the System Monitoring page of the Standby site may be Abnormal. After the data synchronization is complete, the database status changes to Normal.
- DR system drill for a single product
- Check the operation result. If the operation result is not as expected, contact Huawei technical support.
- On the management plane, choose HA > Remote High Availability System > Manage DR System from the main menu.
- On the Manage DR System page, verify that the heartbeat status between the primary and secondary sites is
.
- On the Manage DR System page, view information in the Primary Site Product and Secondary Site Product columns and verify that the product DR status is consistent with the switchover result.
- On the Manage DR System page, verify that Data Synchronization Status of the switched products is Synchronized or Synchronizing. If Data Synchronization Status is Delayed, a large volume of data is being synchronized between the primary and secondary sites. Check the status after data synchronization is complete.
- Verify that you can log in to the service plane at the active site and the menus can be displayed properly. For details, see Logging In to service plane.
- Perform a switchover again to restore the status of the original active and standby sites.
- (Optional) Manually start processes or change the process startup type based on site requirements.
- If Startup Type of a process is Manual, perform the following operations:
- On the management plane, choose Product > System Monitoring from the main menu.
- In the upper left corner of the System Monitoring page, move the pointer to
and select the product.
- On the Processes tab page, select the process to be started, click Start in the upper right corner of the process list, and perform operations as prompted.
- For details about how to change the startup type of a process, see Configuring Process Startup Types.
- If Startup Type of a process is Manual, perform the following operations:
Checking Health for Primary and Secondary Sites
To ensure the stable running of the DR system, the system periodically checks the health status of the DR system. You can also manually check for abnormal items. You are advised to perform a health check on the DR system before performing a switchover between the primary and secondary site products to prevent a switchover failure due to system exceptions.
Prerequisites
- You have logged in to the management plane. For details, see Logging In to the Management Plane.
- The communication of the heartbeat links and data replication link between the primary and secondary sites is normal.
Context
- After the DR system is created, a scheduled task for the health check is automatically created on the management plane of the secondary site and is executed at 07:00:00 every day to perform the health check. For details about how to change the execution time of the scheduled task, see Modifying DR System Parameters.
- If a check item is abnormal, the DR system checks all items every hour until they are normal or until the next scheduled health check starts.
Procedure
- On the management plane, choose HA > Remote High Availability System > Manage DR System from the main menu.
- On the Manage Remote DR System page, click Evaluate Health and perform operations as prompted.
- Rectify the items whose Result is Abnormal based on the suggestions.
Modifying DR System Parameters
You can use this function to modify parameters, such as the execution time of scheduled health check tasks, of the DR system.
Prerequisites
You have logged in to the management plane. For details, see Logging In to the Management Plane.
Precautions
If you need to change the name or heartbeat IP address of the primary and secondary sites, or change the product deployed at the two sites, you need to delete the existing DR system and create a DR system again. For details, see Deleting the DR System and Configuring the DR System (Manual Switchover).
Procedure
- On the management plane at the primary site, choose HA > Remote High Availability System > Manage DR System from the main menu.
- On the Manage Remote DR System page, click Edit and perform operations as prompted.
Fault Maintenance
This section describes how to quickly rectify faults in a DR system to improve O&M efficiency.
Taking Over Faulty Products
If the primary site product is faulty and cannot provide services externally, you can take over services from the primary site product to reduce losses caused by the fault.
Prerequisites
You have logged in to the management plane of the secondary site. For details, see Logging In to the Management Plane.
Context
After the takeover, the product status at both sites changes. As shown in Figure 3-28, the primary site product is faulty, and the secondary site product takes over services from the primary site product, and the secondary site product changes to active from standby.
Precautions
- Service takeover can be performed only on products in the Standby state. After the takeover is successful, products at the standby site take over services of the products at the active site and provide services externally.
- During the takeover, if the data synchronization status is not Synchronized, the takeover may cause product data loss.
Procedure
- On the management plane, choose HA > Remote High Availability System > Manage DR System from the main menu.
- On the Manage Remote DR System page, perform operations as required.
- If a single product is faulty:
In the Operation column of the row that contains the product, click
and perform operations as prompted.
- If multiple products are faulty:
Select the products to be taken over, click Take Over above the product list, and perform operations as prompted.
- If a single product is faulty:
- Check the operation result. If the operation result is not as expected, contact Huawei technical support.
If the heartbeat status between the primary and secondary sites is abnormal, or the primary site product fails to become standby during the takeover, the DR system enters the dual-active state, which may cause data loss. After the heartbeat status between the primary and secondary sites is restored, you need to forcibly synchronize data between the products at the primary and secondary sites to ensure data consistency between the products at the primary and secondary sites. For details, see Synchronizing Product Data Between Primary and Secondary Sites.
- On the management plane, choose HA > Remote High Availability System > Manage DR System from the main menu.
- On the Manage Remote DR System page, check that the DR status of the product that has taken over services in 2 is Active or Active after takeover.
- Verify that you can log in to the service plane of the new active site. For details, see Logging In to service plane.
- (Optional) Manually start processes or change the process startup type based on site requirements.
- If Startup Type of a process is Manual, perform the following operations:
- On the management plane, choose Product > System Monitoring from the main menu.
- In the upper left corner of the System Monitoring page, move the pointer to
and select the product.
- On the Processes tab page, select the process to be started, click Start in the upper right corner of the process list, and perform operations as prompted.
- For details about how to change the startup type of a process, see Configuring Process Startup Types.
- If Startup Type of a process is Manual, perform the following operations:
Follow-up Procedure
After the takeover is complete, restore the DR system according to the system status. For details, see Table 3-121.
Symptom |
Possible Causes |
Measures |
---|---|---|
The heartbeat status between the primary and secondary sites is |
The primary site product has recovered from faults. |
On the Manage Remote DR System page, click |
The heartbeat status between the primary and secondary sites is |
|
|
The heartbeat status between the primary and secondary sites is |
|
If the heartbeat status between the primary and secondary sites is abnormal, or the products at the active site fail to become standby during the takeover, the product becomes dual-active after the takeover. This may cause data loss. You can perform the following operations to rectify the dual-active state.
|
A catastrophic fault occurs at the primary site. |
|
Abnormal DR System Heartbeat Status
Fault Symptoms
On the Manage Remote DR System page of the management plane, the heartbeat status is (Abnormal) or
(Unknown).
Figure 3-30 shows the DR heartbeat list. You can click to expand the faulty heartbeat list and view the faulty node.
Fault Symptom |
Heartbeat Function |
Possible Cause and Handling Measure |
---|---|---|
|
dc1 and dc2 (manual switchover heartbeat between primary and secondary sites) |
For details, see Abnormal DR System Heartbeat Status Between the Primary and Secondary Sites. |
dc1 and dc2 (automatic switchover heartbeat between primary and secondary sites) |
||
dc1 and dc2 (arbitration heartbeat between primary and secondary sites) |
For details, see Symptom 1: Abnormal Arbitration Heartbeat Between the Primary and Secondary Sites. |
|
dc1 and the third-party arbitration site (heartbeat between the primary site and the arbitration site) |
For details, see Symptom 2: Abnormal Arbitration Heartbeat Between the Primary and Third-Party Sites. |
|
dc2 and the third-party arbitration site (heartbeat between the secondary site and the arbitration site) |
For details, see Symptom 3: Abnormal Arbitration Heartbeat Between the Secondary Site and Third-Party Site. |
|
|
dc1 and the third-party arbitration site (heartbeat between the primary site and the arbitration site) |
For details, see Symptom 4: Unknown Arbitration Heartbeat Between the Primary/Secondary Site and Third-Party Site. |
dc2 and the third-party arbitration site (heartbeat between the secondary site and the arbitration site) |
Symptom 1: Abnormal Arbitration Heartbeat Between the Primary and Secondary Sites
Possible Cause |
Verification Method |
Rectification Method |
---|---|---|
The heartbeat between the arbitration nodes at the primary and secondary sites is interrupted. |
|
Contact the administrator to check and restore the network. |
The arbitration service is abnormal. |
|
Contact Huawei engineers to rectify the fault. After the fault is rectified, use PuTTY to log in to the rectified arbitration node at the primary and secondary sites as the sopuser user in SSH mode, switch to the root user, and then switch to the arbiter user to restart the monitor or ETCD process on the arbitration nodes.
|
Symptom 2: Abnormal Arbitration Heartbeat Between the Primary and Third-Party Sites
Possible Cause |
Verification Method |
Rectification Method |
---|---|---|
The network between the management node at the primary site and the third-party site is disconnected. |
|
Contact the administrator to check and restore the network. |
The third-party site is faulty or the ETCD process of the third-party site stops. |
|
Contact Huawei engineers to rectify the fault. After the fault is rectified, use PuTTY to log in to the rectified arbitration node at the third-party site as the sopuser user in SSH mode, switch to the root user, and then switch to the arbiter user to restart the ETCD process of the third-party site. bash /opt/arbitration-etcd/script/service.sh restart |
Symptom 3: Abnormal Arbitration Heartbeat Between the Secondary Site and Third-Party Site
Possible Cause |
Verification Method |
Rectification Method |
---|---|---|
The network between the management node at the secondary site and the third-party site is disconnected. |
|
Contact the administrator to check and restore the network. |
The third-party site is faulty or the ETCD process of the third-party site stops. |
|
Contact Huawei engineers to locate and rectify the fault. After the fault is rectified, use PuTTY to log in to the rectified arbitration node at the third-party site as the sopuser user in SSH mode, switch to the root user, and then switch to the arbiter user to restart the ETCD process of the arbitration node at the third-party site. bash /opt/arbitration-etcd/script/service.sh restart |
Symptom 4: Unknown Arbitration Heartbeat Between the Primary/Secondary Site and Third-Party Site
Possible Cause |
Verification Method |
Rectification Method |
---|---|---|
The primary site is isolated. That is, the network between the primary and secondary sites is disconnected, and the network between the primary and third-party sites is disconnected. |
Determine the fault based on other status information.
If the preceding symptoms occur, the primary site is isolated. |
Contact the administrator to check and restore the network. |
The secondary site is isolated. That is, the network between the secondary and primary sites is disconnected, and the network between the secondary and third-party sites is disconnected. |
Determine the fault based on other status information.
If the preceding symptoms occur, the secondary site is isolated. |
Contact the administrator to check and restore the network. |
Abnormal DR System Heartbeat Status Between the Primary and Secondary Sites
Symptom
On the Manage DR System page of the management plane, the heartbeat status between the primary and secondary sites is (abnormal).
Possible Causes
- The heartbeat network between the primary and secondary sites is abnormal.
- The DR service at the primary or secondary site is abnormal.
- The DR system certificates of the management node at the primary and secondary sites are inconsistent or have expired.
Prerequisites
- You have obtained the heartbeat IP address of the management node at the secondary site.
- You have obtained the password for the sopuser and ossadm user on the management node at the primary and secondary sites.
Troubleshooting Procedure
This section provides only the basic troubleshooting methods. If the fault persists after troubleshooting using the following methods, contact Huawei technical support.
- Check whether the heartbeat network between the primary site and secondary site is normal.
- Use PuTTY to log in to the management node at the primary site as the sopuser user in SSH mode. Run the following command to switch to the ossadm user:> su - ossadm
Password: password for the ossadm user
If the management plane is deployed in cluster mode, perform operations only on OMP_01. For details about how to obtain the IP address of a node, see How Do I Query the IP Address of a Node?
- Run the following command to test the connectivity between the management nodes at the primary and secondary sites.
If the IP address version is IPv4:
> ping heartbeat IP address of the management node at the secondary site
If the IP address version is IPv6:
> ping6 heartbeat IP address of the management node at the secondary site
Check the command output.
- If information similar to the following is displayed, the IP address can be pinged, and the network connection is normal:
64 bytes from heartbeat IP address of the management node at the secondary site: icmp_seq=1 ttl=251 time=42.1 ms
- If no command output is displayed within 1 minute, the network connection is abnormal. Contact the administrator to check the network status and rectify the network fault.
- Press Ctrl+C to stop the ping command.
- Use PuTTY to log in to the management node at the primary site as the sopuser user in SSH mode. Run the following command to switch to the ossadm user:
- Check whether the DR processes of the management node are normal at the primary and secondary sites.
- Log in to the management plane at the primary site. For details, see Logging In to the Management Plane.
- On the management plane choose Product > System Monitoring from the main menu.
- In the upper left corner of the System Monitoring page, move the pointer to
and select iMaster NCE-Campus-OMP.
- On the Services tab page, click UniEPMgr.
- In the Processes area, check whether the drmgrservice-x-x process exists and whether Status of the process is Running.
x indicates the instance number. Replace it based on site requirements.
- If yes, the processes exist and are running properly.
- If no, contact Huawei technical support.
- Log in to the management plane at the secondary site, and perform the preceding operations to check the DR processes at the secondary site. If abnormal, contact Huawei technical support to restore the DR processes.
- Check whether the DR system certificate of the management node at the primary and secondary sites has expired.Check whether the 51025 Certificate of the Remote DR System Has Expired alarm is generated for the primary and secondary sites.
- If yes, update the DR system certificate. For details, see "Updating DR System Certificates" in Maintenance and Monitor (Management plane).
- If no, this fault is not caused by DR system certificate expiration.
- Contact Huawei technical support to check whether the DR system certificates of the management node match between the primary and secondary sites.
Restoring Abnormal DR System Data Replication
Abnormal Data Synchronization Between Databases at Primary and Secondary Sites
Symptom
On the Manage Remote DR System page of the management plane, Data Synchronization Status between the primary and secondary sites is Abnormal. Click to view the product information, and check the item whose Data Type is Database. Status of the item is Abnormal.
Possible Causes
The data replication link between the primary and secondary site products is abnormal.
Figure 3-31 shows the method of locating abnormal data replication in the DR system. The databases are deployed in master/slave mode at each site. When data is written to the master database, the data is synchronized from the master database to the slave database. As shown in Figure 3-31, at the primary site, data is synchronized from DB01 to DB02, and at the secondary site, from DB03 to DB04. During remote replication, the data is synchronized from the master database at the primary site to that at the secondary site, that is, from DB01 at the primary site to DB03 at the secondary site.
Major factors that affect data replication are as follows:
- Data replication links between products at the primary and secondary site
- Data replication links between local nodes
- Database running status
Troubleshooting Procedure
- Check whether the data replication links between the primary and secondary sites are normal.
- Use PuTTY to log in to the management node at the primary site as the sopuser user in SSH mode.
If the management plane is deployed in cluster mode, perform operations only on OMP_01. For details about how to obtain the IP address of a node, see How Do I Query the IP Address of a Node?
- Run the following command to switch to the master database instance node at the primary site:
> ssh IP address of the master database instance node at the primary site
- Run the following command to test the connectivity between the database nodes at the primary and secondary sites:
Replace IP address of a node at the secondary site in the following commands with the IP address of the node where the database instance at the secondary site that shares the same name with the master database instance at the primary site resides.
- For an IPv4 address, run the following command:
> ping IP address of a node at the secondary site
- For an IPv6 address, run the following command:
> ping6 IP address of a node at the secondary site
Check the command output.
- If information similar to the following is displayed, the IP address can be pinged, and the network connection is normal. Press Ctrl+C to stop the ping command and go to 2.
64 bytes from IP address of a node at the secondary site: icmp_seq=1 ttl=251 time=42.1 ms
- If no command output is displayed within 1 minute, the network connection is abnormal. Press Ctrl+C to stop the ping command and contact the administrator to check and restore the network, and then go to 2.
- For an IPv4 address, run the following command:
- Use PuTTY to log in to the management node at the primary site as the sopuser user in SSH mode.
- Check the local master and slave database instance status at the primary site.
- Log in to the management plane of the primary site. For details, see Logging In to the Management Plane.
- On the management plane, choose Product > System Monitoring from the main menu.
- In the upper left corner of the System Monitoring page, move the pointer to
and select the product.
- On the Relational Database tab page, check the statuses of the master and slave database instances.
- If Status of the master and slave database instances is Running and Replication Status is Normal, the database instances are normal. Go to 3.
- If Status of the master or slave database instance is Not Running or Unknown, or Replication Status is Abnormal, the database instance is abnormal. Rectify the fault by referring to "Database Faults" in Troubleshooting Guide.
- Forcibly synchronize data between the primary and secondary sites.
- On the management plane of the active site, choose HA > Remote High Availability System > Manage DR System from the main menu.
- On the Manage Remote DR System page, click
in the Operation column of the row that contains the product whose data is to be synchronized. Select the product data synchronization direction.
After you specify the data synchronization direction, the DR system performs full data synchronization based on the specified direction, and data at the destination site will be overwritten. You are advised to specify the product with the latest data as the active site product to synchronize data from it to the peer site product. If the direction is from the standby to the active, the standby product will be switched to active, and then synchronizes data to the product at the peer site.
- Perform operations as prompted.
- Check the operation result. If the operation result is not as expected, contact Huawei technical support.
- On the management plane, choose HA > Remote High Availability System > Manage DR System from the main menu.
- On the Manage Remote DR System page, verify that the heartbeat status between the primary and secondary sites is
.
- On the Manage Remote DR System page, verify that Data Synchronization Status of all products is Synchronized or Synchronizing. If Data Synchronization Status is Delayed, a large volume of data is being synchronized between the primary and secondary sites. Check the status after data synchronization is complete.
- Verify that you can log in to the service plane of the active site. For details, see Logging In to service plane.
Abnormal RHM Data Replication
Symptom
The RHM data replication between primary and secondary sites is abnormal.
Possible Causes
RHM is abnormal.
Troubleshooting Procedure
- Restart RHM at the site where RHM is abnormal.
- Log in to the management plane. For details, see Logging In to the Management Plane.
- On the management plane, choose Product > System Monitoring from the main menu.
- In the upper left corner of the System Monitoring page, move the pointer to
and select a product other than NCE-OMP.
- Click the Services tab.
- Select all instances whose name Instance Name contains RHM, and click Stop.
- In the Warning dialog box, click OK.
- After RHM service instances are stopped, click Start.
- In the Warning dialog box, click OK.
If all RHM service instances are in the Running state, RHM is normal. Otherwise, contact Huawei technical support.
- If other products are deployed, repeat 1.c to 1.h to restart RHM of all products except NCE-OMP.
- Forcibly synchronize data between the primary and secondary sites.
- On the management plane of the active site, choose HA > Remote High Availability System > Manage DR System from the main menu.
- On the Manage Remote DR System page, click
in the Operation column of the row that contains the product whose data is to be synchronized. Select the product data synchronization direction.
After you specify the data synchronization direction, the DR system performs full data synchronization based on the specified direction, and data at the destination site will be overwritten. You are advised to specify the product with the latest data as the active site product to synchronize data from it to the peer site product. If the direction is from the standby to the active, the standby product will be switched to active, and then synchronizes data to the product at the peer site.
- Perform operations as prompted.
- Check the operation result. If the operation result is not as expected, contact Huawei technical support.
- On the management plane, choose HA > Remote High Availability System > Manage DR System from the main menu.
- On the Manage Remote DR System page, verify that the heartbeat status between the primary and secondary sites is
.
- On the Manage Remote DR System page, verify that Data Synchronization Status of all products is Synchronized or Synchronizing. If Data Synchronization Status is Delayed, a large volume of data is being synchronized between the primary and secondary sites. Check the status after data synchronization is complete.
- Verify that you can log in to the service plane of the active site. For details, see Logging In to service plane.
Restoring Product DR Status
Switching Products to Standby
If the products at the primary and secondary sites are in the dual-active status and the heartbeat links between the primary and secondary sites are abnormal, you cannot rectify the dual-active status by forcibly synchronizing the product data. In this case, switch products at a site to standby so that the product status in the DR system restores.
Prerequisites
You have logged in to the management plane of the site whose product is in the Active state. For details, see Logging In to the Management Plane.
Context
- When you switch the product at a site to standby, the product at the peer site will become active if the heartbeat links are normal. This prevents the products at the primary and secondary sites become dual-standby. If the heartbeat links are abnormal, only the product at the current site becomes standby. To ensure proper functionality of the product, do not switch the product at a site to standby when the heartbeat links are abnormal and the DR status of the products at the primary and secondary sites are different (one active and one standby).Figure 3-32 Before and after a product is switched to standby
- If the data synchronization status is Synchronizing or Delayed when the product at a site is becoming standby, the product data is being synchronized between the primary and secondary sites. The system waits for the data synchronization to complete. After data synchronization is complete, the standby product becomes active. If the waiting times out, the system will forcibly perform the switchover.
Precautions
If a backup or restoration task is in progress, perform operations in this section after the task is complete. Otherwise, the task or the DR operation may fail.
Procedure
- On the management plane, choose HA > Remote High Availability System > Manage DR System from the main menu.
- On the Manage Remote DR System page, click
in the Operation column of the row that contains the product and perform operations as prompted.
To switch multiple products to standby on the management plane, select the products and click Switch to Standby above the product list and perform operations as prompted.
- Check the operation result. If the operation result is not as expected, rectify the fault by referring to "Failure to Switch Products to Standby Due to Site Faults" in Troubleshooting Guide.
- On the management plane, choose HA > Remote High Availability System > Manage DR System from the main menu.
- On the Manage Remote DR System page, verify that the DR status of the product that has been switched to standby is Standby.
- Verify that you cannot log in to the service plane of the site at which the product has become standby.
- (Optional) Manually start processes or change the process startup type based on site requirements.
- If Startup Type of a process is Manual, perform the following operations:
- On the management plane, choose Product > System Monitoring from the main menu.
- In the upper left corner of the System Monitoring page, move the pointer to
and select the product.
- On the Processes tab page, select the process to be started, click Start in the upper right corner of the process list, and perform operations as prompted.
- For details about how to change the startup type of a process, see Configuring Process Startup Types.
- If Startup Type of a process is Manual, perform the following operations:
Synchronizing Product Data Between Primary and Secondary Sites
If the products at the primary and secondary sites are in the same state (dual-active or dual-standby), or exceptions occur during data synchronization between primary and secondary site products, you can specify the synchronization direction after the heartbeat status between the primary and secondary sites is restored. The system performs the synchronization based on the direction so that the product status becomes normal and the data of the primary and secondary site products is consistent.
Prerequisites
- You have logged in to the management plane at the primary or secondary site. For details, see Logging In to the Management Plane.
- All services and database instances of the primary site product are normal during data synchronization. For details, see System Monitoring.
- The heartbeat status between the primary and secondary sites is
.
Context
After forcible synchronization, the data synchronization direction changes according to the forcible data synchronization direction. As shown in Figure 3-33, the heartbeat links between the two sites are normal, but the data replication link between products is abnormal, and the specified product data synchronization direction is from the active to the standby. After successful data synchronization, data of the active product is fully synchronized to the standby product and data of the standby product is overwritten. If the direction is from the standby to the active, the standby product will be switched to active, and then synchronizes data to the active one.
Precautions
- Perform operations provided in this section only at one site.
- If a backup or restoration task is in progress, perform operations in this section after the task is complete. Otherwise, the task or the DR operation may fail.
During the configuration or deletion of the DR system, DR switchover, or forcible product data synchronization, the database at the secondary site is restarted. As a result, the management plane at the secondary site reports GaussDB T V3 process has not started. After the DR operation is complete, the alarm is automatically cleared.
Procedure
- On the management plane, choose HA > Remote High Availability System > Manage DR System from the main menu.
- In the Operation column of the row that contains the product with data to be synchronized, click
. Select the data synchronization direction between the primary and secondary site products.
After you specify the data synchronization direction, the DR system performs full data synchronization based on the specified direction. You are advised to specify the product with the latest data as the active site product to synchronize data from it to the peer site product. If the direction is from the standby to the active, the standby product will be switched to active, and then synchronizes data to the active one.
- Perform operations as prompted.
- Check the operation result.
- On the management plane, choose HA > Remote High Availability System > Manage DR System from the main menu.
- On the Manage DR System page, verify that the heartbeat status between the primary and secondary sites is
.
- On the Manage DR System page, verify that Data Synchronization Status of all products is Synchronized or Synchronizing. If Data Synchronization Status is Delayed, a large volume of data is being synchronized between the primary and secondary sites. Check the status after data synchronization is complete.
- Verify that you can log in to the service plane at the active site. For details, see Logging In to service plane.
- If the operation result is not as expected, contact Huawei technical support.
- (Optional) Manually start processes or change the process startup type based on site requirements.
- If Startup Type of a process is Manual, perform the following operations:
- On the management plane, choose Product > System Monitoring from the main menu.
- In the upper left corner of the System Monitoring page, move the pointer to
and select the product.
- On the Processes tab page, select the process to be started, click Start in the upper right corner of the process list, and perform operations as prompted.
- For details about how to change the startup type of a process, see Configuring Process Startup Types.
- If Startup Type of a process is Manual, perform the following operations:
- Disaster Recovery System Overview
- Maintenance Scenarios of the Disaster Recovery System
- DR Status Overview
- Creating or Deleting a DR System
- Freezing Products to Disable Automatic Service Startup
- Routine Maintenance
- Fault Maintenance