Troubleshooting Guide
- Instructions for Maintenance Engineers
- Fault Information Collection
- Troubleshooting
- OS Faults
- Database Faults
- Abnormal Single Database Instance Status of Products
- Abnormal Slave Database Instance Status of Products
- Abnormal Master and Slave Database Instance Status of Products
- Abnormal Master Database Instance Due to Multiple Restarts on Nodes
- The Database Replication Status Is Abnormal
- The Disk Usage of a Database Instance Is Close to 100%
- Node and Service Faults
- DR System Faults
- Abnormal DR System Heartbeat Status Between the Primary and Secondary Sites
- Abnormal DR System Data Replication
- Failure to Load the Page for the DR System or Perform DR Operations on the Web Client
- Abnormal Product Status After DR Services at the Primary and Secondary Sites Are Restarted
- Abnormal DR System After the DR Service at the Active Site Is Restarted
- Failed to Configure the DR Relationship
- Failure to Switch Products to Standby Due to Site Faults
- Both the Switchover and Rollback Fail
- Disaster Recovery Exception Caused by the Uninstallation and Reinstallation of the ZooKeeperService in the RHM DR Scenario
- Clearing the DR Information of the Product Nodes After the DR System Is Deleted
- Arbiter Third-Party Site Faults
- The HDFS Synchronization Task Does Not Exist
- The HDFS Synchronization Status of the DR System Is Abnormal
- Failed to Query the HDFS Replication Status
- File Synchronization Fails Between the Primary and Secondary Sites
- Log and Alarm Management
- Login Failures
- System Management
- Portal Authentication
- Failure to Redirect to the Authentication Page After a User Clicks an Image or Video Link on the Login Page
- Failure to Move the Cursor to the Dialog Box on the SMS Authentication Page After a User Installs Google Chrome for the First Time
- When a User Accesses a Website Through Any Port Except Port 80 from a Terminal, the User Cannot Be Redirected to the Target Portal Page
- Authorization Fails During Authentication
- Error 404 Is Displayed on the Portal Page During Authentication
- Service Configuration of LAN Network
- A Device Fails to Go Online (Device Unregistered)
- Operation Fails on iMaster NCE-Campus
- Service Configuration Delivery Fails
- The Configuration Result is Displayed as Failed on iMaster NCE-Campus, But the Configuration Is Successfully Delivered to the Device
- Terminals Failed to Obtain IP Addresses from the DHCP Server
- A Network Disconnection Occurs After a Faulty Device with an Eth-Trunk as an Uplink Is Replaced
- Service Configuration of WAN Network
- A Device Fails to Go Online in WAN Network (Device Offline)
- No Deployment Email Can Be Received After Site Creation
- WAN Service Configuration Delivery Fails (The Configuration Result Is in Preconfigured State)
- WAN Service Configuration Delivery Fails (The Configuration Result Is in Failed State)
- WAN Service Configuration Delivery Fails (The Configuration Result Is in Alarm State)
- The Email Server Test Fails
- Maintenance
- VM or Physical Server Exception
- Failure to Display Alarm Information/Failure to Display Performance Data/Empty Device Login and Logout Logs/Failure to Upgrade Device Software/Failure to Add, Delete, or Modify Tenants, Devices, or Sites on iMaster NCE-Campus
- Failure to Access the Operating System During the VM Startup After Server Recovery from Power-Off
- Insufficient Memory Leads to Automatic Restart of VMs
- Login Failure to a FusionCompute VM
- FusionInsight Faults
- iMaster NCE-Campus Basic Operation Troubleshooting
- Security Management Faults
- Backup and Restoration Troubleshooting
- Failure to Clear Alarms Due to System Faults
- Blank Pages on the Management Plane Web Client Due to Used-up Space of the tmp Directory
- Abnormal mczkapp Process Due to Damaged MCZKService Data Files
- Abnormal zookeeperapp Process Due to Damaged ZookeeperService Data Files
- Some Unavailable Menus on the Service Plane
Instructions for Maintenance Engineers
Precautions
A maintenance engineer must read the following precautions before locating and troubleshooting faults:
- Confirm whether the fault is an emergency fault. If it is an emergency fault, recover the faulty module using the predefined troubleshooting methods immediately, and then restore services.
- Strictly conform to operation rules and industrial safety standards, ensuring human and device safety.
- Take ESD protection measures. For example, wear an ESD wrist strap when replacing or maintaining device components.
- Record first-hand information about the problems occurring during troubleshooting.
- Record all the operations you have performed, especially key operations such as restarting device and clearing database. Before performing a key operation, confirm the operation feasibility, back up data, and prepare the emergency and security measures. Only qualified personnel can perform key operations.
Troubleshooting Flowchart
Systematic troubleshooting is to find fault causes step by step, and finally recover the fault.
Generally, troubleshooting steps include observing fault symptom, collecting information, analyzing problem, and finding the root cause. All possible causes of faults can be grouped into multiple cause sets, which make troubleshooting easier.
Refer to Troubleshooting when locating and rectifying a fault.
If the fault cannot be located, collect fault logs according to Fault Information Collection, and contact Huawei technical support.
Asking for Help
Huawei Enterprise Website
At https://support.huawei.com/enterprise/, you can:
- Search troubleshooting cases to find a way to fix your problem.
- Post your question on the support community and wait for answers from online technical experts.
Contacting Technical Support Personnel
If the problems persist, you can contact service providers for technical support.
Provide device information, fault symptoms, logs, and other related information in advance, which will help fault location.
Fault Information Collection
Collecting Logs
Collecting Service Logs
This function provides default and user-defined collection templates for O&M personnel to collect logs and database tables as required for fault diagnosis.
- On the iMaster NCE-Campus management plane, choose Maintenance > O&M Management > Data Collection from the main menu.
- On the Data Collection page, as shown in the following figure, refer to Table 3-127 and Table 3-128 to perform the operations.Table 3-127 Scenario collection
Operation Type
Procedure
Collection by scenario
- Click the Fault Scenario-based tab and select a scenario that matches the actual symptom.
You can query keywords or directly select a fault scenario from the navigation tree.
- Set the start time and end time and click Collect to collect data, perform operations as prompted.
Collection by microservice
- Click the Microservice-based tab and select one or more services for which data needs to be collected.
You can query keywords or directly select a service.
- Set the start time and end time and click Collect to collect data, perform operations as prompted.
Collection by directory
- Click the Directory-based tab, select the type of the file to be collected, and enter a file path.
- Add one or more user-defined collection items.
- Set the start time and end time and click Collect to collect data, perform operations as prompted.
Table 3-128 Other operationsOperation Type
Procedure
Collection by reproducing a fault
Reproduce data means that after a fault occurs, data is collected during the fault reproduction process.
- After selecting a collection scenario, set Reproduce fault to On and select Timeout time.NOTE:
All collection scenarios support fault reproduction. If the reproduction duration reaches Timeout time, the collection task automatically ends.
- Click Reproduce to reproduce the fault.
- After the fault is reproduced, click Stop Reproduction within the time specified by Timeout time to collect data. Otherwise, the reproduction and collection fails and you need to start the reproduction again, perform operations as prompted.
Downloading collection results
- Use Google Chrome to download the result file and set Google Chrome as follows:
Choose Customize and control Google Chrome > Settings in the upper right corner of Google Chrome. Click Advanced and set Ask where to save each file before downloading to the disabled state in the Downloads area.
- After data collection is complete, perform the following operations as required:
- If the status of each node is Completed, click Download. When multiple result files are downloaded for the first time, the Download multiple files dialog box is displayed in the upper left corner of Google Chrome. Click Allow to download the files one by one.
- All nodes: Click the
button to download the result files of all nodes.
- Single node: Click the
button next to a node to download the result file of the node.
- All nodes: Click the
- If the message "The disk space of OMP is insufficient..." is displayed, the disk space in the /opt directory of the current OMP node is less than 5 GB. Download the collection results from the /opt/backup/hfs/dfs/logcollect directory on each node.
- If the status of a node is Failed, the disk space in the /opt directory of the node is less than 5 GB. Data cannot be collected. In this case, you need to clean up the disk space and collect data again.
- If the status of a node is Abnormal, perform operations as prompted.
- If the status of each node is Completed, click Download. When multiple result files are downloaded for the first time, the Download multiple files dialog box is displayed in the upper left corner of Google Chrome. Click Allow to download the files one by one.
- Click the Fault Scenario-based tab and select a scenario that matches the actual symptom.
Collecting Security Logs
The following uses security logs as an example to describe how to view and export logs.
- Log in to iMaster NCE-Campus using the admin account.
- On the main menu, choose Security Logs page. and view log details on the
- Click Export and save log files to a local directory as prompted.
Troubleshooting
OS Faults
System Time Inconsistent with the NTP Time
Symptom
- The OS time of the server is inconsistent with the time of the Network Time Protocol (NTP) time source. The time cannot be synchronized from the NTP clock source.
- On the management plane, choose Maintenance > Time Management > Configure NTP from the main menu. On the Configure NTP page, the time synchronization status of the NTP server is Abnormal.
- If the management plane is deployed in cluster mode, you are frequently logged out of the management plane and the login page is displayed.
Possible Causes
- The network is faulty.
- Sudden time change occurs on the NTP clock source.
Prerequisites
- You have obtained the passwords for the sopuser, ossadm, and root users of the node with time to be modified.
- You have obtained the IP address of the active NTP clock source.
- In a DR scenario, the DR relationship between the primary and secondary sites has been deleted. For details, see "Deleting the DR System" in Administrator Guide.
Precautions
The services and database on the node with the time to be modified will be restarted during synchronization. Perform the following operations during off-peak hours.
Troubleshooting Procedure
The time zone and time can be forcibly synchronized in two modes: GUI mode and manual mode. In GUI mode, the system gradually adjusts the time and no sudden time change occurs. In manual mode, the system forcibly synchronizes the time immediately and sudden time change occurs. Sudden time change affects time-sensitive functions, for example, backup and restore. Select a time synchronization mode based on site requirements.
- GUI mode
- Log in to the management plane. For details, see Logging In to the Management Plane.
- On the management plane, choose Maintenance > Time Management > Configure Time Zone and Time from the main menu.
- On the Configure Time Zone and Time page, click Forcibly Synchronize.
If you want to perform other configuration operations that need to restart product services or product databases after forcibly synchronizing the time zone and time, do not select Automatically start the product databases and product services after the forcible synchronization in the Warning dialog box. In this case, after the forcible synchronization, product databases and product services will not be automatically started, preventing the product services or product databases from being restarted for several times.
In a DR scenario, if you are forcibly synchronizing the time zone and time of the secondary site, do not select Automatically start the product databases and product services after the forcible synchronization in the Warning dialog box, preventing the product services of the secondary site from being restarted and causing the product to become dual-active.
- Choose System > Task List from the main menu. Wait until the task for forcibly synchronizing time zone and time is complete.
- After the time zone and time are forcibly synchronized, wait for 1 to 15 minutes and check whether the time of the server OS is the same as that of the NTP clock source.
- If yes, the fault has been rectified.
- If no, contact Huawei technical support.
- In a DR scenario, establish the DR relationship between the primary and secondary sites. For details, see "Configuring the DR System" in Administrator Guide.
- Manual mode
- Use PuTTY to log in to the node with time to be modified as the sopuser user in SSH mode. For details about how to obtain the IP address of a node, see How Do I Query the IP Address of a Node?
- Run the following command to switch to the ossadm user:
> su - ossadm
Password: password for the ossadm user
- Run the following commands to stop services on the node:
> source /opt/oss/manager/bin/engr_profile.sh
> ipmc_adm -cmd stopnode
Skip this step if the application and data of the management plane are faulty and not restored.
If information similar to the following is displayed and success is displayed for all processes, all services on the node are stopped successfully. Otherwise, contact Huawei technical support.
... Stopping process mcdrrouteragent-1-0 ... success Stopping process testdummyagent-1-0 ... success Stopping process uniepservice-1-0 ... success ...
- Run the following command to switch to the root user:
> su - root
Password: password for the root user
- Run the following command to stop the NTP service on the node:
# service ntpd stop
- Run the following commands to synchronize the time from the NTP clock source:
# ntpdate clock source IP address
# timedatectl set-local-rtc 0
# hwclock --systohc -u
# echo $?
If 0 is displayed, the time is successfully synchronized from the NTP clock source. Otherwise, contact Huawei technical support.
0
Replace clock source IP address in the commands with the planned IP address. If the server with time to be modified is the management node, change the IP address to that of the NTP time clock source. If the server is a product node, change the IP address to that of OMP_01.
- Run the following command to check whether the time of the server is consistent with that of the NTP clock source:
# date
Information similar to the following is displayed:
Tue May 26 19:46:12 CST 2019
- If the node with time to be modified is the management node, the time of the node must be consistent with that of the NTP clock source.
- If the node with time to be modified is a product node, the time of the node must be consistent with that of the management node.
If the time is correct and consistent, go to 8. Otherwise, contact Huawei technical support.
- Run the following command to start the NTP service on the node with the time to be modified:
# service ntpd start
- Run the following command to exit the root user:
# exit
- Run the following commands to start services on the node:
> source /opt/oss/manager/bin/engr_profile.sh
> ipmc_adm -cmd startnode
Skip this step if the application and data of the management plane are faulty and not restored.
If information similar to the following is displayed and success is displayed for all processes, all services on the node are started successfully. Otherwise, contact Huawei technical support.
... Starting process testdummyagent-1-0 ... success Starting process mcdrrouteragent-1-0 ... success Starting process mcir-1-0 ... success ...
- In a DR scenario, establish the DR relationship between the primary and secondary sites. For details, see "Configuring the DR System" in Administrator Guide.
IR, ER and BER Startup Failure Due to Damaged Logs
Symptom
The IR, ER, and BER services fail to be started. The IR, ER, and BER log properties such as permissions and owners become ????, as shown in the following figure. This indicates that the log file is damaged.
For example, the owner of ER logs in /opt/oss/log/{tenant}/ERService is changed to ????.
rw-------.1 ossuser ossgroup 2803315 Jan 9 15:53 bus_adm.script.log rw-r-----.1 ossuser ossgroup 0 Jan 3 16:22 cmd.check_backenderservice.sh.log rw-------.1 ossuser ossgroup 168 Jan 3 16:22 cmd.post_install_backenderservice.sh.log rwx------.2 ossuser ossgroup 4096 Jan 9 15:50 nginx ????????? ? ? ? ? ? oss.bus_adm.script.trace rw-------.1 ossuser ossgroup 1380 Jan 3 16:22 oss.busdeploy.trace
Possible Causes
The log file is damaged due to the disk space is used up. As a result, the IR, ER, and BER services fail to be started.
Troubleshooting Methods
The methods for handling the IR, ER and BER startup failure due to damaged logs are the same. The following method uses ERService as an example.
- Use PuTTY to log in to a node where ERService resides, as the sopuser user in SSH mode. For details about how to obtain the IP address of the node where the service resides, see How Do I Query the IP Address of the Node Where a Service Resides?
- Run the following command to switch to the root user:
su - root
Password: password for the root user
- Run the following command to copy the log folder:
cd /opt/oss/log/{tenant}
cp -r ERService ERService1
After this command is executed successfully, all the files that are not damaged are copied to the new folder ERService1.
- Run the following command to move the original folder:
mv ERService ERService_bak
- Run the following command to move the ERService1 folder back to the directory of the original folder:
mv ERService1 ERService
- Run the following commands to change the owner of the ERService folder:
chown -R ossuser:ossgroup ERService
- Run the following command to exit the root user:
exit
Database Faults
This section describes how to rectify faults of GaussDB 100. To rectify the faults of other databases, contact Huawei technical support.
Context
On the management plane, choose Product > System Monitoring from the main menu. On the Relational Databases tab page, view the role of the database instance.
- If Role in the row that contains the database instance is Master, the database instance is a master instance.
- If Role in the row that contains the database instance is Slave, the database instance is a slave instance.
- If Role in the row that contains the database instance is --, the database instance is a single instance.
If a database instance is faulty, you are advised to rectify the database fault first. If multiple database instances are faulty because multiple faults are accumulated, the database instances may become dual-master. After one of the instances is switched to slave, data is automatically synchronized between the master and slave database instances, causing data loss.
Abnormal Single Database Instance Status of Products
Symptom
On the management plane, choose Product > System Monitoring from the main menu. On the top of the System Monitoring page, move the pointer to and select the product. On the Relational Databases tab page, Status of the database instance is Not Running or Unknown.
In a DR scenario, operations in this section only apply to fault rectification at the primary site. For details about how to rectify the fault at the secondary site, see Abnormal DR System Data Replication.
Possible Causes
- The node where the database instance resides is stopped.
- The database instance is manually stopped.
- The database process fails to be restarted after an exception.
- The database data is damaged.
Locating Method
Figure 3-35 shows the locating method for abnormal database status.
Troubleshooting Procedure
- In a DR scenario, separate the primary and secondary site products. For details, see "Separating the Primary and Secondary Site Products" in Administrator Guide.
- Check whether the Servers where database instances reside are stopped.
- Check whether nodes are faulty. For details, see Node and Service Faults.
- If yes, rectify faults by referring to corresponding sections.
- If no, go to 4.
- Check the database instance status.
- Log in to the management plane, and choose Product > System Monitoring from the main menu.
- In the upper left corner of the System Monitoring page, move the pointer to
and select the product.
- On the Relational Databases tab page, check whether Status of the database instances is Not Running.
- If yes, manually start the node where the database instance resides. For details, see Starting the Service Plane Databases.
If the database instance is in the Running state, the database instance is started successfully, and the fault is rectified.
If the database instance is manually stopped, locate the problem and start the instance after related tasks are complete.
If the database instance is in the Not Running state, the database instances fail to be started. Go to 5.
- If the database instance is not manually stopped, go to 5.
- If yes, manually start the node where the database instance resides. For details, see Starting the Service Plane Databases.
- Perform the following operations to check DBAgent and DeployAgent:
- Use PuTTY to log in to the node where the abnormal database instance resides as the sopuser user in SSH mode. For details about how to obtain the IP address of a node, see How Do I Query the IP Address of the Node Where a Database Instance Resides?
- Run the following command to switch to the ossadm user:
> su - ossadm
Password: password for the ossadm user
- Run the following commands to check the status of DBAgent:
> cd /opt/oss/manager/agent/bin
> bash ipmc_adm -cmd statusapp -tenant manager
Process Name Process Type App Name Tenant Name Process Mode IP PID Status mcfebservice-0-0 dbagentapp DBAgent manager cluster 10.10.10.1 80125 RUNNING ... ... [All Processes: 17] [Running: 17] [Not Running: 0]
- If DBAgent is in the RUNNING state, DBAgent is running.
- If DBAgent is in the NOT RUNNING state, DBAgent is stopped. Run the following command to start the service:
> bash ipmc_adm -cmd startapp -app DBAgent -tenant manager
If information similar to the following is displayed, DBAgent is started. Otherwise, contact Huawei technical support.
Starting process dbagentapp-0-0 ... success
- Run the following commands to check the status of DeployAgent:
> source /opt/oss/manager/bin/engr_profile.sh
> ipmc_adm -cmd statusmgrossadm 28228 1 0 10:37 ? 00:01:53 ... ... /opt/oss/manager/apps/DeployAgent-903.4.63/tools/pyscript/deployagent/DeployAgent.pyc -DNFW=deployagent ossadm 28188 1 0 10:37 ? 00:00:09 ... ...
- If deployagent is displayed in the command output, DeployAgent is started.
- If deployagent is not displayed in the command output, DeployAgent is stopped. Run the following command to start DeployAgent:
> bash ipmc_adm -cmd startmgr -app DeployAgent
If information similar to the following is displayed, DeployAgent is started successfully. Otherwise, contact Huawei technical support.
============================ Starting management processes... Starting deployagent... ....... start mcwatchdog... success ============================ Starting management processes is complete.
- Use PuTTY to log in to the management node as the sopuser user in SSH mode.
If the management plane is deployed in cluster mode, perform operations only on OMP_01. For details about how to obtain the IP address of a node, see How Do I Query the IP Address of a Node?
- Run the following command to switch to the ossadm user:
> su - ossadm
Password: password for the ossadm user
- Perform the following operations to restore the single instance:
> cd /opt/oss/manager/apps/UniEPService/tools/DB_Recovery
> bash Single_DB_Recovery.sh instance name
Choose Product > System Monitoring, and view the instance name on the Relational Databases tab page.
Assume that the database instance name is xdj-1-1034. If the following information is displayed, the database instance is restored successfully. Otherwise, contact Huawei technical support.... The result: xdj-1-1034: success [2018-12-22 02:14:37] [185770] Recovery DB-Instance Success. You need to recovery the product data.
- Restore the product data. For details, see "Restoring Product Data" in Administrator Guide.
- Log in to the management plane, and check whether the database instance is restored.
- If Status of the database instance is Running, the fault is rectified. No further action is required.
- If Status of the database instance is Not Running or Unknown, the database application may be abnormal. Go to 11. If the database instance status is abnormal after the database application and the database instance are restored, contact Huawei technical support.
- Restore the database application. For details, see "Restoring Database Applications" in Administrator Guide.
- Repeat 6 to 10 to restore the database instance.
- In a DR scenario, connect the primary and secondary site products. For details, see "Connecting the Primary and Secondary Site Products" in Administrator Guide.
Abnormal Slave Database Instance Status of Products
Symptom
On the management plane, choose Product > System Monitoring from the main menu. On the top of the System Monitoring page, move the pointer to and select the product. On the Relational Databases tab page, for the master database instance, Status and Replication Status are Running and Normal respectively. For the slave database instance, Status is Not Running or Replication Status is Abnormal.
In a DR scenario, operations in this section only apply to fault rectification at the primary site. For details about how to rectify the fault at the secondary site, see Abnormal DR System Data Replication.
Possible Causes
- The node where the database instance resides is stopped.
- The database instance is manually stopped.
- The database process fails to be restarted after an exception.
- The database data is damaged.
Locating Method
Figure 3-36 shows the locating method for abnormal database status.
Troubleshooting Procedure
- In a DR scenario, the primary and secondary site products have been separated. For details, see "Separating the Primary and Secondary Site Products" in Administrator Guide.
- Check whether the Servers where database instances reside are stopped.
- Optional: If the master and slave database instances have been switched over within the past 30 minutes, run the following command to manually delete the corresponding database switchover record to allow the database instances to be switched over again:
- Use PuTTY to log in to OMP_01 as the sopuser user in SSH mode. For details, see Logging In to a Server Using PuTTY.
- Run the following command to switch to the ossadm user:
> su - ossadm
Password: password for the ossadm user
- Run the following command to delete the database instance switchover record:
> cd /opt/oss/manager/apps/DBHASwitchService/bin
> bash switchtool.sh -cmd del-failover-time -instid name of the database instance whose switchover record is to be deleted
If the following information is displayed, the database instance switchover record is deleted successfully:Successful.
- Check whether nodes are faulty. For details, see Node and Service Faults.
- If yes, rectify faults by referring to corresponding sections.
- If no, go to 5.
- Check the database instance status.
- Log in to the management plane, and choose Product > System Monitoring from the main menu.
- In the upper left corner of the System Monitoring page, move the pointer to
and select the product.
- On the Relational Databases tab page, check whether Status of the database instances is Not Running.
- If yes, manually start the node where the database instance resides. For details, see Starting the Service Plane Databases.
If the database instance is in the Running state, the database instance is started successfully, and the fault is rectified.
If the database instance is manually stopped, locate the problem and start the instance after related tasks are complete.
If the database instance is in the Not Running state, the database instances fail to be started. Go to 6.
- If the database instance is not manually stopped, go to 6.
- If yes, manually start the node where the database instance resides. For details, see Starting the Service Plane Databases.
- Perform the following operations to check DBAgent and DeployAgent:
- Use PuTTY to log in to the node where the abnormal database instance resides as the sopuser user in SSH mode. For details about how to obtain the IP address of a node, see How Do I Query the IP Address of the Node Where a Database Instance Resides?
- Run the following command to switch to the ossadm user:
> su - ossadm
Password: password for the ossadm user
- Run the following commands to check the status of DBAgent:
> cd /opt/oss/manager/agent/bin
> bash ipmc_adm -cmd statusapp -tenant manager
Process Name Process Type App Name Tenant Name Process Mode IP PID Status mcfebservice-0-0 dbagentapp DBAgent manager cluster 10.10.10.1 80125 RUNNING ... ... [All Processes: 17] [Running: 17] [Not Running: 0]
- If DBAgent is in the RUNNING state, DBAgent is running.
- If DBAgent is in the NOT RUNNING state, DBAgent is stopped. Run the following command to start the service:
> bash ipmc_adm -cmd startapp -app DBAgent -tenant manager
If information similar to the following is displayed, DBAgent is started. Otherwise, contact Huawei technical support.
Starting process dbagentapp-0-0 ... success
- Run the following commands to check the status of DeployAgent:
> source /opt/oss/manager/bin/engr_profile.sh
> ipmc_adm -cmd statusmgrossadm 28228 1 0 10:37 ? 00:01:53 ... ... /opt/oss/manager/apps/DeployAgent-903.4.63/tools/pyscript/deployagent/DeployAgent.pyc -DNFW=deployagent ossadm 28188 1 0 10:37 ? 00:00:09 ... ...
- If deployagent is displayed in the command output, DeployAgent is started.
- If deployagent is not displayed in the command output, DeployAgent is stopped. Run the following command to start DeployAgent:
> bash ipmc_adm -cmd startmgr -app DeployAgent
If information similar to the following is displayed, DeployAgent is started successfully. Otherwise, contact Huawei technical support.
============================ Starting management processes... Starting deployagent... ....... start mcwatchdog... success ============================ Starting management processes is complete.
- Use PuTTY to log in to the management node as the sopuser user in SSH mode.
If the management plane is deployed in cluster mode, perform operations only on OMP_01. For details about how to obtain the IP address of a node, see How Do I Query the IP Address of a Node?
- Run the following command to switch to the ossadm user:
> su - ossadm
Password: password for the ossadm user
- Run the following command to check the database instance status:
> cd /opt/oss/manager/apps/DBAgent/bin/
> bash dbsvc_adm -cmd query-db-instance
Information similar to the following is displayed:
DBInstanceId ... IP Port ... Role Rpl Status ... apmdbsvr-10_90_73_163-3@10_90_73_164-3 ... 10.90.73.164 32082 ... Slave Normal ... apmdbsvr-10_90_73_178-21@10_90_73_179-21 ... 10.90.73.179 32080 ... Slave Abnormal (101) ... apmdbsvr-10_90_73_178-21@10_90_73_179-21 ... 10.90.73.179 32080 ... Slave Abnormal (103) ... ...
- If the value of Rpl Status is Normal, the status of the database instance is normal.
- If the value of Rpl Status is Abnormal, the status of the database instance is abnormal. Record the error code in the brackets next to Abnormal. Check whether the recorded error code exists in Table 3-129.
- If yes, go to 10 to restore the database instance.
- If no, contact Huawei technical support.
Table 3-129 Database status error codes
Error Code
Description
101
The database instance is not running, or the node where the database instance resides is not running.
310
The slave database instance needs to be rebuilt, and will be automatically restored.
403
The slave database instance needs to be rebuilt, and will be automatically restored.
- Run the following commands to restore the abnormal slave database instance:
> cd /opt/oss/manager/apps/UniEPService/tools/DB_Recovery
> bash DBSlaveInstance_Recovery.sh -instid servicedbsvr2-1-1@2-1 -tenant NCECAMPUS
-instid: database instance name. The value can be a database instance name or all. all indicates that all database instances of the product will be restored.
Assume that the database instance name is servicedbsvr2-1-1@2-1. If the following information is displayed, the database instance is restored successfully. Otherwise, contact Huawei technical support.... The result: servicedbsvr2-1-1@2-1: success [2018-12-22 02:29:33] [264943] Recovery DB-Instance Success.
- Log in to the management plane and check whether the database instance is restored.
- If Status of the database instance is Running, and Replication Status is Normal, the fault is rectified. No further action is required.
- If Status of the database instance is Not Running or Unknown, and Replication Status is Abnormal, the database application may be abnormal. Go to 12. If the database instance status is abnormal after the database application and the database instance are restored, contact Huawei technical support.
- Restore the database application. For details, see "Restoring Database Applications" in Administrator Guide.
- Repeat 7 to 11 to restore the database instance.
- In a DR scenario, connect the primary and secondary site products. For details, see "Connecting the Primary and Secondary Site Products" in Administrator Guide.
Abnormal Master and Slave Database Instance Status of Products
Symptom
On the management plane, choose Product > System Monitoring from the main menu. On the top of the System Monitoring page, move the pointer to and select the product. On the Relational Databases tab page, Status of both the master and slave database instances is Not Running or Unknown.
In a DR scenario, operations in this section only apply to fault rectification at the primary site. For details about how to rectify the fault at the secondary site, see Abnormal DR System Data Replication.
Possible Causes
- The node where the database instance resides is stopped.
- The database instance is manually stopped.
- The database process fails to be restarted after an exception.
- The database data is damaged.
Locating Method
Figure 3-37 shows the locating method for abnormal database status.
Troubleshooting Procedure
- In a DR scenario, the primary and secondary site products have been separated. For details, see "Separating the Primary and Secondary Site Products" in Administrator Guide.
- Check whether the Servers where database instances reside are stopped.
- Optional: If the master and slave database instances have been switched over within the past 30 minutes, run the following command to manually delete the corresponding database switchover record to allow the database instances to be switched over again:
- Use PuTTY to log in to OMP_01 as the sopuser user in SSH mode. For details, see Logging In to a Server Using PuTTY.
- Run the following command to switch to the ossadm user:
> su - ossadm
Password: password for the ossadm user
- Run the following command to delete the database instance switchover record:
> cd /opt/oss/manager/apps/DBHASwitchService/bin
> bash switchtool.sh -cmd del-failover-time -instid name of the database instance whose switchover record is to be deleted
If the following information is displayed, the database instance switchover record is deleted successfully:Successful.
- Check whether nodes are faulty. For details, see Node and Service Faults.
- If yes, rectify faults by referring to corresponding sections.
- If no, go to 5.
- Check the database instance status.
- Log in to the management plane, and choose Product > System Monitoring from the main menu.
- In the upper left corner of the System Monitoring page, move the pointer to
and select the product.
- On the Relational Databases tab page, check whether Status of the database instances is Not Running.
- If yes, manually start the nodes where the database instances reside. For details, see Starting the Service Plane Databases.
If the database instances are in the Running state, the database instances are started successfully, and the fault is rectified.
If the database instances are manually stopped, locate the problem and start the instances after related tasks are complete.
If the database instances are in the Not Running state, the database instances fail to be started. Go to 6.
- If the database instances are not manually stopped, go to 6.
- If yes, manually start the nodes where the database instances reside. For details, see Starting the Service Plane Databases.
- Perform the following operations to check DBAgent and DeployAgent:
- Use PuTTY to log in to the node where the abnormal database instance resides as the sopuser user in SSH mode. For details about how to obtain the IP address of a node, see How Do I Query the IP Address of the Node Where a Database Instance Resides?
- Run the following command to switch to the ossadm user:
> su - ossadm
Password: password for the ossadm user
- Run the following commands to check the status of DBAgent:
> cd /opt/oss/manager/agent/bin
> bash ipmc_adm -cmd statusapp -tenant manager
Process Name Process Type App Name Tenant Name Process Mode IP PID Status mcfebservice-0-0 dbagentapp DBAgent manager cluster 10.10.10.1 80125 RUNNING ... ... [All Processes: 17] [Running: 17] [Not Running: 0]
- If DBAgent is in the RUNNING state, DBAgent is running.
- If DBAgent is in the NOT RUNNING state, DBAgent is stopped. Run the following command to start the service:
> bash ipmc_adm -cmd startapp -app DBAgent -tenant manager
If information similar to the following is displayed, DBAgent is started. Otherwise, contact Huawei technical support.
Starting process dbagentapp-0-0 ... success
- Run the following commands to check the status of DeployAgent:
> source /opt/oss/manager/bin/engr_profile.sh
> ipmc_adm -cmd statusmgrossadm 28228 1 0 10:37 ? 00:01:53 ... ... /opt/oss/manager/apps/DeployAgent-903.4.63/tools/pyscript/deployagent/DeployAgent.pyc -DNFW=deployagent ossadm 28188 1 0 10:37 ? 00:00:09 ... ...
- If deployagent is displayed in the command output, DeployAgent is started.
- If deployagent is not displayed in the command output, DeployAgent is stopped. Run the following command to start DeployAgent:
> bash ipmc_adm -cmd startmgr -app DeployAgent
If information similar to the following is displayed, DeployAgent is started successfully. Otherwise, contact Huawei technical support.
============================ Starting management processes... Starting deployagent... ....... start mcwatchdog... success ============================ Starting management processes is complete.
- Use PuTTY to log in to the management node as the sopuser user in SSH mode.
If the management plane is deployed in cluster mode, perform operations only on OMP_01. For details about how to obtain the IP address of a node, see How Do I Query the IP Address of a Node?
- Run the following command to switch to the ossadm user:
> su - ossadm
Password: password for the ossadm user
- Run the following commands to restore the abnormal master and slave database instances:
> cd /opt/oss/manager/apps/UniEPService/tools/DB_Recovery
> bash Master_Slave_DB_Recovery.sh -instid dbsvr-0-1001@1-1001 -tenant NCECAMPUS
-instid: database instance name. The value can be a database instance name or all. all indicates that all database instances of the product will be restored.
Assume that the database instance name is dbsvr-0-1001. If the database instance is restored successfully, the following information is displayed. Otherwise, contact Huawei technical support.... The result: dbsvr-0-1001@1-1001: success [2018-12-22 03:36:24] [175465] Recovery DB-Instance Success. You need to recovery the product data.
- Restore the product data. For details, see "Restoring Product Data" in Administrator Guide.
- Log in to the management plane and check whether the database instance is restored.
- If Status of the database instances is Running, and Replication Status is Normal, the fault is rectified. No further action is required.
- If Status of the database instances is Not Running or Unknown, and Replication Status is Abnormal, the database application may be abnormal. Go to 12. If the database instance status is abnormal after the database application and the database instances are restored, contact Huawei technical support.
- Restore the database application. For details, see "Restoring Database Applications" in Administrator Guide.
- Repeat 7 to 11 to restore the database instance.
- In a DR scenario, connect the primary and secondary site products. For details, see "Connecting the Primary and Secondary Site Products" in Administrator Guide.
Abnormal Master Database Instance Due to Multiple Restarts on Nodes
Symptom
If the management nodes are deployed in cluster mode, and the management node has been restarted for multiple times within 30 minutes, which causes the master database instance of the node to become abnormal, so you cannot log in to the management plane.
If the database nodes of the product are deployed in cluster mode, a database node has been restarted for multiple times within 30 minutes, which causes the master database instance of the node to become abnormal.
Possible Causes
If the master database instance is abnormal, a switchover is automatically performed. To ensure system stability, the switchover between the master and slave database instances can be performed only once within 30 minutes when the two instances are running properly. If the master database instance is abnormal, a maximum of 60 seconds of data loss may occur during the switchover. If the nodes where the master and slave database instances reside are powered off or powered on for multiple times within 30 minutes, the master database instance may be abnormal.
Troubleshooting Procedure
After the power supply is stable, wait for 30 minutes and check whether the master and slave instances are recovered, that is, Status of the master and slave database instances is Running, and Replication Status is Normal. This is because only a switchover can be performed only once within 30 minutes. If the databases are abnormal, perform the following operations based on site requirements:
- If you can log in to the management plane, check whether the slave database instance of the management plane or the product is abnormal, and restore it if abnormal. For details, see Abnormal Slave Database Instance Status of Products.
- If you cannot log in to the management plane, restore the management plane if the management node is abnormal. For details, see Co-Deployed Node Faults. Restore the master and slave database instances if the database node of the product is abnormal. For details, see Abnormal Master and Slave Database Instance Status of Products.
The Database Replication Status Is Abnormal
Symptom
The replication status of a database instance is Abnormal for a long time on the Relational Database tab page. You can navigate to this tab page by choosing Product > System Monitoring on the management plane and clicking in the upper left corner to select the desired product.
Possible Causes
The I/O of the slave database node is too high. As a result, the database cannot be replayed in time, and becomes abnormal.
Troubleshooting Roadmap
Figure 3-38 shows the troubleshooting roadmap.
- In the DR scenario, the primary and secondary site products must be separated. For details, see "Separating the Primary and Secondary Site Products" in the Geographic Redundancy.
- Perform the following operations:
- Use PuTTY to log in to the abnormal database node as the sopuser user in SSH mode. For details about how to obtain the IP address of the node where a specific database instance resides, see How Do I Query the IP Address of the Node Where a Database Instance Resides?.
- Run the following command to switch to the ossadm user:
> su ossadm
- Run the following commands to check the log file size:
> cd /opt/zenith/data/Database instance name/archive_log
> du –h
- If the log file size exceeds 16 GB, manually run the deletion script to delete the log file.Log in to the node where the slave database resides and run the following command to delete the log file:
rm –r /opt/zenith/data/Database instance name/archive_log
Run the following command to rebuild the instance:/usr/bin/sudo -u dbuser bash -c "source ~/.bashrc;/usr/bin/flock -ox /opt/zenith/data/Database instance name -c '/opt/oss/manager/agent/DeployAgent/rtsp/python/bin/python /opt/zenith/app/bin/zctl.py -t build -D /opt/zenith/data/database instance name -P'"
The Disk Usage of a Database Instance Is Close to 100%
Symptom
On the Relational Database tab page of the management plane, the disk usage of a database instance is close to 100%. To navigate to the Relational Database tab page, choose Product > System Monitoring from the main menu, move the pointer to , and select the desired product.
Possible Causes
The size of a configuration source table in the database instance is too large.
Procedure
- Clear the indexes of configuration source tables.
- Use PuTTY to log in to the management node as the sopuser user in SSH mode.
- Run the following commands to switch to the root user:
> su - root
Password: password for the root user
- Run the following command to log in to the database console:
su - dbuser -c "source appgsdb.bashrc&&gsql -d nepersistentdb -U dbuser -p '`cat /opt/oss/manager/var/tenants/NCECAMPUS/containerlist.json | grep -A 200 "nepersistentdb" | grep "port" | awk -F ':' '{print $2}' | awk -F '{' '{print $1}'`' -h `cat /opt/oss/manager/var/tenants/NCECAMPUS/containerlist.json | grep -A 200 "nepersistentdb" | grep "ip" | awk -F ':' '{print $2}' | awk -F '"' '{print $2}'`"
Enter the password for the dbuser user as prompted.Password for user dbuser: password for the dbuser user
The following shows the command output.Password for user dbuser: gsql (9.2.4) SSL connection (cipher: DHE-RSA-AES256-SHA, bits: 256) Type "help" for help. NEPERSISTENTDB=#
nepersistentdb: In the command, nepersistentdb indicates the abnormal database instance.
- Run the following commands on the database console to clear the indexes of the source and link tables:
CREATE OR REPLACE FUNCTION delete_index_func() RETURNS void AS $func$ DECLARE tablenamerefcursor refcursor; tablename text; sql text; BEGIN BEGIN execute 'SELECT 1 FROM YANG_SCHEMA;'; EXCEPTION WHEN others THEN RAISE NOTICE 'no need to execute'; return; END; open tablenamerefcursor for execute 'select RELNAME from pg_stat_user_tables where RELNAME like ''T\_DB\_DATASTORE\_%\_SOURCE'' escape ''\'';'; loop fetch tablenamerefcursor into tablename; if found then sql = 'drop index ' || upper(tablename) || upper('_idx_path;'); BEGIN execute sql; EXCEPTION WHEN others THEN RAISE NOTICE 'sql: %',sql; END; else exit; end if; end loop; close tablenamerefcursor; open tablenamerefcursor for execute 'select RELNAME from pg_stat_user_tables where RELNAME like ''T\_DB\_DATASTORE\_%\_LINK'' escape ''\'';'; loop fetch tablenamerefcursor into tablename; if found then sql = 'drop index ' || upper(tablename) || upper('_idx_path;'); BEGIN execute sql; EXCEPTION WHEN others THEN RAISE NOTICE 'sql: %',sql; CONTINUE; END; else exit; end if; end loop; close tablenamerefcursor; return; END; $func$ LANGUAGE plpgsql; select delete_index_func();
The command output is as follows:select delete_index_func(); NEPERSISTENTDB$# NEPERSISTENTDB$# NEPERSISTENTDB$# NEPERSISTENTDB$# CREATE FUNCTION NEPERSISTENTDB=# NOTICE: sql: drop index T_DB_DATASTORE_AAA_SOURCE_IDX_PATH; NOTICE: sql: drop index T_DB_DATASTORE_AAA_SOURCE_IDX_VALUE; NOTICE: sql: drop index T_DB_DATASTORE_INTERFACES_SOURCE_IDX_PATH; NOTICE: sql: drop index T_DB_DATASTORE_INTERFACES_SOURCE_IDX_VALUE; NOTICE: sql: drop index T_DB_DATASTORE_GLOBAL_SOURCES_LINK_IDX_PATH; DELETE_INDEX_FUNC ------------------- (1 row) NEPERSISTENTDB=# NEPERSISTENTDB=#
- Check whether the fault persists after the preceding operations are complete. If so, go to step 2.
- Back up date and restore it.
- Check the tables that occupy a large amount of data in the database instance.
- Use PuTTY to log in to the management node as the sopuser user in SSH mode.
- Run the following commands to switch to the root user:
> su - root
Password: password for the root user
- Run the following command to log in to the database console:
su - dbuser -c "source appgsdb.bashrc&&gsql -d nepersistentdb -U dbuser -p '`cat /opt/oss/manager/var/tenants/NCECAMPUS/containerlist.json | grep -A 200 "nepersistentdb" | grep "port" | awk -F ':' '{print $2}' | awk -F '{' '{print $1}'`' -h `cat /opt/oss/manager/var/tenants/NCECAMPUS/containerlist.json | grep -A 200 "nepersistentdb" | grep "ip" | awk -F ':' '{print $2}' | awk -F '"' '{print $2}'`"
Enter the password for the dbuser user as prompted.Password for user dbuser: password for the dbuser user
The following shows the command output.Password for user dbuser: gsql (9.2.4) SSL connection (cipher: DHE-RSA-AES256-SHA, bits: 256) Type "help" for help. NEPERSISTENTDB=#
nepersistentdb: In the command, nepersistentdb indicates the abnormal database instance.
- Run the following command to view the source (in T_DB_DATASTORE_xxx_SOURCE format) and link (in T_DB_DATASTORE_GLOBAL_xxx_LINK format) tables that occupy the largest disk space:
select RELNAME,N_LIVE_TUP,N_DEAD_TUP,last_autovacuum,vacuum_count,pg_size_pretty(pg_total_relation_size('"' || relname || '"')) as totalsize,pg_size_pretty(pg_table_size('"' || relname || '"')) as tablesize, pg_size_pretty(pg_indexes_size('"' || relname || '"')) as indexsize from pg_stat_user_tables order by N_LIVE_TUP desc;
The following shows the command output.
- Stop the SouthboundService microservice.
- Log in to the management plane and choose Product > System Monitoring.
- On the upper left corner of the System Monitoring page, move the pointer to
and select the product.
- On the Services tab page, search for southboundservice.
- Select all instances and stop them.
- Back up tables that occupy a large amount of database instance data.
- Use PuTTY to log in to the management node as the sopuser user in SSH mode.
- Run the following commands to switch to the root user:
> su - root
Password: password for the root user
- Back up tables with a large amount of data to the /tmp/dumpTable.sql file. The time required depends on the data volume.For example, run the following commands to back up the T_DB_DATASTORE_INTERFACES_SOURCE table:
login_db="nepersistentdb";su - dbuser -c "source appgsdb.bashrc&&gs_dump ${login_db} -t T_DB_DATASTORE_INTERFACES_SOURCE -f /tmp/dumpTable.sql -U ossdbuser -p '`cat /opt/oss/manager/var/tenants/NCECAMPUS/containerlist.json | grep -A 200 "${login_db}" | grep "port" | awk -F ':' '{print $2}' | awk -F ',' '{print $1}'`' -h `cat /opt/oss/manager/var/tenants/NCECAMPUS/containerlist.json | grep -A 200 "${login_db}" | grep "ip" | awk -F ':' '{print $2}' | awk -F '"' '{print $2}'`"
nepersistentdb: In the command, nepersistentdb indicates the abnormal database instance.
Enter the password for the ossdbuser user as prompted.Password for user ossdbuser: password for the ossdbuser user
The command output is as follows:gs_dump: total time: 176635 ms -bash: line 2: 192.168.6.17: command not found [root@xxx ~]#
- Delete the database tables that have been backed up from the database.
- Use PuTTY to log in to the management node as the sopuser user in SSH mode.
- Run the following commands to switch to the root user:
> su - root
Password: password for the root user
- Run the following command to log in to the database console:
su - dbuser -c "source appgsdb.bashrc&&gsql -d nepersistentdb -U dbuser -p '`cat /opt/oss/manager/var/tenants/NCECAMPUS/containerlist.json | grep -A 200 "nepersistentdb" | grep "port" | awk -F ':' '{print $2}' | awk -F '{' '{print $1}'`' -h `cat /opt/oss/manager/var/tenants/NCECAMPUS/containerlist.json | grep -A 200 "nepersistentdb" | grep "ip" | awk -F ':' '{print $2}' | awk -F '"' '{print $2}'`"
Enter the password for the dbuser user as prompted.Password for user dbuser: password for the dbuser user
The following shows the command output.Password for user dbuser: gsql (9.2.4) SSL connection (cipher: DHE-RSA-AES256-SHA, bits: 256) Type "help" for help. NEPERSISTENTDB=#
nepersistentdb: In the command, nepersistentdb indicates the abnormal database instance.
- Run the following command to delete the tables backed up in 2.b:
drop table T_DB_DATASTORE_INTERFACES_SOURCE;
The following shows the command output.NEPERSISTENTDB=# drop table T_DB_DATASTORE_INTERFACES_SOURCE; DROP TABLE NEPERSISTENTDB=#
- Restore the database tables that are backed up in 4. Delete the database tables that have been backed up from the database.
- Use PuTTY to log in to the management node as the sopuser user in SSH mode.
- Run the following commands to switch to the root user:
> su - root
Password: password for the root user
- Run the following command to restore the /tmp/dumpTable.sql file:
login_db="nepersistentdb";su - dbuser -c "source appgsdb.bashrc&&gsql ${login_db} -f /tmp/dumpTable.sql -U ossdbuser -p '`cat /opt/oss/manager/var/tenants/NCECAMPUS/containerlist.json | grep -A 200 "${login_db}" | grep "port" | awk -F ':' '{print $2}' | awk -F ',' '{print $1}'`' -h `cat /opt/oss/manager/var/tenants/NCECAMPUS/containerlist.json | grep -A 200 "${login_db}" | grep "ip" | awk -F ':' '{print $2}' | awk -F '"' '{print $2}'`"
Enter the password for the ossdbuser user as prompted.Password for user ossdbuser: password for the ossdbuser user
The following shows the command output.
Password for user ossdbuser: SET SET SET SET SET SET SET SET SET CREATE TABLE gsql:/tmp/dumpTable.sql:1638262: NOTICE: ALTER TABLE / ADD PRIMARY KEY will create implicit index " T_DB_DATASTORE_INTERFACES_SOURCE_PKEY" for table "T_DB_DATASTORE_INTERFACES_SOURCE" total time: 1761635 ms [root@linux ~]#
- If the fault persists, contact technical support engineers.
- Check the tables that occupy a large amount of data in the database instance.
Node and Service Faults
iMaster NCE-Campus and NCE-OMP Nodes Are in Abnormal State on the Management Plane
Symptom
On the management plane homepage, the statuses of the iMaster NCE-Campus and NCE-OMP nodes are abnormal and service exceptions occur. On the NCE-OMP and iMaster NCE-Campus system monitoring pages, the connection status of the nodes is normal but the service status is unknown, as shown in the following figures.
Possible Causes
A disk error occurs on the nodes.
Procedure
- Check the following items and rectify faults according to the check and troubleshooting methods.
A service status exception may be due to several reasons. This section describes how to troubleshoot disk faults. If the fault persists after you perform the following operations, collect the fault information and contact Huawei technical support.
No.
Check Item
Check Method
Troubleshooting Method
1
Disk file size
- On the management plane, select iMaster_NCE-Campus, select a faulty node, and click the node name to check the node IP address.
- Log in to the node as the sopuser user.
- Run the df -h command to check the disk usage.
- Run the ps -ef | grep /opt/oss | grep -v grep command to check the running status of service processes.
- If the disk usage of /opt or /var/log/ is 100% and no output is displayed after the ps -ef | grep /opt/oss | grep -v grep command is run, go to 2.
- If the disk usage of /opt or /var/log/ does not reach 100% and no output is displayed after the ps -ef | grep /opt/oss | grep -v grep command is run, go to 3.
- If the disk usage of /opt or /var/log/ does not reach 100% and the command output of the ps -ef | grep /opt/oss | grep -v grep command is not empty, contact technical support.
- Perform the following operations if the disk usage of /opt or /var/log/ is 100% and no output is displayed after the ps -ef | grep /opt/oss | grep -v grep command is run. The command outputs are displayed in the following figures.
- Log in to the faulty node as the sopuser user and switch to the root user.
- Access /opt or /var/log/ whose disk usage reaches 100%.
- Run the following commands to check whether a disk error occurs:
dd if="$(df -P /opt/ | tail -1 | awk '{print $1}')" of=/dev/zero bs=512 count=1 iflag=direct
dd if="$(df -P /var/log | tail -1 | awk '{print $1}')" of=/dev/zero bs=512 count=1 iflag=direct
The command output is shown in the following figure. If the commands fail to be executed, contact technical support to rectify the fault.
- Run the /sbin/chkconfig ossipmc01 on command to restart the OS.
- After the restart, run the ps -ef | grep /opt/oss | grep -v grep command to check whether controller service processes are normal. The command output is shown in the following figure. If the command output is empty, contact technical support.
- Log in to the management plane as the admin user and check whether the node service status becomes normal.
It takes a long time to restore a node. In most cases, it takes about 60 minutes for node statuses to be displayed on the management plane. If the node is not restored after 60 minutes, contact technical support.
- Perform the following operations if the disk usage of /opt or /var/log/ does not reach 100% and no output is displayed after the ps -ef | grep /opt/oss | grep -v grep command is run.
- Log in to the faulty node as the sopuser user and switch to the root user.
- Run the following commands to check whether a disk error occurs:
dd if="$(df -P /opt/ | tail -1 | awk '{print $1}')" of=/dev/zero bs=512 count=1 iflag=direct
dd if="$(df -P /var/log | tail -1 | awk '{print $1}')" of=/dev/zero bs=512 count=1 iflag=direct
The command output is shown in the following figure. If the commands fail to be executed, contact technical support to rectify the fault.
- Run the /sbin/chkconfig ossipmc01 on command to restart the OS.
- After the restart, run the ps -ef | grep /opt/oss | grep -v grep command to check whether controller service processes are normal. The command output is shown in the following figure. If the command output is empty, contact technical support.
- Log in to the management plane as the admin user and check whether the node service status becomes normal.
It takes a long time to restore a node. In most cases, it takes about 60 minutes for node statuses to be displayed on the management plane. If the node is not restored after 60 minutes, contact technical support.
Co-Deployed Node Faults
Symptom
When you log in to the management plane using a browser or log in to the co-deployed node using PuTTY, no response is returned or the login fails.
Possible Causes
- The network is faulty.
- The node is powered off.
- The OS fails.
- The applications or databases are damaged.
Troubleshooting Procedure
- The management node faults are caused by complicated causes. This section provides basic troubleshooting methods for rectifying the faults. If the faults persist after you perform the following operations, collect the fault information and contact Huawei technical support.
- During fault rectification, the system partition will be formatted when you restore the OS. Therefore, exercise caution when performing this operation.
- If the co-deployed node functions as the backup server and is faulty, the node fault cannot be rectified on the management plane.
- Contact the administrator to check and rectify the network fault.
- Contact the administrator to check whether Servers are abnormal, for example, powered-off or deleted. If abnormal, rectify the fault.
- Restart Servers and use PuTTY to check whether you can log in to the faulty node as the sopuser user in SSH mode.
- If the login is successful, the node faults have been rectified. No further action is required.
- If the login fails or no response is returned, the OS of the faulty node is abnormal. Restore the OS of the faulty node. For details, see "Restoring the OS of OMP Node" in Administrator Guide.
- Restore the product application. For details, see "Restoring Product Applications" in Administrator Guide.
Faults of Multiple Management Nodes (GaussDB 100)
Symptom
The management plane is deployed in cluster mode and uses the GaussDB 100 database. The management plane is unreachable.
Possible Causes
- The service or database of the management plane is abnormal.
- Multiple management nodes are faulty. For example:
- OMP_01 is faulty and OMP_02 is powered off.
- After the active/standby switchover, the database of the management plane is in the dual-standby state.
Troubleshooting Procedure
- Obtain the backup package management.tar.gz of the management plane and the signature file management.tar.gz.sign from the backup server. The backup packages are stored in /root directory of the backup server user/path specified in the backup parameters/management/management/timestamp/node name. For example, if the login user of the backup server is the ftpuser user, the path is /opt/backup/ftpboot/backup/management/management/20190729002834588/node146.
- Obtain the integrity check tool package BKSigntool-tool version-OS_system type_pkg.tar of the required version from Huawei technical support and have saved it to your PC.
- Use FileZilla to upload the integrity check tool package, the backup file of the management plane, and the signature file to the /tmp directory on the all management node, as the sopuser user in SFTP mode. For details, see Logging In to a Server Using PuTTY.
- Disable the switchover between the master and slave database instances.
- Use PuTTY to log in to OMP_01 as the sopuser user in SSH mode. For details, see Logging In to a Server Using PuTTY.
Perform this operation only on OMP_01. For details about how to obtain the IP address of a node, see How Do I Query the IP Address of a Node?
- Run the following command to switch to the ossadm user:
> su - ossadm
Password: password for the ossadm user
- Run the following commands to disable the switchover between the master and slave database instances within 180 minutes:
> cd /opt/oss/manager/agent/bin
> bash dbha_switch_tool.sh -cmd set-ignore-nodes -nodes all -expire 180
- Use PuTTY to log in to OMP_01 as the sopuser user in SSH mode. For details, see Logging In to a Server Using PuTTY.
- Stop the service and databases of the management plane.
- Use PuTTY to log in to each management node as the sopuser user in SSH mode and perform the following operations:
- Run the following command to switch to the ossadm user:
> su - ossadm
Password: password for the ossadm user
- Run the following commands to stop the service and databases of the management plane:
> source /opt/oss/manager/bin/engr_profile.sh
> ipmc_adm -cmd stopmgr
If information similar to the following is displayed, the service and databases of the management plane are stopped successfully, go to 6.If the service and databases fail to be stopped, go to 6.
... ============================ Stopping management processes is complete. ... ============================ Stopping management dc is complete
- Perform the pre-restoration processing operations based on the node type.
- Use PuTTY to log in to OMP_01, OMP_02, and OMP_03 as the sopuser user in SSH mode.
- Run the following command to switch to the root user:
> su - root
Password: password for the root user
- Run the following commands to perform the pre-restoration processing operations:
- On OMP_01 and OMP_02, run the following commands:
# [ -d /opt/oss/share/manager-bak ] || cp -a /opt/oss/share/manager /opt/oss/share/manager-bak
# rm -rf /opt/oss/share/manager/{Etcd/,MCZKService/,ServiceCenter/}
- On OMP_03, run the following commands:
# [ -d /opt/oss/share/manager-bak ] || cp -a /opt/oss/share/manager /opt/oss/share/manager-bak
# rm -rf /opt/oss/share/manager/{Etcd/,MCZKService/}
- On OMP_01 and OMP_02, run the following commands:
- Run the following command to check whether the processes started by the ossadm and dbuser users exist. Clear the processes if they exist.
# ps -ef
Information similar to the following is displayed:UID PID PPID C STIME TTY TIME ... root 5263 5475 0 15:04 ? 00:00:00 ... ossadm 5270 35779 0 15:04 ? 00:00:00 ... dbuser 5322 1 8 11:36 ? 00:18:26 ... ...
- Records whose UID is ossadm are processes started by the ossadm user. Run the following command to clear these processes:
# ps -fww -uossadm --no-headers |awk '{print $2}'|xargs kill -9
- Records whose UID is dbuser are processes started by the dbuser user. Run the following command to clear these processes:
# ps -fww -udbuser --no-headers |awk '{print $2}'|xargs kill -9
- If the values of UID do not contain ossadm or dbuser, skip this step.
After the clearing, run the ps -ef command to check that no process started by the ossadm or dbuser user exists.
- Records whose UID is ossadm are processes started by the ossadm user. Run the following command to clear these processes:
- Run the following command to exit the root user:
# exit
- On OMP_01 or OMP_02, query the node where the master mgrdbInstanceName database instance resides: If the faulty node you have logged in to is not OMP_01 or OMP_02, skip this step.
- Use PuTTY to log in to OMP_01 or OMP_02 as the sopuser user in SSH mode.
- Run the following command to query the node where the master mgrdbInstanceName database instance resides:
> cd /tmp
> zgrep --binary-files=text 'mgrdbInstanceName=managedbsvr' management.tar.gz- If information similar to the following is displayed, the master mgrdbInstanceName database instance resides on the node:
mgrdbInstanceName=managedbsvr-0-999
- If no information is displayed, the slave instance of the mgrdbInstanceName database instance resides on the node:
- Run the following command to switch to the ossadm user:
> su - ossadm
Password: password for the ossadm user
- Restore the application and data of the management plane. For details, see Table 3-130.
- Perform 9 to 11 on nodes in the following sequence to ensure the restoration is successful: node where the master mgrdbInstanceName database instance resides, node where the slave mgrdbInstanceName database instance resides, and other nodes.
- The restoration of the management plane takes a long time, so PuTTY may be disconnected during the restoration due to timeout. Configure PuTTY to prevent it from being disconnected. For details, see How Do I Prevent PuTTY from Being Disconnected upon Timeout?
Table 3-130 Restoring the management planeNode
Operation
Node where the master mgrdbInstanceName database instance resides
> sudo /usr/local/uniepsudobin/execute.sh /tmp/BKSigntool-tool version-OS_system type_pkg.tar /opt/backupManagement restoreManagement.sh /tmp/management.tar.gz
NOTE:If the management node and product node are the same node and use the same database software, and the database software needs to be restored if the database software is damaged, add yes to the end of the command. If yes is not added, the database software is not restored by default. During database software restoration, the product functions may be unavailable for a short period of time. For details about how to check that the management node and product node are the same node and use the same database software, see How Do I Determine the Deployment Mode of Nodes? and How Do I Check Whether Management Nodes and Product Nodes Use the Same Database Software?
For example:
> sudo /usr/local/uniepsudobin/execute.sh /tmp/BKSigntool-tool version-OS_system type_pkg.tar /opt/backupManagement restoreManagement.sh /tmp/management.tar.gz yes
When the following information is displayed, enter y and press Enter:Are you sure you want to restore the database applications? [y/n]
Other nodes
> sudo /usr/local/uniepsudobin/execute.sh /tmp/BKSigntool-tool version-OS_system type_pkg.tar /opt/backupManagement recoveryGaussManagement.sh /tmp/management.tar.gz
NOTE:If the management node and product node are the same node and use the same database software, and the database software needs to be restored if the database software is damaged, add yes to the end of the command. If yes is not added, the database software is not restored by default. During database software restoration, the product functions may be unavailable for a short period of time.
For example:
> sudo /usr/local/uniepsudobin/execute.sh /tmp/BKSigntool-tool version-OS_system type_pkg.tar /opt/backupManagement recoveryGaussManagement.sh /tmp/management.tar.gz yes
When the following information is displayed, enter y and press Enter:Are you sure you want to restore the database applications? [y/n]
- If the following information is displayed, the management plane is successfully restored, and the database instances and the management plane service are started successfully.
Restore management successfully.
- If the following information is displayed, the management plane service fails to be started during the restoration. Contact Huawei technical support to check the statuses of the database instances of the management plane.
ERROR: Start management app service falied. ERROR: Please check if the dbInstance status is ok, if its not ok, please recovery the dbInstance first, and then try to start management. ERROR: Restore management failure.
- If the statuses of the management plane database instances are normal, the management plane service startup failure is not caused by exceptions in the database instances of the management plane. Contact Huawei technical support.
- If the statuses of the management plane database instances are abnormal, restore the databases first. For details, see "Database Faults" in Troubleshooting Guide. Manually start the management plane service. For details, see Starting the Management Plane Service.
- If information similar to the following is displayed, the management plane backup file fails to be verified. Contact Huawei technical support.
ERROR: Verify /opt/backupManagement/management.tar.gz failed. ERROR: Restore management failure.
- If the following information is displayed, the task execution fails. Contact Huawei technical support.
ERROR: Restore management failure.
- Run the following command to exit the ossadm user:
> exit
- Run the following commands to delete the files uploaded to the temporary directory:
> rm -rf /tmp/management.tar.gz
> rm -rf /tmp/management.tar.gz.sign
> rm -rf /tmp/BKSigntool-tool version-OS_system type_pkg.tar
- Enable the switchover between the master and slave database instances.
- Use PuTTY to log in to OMP_01 as the sopuser user in SSH mode.
If the management plane is deployed in cluster mode, perform the operations on OMP_01.
- Run the following command to switch to the ossadm user:
> su - ossadm
Password: password for the ossadm user
- Run the following commands to enable the switchover between the master and slave database instances:
> cd /opt/oss/manager/agent/bin
> bash dbha_switch_tool.sh -cmd del-ignore-nodes
If Successful is not displayed, the command execution fails. Contact Huawei technical support.
- Use PuTTY to log in to OMP_01 as the sopuser user in SSH mode.
- Check the product database instance status.
- Log in to the management plane and choose Product > System Monitoring.
- In the upper left corner of the System Monitoring page, move the pointer to
and select the product.
- On the Relational Databases tab page, check whether Status of the database instances is Running.
- If yes, skip this step.
- If no, restore the databases first. For details, see "Database Faults" in Troubleshooting Guide.
- Optional: If the management node and product node are the same node, restore the application and data of the product. Restore the database application, product application, and product data in sequence. You only need to restore the product data once.
- Restore the database application. For details, see "Restoring Database Applications" in Administrator Guide.
- Restore the product application. For details, see "Restoring Product Applications" in Administrator Guide.
- Restore the product data. For details, see "Restoring Product Data" in Administrator Guide.
Longer Time for Node Switchover to Take Effect
Symptom
The switchover between the active and standby nodes takes longer time than expected to take effect.
Possible Causes
For a service deployed in active/standby mode, if the active node service is still abnormal after three restarts, an active/standby switchover is performed and the floating IP address is migrated from the active node to the standby node. If the gateway supports broadband remote access server (BRAS) authentication, the migration of floating IP address takes effect only after the BRAS route table ages or when BRAS detects that the original active node is invalid. The time when the floating IP address takes effect depends on the aging time of the route table or the detection duration of BRAS.
Troubleshooting Procedure
Contact the administrator to obtain the aging time of the BRAS route table and BRAS detection duration. If the switchover does not take effect after the period, contact Huawei technical support.
Service Faults
Service Log Faults
Symptom
The latest service log information cannot be printed.
Possible Causes
- The service log rights are incorrect.
- An error is reported in the service log.
Troubleshooting Procedure
- Use PuTTY to log in to the service node where the fault occurs as user sopuser.
- Run the following command to switch to the ossadm user:
> su - ossadm
- Run the following command to query the cause of the service exception:
> cd /var/log/oss/NCECAMPUS/XXXXXService/XXXXXXservice-XX-XX/tomcatlog
> vi catalina.outlog4j:ERROR setFile error java.io.FileNotFoundException: /var/log/oss/NCECAMPUS/XXXXXService/XXXXXservice-XX-XX/log/root.log (Permission denied) at java.io.FileOutputStream.open0(Native Method) at java.io.FileOutputStream.open(FileOutputStream.java:270) at java.io.FileOutputStream.<init>(FileOutputStream.java:213) at java.io.FileOutputStream.<init>(FileOutputStream.java:133) at org.apache.log4j.FileAppender.setFile(FileAppender.java:294) at org.apache.log4j.RollingFileAppender.setFile(RollingFileAppender.java:207) at com.huawei.bsp.log4j.extend.OssRollingFileAppender.setFile(OssRollingFileAppender.java:247) at com.huawei.bsp.log4j.extend.OssRollingFileAppender.subAppend(OssRollingFileAppender.java:212)
- Run the following command to restart the service where the fault occurs:
> /opt/oss/manager/agent/bin/ipmc_adm -cmd restartapp -app XXXXXService
- Check whether the latest service logs can be properly printed.
> cd /var/log/oss/NCECAMPUS/XXXXXService/XXXXXservice-XX-XX/log
> tailf root.log
Abnormal SMPMQService Process Due to Damaged SMPMQService Data Files
Symptom
SMPMQService cannot be started due to damaged data files.
Possible Causes
The SMPMQService data files are damaged due to unexpected node power-off or management plane backup and restoration.
Prerequisites
- You have obtained the name of the SMPMQService process, for example, SMPMQService-0-0.
- You have obtained the management IP address of the node where the SMPMQService process is abnormal.
Context
SMPMQService is a service based on the Kafka third-party component. Data reported by each node is written into files. If a node is powered off unexpectedly or the management plane is backed up and restored, files on the node may be damaged, causing service startup failure.
Troubleshooting Procedure
- Use PuTTY to log in to the node where the SMPMQService process is abnormal as the sopuser user in SSH mode. For details about how to obtain the IP address of the node where SMPMQService resides, see How Do I Query the IP Address of the Node Where a Service Resides?
- Run the following command to switch to the ossadm user:
> su - ossadm
- Run the following command to delete the configuration files of the SMPMQService process:
> cd /opt/oss/manager/apps/SMPMQService/init/
> sh delete_mq_damaged_files.sh
- Run the following commands to start the SMPMQService process:
> /opt/oss/manager/agent/bin/ipmc_adm -cmd restartapp -app SMPMQService -tenant manager
If information similar to the following is displayed, the SMPMQService process is started successfully. Otherwise, contact Huawei technical support.
Starting process SMPMQService-0-0 ... success
- Check the values of Status of the SMPMQService process.
- Log in to the management plane. For details, see Logging In to the Management Plane.
- On the management plane, choose Product > System Monitoring from the main menu.
- In the upper left corner of the System Monitoring page, move the pointer to
and select iMaster NCE-Campus-OMP.
- On the Services tab page, and click UniEPMgr.
- In the Processes area, check the values of Status of the process whose name starts with SMPMQService.
- If Status is Running, the fault is rectified. No further action is required.
- If Status is Starting or Stopping, the duration for starting or stopping a service is less than 1 minute. If the service is in this state for a long time, contact Huawei technical support.
- If Status is Faulty, Unknown, or Not Running, the SMPMQService process is abnormal. Contact Huawei technical support.
Memory Overflow Occurs When SMPMQService Is Restarted After the SMPMQService Data Files Are Damaged
Symptom
After the active and standby OMP nodes are powered off unexpectedly, data files of SMPMQService are damaged. If a file with the extension hprof is generated in the opt/oss/manager/apps/SMPMQService/ directory after SMPMQService is restarted, memory overflow occurs.
Precautions
If the management plane is deployed in cluster mode, perform restoration only on the OMP node where memory overflow occurs.
Procedure
- Use PuTTY to log in to the OMP node as the sopuser user in SSH mode. For details, see Logging In to a Server Using PuTTY.
- Run the following command to switch to the ossadm user:
> su - ossadm
- Run the following commands to perform restoration:
# cd /opt/oss/manager/apps/SMPMQService/shellscript
# sh reset_log_offsets.sh
If information similar to the following is displayed, the restoration is successful:
Excute reset_log_offsets.sh operation OMP IP:10.248.151.239 Authorized users only. All activities may be monitored and reported. Stopping process smpmqservice-0-0 ... success Authorized users only. All activities may be monitored and reported. Starting process smpmqservice-0-0 ... success Finish reset_log_offsets operation.
DR System Faults
Abnormal DR System Heartbeat Status Between the Primary and Secondary Sites
Symptom
On the Manage DR System page of the management plane, the heartbeat status between the primary and secondary sites is (abnormal).
Possible Causes
- The heartbeat network between the primary and secondary sites is abnormal.
- The DR service at the primary or secondary site is abnormal.
- The DR system certificates of the management node at the primary and secondary sites are inconsistent or have expired.
Prerequisites
- You have obtained the heartbeat IP address of the management node at the secondary site.
- You have obtained the password for the sopuser and ossadm user on the management node at the primary and secondary sites.
Troubleshooting Procedure
This section provides only the basic troubleshooting methods. If the fault persists after troubleshooting using the following methods, contact Huawei technical support.
- Check whether the heartbeat network between the primary site and secondary site is normal.
- Use PuTTY to log in to the management node at the primary site as the sopuser user in SSH mode. Run the following command to switch to the ossadm user:> su - ossadm
Password: password for the ossadm user
If the management plane is deployed in cluster mode, perform operations only on OMP_01. For details about how to obtain the IP address of a node, see How Do I Query the IP Address of a Node?
- Run the following command to test the connectivity between the management nodes at the primary and secondary sites.
If the IP address version is IPv4:
> ping heartbeat IP address of the management node at the secondary site
If the IP address version is IPv6:
> ping6 heartbeat IP address of the management node at the secondary site
Check the command output.
- If information similar to the following is displayed, the IP address can be pinged, and the network connection is normal:
64 bytes from heartbeat IP address of the management node at the secondary site: icmp_seq=1 ttl=251 time=42.1 ms
- If no command output is displayed within 1 minute, the network connection is abnormal. Contact the administrator to check the network status and rectify the network fault.
- Press Ctrl+C to stop the ping command.
- Use PuTTY to log in to the management node at the primary site as the sopuser user in SSH mode. Run the following command to switch to the ossadm user:
- Check whether the DR processes of the management node are normal at the primary and secondary sites.
- Log in to the management plane at the primary site. For details, see Logging In to the Management Plane.
- On the management plane choose Product > System Monitoring from the main menu.
- In the upper left corner of the System Monitoring page, move the pointer to
and select iMaster NCE-Campus-OMP.
- On the Services tab page, click UniEPMgr.
- In the Processes area, check whether the drmgrservice-x-x process exists and whether Status of the process is Running.
x indicates the instance number. Replace it based on site requirements.
- If yes, the processes exist and are running properly.
- If no, contact Huawei technical support.
- Log in to the management plane at the secondary site, and perform the preceding operations to check the DR processes at the secondary site. If abnormal, contact Huawei technical support to restore the DR processes.
- Check whether the DR system certificate of the management node at the primary and secondary sites has expired.Check whether the 51025 Certificate of the Remote DR System Has Expired alarm is generated for the primary and secondary sites.
- If yes, update the DR system certificate. For details, see "Updating DR System Certificates" in Maintenance and Monitor (Management plane).
- If no, this fault is not caused by DR system certificate expiration.
- Contact Huawei technical support to check whether the DR system certificates of the management node match between the primary and secondary sites.
Abnormal DR System Data Replication
Abnormal Data Synchronization Between Databases at Primary and Secondary Sites
Symptom
On the Manage Remote DR System page of the management plane, Data Synchronization Status between the primary and secondary sites is Abnormal. Click to view the product information, and check the item whose Data Type is Database. Status of the item is Abnormal.
Possible Causes
The data replication link between the primary and secondary site products is abnormal.
Figure 3-39 shows the method of locating abnormal data replication in the DR system. The databases are deployed in master/slave mode at each site. When data is written to the master database, the data is synchronized from the master database to the slave database. As shown in Figure 3-39, at the primary site, data is synchronized from DB01 to DB02, and at the secondary site, from DB03 to DB04. During remote replication, the data is synchronized from the master database at the primary site to that at the secondary site, that is, from DB01 at the primary site to DB03 at the secondary site.
Major factors that affect data replication are as follows:
- Data replication links between products at the primary and secondary site
- Data replication links between local nodes
- Database running status
Troubleshooting Procedure
- Check whether the data replication links between the primary and secondary sites are normal.
- Use PuTTY to log in to the management node at the primary site as the sopuser user in SSH mode.
If the management plane is deployed in cluster mode, perform operations only on OMP_01. For details about how to obtain the IP address of a node, see How Do I Query the IP Address of a Node?
- Run the following command to switch to the master database instance node at the primary site:
> ssh IP address of the master database instance node at the primary site
- Run the following command to test the connectivity between the database nodes at the primary and secondary sites:
Replace IP address of a node at the secondary site in the following commands with the IP address of the node where the database instance at the secondary site that shares the same name with the master database instance at the primary site resides.
- For an IPv4 address, run the following command:
> ping IP address of a node at the secondary site
- For an IPv6 address, run the following command:
> ping6 IP address of a node at the secondary site
Check the command output.
- If information similar to the following is displayed, the IP address can be pinged, and the network connection is normal. Press Ctrl+C to stop the ping command and go to 2.
64 bytes from IP address of a node at the secondary site: icmp_seq=1 ttl=251 time=42.1 ms
- If no command output is displayed within 1 minute, the network connection is abnormal. Press Ctrl+C to stop the ping command and contact the administrator to check and restore the network, and then go to 2.
- For an IPv4 address, run the following command:
- Use PuTTY to log in to the management node at the primary site as the sopuser user in SSH mode.
- Check the local master and slave database instance status at the primary site.
- Log in to the management plane of the primary site. For details, see Logging In to the Management Plane.
- On the management plane, choose Product > System Monitoring from the main menu.
- In the upper left corner of the System Monitoring page, move the pointer to
and select the product.
- On the Relational Database tab page, check the statuses of the master and slave database instances.
- If Status of the master and slave database instances is Running and Replication Status is Normal, the database instances are normal. Go to 3.
- If Status of the master or slave database instance is Not Running or Unknown, or Replication Status is Abnormal, the database instance is abnormal. Rectify the fault by referring to "Database Faults" in Troubleshooting Guide.
- Forcibly synchronize data between the primary and secondary sites.
- On the management plane of the active site, choose HA > Remote High Availability System > Manage DR System from the main menu.
- On the Manage Remote DR System page, click
in the Operation column of the row that contains the product whose data is to be synchronized. Select the product data synchronization direction.
After you specify the data synchronization direction, the DR system performs full data synchronization based on the specified direction, and data at the destination site will be overwritten. You are advised to specify the product with the latest data as the active site product to synchronize data from it to the peer site product. If the direction is from the standby to the active, the standby product will be switched to active, and then synchronizes data to the product at the peer site.
- Perform operations as prompted.
- Check the operation result. If the operation result is not as expected, contact Huawei technical support.
- On the management plane, choose HA > Remote High Availability System > Manage DR System from the main menu.
- On the Manage Remote DR System page, verify that the heartbeat status between the primary and secondary sites is
.
- On the Manage Remote DR System page, verify that Data Synchronization Status of all products is Synchronized or Synchronizing. If Data Synchronization Status is Delayed, a large volume of data is being synchronized between the primary and secondary sites. Check the status after data synchronization is complete.
- Verify that you can log in to the service plane of the active site. For details, see Logging In to service plane.
Abnormal RHM Data Replication
Symptom
The RHM data replication between primary and secondary sites is abnormal.
Possible Causes
RHM is abnormal.
Troubleshooting Procedure
- Restart RHM at the site where RHM is abnormal.
- Log in to the management plane. For details, see Logging In to the Management Plane.
- On the management plane, choose Product > System Monitoring from the main menu.
- In the upper left corner of the System Monitoring page, move the pointer to
and select a product other than NCE-OMP.
- Click the Services tab.
- Select all instances whose name Instance Name contains RHM, and click Stop.
- In the Warning dialog box, click OK.
- After RHM service instances are stopped, click Start.
- In the Warning dialog box, click OK.
If all RHM service instances are in the Running state, RHM is normal. Otherwise, contact Huawei technical support.
- If other products are deployed, repeat 1.c to 1.h to restart RHM of all products except NCE-OMP.
- Forcibly synchronize data between the primary and secondary sites.
- On the management plane of the active site, choose HA > Remote High Availability System > Manage DR System from the main menu.
- On the Manage Remote DR System page, click
in the Operation column of the row that contains the product whose data is to be synchronized. Select the product data synchronization direction.
After you specify the data synchronization direction, the DR system performs full data synchronization based on the specified direction, and data at the destination site will be overwritten. You are advised to specify the product with the latest data as the active site product to synchronize data from it to the peer site product. If the direction is from the standby to the active, the standby product will be switched to active, and then synchronizes data to the product at the peer site.
- Perform operations as prompted.
- Check the operation result. If the operation result is not as expected, contact Huawei technical support.
- On the management plane, choose HA > Remote High Availability System > Manage DR System from the main menu.
- On the Manage Remote DR System page, verify that the heartbeat status between the primary and secondary sites is
.
- On the Manage Remote DR System page, verify that Data Synchronization Status of all products is Synchronized or Synchronizing. If Data Synchronization Status is Delayed, a large volume of data is being synchronized between the primary and secondary sites. Check the status after data synchronization is complete.
- Verify that you can log in to the service plane of the active site. For details, see Logging In to service plane.
The DR Replication Is Abnormal and the Health Check Result Is Empty
Symptom
On the management plane, choose HA > Remote High Availability System > Manage DR System from the main menu. The data synchronization status is Abnormal.
In the upper right corner of the page, click Evaluate Health. After the check is complete, "No records found" is displayed in Inter-Site Evaluation and Primary Site Evaluation.
Possible Causes
When the standby management node of the primary site that is co-deployed with the service database is faulty, the service database cannot run properly. As a result, the DR replication is abnormal and the health check result is empty.
Answer
- Perform the takeover operation at the standby site. For details, see "Taking Over Faulty Products" in Maintenance and Monitor (Management plane).
- Delete the DR system at the current active site. For details, see "Deleting the DR System" in Maintenance and Monitor (Management plane).
- Rectify the faulty node.
- Reconfigure the DR relationship at the current active site. Data is synchronized from the current active site to the standby site. For details, see "Configuring the DR System" in Maintenance and Monitor (Management plane).
- Perform the switchover operation at the current active site to restore the site for which the fault has been rectified to the active state.
- On the management plane, choose HA > Remote High Availability System > Manage DR System from the main menu.
- On the Manage Remote DR System page, click
in the Operation column of the row that contains the product and perform operations as prompted.
- During the data synchronization, the database status on the System Monitoring page of the standby site may be Abnormal. After the data synchronization is complete, the database status changes to Normal.
- To perform the switchover for multiple products on the management plane, select the products and click Switch Over above the product list and perform operations as prompted.
- Check the operation result. If the operation result is not as expected, contact Huawei technical support.
- On the management plane, choose HA > Remote High Availability System > Manage DR System from the main menu.
- On the Manage Remote DR System page, verify that the heartbeat status between the primary and secondary sites is
.
- On the Manage Remote DR System page, view information in the Primary Site Product and Secondary Site Product columns and verify that the product DR status is consistent with the switchover result.
- On the Manage Remote DR System page, verify that Data Synchronization Status of the switched products is Synchronized or Synchronizing. If Data Synchronization Status is Delayed, a large volume of data is being synchronized between the primary and secondary sites. Check the status after data synchronization is complete.
- Verify that you can log in to the service plane at the active site and the menus can be displayed properly.
- (Optional) Manually start processes or change the process startup type based on site requirements.
- If Startup Type of a process is Manual, perform the following operations:
- On the management plane, choose Product > System Monitoring from the main menu.
- In the upper left corner of the System Monitoring page, move the pointer to
and select the product.
- Click Processes, select the process to be started, click Start in the upper right corner of the process list, and perform operations as prompted.
- For details about how to change the startup type of a process, see Configuring Process Startup Types.
- If Startup Type of a process is Manual, perform the following operations:
Failure to Load the Page for the DR System or Perform DR Operations on the Web Client
Symptom
On the management plane, when the Manage Remote DR System page is being loaded, an exception occurred, but other pages are normal. Alternatively, when a DR operation is performed on the Manage Remote DR System page, such as querying information about the DR system, a message indicating that the DR operation fails to be performed is displayed, but functions of other pages are normal.
Possible Causes
The DR service DRMgrService is abnormal.
Troubleshooting Procedure
- Use PuTTY to log in to the management node at the site whose page for the DR system is abnormal, as the sopuser user in SSH mode.
If the management plane is deployed in cluster mode, perform operations on OMP_01 and then on OMP_02. For details about how to obtain the IP address of a node, see How Do I Query the IP Address of a Node?
- Run the following command to switch to the ossadm user:
> su - ossadm
Password: password for the ossadm user
- Run the following commands to query the DR service status:
> source /opt/oss/manager/bin/engr_profile.sh
> ipmc_adm -cmd statusapp -tenant manager
View the value of Status of DRMgrService when information similar to the following is displayed:
... drmgrservice-0-0 drmgrservice DRMgrService manager cluster 10.10.67.76 53986 RUNNING ...
- Perform the following operations based on the service status:
DR Service Status
Operation
The value of Status is STOPPED.
Start the DR services.
> ipmc_adm -cmd startapp -app DRMgrService -tenant manager
If information similar to the following is displayed, the service is started successfully. Otherwise, contact Huawei technical support.
Starting process drmgrservice-0-0 ... success
The value of Status is RUNNING.
Restart the DR services.
> ipmc_adm -cmd restartapp -app DRMgrService -tenant manager
If information similar to the following is displayed, the service is restarted successfully. Otherwise, contact Huawei technical support.
Stopping process drmgrservice-0-0 ... success Starting process drmgrservice-0-0 ... success
The value of Status is ABNORMAL.
- Check the status of the DR system 5 minutes after the services are started. If the check result is not as expected, contact Huawei technical support.
- On the management plane, choose HA > Remote High Availability System > Manage DR System from the main menu.
- On the Manage Remote DR System page, verify that the heartbeat status between the primary and secondary sites is
.
- Check the DR status of all products at the primary and secondary sites. Verify that products at one site are in the Active state, and products at the other site are in the Standby state.
If a product is in the Initializing state, forcibly synchronize the product data between the primary and secondary sites. For details, see "Synchronizing Product Data Between Primary and Secondary Sites" in Administrator Guide.
- Verify that Data Synchronization Status of all the products is Synchronized or Synchronizing.
If Data Synchronization Status is Delayed, a large volume of data is being synchronized between the primary and secondary sites. Check the status after data synchronization is complete. Wait for 10 minutes. If there is any product whose Data Synchronization Status is still not Synchronized or Synchronizing, forcibly synchronize the product data between the primary and secondary sites. For details, see "Synchronizing Product Data Between Primary and Secondary Sites" in Administrator Guide.
- Verify that you can log in to the service plane at the active site.
Abnormal Product Status After DR Services at the Primary and Secondary Sites Are Restarted
Symptom
After the DR services at the primary and secondary sites are restarted, the primary and secondary site products on the Manage Remote DR System page are Initializing.
Troubleshooting Procedure
- On the management plane, choose HA > Remote High Availability System > Manage DR System from the main menu.
- On the Manage Remote DR System page, check the heartbeat status between the primary and secondary sites.
- If the heartbeat status is
(normal) or
(partially abnormal), go to 3.
- If the heartbeat status is
(abnormal), rectify it by referring to Abnormal DR System Heartbeat Status Between the Primary and Secondary Sites and then go to 3.
- If the heartbeat status is
- On the Manage Remote DR System page, click
in the Operation column of the row that contains the product whose data is to be synchronized. Select the product data synchronization direction.
After you specify the data synchronization direction, the DR system performs full data synchronization based on the specified direction, and data at the destination site will be overwritten. You are advised to specify the product with the latest data as the active site product to synchronize data from it to the peer site product. If the direction is from the standby to the active, the standby product will be switched to active, and then synchronizes data to the product at the peer site.
- Perform operations as prompted.
- Check the operation result. If the operation result is not as expected, contact Huawei technical support.
- On the management plane, choose HA > Remote High Availability System > Manage DR System from the main menu.
- On the Manage Remote DR System page, check that the heartbeat status between the primary and secondary sites is
.
- On the Manage Remote DR System page, check that Data Synchronization Status of all products is Synchronized or Synchronizing. If Data Synchronization Status is Delayed, a large volume of data is being synchronized between the primary and secondary sites. Check the status after data synchronization is complete.
- Verify that you can log in to the service plane of the active site. For details, see Logging In to service plane.
Abnormal DR System After the DR Service at the Active Site Is Restarted
Symptom
If the heartbeat status between the primary and secondary sites is , the DR system is abnormal after the DR service at the active site is restarted.
Possible Causes
The heartbeat between the primary and secondary sites is abnormal.
Troubleshooting Procedure
For details about how to resolve the issue, see Abnormal Product Status After DR Services at the Primary and Secondary Sites Are Restarted.
Failed to Configure the DR Relationship
Symptom
- When a user configures the DR relationship between the primary and secondary sites, the status of the DR system creation task in System > Task List on the management plane of the primary site is Partially Succeeded. When the user expands the basic task information, the error message "failed to create the replication relationship of the database" is displayed for some database instances.
- When the user chooses Product > System Monitoring > Relationship Database from the main menu on the management plane of the secondary site, the corresponding database instances are displayed as Not Running.
Possible Causes
The bandwidth between the primary and secondary sites does not meet requirements or the network is intermittently disconnected, causing data synchronization failures. The database instances at the secondary site become faulty during the creation of the data replication relationship.
Precautions
If a backup or restoration task is in progress, perform operations in this section after the task is complete. Otherwise, the task or the DR operation may fail.
Procedure
- Use PuTTY to log in to the DB node where the database instance is not running on the secondary site as the sopuser user.
- Run the following command to switch to the ossadm user:
> su - ossadm
- Run the following command to force the database instance whose status is not running to rebuild:
The following uses the database instance tenantpuerdbsvr-2-42 as an example to rebuild.
/usr/bin/sudo -u dbuser bash -c "source ~/.bashrc;/usr/bin/flock -ox /opt/zenith/data/tenantpuerdbsvr-2-42 -c '/opt/oss/manager/agent/DeployAgent/rtsp/python/bin/python /opt/zenith/app/bin/zctl.py -t build -c -D /opt/zenith/data/tenantpuerdbsvr-2-42 -P'"
Log in to the database as the sys user.Need database connector's name and password: Username:sys Password:
Wait until the command execution is complete.
Begin to shutdown database ... Done Begin to clear data and log ... Done Begin to startup instance nomount ... Done Begin to build database ... Done Successfully build database
Check the operation result. On the management plane of the secondary site, choose Product > System Monitoring > Relationship Database from the main menu, and verify that all the database instance is in the Running state.
- On the management plane of the primary site, choose HA > Remote High Availability System > Manage DR System from the main menu. In the Operation column of the row that contains the product with data to be synchronized, click
. Select the data synchronization direction between the primary and secondary site products, and forcibly synchronize data between the primary and secondary sites as instructed.
- Check the operation result. If the operation result is as expected, the fault is rectified. Otherwise, contact Huawei technical support.
On the management plane of the primary site, choose HA > Remote High Availability System > Manage DR System from the main menu, verify that Data Synchronization Status of all products is Synchronized or Synchronizing. If Data Synchronization Status is Delayed, a large volume of data is being synchronized between the primary and secondary sites. Check the status after data synchronization is complete.
On the management plane of the secondary site, choose Product > System Monitoring > Relationship Database from the main menu, and verify that all the database instance is in the Running state.
Failure to Switch Products to Standby Due to Site Faults
Symptom
One of the following occurs:
- The heartbeat is normal. The task for switching the product to standby failed. The DR status of the product at the local site is Becoming Standby, and that of the peer site is Becoming Active, Initializing, or Active after takeover.
- The heartbeat is abnormal. The task for switching the product to standby failed. The DR status of the product at the local site is Becoming Standby.
Possible Causes
The current site is faulty and the services of the current site fail to be stopped when the products at the site are becoming standby.
Troubleshooting Procedure
- Rectify the fault at the current site based on the task information.
- Switch the product at the current site to standby.
- On the management plane, choose HA > Remote High Availability System > Manage DR System from the main menu.
- On the Manage Remote DR System page, click
in the Operation column of the row that contains the product and perform operations as prompted.
To switch multiple products to standby on the management plane, select the products and click Switch to Standby above the product list and perform operations as prompted.
- Check the operation result. If the operation result is not as expected, contact Huawei technical support.
- On the management plane, choose HA > Remote High Availability System > Manage DR System from the main menu.
- On the Manage Remote DR System page, verify that the status of the product that has been switched to standby is Standby.
- Verify that you cannot log in to the service plane of the site at which the product has become standby.
- Restore the heartbeat status. For details, see Abnormal DR System Heartbeat Status Between the Primary and Secondary Sites.
- Specify the peer site as the active site to perform data synchronization.
- On the management plane, choose HA > Remote High Availability System > Manage DR System from the main menu.
- In the Operation column of the row that contains the product with data to be synchronized, click
. Select the synchronization direction to synchronize the product data from the peer site to the local site.
- Perform operations as prompted.
- Choose System > Task List from the main menu to check the result of the task for data synchronization.
- If the task is successful, go to 4.e.
- If the task fails and the task details indicate that the system failed to stop services, on the Manage Remote DR System page of the peer site, click
in the Operation column and perform operations as prompted to make the product at the peer site take over services from the product at the local site. Then, contact Huawei technical support.
If the data synchronization fails due to other causes, contact Huawei technical support.
- Check the heartbeat status and data synchronization status. If the check result is not as expected, contact Huawei technical support.
- On the management plane, choose HA > Remote High Availability System > Manage DR System from the main menu.
- On the Manage Remote DR System page, verify that the heartbeat status between the primary and secondary sites is
.
- On the Manage Remote DR System page, verify that Data Synchronization Status of all products is Synchronized or Synchronizing. If Data Synchronization Status is Delayed, a large volume of data is being synchronized between the primary and secondary sites. Check the status after data synchronization is complete.
- Verify that you can log in to the service plane of the active site. For details, see Logging In to service plane.
Both the Switchover and Rollback Fail
Symptom
After a DR switchover is performed, the status of the DR switchover task on the System > Task List page on the management plane is Execution failed. In addition, on the HA > Remote High Availability System > Manage DR System page on the management plane, the active site before the switchover is displayed as Becoming Active, and the standby site before the switchover is displayed as Becoming Standby.
Possible Causes
The bandwidth between the primary and secondary sites does not meet requirements or the network is intermittently disconnected. As a result, the switchover and rollback cannot be performed.
Procedure
Perform a forcible takeover at the target active site.
In this section, the site that is planned as the active site is called the target active site, and the site that is planned as the standby site is called the target standby site.
- On the management plane of the target active site, choose HA > Remote High Availability System > Manage DR System from the main menu. In the Operation column of the row that contains the product, click
and perform the takeover as prompted.
- Choose System > Task List from the main menu to check the execution status of the DR takeover task. Choose HA > Remote High Availability System > Manage DR System from the main menu to view the status of the primary and secondary sites.
- If the takeover task is in the Execution Succeeded state and the primary and secondary sites are in the Active and Standby states respectively, the DR system is restored.
- If the takeover task is in the Execution Succeeded state but the primary and secondary sites are in the Active and Becoming Standby states respectively, perform a forcible switchover to standby at the target standby site.
On the management plane of the target standby site, choose HA > Remote High Availability System > Manage DR System from the main menu. In the Operation column of the row that contains the product, click
and perform the forcible switchover to standby as prompted.
- If the takeover task is in the Execution failed state, stop all databases of the products at the target standby site and forcibly take over services at the target active site.
- On the management plane of the target standby site, choose Product > System Monitoring from the main menu. Click > in the upper left corner switch to the corresponding product. Click Stop, choose Stop DB from the drop-down menu, and perform related operations as prompted.
- On the management plane of the target active site, choose HA > Remote High Availability System > Manage DR System from the main menu. In the Operation column of the row that contains the product, click
and perform the takeover again.
- Choose HA > Remote High Availability System > Manage DR System from the main menu, click
in the Operation column of a product at any site, and select the data synchronization direction of the products at the primary and secondary sites. Perform data synchronization operations as prompted.
After you specify the data synchronization direction, the DR system performs full data synchronization based on the direction. You are advised to specify the product with the latest data as the active site product, and synchronize data from it to the peer site product.
Disaster Recovery Exception Caused by the Uninstallation and Reinstallation of the ZooKeeperService in the RHM DR Scenario
Symptom
After the RHM DR relationship is established, uninstall and reinstall the ZooKeeper service on either the primary or secondary cluster. Then, the DR relationship fails to be re-established.
Possible Causes
This exception is caused by the ZooKeeper component. After the RHM DR relationship is established, if you uninstall and reinstall the ZooKeeper service on either the primary or secondary cluster. For example, if you uninstall and reinstall the secondary cluster, ZooKeeper in the secondary cluster functions as the server, and the recorded zxid is reset due to reinstallation. However, ZooKeeper in the primary cluster functions as the client, and the recorded zxid still exists. When the client sends requests to the server, the zxid on the client does not match that on the server. As a result, the client actively disconnects from the server.
Procedure
After the primary and secondary clusters are uninstalled and reinstalled, restart the RHM service on the peer end. If the secondary cluster is uninstalled and reinstalled, restart the RHM service in the primary cluster.
Take the management plane as an example.
- Log in to the management plane.
- Choose the management plane. from the main menu of
- In the upper left corner of the System Monitoring page, move the pointer over
and select the target product.
- Enter RHM in the search box in the upper left corner of the Services tab page and click
.
- Select the RHM service and click Stop on the right of the page.
- In the displayed dialog box, click Yes.
- On the Services tab page, check the status of the RHM service. If the status is Not Running, the RHM service is stopped successfully.
- Select the RHM service and click Start on the right of the page.
- In the displayed dialog box, click Yes.
- On the Services tab page, check the status of the RHM service. If the status is Running, the RHM service is restarted successfully.
Clearing the DR Information of the Product Nodes After the DR System Is Deleted
Symptom
If you delete the DR system when a product node is powered off or abnormal, on the Task List page, the statuses of the tasks for deleting the DR system and deleting the product are both partially successful. The IP address of the node where the deletion failed is displayed in the details of the task for deleting the product. After the node is restored, you need to clear the DR information on the product node. Otherwise, services on the node are abnormal in the non-DR scenario.
Possible Causes
The product node is powered off or abnormal.
Troubleshooting Procedure
- If a new DR system is required after the DR system is deleted, ignore operations described in this section. You can directly create a DR system.
- If no new DR system is required after the DR system is deleted, perform the operations described in this section at the corresponding site based on the IP address of the node that fails to be deleted in the task details.
- Use PuTTY to log in to the management node as the sopuser user in SSH mode.
If the management plane is deployed in cluster mode, perform operations only on OMP_01. For details about how to obtain the IP address of a node, see How Do I Query the IP Address of a Node?
- Run the following command to switch to the ossadm user:
> su - ossadm
Password: password for the ossadm user
- Run the following commands to clear the DR information on the product node:
> cd /opt/oss/manager/apps/DRMgrService/bin
> bash dr_repair.sh
If the following information is displayed, specify whether to start all services of the product after the node information is cleared based on site requirements:
Start the product services of cdo after the repair? (y/n):
If you choose to start the product services after the node information of the product is cleared, information similar to the following is displayed, and the operation is successful. Otherwise, contact Huawei technical support.
... Starting the product services of product... Product services of product started successfully. Complete.
- Log in to the management plane. For details, see Logging In to the Management Plane.
- On the management plane, choose Product > System Monitoring from the main menu.
- In the upper left corner of the System Monitoring page, move the pointer to
and select the product.
- If the management plane is deployed in cluster mode, check the running status of the OMMHA service instance on the Services tab page. Otherwise, skip this step.
- On the Nodes tab page, check the status of the product node that has been restored.
- If Service Status is Running, the product node status is restored successfully.
- If Service Status is Partially Running, Unknown, or Fault, contact Huawei technical support.
- If Service Status is Not Running, in the upper left corner of the System Monitoring page, click Start, choose Start Service from the drop-down menu, and perform operations as prompted.
Arbiter Third-Party Site Faults
Symptom
The alarm "Arbiter node disconnected" is generated on the service plane. And on the Product > System Monitoring page of the management plane, the connection status of the Common_Service node under the product is Normal.
Possible Causes
The arbiter node cannot be accessed or the arbitration service is abnormal.
Prerequisites
In a DR scenario, you need to delete the DR relationship of the product to be restored between the primary and secondary sites before rectifying the fault. For details, see "Separating the Primary and Secondary Site Products" in Maintenance and Monitor (Management plane).
Troubleshooting Procedure
- Execute the check items and check methods in Table 3-131 and rectify the fault according to the corresponding troubleshooting methods.
The arbiter third-party site failure is caused by complicated causes. This section provides basic troubleshooting methods for rectifying the fault. If the fault persists after you perform the following operations, collect the fault information and contact Huawei technical support.
Table 3-131 Troubleshooting arbiter third-party site faultsNo.
Check Item
Check Method
Troubleshooting Method
1
Network connection
Contact the administrator to check whether the network connection is normal.
Contact the network administrator to restore the network.
2
Running status of VMs or physical machines
Contact the administrator to check whether VMs or physical machines are abnormal, for example, powered-off or deleted.
Contact the administrator to restore the VMs or physical machines.
3
OS running status
Restart VMs or physical machines and use PuTTY to log in to the faulty node as the sopuser user in SSH mode.
If the login fails or no response is returned, the OS of the faulty node is abnormal. Restore the OS and arbitration service of the faulty node. For details, see "Automatic Switchover (With the Arbitration Service)" in Geographic Redundancy System Installation.
4
Arbitration service instance status
Check whether there is an Arbiter Node Disconnected alarm on the service plane.
If the alarm exits, restore the arbitration service. For details, see "Automatic Switchover (With the Arbitration Service)" in Geographic Redundancy System Installation.
- Log in to the service plane and check whether the "Arbiter node disconnected" alarm is cleared. Otherwise, contact Huawei technical support.
- In a DR scenario, re-establish the DR relationship between the primary and secondary sites for the product that has been restored. For details, see "Connecting the Primary and Secondary Site Products" in Maintenance and Monitor (Management plane).
The HDFS Synchronization Task Does Not Exist
Symptom
HDFS data is displayed in abnormal state on the Manage DR System page of the management plane. A message indicating that the HDFS synchronization task does not exist is displayed in the Details column.
Possible Causes
The bandwidth between the primary and secondary sites does not meet requirements or the network is intermittently disconnected. As a result, data synchronization fails.
Precautions
If a backup or restoration task is in progress, perform the operations described in this section after the task is complete. Otherwise, the task or operations may fail.
Procedure
- Log in to the management plane.
- On the management plane, choose HA > Remote High Availability System > Manage DR System from the main menu.
- On the Manage DR System tab page, click
in the Operation column and select the direction for data synchronization between products at the primary and secondary sites.
- Perform data synchronization as prompted.
The HDFS Synchronization Status of the DR System Is Abnormal
Symptom
On the DR product list page of the management plane, a message is displayed indicating that the Hadoop distributed file system data is abnormal. After you click Detail, a message is displayed indicating that the replication status cannot be queried.
Procedure
- The FusionInsight password of either or both primary and secondary sites has expired. As a result, FusionInsight cannot be connected and the replication task fails to be queried.
- 2. Change the FusionInsight passwords of the primary and secondary sites on iMaster NCE-Campus simultaneously. For details, see How Do I Synchronize the Password Change to iMaster NCE-Campus After the Password of FusionInsight Manager Is Changed.
- 3. After the passwords are changed, wait for about 5 minutes. The HDFS status will then become normal.
Failed to Query the HDFS Replication Status
Symptom
The HDFS replication status cannot be queried.
Possible Causes
The FusionInsight password of one or both of the primary and secondary sites expires. As a result, the FusionInsight cannot be connected and the replication task fails to be queried.
Troubleshooting Procedure
- Change the FusionInsight password of the primary site.
- Change the FusionInsight password of the secondary site to the same as that of the primary site. For details, see How Do I Synchronize the Password Change to iMaster NCE-Campus After the Password of FusionInsight Manager Is Changed.
- After the modification is complete, wait for about 5 minutes. The HDFS status becomes normal.
File Synchronization Fails Between the Primary and Secondary Sites
Symptom
On the Manage DR System page of the management plane, the MSP customization page is displayed in abnormal state. A message indicating that file synchronization between the primary and secondary sites fails is displayed in the Details column.
Possible Causes
The mutual trust relationship between the primary and secondary sites is incorrectly configured, or the primary or secondary site is abnormal.
Precautions
If a backup or restoration task is in progress, perform the operations described in this section after the task is complete. Otherwise, the task or operations may fail.
Procedure
- Log in to the management plane of each of the primary and secondary site.
- Verify the cluster status. If the cluster is abnormal, contact technical support engineers.
- If the cluster status is normal, reconfigure the mutual trust relationship between the primary and secondary controller clusters. For details, see Managing Mutual Trust Relationships Between Controller Clusters.
Log and Alarm Management
iMaster NCE-Campus Self-Monitored Alarm Query Failure
Symptoms
Some iMaster NCE-Campus alarms cannot be queried on iMaster NCE-Campus.
Possible Causes
- Possible cause 1: The alarm has been masked.
- Possible cause 2: There is a clock difference between the client on which the current browser runs and the background Linux environment of iMaster NCE-Campus.
Troubleshooting Procedures
- Possible cause 1: The alarm has been masked.
Log in to iMaster NCE-Campus using the admin account, and check masked alarms. If active alarms are in the masked alarm list, click
. The active alarms are shown in the Current Alarms page.
Check whether the fault is rectified. If the fault is rectified, the process ends. If the fault is not rectified, continue with the following steps.
- Possible cause 2: There is a clock difference between the client on which the current browser runs and the background Linux environment of iMaster NCE-Campus.
Check whether the difference between the OS time of the current browser and the Linux time of the iMaster NCE-Campus background is equal to the difference of the OS time zone to the zero time zone. For example, if the OS time of the current browser is UTC+8, then the difference between the background Linux time and the foreground time must be 8 hours (the difference can be queried by running the date -R command). If the time difference between the foreground and background is not the difference of the OS time zone to the zero time zone, configure the foreground and background time to meet the requirements.
Check whether the fault is rectified. If the fault is rectified, the process ends. If the fault is not rectified, contact technical support personnel.
Alarms
None
Login Failures
iMaster NCE-Campus Login Failure
Symptoms
- The login page cannot be displayed after you enter the iMaster NCE-Campus address.
- You cannot log in to iMaster NCE-Campus after entering the user name and password on the login page.
- The page cannot be refreshed or opened after you click the refresh button.
Possible Causes
- Possible cause 1: The browser version is not supported.
- Possible cause 2: The network IP address conflicts.
- Possible cause 3: The customer network is abnormal.
- Possible cause 4: The CPU usage or memory usage of an iMaster NCE-Campus node is excessively high.
- Possible cause 5: The number of file handles opened by an iMaster NCE-Campus process is excessively large.
Troubleshooting Procedures
- Possible cause 1: The browser version is not supported.
- Browsers supported are Google Chrome 57 or later. Check whether the browser version is supported.
- If the browser version is supported, the fault is not caused by an incorrect browser versions. Continue to locate the fault.
- If the browser version is not supported, update the browser to a version supported by iMaster NCE-Campus.
- Check whether the fault is rectified. If the fault is rectified, the process ends. If the fault is not rectified, continue with the following steps.
- Possible cause 2: The network IP address conflicts.
Use SSH to log in to the controller installation node. If the system prompts that the password is incorrect or logs in to another node, IP address conflict may occur. In this case, check the IP addresses of the controller and FusionInsight nodes to ensure that there is no IP address conflict in the cluster. You can query the mapping between the IP address and MAC address of each network node through the gateway device to find the device corresponding to the conflicting IP address.
- Possible cause 3: The customer network is abnormal.
- Check whether all the involved floating IP addresses and ports in the northbound of iMaster NCE-Campus are normal.
Log in to all iMaster NCE-Campus nodes and run the ifconfig command to check whether the northbound IP address and ER floating IP address are started. In the following figure, :1 specifies the ER floating IP address and :nv specifies the northbound IP address. These two IP addresses may not be on the same node.
Run the netstat -anp | grep 31943 | grep LISTEN command on the :1 node to check whether the northbound port 31943 is normal.
If this northbound interface is normal, run the service keepalived restart command on the :nv node to restart the Keepalived service and check whether the fault is rectified. If the fault persists, continue with the following steps.
- Ping the iMaster NCE-Campus address (10.162.106.61) and gateway address (10.162.106.1) locally and check whether they can ping each other.
If so, the fault is not caused by an abnormal network. Continue to locate the fault.
If not, the network is abnormal.
- Check whether the fault is caused by local faults such as loose local PC network cables or an IP address configuration conflict. If so, rectify the faults.
Log in to iMaster NCE-Campus and check whether the fault is rectified. If so, the process ends. If not, perform the following steps.
- Check whether a firewall is deployed between the PC and iMaster NCE-Campus. If so, log in to the firewall. Check whether the source IP addresses connecting to iMaster NCE-Campus is restricted and whether access to the iMaster NCE-Campus port 18008 is prohibited. If so, remove the restriction.
- Log in to iMaster NCE-Campus and check whether the fault is rectified. If so, the process ends. If not, perform the following steps.
- Check whether the fault is caused by local faults such as loose local PC network cables or an IP address configuration conflict. If so, rectify the faults.
- On each node in the iMaster NCE-Campus cluster, ping the management interface IP addresses of other nodes (the default management interface is eth0).
- If the packet loss ratio and latency are close to zero, the fault is not caused by high packet loss ratio or severe latency on the customer's network. Check other causes.
- If the packet loss ratio is high (≥ 20%) or the network latency is high (≥ 2000 milliseconds), the fault occurs on the customer's network. Rectify the network fault.
- On each node in the iMaster NCE-Campus cluster, ping the service interface IP addresses of other nodes (the default service interface is eth1).
- If the packet loss ratio and latency are close to zero, the fault is not caused by high packet loss ratio or severe latency on the customer's network. Check other causes.
- If the packet loss ratio is high (≥ 20%) or the network latency is high (≥ 2000 milliseconds), the fault occurs on the customer's network. Rectify the network fault.
- If the fault persists, check whether the heartbeat between the LVS and ER is normal. Log in to the node where the floating IP address of :nv is located and run the ipvsadm -ln command.
Find "TCP xxx.xxx.xxx.xxx:18008 lc persistent 50." If the line marked by the arrow in the following figure is displayed, the heartbeat between the LVS and ER is normal. In the preceding information, 192.XXX.XXX.126 indicates the ER floating IP address. If the target line is not found, rectify the fault of the heartbeat cable.
- Check whether the fault is rectified. If so, the process ends. If not, continue with the following steps.
- Check whether all the involved floating IP addresses and ports in the northbound of iMaster NCE-Campus are normal.
- Possible cause 4: The CPU usage or memory usage of an iMaster NCE-Campus node is excessively high.
- On each iMaster NCE-Campus node, run the following commands to check the CPU usage and memory usage of the node.
If the CPU usage is excessively high (≥ 90%), run the ps aux|head -1;ps aux|grep -v PID|sort -rn -k +3|head command to check the processes that occupy large CPU, and contact technical support personnel to locate the fault.
If the memory usage is excessively high (≥ 90%), run the ps aux|head -1;ps aux|grep -v PID|sort -rn -k +4|head command to check the processes that occupy large memory, and contact technical support personnel to locate the fault.
If a process unrelated to iMaster NCE-Campus occupies much CPU or much memory (≥ 5 GB), run the kill -9 <PID> command to kill the process. Then, check the memory usage again and contact technical support personnel for confirmation.
- Check whether the fault is rectified. If so, the process ends. If not, continue with the following steps.
- On each iMaster NCE-Campus node, run the following commands to check the CPU usage and memory usage of the node.
- Possible cause 5: The number of file handles opened by an iMaster NCE-Campus process is excessively large.
- Log in to each iMaster NCE-Campus node and run the lsof -n|awk '{print $2}'|sort|uniq -c|sort -nr|more command to check the number of file handles that are opened. Note that the command execution period may be long (about 1 minute).
In the command output, the number of file handles that are opened and the corresponding process IDs are displayed in the first and second columns respectively. The number of file handles obtained by running the lsof command is inaccurate. To obtain the accurate number, perform the following steps for each process ID.
- For each process ID displayed in the preceding output, run the ll /proc/<PID>/fd | wc -l command to check the number of file handles opened by the process.
The command and output for the process with the PID 24703 are as follows.
If the number of file handles opened by the process is large (≥ 50,000), contact technical support personnel for further analysis and fault location.
If it is confirmed that the fault is caused by the large number of file handles opened by an iMaster NCE-Campus process, use either of the following methods to solve the problem temporarily:
- Restart the iMaster NCE-Campus process that occupies much memory. For example, to stop the process with the PID 24703, run the kill -9 24703 command. The daemon process will automatically start this process within 1 minute.
- Run the reboot command to restart the node.
Even if the fault is rectified temporarily, you still need to report this fault to technical support personnel.
- Check whether the fault is rectified. If so, the process ends. If not, contact technical support personnel.
- Log in to each iMaster NCE-Campus node and run the lsof -n|awk '{print $2}'|sort|uniq -c|sort -nr|more command to check the number of file handles that are opened. Note that the command execution period may be long (about 1 minute).
Alarms
None
System Management
Certificate Loading Failure
Symptoms
Certificate loading fails, and an error message is displayed.
Possible Causes
- Possible cause 1: The CA certificate file to be loaded already exists.
- Possible cause 2: The CA certificate file to be loaded is invalid.
- Possible cause 3: The number of CA certificate files to be loaded exceeds the threshold.
- Possible cause 4: The list of loaded certificates contains a CA certificate file with the same fingerprint as that of the one to be loaded.
- Possible cause 5: The certificate does not meet the requirements of iMaster NCE-Campus.
Troubleshooting Procedures
- Possible cause 1: The CA certificate file to be loaded already exists.
Log in to iMaster NCE-Campus using the admin account. On the main menu, choose . Then, click the Trusted Certificate tab. In the Certificate Information area on the Trusted Certificate tab, check whether there is any CA certificate file with the same name as the one to be loaded.
If such a CA certificate file exists, rename the CA certificate file to be loaded and then load it.
Check whether the fault is rectified. If so, the process ends. If not, continue with the following steps.
- Possible cause 2: The CA certificate file to be loaded is invalid.
- If the error message "The file size is incorrect." is displayed, the size of the CA certificate file is 0 KB or exceeds 50 KB. In this case, apply for another certificate from the CA certificate organization and load the certificate file.
- If the error message "The file type is incorrect." is displayed, the type of the CA certificate file is incorrect (only the .pem and .cer formats are supported). In this case, apply for another certificate from the CA certificate organization and load the certificate file.
- If the error message "The certificate has expired." is displayed, the CA certificate has expired. In this case, apply for another certificate from the CA certificate organization and load the certificate file.
- If the error message "The certificate has not taken effect." is displayed, the CA certificate is not yet valid. In this case, apply for another certificate from the CA certificate organization or wait for the certificate to take effect and then load the certificate file.
- If the error message "The certificate validity period exceeds the upper limit (50 years) or is less than the lower limit (90 days)." is displayed, the validity period of the CA certificate exceeds the lower limit (50 days) or upper limit (90 days). In this case, apply for another certificate from the CA certificate organization and load the certificate file.
- If the error message "The certificate must use a signature hash algorithm with a higher security than SHA256." is displayed, the security of the signature hash algorithm for the CA certificate does not meet the requirement. In this case, apply for another certificate from the CA certificate organization and load the certificate file.
- If the error message "The certificate must use a signature algorithm with an RSA key length being greater than 2048 bits." is displayed, the security of the signature algorithm for the CA certificate does not meet the requirement. In this case, apply for another certificate from the CA certificate organization and load the certificate file.
Check whether the fault is rectified. If so, the process ends. If not, continue with the following steps.
- Possible cause 3: The number of CA certificate files to be loaded exceeds the threshold.
Log in to iMaster NCE-Campus using the admin account. On the main menu, choose . Then, click the Trusted Certificate tab. In the Certificate Information area on the Trusted Certificate tab, view the number of loaded certificates.
If the number of loaded certificates is 128, which is the threshold, no new certificates can be loaded. To load the desired CA certificate, delete unneeded certificates.
Check whether the fault is rectified. If so, the process ends. If not, continue with the following steps.
- Possible cause 4: The list of loaded certificates contains a CA certificate file with the same fingerprint as that of the one to be loaded.
Log in to iMaster NCE-Campus using the admin account. On the main menu, choose . Then, click the Policy tab. When a CA certificate file is loaded on the Policy page, the error message "A certificate with the same fingerprint already exists." is displayed, indicating that a CA certificate file with a different name but the same fingerprint as that of the one to be loaded exists in the list of loaded certificates. In this case, apply for another certificate from the CA certificate organization and load the certificate file.
Check whether the fault is rectified. If so, the process ends. If not, continue with the following step.
- Possible cause 5: The certificate does not meet the requirements of iMaster NCE-Campus.Check whether the constraints are met based on the error message:
- The certificate must be in the X509 V3 Base64 format.
- The certificate must use a signature hash algorithm whose security is higher than SHA256.
- The certificate must use a signature algorithm whose RSA key length is more than 2048 bits.
- The certificate validity period cannot exceed the upper limit (50 years) or be less than the lower limit (90 days).
- The effective date of the certificate must be earlier than the current system date. Otherwise, the certificate does not take effect yet.
- The expiration date of the certificate must be later than the current system date. Otherwise, the certificate has expired.
- The certificate file must be larger than 0 KB and smaller than 50 KB.
- A maximum of 128 certificates can be uploaded.
- A maximum of 256 policies can be created.
If the requirements are not met, modify the certificate to make it meet the requirements.
Check whether the fault is rectified. If so, the process ends. If not, contact technical support personnel.
Alarms
None
Failed to Delete Certificates
Symptoms
Certificates cannot be deleted. The button for deleting a certificate is dimmed.
Possible Causes
The certificate to be deleted has been bound to a policy.
Troubleshooting Procedures
- Log in to iMaster NCE-Campus using the admin account. Choose . Then, click the Policy tab.
- View the policy list and locate the policy to which the certificate to be deleted is bound. Delete such a policy and then delete the certificate.
Check whether the fault is rectified. If so, the process ends. If not, contact technical support personnel.
The default certificate SyslogClientDefaultTrust.cer cannot be deleted.
Alarms
None
Failed to Delete a Policy
Symptoms
A policy cannot be deleted. The button for deleting a policy is dimmed.
Possible Causes
The current policy is in use.
Troubleshooting Procedures
- Log in to iMaster NCE-Campus using the admin account. Choose . Then, click the Policy tab.
- In the policy list, check whether some services are bound to the policy. If some services are bound to the policy, unbind these services from the policy and then delete the policy.
Check whether the fault is rectified. If so, the process ends. If not, contact technical support personnel.
Alarms
None
Third-Party SMS Server Interconnection Failure
Symptoms
An administrator configures third-party SMS server parameters on Service Manager, and clicks Test. The system prompts that the test fails.
Possible Causes
- The third-party SMS server is not supported by iMaster NCE-Campus.
- iMaster NCE-Campus is disconnected from the third-party SMS server.
- The domain name of the third-party SMS server is configured in the URL address, but iMaster NCE-Campus server fails to resolve this domain name.
- The third-party SMS server parameters configured on iMaster NCE-Campus are incorrect.
- The SMS message template of iMaster NCE-Campus does not comply with the requirements of the third-party SMS server.
Troubleshooting Procedures
- Verify that the third-party SMS server is supported by iMaster NCE-Campus.
Currently, iMaster NCE-Campus supports the following third-party SMS servers.
- fungo
- twilio
Check whether iMaster NCE-Campus can connect to the third-party SMS server.
- On the iMaster NCE-Campus server, run the ping IP command to check whether iMaster NCE-Campus can connect to the third-party SMS server. If iMaster NCE-Campus cannot ping the third-party SMS server, check network connectivity.
- Check whether the firewall of the iMaster NCE-Campus server permits the IP address and port of the third-party SMS server.
- Verify that the iMaster NCE-Campus server can correctly resolve the domain name of the third-party SMS server.
- On the iMaster NCE-Campus server, choose Start > Apps > Windows System > Command Prompt.
- Run the following commands, and check whether the iMaster NCE-Campus server can correctly resolve the domain name of the third-party SMS server.
If not, check whether the NIC of the iMaster NCE-Campus server is configured with a correct IP address of the DNS server.
nslookup domain name of the third-party SMS server
- Verify that the third-party SMS server parameters configured on Service Manager are correct.
- Verify that the SMS message template of iMaster NCE-Campus complies with the requirements of the third-party SMS server.
- If the third-party SMS server still cannot be configured successfully, go to /opt/oss/log/NCECAMPUS/CampusBaseService/log, and open the karaf.log.CampusBaseService file. Check log information returned by the third-party SMS server, and solve the problem according to the information.
Alarms
None
How Do I Query the DN Format of the AD Server Synchronization Account?
Symptoms
An error is reported when the synchronization range is configured for the AD on the iMaster NCE-Campus.
Procedures
- Connect to the AD/LDAP server, search for the user group, and search for distinguishedName in the attribute. The value is the DN format of the synchronization account.
Portal Authentication
Failure to Redirect to the Authentication Page After a User Clicks an Image or Video Link on the Login Page
Symptoms
After a user associates the terminal with an SSID and clicks an image or video link on the login page, the user is not redirected to the authentication page.
Possible Causes
The browser fails to obtain the redirected iMaster NCE-Campus address.
Procedures
No workaround measure is available for this problem. Do not click an image or video link to redirect to the authentication page.
Failure to Move the Cursor to the Dialog Box on the SMS Authentication Page After a User Installs Google Chrome for the First Time
Symptoms
After a user installs Google Chrome for the first time and clicks any page to access the authentication page of iMaster NCE-Campus, the user cannot move the cursor to the dialog box of the SMS authentication page to enter a mobile phone number.
Possible Causes
This is the behavior of Google Chrome.
Procedures
- Refresh the page or exit the browser.
- Access the SMS authentication page again.
When a User Accesses a Website Through Any Port Except Port 80 from a Terminal, the User Cannot Be Redirected to the Target Portal Page
Symptoms
After a user terminal is associated with an SSID in Portal authentication mode, the Portal page cannot be automatically displayed when the user accesses any website through any port except port 80.
Possible Causes
Due to specification limitation, iMaster NCE-Campus does not support Portal page redirection when the website is not accessed through port 80.
Procedures
In the browser, access any IP address or domain name without a port number. For example:
- https://www.huawei.com/
- https://192.168.1.100/
Authorization Fails During Authentication
Symptoms
During authentication on a mobile phone, a message indicating that the authorization fails is displayed on the page pushed in the browser.
Possible Causes
There is residual cache about a historical error in the browser on the mobile phone, leading to the authorization failure.
Procedures
Clear the cache in the browser on the mobile phone, and then access the page again for authentication.
Error 404 Is Displayed on the Portal Page During Authentication
Symptoms
When a user connects to the network through Wi-Fi, error 404 is displayed on the page.
Possible Causes
The AP SSID name starts or ends with spaces, and the involved AP version is V200R008.
Procedures
- Delete the spaces at beginning and end of the AP SSID name.
- Alternatively, upgrade the AP to V200R009 or a later version.
Service Configuration of LAN Network
A Device Fails to Go Online (Device Unregistered)
Symptoms
A device fails to register with iMaster NCE-Campus.
Choose Unregistered is displayed in the Status column for some devices.
. On the displayed page,Possible Causes
- Possible cause 1: The device version is incorrect.
- Possible cause 2: The ESNs that are added to iMaster NCE-Campus are different from the actual ones.
- Possible cause 3: The license of iMaster NCE-Campus has expired.
- Possible cause 4: The registration service is not started.
- Possible cause 5: Network IP addresses conflict.
- Possible cause 6: The device fails to obtain an IP address due to switching of the management VLAN.
- Possible cause 7: The length of the registration response packet exceeds the MTU of the device. As a result, the device fails to process the packet.
Troubleshooting Procedures
- Possible cause 1: The device version is incorrect.
- Log in to iMaster NCE-Campus as a tenant administrator.
- On the main menu, choose Device Online and Offline Log page is displayed. . The
- Click Filter Conditions, filter log records based on the ESN, and check the filtered log details in Failure/Offline Cause.
If the message "Get device basic information packet timeout or return the device to fail" is displayed in the log details, iMaster NCE-Campus does not support the current version.
- Upgrade the device to the version supported by iMaster NCE-Campus and try again.
- Possible cause 2: The ESNs that are added to iMaster NCE-Campus are different from the actual ones.
- Log in to iMaster NCE-Campus as a tenant administrator.
- On the main menu, choose Device List page is displayed. . The
- On the right of the page, use the ESN to search for a device. If the device is not found, the device is not added to iMaster NCE-Campus. Add the device to iMaster NCE-Campus and try again.
- Possible cause 3: The license of iMaster NCE-Campus is expired.
- Log in to iMaster NCE-Campus as a tenant administrator.
- On the main menu, choose Device Online and Offline Logs page is displayed. . The
- Click Filter Conditions, filter log records based on the ESN, and check the filtered log details in Failure/Offline Cause.
If the message "Expired license" is displayed in the log details, the license of iMaster NCE-Campus does not meet the requirements for the device to go online.
- Check the license usage information and perform corresponding operations.
- In the scenario where a tenant administrator manages the license, choose License tab page.
- If the license is not loaded, expired, or used up, purchase and load a new license.
- If the license is loaded and effective and still has available resources, contact Huawei engineer to check if license or device management function is abnormal.
from the main menu as the tenant administrator and check the license information on the - In the scenario where the system administrator manages the license, choose from the main menu as the system administrator and check the license information. If the license is not loaded, expired, or used up, purchase and load a new license.
- In the scenario where a tenant administrator manages the license, choose License tab page.
- Possible cause 4: The registration service is not started.
- Log in to iMaster NCE-Campus as a tenant administrator.
- On the main menu, choose Device Online and Offline Logs page is displayed. . The
- Click Filter Conditions, filter log records based on the ESN, and check the filtered log details.
If no login or logout log is found, the registration service of iMaster NCE-Campus is not started.
- Log in to each node of iMaster NCE-Campus as the ossuser user respectively and run su - root command to switch to the root user.
- Run the netstat -apn|grep 10020|grep -i "listen" command to check the usage of TCP port 10020.
- If there is no command output, the device registration service is not started. Contact technical support personnel.
- If the command output indicates that port 10020 is occupied, check whether the port is occupied by an iMaster NCE-Campus process. If the port is not occupied by an iMaster NCE-Campus process, a resource conflict occurs. Run the kill -9 <PID> command to kill the suspicious process and restart the iMaster NCE-Campus process. If the fault persists, contact technical support personnel.
For example, if the following command output is displayed, port 10020 is occupied by process 22127.
# netstat -apn|grep 10020|grep -i "listen" tcp 0 0 10.1.2.101:10020 :::* LISTEN 22127/java
Find the process based on the PID and check whether the process is an iMaster NCE-Campus process.
# ps -ef|grep 22127
If the command output indicates that the process occupying the port is not related to iMaster NCE-Campus (as shown in the following figure), kill the process and restart the iMaster NCE-Campus process.
# ps -ef | grep NetWork | grep Main | awk '{print $2}' | xargs kill -9
- Possible cause 5: Network IP address conflict
If the device suddenly fails to go online and logs on to the controller installation node through SSH, IP address conflict may occur if the password is always prompted to be incorrect or logged on to other nodes.
In this case, we need to check the IP of each node of the controller and FusionInsight to ensure that there is no IP address conflict within the cluster. The mapping relationship between IP and MAC of each network node can be viewed through gateway devices, and the corresponding devices of conflicting IP addresses can be found.
- Possible cause 6: The device fails to obtain an IP address due to switching of the management VLAN.
Devices registered with iMaster NCE-Campus automatically save configurations every 2 hours. After the management VLAN of a device is changed, the device will switch back to the original management VLAN if it is restarted due to power-off before the next automatic configuration saving, or due to version upgrade or downgrade. If this occurs, the device cannot obtain an IP address from the current DHCP address pool; therefore, it can no longer be managed by iMaster NCE-Campus. To prevent this problem, restore the original DHCP server environment.
- Possible cause 7: The length of the registration response packet exceeds the MTU of the device. As a result, the device fails to process the packet.
During the exchange of a registration packet between a device and the controller through networks, the intermediate networks may add packet headers to the packet. For example, after a packet passes through a VXLAN tunnel, a VXLAN header is added to the packet. The MTU value of devices on the live network is 1500, indicating that these devices can process a data packet with a maximum size of 1500 bytes. If the length of a packet exceeds 1500 bytes due to added packet headers, the packet is discarded. As a result, the device cannot register or go online.
- On any PC on the tenant network, run the ping <Southbound address or domain name of the controller> /f /l <size> command to determine the maximum size of a packet that can be transmitted successfully.
- Adjust network parameters.
- Method 1: Change the value of tcp-mss on devices. For details, see the product documentation of the devices.
- Method 2: Contact the network service provider to adjust the MTU values of the intermediate network devices.
Alarms
None
Operation Fails on iMaster NCE-Campus
Symptoms
When a tenant administrator is configuring services, a message indicating the configuration failure is displayed.
Possible Causes
A service data conflict occurs or the operation is invalid.
Troubleshooting Procedures
- If the window displayed after service configuration contains the error information and Details, click Details.
- Modify the configuration according to the error information. If the fault persists, continue the following steps:
- Choose from the main menu.
- On the Operation Log tab page, set Operation result to Failure to filter the operation failure records.
- Click the content under the Additional Info column. In the window that is displayed, check Additional Info to view details of the failure cause.
- Modify the configuration based on the details of the failure cause.
Alarms
None
Service Configuration Delivery Fails
Symptoms
On the Configuration Result page, the value of Config Status of one or more devices is Pre-configuration, Alarm, or Fail.
Possible Causes
- Possible cause 1: The device does not go online.
- Possible cause 2: During configuration delivery, intermittent device disconnection or network flapping occurs.
- Possible cause 3: The device does not support the current feature.
- Possible cause 4: The iMaster NCE-Campus background is abnormal.
Troubleshooting Procedures
- Possible cause 1: The device does not go online.
If the Status is Pre-configuration, the service data is not delivered. Perform the following steps to rectify the fault:
- Check the device status.
If the device status is Unregistered, the status is normal, and no further operation needs to be performed. Otherwise, go to the next step.
- Wait for 10 minutes, refresh and check the configuration delivery result again.
- If the configuration delivery is successful, the fault is rectified.
- If the Configuration Status is still Pre-configuration, go to the next step.
- Check the device status.
- If the device is offline, rectify the fault according to A Device Fails to Go Online (Device Unregistered).
- If the device is online, contact technical support personnel.
- Check the device status.
- Possible cause 2: During configuration delivery, intermittent device disconnection or network flapping occurs.
If the Configuration Status is Fail and the error details indicate timeout, perform the following steps to rectify the fault:
- Check the device status.
- If the device is offline, rectify the fault according to A Device Fails to Go Online (Device Unregistered).
- If the device is online, go to the next step.
- Click Re-deliver if Failure.
If the Configuration Status is Fail and the error details indicate timeout, perform the following steps to rectify the fault:
- If the configuration fails, check whether the service is restarted successfully.
- If the service is not restarted successfully, select the device after the service is restarted successfully, click Re-deliver if Failure.
- Check the device status.
- Possible cause 3: The device does not support the current feature.
If the Configuration Result is Alarm, perform the following steps to rectify the fault:
- Click
to display the details.
- Find the feature with Status set to Alarm and click View Details.
- If the value of Error Message is This configuration is not supported by the current model or version of the device., the device does not support the current feature. Upgrade the device to a version that supports the current feature and try again.
- Otherwise, contact technical support personnel.
- Click
- Possible cause 4: The iMaster NCE-Campus background is abnormal.
If the Configuration Result is Fail and the "Configuration service exception. Contact maintenance engineers." message is displayed, perform the following steps to rectify the fault:
- Remove the faulty device from the site. Then, add the device to the site again.
- Check the configuration delivery result to see whether the fault is rectified.
If the fault persists, contact technical support personnel.
Alarms
None
The Configuration Result is Displayed as Failed on iMaster NCE-Campus, But the Configuration Is Successfully Delivered to the Device
Symptoms
The configuration result displayed as failed on iMaster NCE-Campus, but the configuration is successfully delivered to the device.
Possible Causes
- Possible cause 1: After the configuration is delivered to the device, the response packet indicating configuration success is not sent back to iMaster NCE-Campus. As a result, iMaster NCE-Campus considers that the configuration delivery times out and displays the incorrect configuration result.
- Possible cause 2: When the device goes online for the first time or restarts, iMaster NCE-Campus delivers all service data packets to the device. If the device has residual configuration data or the software version does not match, the configuration fails to be delivered. In this case, the configuration result is displayed as failed. After the problem is solved by clearing the residual configuration data or upgrading the version, iMaster NCE-Campus re-delivers all service data packets to the device after it goes online again. The configuration takes effect on the device, but the configuration result is still displayed as failed.
Troubleshooting Procedures
- Choose from the main menu.
- Find the target device and click Redeploy in the Operation column.
Alarms
None
Terminals Failed to Obtain IP Addresses from the DHCP Server
Symptom
A switch functions as the DHCP server.
A tenant logs in to iMaster NCE-Campus, accesses the page, and view the online user list and historical online user list. It is found that IP addresses of some online users cannot be displayed.
Possible Causes
The terminals failed to obtain IP addresses from the DHCP server.
Handling Suggestion
Possible Cause |
Verification |
Solution |
---|---|---|
DHCP is disabled. |
Run the display current-configuration | include dhcp enable command in the user view to check whether DHCP is enabled. If the command output is empty, DHCP is disabled. |
Run the dhcp enable command in the system view to enable DHCP. By default, DHCP is disabled in the system. |
The configuration is incorrect. |
|
On the DHCP server:
On the DHCP relay agent:
|
The address pool has no available IP address. |
Run the display ip pool command to check whether there are available IP addresses in the address pool. The Idle(Expired) field displays the number of idle IP addresses in the address pool. If the value of this field is 0, there are no available IP addresses in the address pool. |
Determine the number of DHCP clients on the network.
|
The Spanning Tree Protocol (STP) is enabled on access devices of a diskless workstation. |
The timeout period of DHCP Discover messages sent from clients is shorter than the STP convergence time. The DHCP server cannot receive DHCP Discover messages and therefore cannot allocate IP addresses to the diskless workstations. |
Disable STP on access devices of the diskless workstations. |
The IP address is manually configured for another host on the network. This causes an IP address conflict because the DHCP server does not exclude manually configured IP addresses from the address pool. |
|
To prevent clients from obtaining conflicting IP addresses, configure IP address conflict detection on the DHCP server. When an IP address conflict is detected, the DHCP server allocates another available IP address. |
A Network Disconnection Occurs After a Faulty Device with an Eth-Trunk as an Uplink Is Replaced
Symptoms
A device uses an Eth-Trunk to connect to its uplink device and network services have been deployed successfully. After the device fails and is replaced by a new device, the new device cannot connect to the uplink device and cannot be automatically managed by the controller.
Possible Causes
- Possible cause 1: Eth-Trunk auto-negotiation is not enabled on the downlink Eth-Trunk interface of the uplink device.
- Possible cause 2: The downlink Eth-Trunk interface of the uplink device works in LACP mode, but the new downlink device is not configured with an Eth-Trunk interface. As a result, the uplink and downlink devices cannot communicate with each other.
- Possible cause 3: The downlink Eth-Trunk interface of the uplink device works in manual mode. However, the new downlink device is not configured with an Eth-Trunk interface and has multiple links connected to the uplink device. As a result, a network loop occurs between the devices and cannot be eliminated by STP.
Troubleshooting Procedure
- Possible cause 1: Eth-Trunk auto-negotiation is not enabled on the downlink Eth-Trunk interface of the uplink device.
- Choose from the main menu.
- In the displayed window, select a site from the Site drop-down list in the upper left corner.
- Click the Site Configuration tab.
- In the navigation pane, choose .
- Select the target device to be configured, select the desired Eth-Trunk interface, and enable Eth-Trunk auto-negotiation.
- Possible cause 2: The downlink Eth-Trunk interface of the uplink device works in LACP mode, but the new downlink device is not configured with an Eth-Trunk interface. As a result, the uplink and downlink devices cannot communicate with each other.
- Possible cause 3: The downlink Eth-Trunk interface of the uplink device works in manual mode. However, the new downlink device is not configured with an Eth-Trunk interface and has multiple links connected to the uplink device. In this case, a network loop occurs between the devices and cannot be eliminated by STP.
- Choose from the main menu.
- In the displayed window, select a site from the Site drop-down list in the upper left corner.
- Click the Site Configuration tab.
- In the navigation pane, choose .
- Select the target device to be configured and select the desired Eth-Trunk interface. Set Mode of the downlink Eth-Trunk interface on the uplink device to Manual Mode and enable Administrative status.
- Select the downlink Eth-Trunk interface of the uplink device, enable Administrative status for only one member interface and disable Administrative status for other member interfaces.
- The downlink device goes online and is automatically managed by the controller.
- On the downlink device, select the interfaces corresponding to the member interfaces of the downlink Eth-Trunk interface on the uplink device, and configure these interfaces as an Eth-Trunk interface.
- Enable Administrative status for all member interfaces of the downlink Eth-Trunk interface on the uplink device.
Service Configuration of WAN Network
A Device Fails to Go Online in WAN Network (Device Offline)
Symptoms
The device is offline.
Choose Device. On the displayed page, Offline is displayed in the Status column of some devices.
, and clickPossible Causes
- Possible cause 1: The device is restarted or powered off.
- Possible cause 2: The cluster node to which the device connects is restarted.
- Possible cause 3: In the email-based deployment scenario, users' network access mode is different from that specified by Interface protocol for Site on iMaster NCE-Campus. In addition, the configuration parameters about email-based deployment are modified on a PC.
- Possible cause 4: The underlay network where the CPE resides is disconnected due to reasons such as an outstanding balance or a disconnected network cable.
Troubleshooting Procedures
- Possible cause 1: The device is restarted or powered off.
Wait until the device restart is complete, or power on the device again.
- Possible cause 2: The cluster node to which the device connects is restarted.
- Since device logs are reported only to the system log center, log in to the Syslog server and check the log details about the fault that the device fails to go online. If "The iMaster NCE-Campus is restarted" is displayed in the log details, the fault is caused by a cluster node restart.
- Wait for 5 minutes until the cluster node restart is complete. Devices connected to this node automatically go online.
- Check whether the fault is rectified. If so, the troubleshooting procedure ends. If not, contact technical support personnel.
- Possible cause 3: In the email-based deployment scenario, users' network access mode is different from that specified by Interface protocol for Site on iMaster NCE-Campus. In addition, the configuration parameters about email-based deployment are modified on a PC.
- Check users' network access mode.
- Modify Interface protocol and IP address access mode for Site to be the same as the network access mode.
- Perform email-based deployment again.
Configuration parameters as shown in Figure 3-40 cannot be modified during email-based deployment.
- Possible cause 4: The underlay network where the CPE resides is disconnected due to reasons such as an outstanding balance or a disconnected network cable.
- Connect your PC to the WAN-side interface of the CPE and check the connectivity between them.
- Check the WAN-side network connections of the CPE.
Alarms
None
No Deployment Email Can Be Received After Site Creation
Symptoms
After a tenant administrator configures email-based deployment, no deployment email can be received and the email-based deployment fails.
Possible Causes
- Possible cause 1: If the tenant operating mode (system administrator - tenant) is used, the system administrator did not configure an email server or the configuration is incorrect.
- Possible cause 2: The iMaster NCE-Campus cluster nodes fail to ping the IP address or domain name of the SMTP server or the port used by the SMTP server is not enabled.
- Possible cause 3: If the MSP operating mode (system administrator - MSP - tenant) of iMaster NCE-Campus is used, neither the MSP no system administrators configured an email server or the configuration is incorrect.
Troubleshooting Procedures
- Possible cause 1: If the tenant operating mode (system administrator - tenant) is used, the system administrator did not configure an email server or the configuration is incorrect.
- Reconfigure email server parameters.The tenant operating mode is used as an example to describe the operations of configuring an email server.
- Log in to iMaster NCE-Campus as the system administrator.
- On the main menu, choose Email Server. , click
- Reconfigure email server parameters and click Test.
- If the message "The test succeeds" is displayed and the test email is received, the configuration is successful. Click Save.
- If the message "The test succeeds" is displayed but no the test email is received, check whether the email function of the SMTP server is normal.
- If the message "Config test error" is displayed, check whether the parameters are correctly configured.
Depending on the network quality and performance of the SMTP server, the time of receiving emails will be delayed by up to 2 minutes.
Some SMTP providers enable permission control for third-party application access. If the test fails, check whether third-party application access is enabled on the SMTP server and set the password parameter to the authentication password of the SMTP server.
- Modify site information and re-deliver the deployment email.
- Log in to iMaster NCE-Campus as a tenant administrator.
- On the main menu, choose .
- Click the ZTP tab.
- Click Send Email to reconfigure the email-based deployment function and then click OK. Verify that a deployment email is received from the specified CPE.
If the fault persists, contact technical support personnel.
- Modify the information about all other sites and re-deliver the deployment email.
- Reconfigure email server parameters.
- Possible cause 2: The iMaster NCE-Campus cluster nodes fail to ping the IP address or domain name of the SMTP server or the port used by the SMTP server is not enabled.
- Ping the IP address or domain name of the SMTP server on all iMaster NCE-Campus cluster nodes respectively.
- If the IP address or domain name can be pinged, go to 4.
- If the IP address or domain name cannot be pinged, enable the IP address or domain name of the SMTP server on the firewall or in the basic network configuration.
- Check whether the port is enabled. If the port is not enabled, enable it on the firewall or in the basic network configuration.
To obtain the communication port between iMaster NCE-Campus and the SMTP server, see the Communication Matrix.
Alarms
None
WAN Service Configuration Delivery Fails (The Configuration Result Is in Preconfigured State)
Symptoms
Choose Configuration Status of one or more sites is Preconfigured.
from the main menu. The value ofPossible Causes
The device of the site does not go online.
Troubleshooting Procedures
If the value of Configuration Status is Preconfigured, the service data is not delivered. Perform the following steps to rectify the fault:
- On the main menu, choose
- If the value of Configuration Status is Unregistered, rectify the fault according to A Device Fails to Go Online (Device Unregistered).
- Otherwise, wait for 10 minutes and go to the next step.
and find the device by site name. - On the main menu, choose Configuration Status again.
- If the configuration delivery is successful, the fault is rectified.
- If the value of Configuration Status is still Preconfigured, contact technical support personnel.
, and check the value of
Alarms
None
WAN Service Configuration Delivery Fails (The Configuration Result Is in Failed State)
Symptoms
Choose Configuration Status of one or more sites is Failed.
from the main menu. On the displayed page, the value ofPossible Causes
- Possible cause 1: During configuration delivery, intermittent device disconnection or network flapping occurs.
- Possible cause 2: In the process of configuring links on the WAN or LAN side or between dual gateways or dual hub sites, an incorrect interface type is selected.
- Possible Cause 3: The version of the SA signature database on the device is not updated.
- Possible Cause 4: There is residue in the equipment configuration before the start.
Troubleshooting Procedures
- Possible cause 1: During configuration delivery, intermittent device disconnection or network flapping occurs.
If the configuration result is Failed and the error details indicate timeout, perform the following steps to rectify the fault:
- On the main menu, choose
- If the value of Configuration Status is Unregistered, rectify the fault according to A Device Fails to Go Online (Device Unregistered).
- If the value of Configuration Status is not Unregistered, go to the next step.
and find the device by site name. - On the main menu, choose Redeploy.
If the delivery fails again, contact technical support personnel.
, and click
- On the main menu, choose
- Possible cause 2: In the process of configuring links on the WAN or LAN side or between dual gateways or dual hub sites, an incorrect interface type is selected.
- In manual deployment scenarios, perform the following operations:
Change the interface type to the correct one for the site. After the configuration is complete, the data needs to be delivered to the devices again.
- In email-based deployment scenarios, perform the following operations:
- Change the interface type to the correct one for the site.
- Perform email-based deployment operations again according to Configuring Email-based Deployment.
If the fault persists, contact technical support personnel.
- In manual deployment scenarios, perform the following operations:
- Possible Cause 3: The version of the SA signature database on the device is not updated.
- Choose and find the corresponding site by site name. Select sites in the table and click Create New Policy. In the dialog box that is displayed, select Immediately and click OK.
- Wait until the status of the sites in the table changes to Succeeded. This process takes about 20 minutes, depending on the network speed and device model.
If the upgrade fails, contact technical support personnel.
- Choose and click Redeploy.
- Possible Cause 4: There is residue in the equipment configuration before the start.
Confirm whether the CPE is factory configuration before the start, if not, need to restore the CPE factory settings, re-opening.
If the fault persists, contact technical support personnel.
Alarms
None
WAN Service Configuration Delivery Fails (The Configuration Result Is in Alarm State)
Symptoms
Choose Configuration Status of one or more sites is Alarm.
from the main menu. On the displayed page, the value ofPossible Causes
The device does not support the current feature.
Troubleshooting Procedures
- On the main menu, choose .
- Find the policy with Configuration Status set to Alarm and click View Detail.
- If the value of Error Information is No device adapter package for this policy, the device version does not support the current feature. Upgrade the device to a version that supports the current feature and try again.
- If the value of Error Information is not No device adapter package for this policy, contact technical support personnel.
Alarms
None
The Email Server Test Fails
Symptoms
An administrator account is used to configure the email server, but the email server test fails.
Possible Causes
- Possible cause 1: The email server is unavailable.
- Possible cause 2: The account and password of the email server configured on iMaster NCE-Campus are incorrect.
- Possible cause 3: The iMaster NCE-Campus cluster node fails to ping the IP address/domain name of the SMTP server, or the corresponding port is disabled on the SMTP server.
Troubleshooting Procedures
- Possible cause 1: The email server is unavailable.
- Log in to the mailbox using the correct account and password and check whether you can normally send and receive emails.
- Check whether Post Office Protocol 3 (POP3) and the SMTP service have been enabled for the mailbox.
- If the mailbox cannot normally send or receive emails, contact the email server administrator.
- Possible cause 2: The account and password of the email server configured on iMaster NCE-Campus are incorrect.
- Log in to iMaster NCE-Campus using a system administrator account.
- Choose Email Server. from the main menu, click
- Check whether Account and Password are correctly configured.
- Possible cause 3: The iMaster NCE-Campus cluster node fails to ping the IP address/domain name of the SMTP server, or the corresponding port is disabled on the SMTP server.
- Ping the IP address or domain name of the SMTP server on all iMaster NCE-Campus cluster nodes respectively.
- If the IP address or domain name can be pinged, go to 4.
- If the IP address or domain name cannot be pinged, enable the IP address or domain name of the SMTP server on the firewall or in the basic network configuration.
- Check whether the port is enabled. If the port is not enabled, enable it on the firewall or in the basic network configuration.
To obtain the communication port between iMaster NCE-Campus and the SMTP server, see the Communication Matrix.
Alarms
None
Maintenance
The Display of the Performance Data Is Abnormal
Symptoms
On the iMaster NCE-Campus web UI, no performance monitoring data is displayed, and the FusionInsight nodes are in an abnormal state.
Possible Causes
FusionInsight is restarted repeatedly. As a result, the HDFS enters the safe mode, providing only the data read service but not the data write service.
Troubleshooting Procedures
Quit the HDFS safe mode for FusionInsight. For details, see https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/CommandsManual.html.
A User Does Not Receive an Email After Password Reset
Symptoms
After a tenant administrator resets the password of a user, the user does not receive an email containing the reset password.
Possible Causes
- Possible cause 1: The email server is unavailable.
- Possible cause 2: The user's email address is incorrectly set on iMaster NCE-Campus.
- Possible cause 3: The mailbox configuration of the user is incorrect.
Troubleshooting Procedures
- Possible cause 1: The email server is unavailable.
- Log in to iMaster NCE-Campus using the system administrator account.
- Choose Email Server tab. from the main menu. Click the
- Click Test to test the connectivity of the email server.
- If the test fails, contact the email server administrator.
- Possible cause 2: The user's email address is incorrectly set on iMaster NCE-Campus.
- Log in to iMaster NCE-Campus using a tenant administrator account.
- Choose Users tab. from the main menu. Click the
- If
is displayed next to the email address, the email address is not verified or the verification fails.
- Click Verify next to the email address. In the dialog box that is displayed, click Obtain Verification Code.
- Obtain the verification code in the email received, and enter the verification code on iMaster NCE-Campus. If the verification succeeds, the email address is correct. If the verification fails, the email address is incorrect.
- If the email address is incorrect, click
next to the email address to change the email address.
- Possible cause 3: The mailbox configuration of the user is incorrect.
- Log in to the mailbox using the correct account and password, and check whether you can normally send and receive emails.
- Check whether the email containing the password is filtered as a junk email.
- If the mailbox cannot normally send or receive emails, contact the email server administrator.
Alarms
None
Device Upgrade Failure
Download Failure or Network Unreachability Between the Device and Server
Symptoms
On the Upgrade Status column. On the Upgrade Detail page, the reported failure cause indicates that the download fails or the between the device and server is unreachable.
page under , Failure is displayed in thePossible Causes
- FusionInsight is faulty.
- The network between the device and file server is disconnected, or the network between the file server and FusionInsight is disconnected.
- The firewall deployed between the device and iMaster NCE-Campus does not allow HTTPS packets to pass through.
- The certificate authentication between the device and file server fails. As a result, the HTTPS connection fails to be set up.
Troubleshooting Procedures
- Log in to FusionInsight and check whether an error occurs on FusionInsight.
- Check the network connectivity between the device and file server and between the file server and FusionInsight, check whether ping operations between the device and file server and between the file server and FusionInsight are successful, and check the connectivity of port 18021 or port 18020 on the server.
- Log in to the firewall and enable HTTPS packets to pass through the interface connecting the device to iMaster NCE-Campus.
- Check whether the device certificate has expired. If so, load a new certificate.
Alarms
None
Communication Failure or Communication Abnormality
Symptoms
On the Device Upgrade page under , Failure is displayed in the Upgrade Status column. On the Upgrade Detail page, the reported failure cause indicates that the communication fails or is abnormal.
Possible Causes
NETCONF packets fail to be sent or are lost during transmission, or the device does not return response packets within the specified time.
Troubleshooting Procedures
View the device logs to check whether the device receives the NETCONF packets. If not, check the network connectivity.
Alarms
None
Packet Delivery Failure
Symptoms
On the Device Upgrade page under , Failure is displayed in the Upgrade Status column. On the Upgrade Detail page, the reported failure cause indicates that packets fail to be delivered.
Possible Causes
The format of the packets returned by the device is incorrect or the NETCONF service fails to be obtained.
Troubleshooting Procedures
- View the device logs to check whether the packets sent and received by the device are abnormal.
- Check whether iMaster NCE-Campus modules involved in packet exchange are working properly.
Alarms
None
Process Query Failure
Symptoms
On the Device Upgrade page under , Failure is displayed in the Upgrade Status column. On the Upgrade Detail page, the reported failure cause indicates that the process fails to be queried.
Possible Causes
The format of the packets returned by the device is incorrect or the NETCONF service fails to be obtained.
Troubleshooting Procedures
- View the device logs to check whether the packets sent and received by the device are abnormal.
- Check whether iMaster NCE-Campus modules involved in packet exchange are working properly.
Alarms
None
Download Cancellation Failure
Symptoms
On the Device Upgrade page under , Failure is displayed in the Upgrade Status column. On the Upgrade Detail page, the reported failure cause indicates that the download fails to be cancelled.
Possible Causes
Due to network problems, the device fails to respond to the packets requesting download cancellation. As a result, the upgrade fails to be cancelled.
Troubleshooting Procedures
Click the download cancellation button again to cancel the upgrade.
Alarms
None
Inconsistency Between the Version of the Uploaded System Software and That of the Used System Software
Symptoms
On the Device Upgrade page under , Failure is displayed in the Upgrade Status column. On the Upgrade Detail page, the reported failure cause indicates that the version number of the uploaded software is inconsistent with that of the actually used software.
Possible Causes
When uploading the device system file, the administrator manually changes the device version in the system file. As a result, the version of the uploaded software is inconsistent with that of the actually used software.
Troubleshooting Procedures
Delete the system file in which the version is changed, and upload the correct system software. Ensure that the version of the uploaded system software is the same as that of the actually used software.
Alarms
None
Inconsistency Between the Version of the Uploaded System Software and That in the EFS
Symptoms
On the Device Upgrade page under , Failure is displayed in the Upgrade Status column. On the Upgrade Detail page, the reported failure cause indicates that the version number of the uploaded software is inconsistent with that in the EFS.
Possible Causes
When uploading the system file of the central AP, the administrator manually changes the software version of the central AP. As a result, the version of the uploaded software is inconsistent with that in the EFS.
Troubleshooting Procedures
Delete the system file in which the version is changed, and upload the correct system software. Ensure that the version of the uploaded software is the same as that in the EFS.
Alarms
None
Performance Monitoring Data Cannot Be Displayed on iMaster NCE-Campus
Symptom
Performance monitoring data cannot be displayed on iMaster NCE-Campus.
Possible Causes
FusionInsight 6.5.1.6 is installed and the tokens of Spark2x tasks have expired.
- Check whether the FusionInsight version is 6.5.1.6.
- Go to the task list on the Yarn page and search for Yarn logs. The "Token has expired" log is found.
Troubleshooting Procedure
- Modify FusionInsight parameters.
- Log in to FusionInsight Manager and choose from the main menu.
- Click the Configuration tab, click the All Configurations tab, and choose SparkResource2x > Customization from the navigation pane. Set the custom parameter. Specially, set the name to spark.security.credentials.renewalRatio and set the value to 0.000000000007.
If the following information is displayed, close the window.
- Choose JDBCServer2x > Customization from the navigation pane. Set the custom parameter. Specially, set the name to spark.security.credentials.renewalRatio and set the value to 0.000000000007.
- Click Save in the upper left corner. In the dialog box that is displayed, click Confirm. The configuration will be saved after about 1 minute.
- Click the
- Log in to the iMaster NCE-Campus server as the sopuser user and switch to the root user. Check whether the /opt/hadoopclient/Spark2x/spark/conf directory exists. If so, run the following command. If not, skip this step.
sed -i '$aspark.security.credentials.renewalRatio = 0.000000000007' spark-defaults.conf
- Restart CampusPerfService. Log in to iMaster NCE-Campus, choose from the main menu, click the Service tab, search for CampusPerfService, select the target service, and click Stop and then Start.
- Restart Spark tasks.
- Log in to FusionInsight Manager and choose Cluster > Yarn from the main menu.
- Click ResourceManager(Active). The Spark task page is displayed.
- Click RUNNING from the navigation pane to view running tasks. Click the target task ID to access the task page and click Kill Application to stop the task.
- Repeat the preceding steps to stop all running tasks.
- Log in to FusionInsight Manager and choose Cluster > Yarn from the main menu.
VM or Physical Server Exception
Failure to Display Alarm Information/Failure to Display Performance Data/Empty Device Login and Logout Logs/Failure to Upgrade Device Software/Failure to Add, Delete, or Modify Tenants, Devices, or Sites on iMaster NCE-Campus
Symptoms
The following symptoms occur on iMaster NCE-Campus:
- No alarm information can be displayed.
- No performance data can be displayed.
- Login and logout logs are empty.
- The device software upgrade fails.
- Administrators cannot add, delete, or modify tenants, devices, and sites.
Possible Causes
More than half of the VMs or physical servers on which FusionInsight resides are not started.
Troubleshooting Procedures
- If so, start them and try again.
- If not, contact technical support personnel for further fault location.
Alarms
None
Failure to Access the Operating System During the VM Startup After Server Recovery from Power-Off
Symptoms
A VM fails to access the operating system during the startup. In the FusionCompute, log in to the VM through the VNC. You can view the following exception information.
Possible Causes
The file system is damaged.
Troubleshooting Procedures
- In the FusionCompute, log in to a faulty VM and a normal VM with the same application deployed through the VNC. (For example, if one FusionInsight node fails, log in to this faulty node and another normal FusionInsight node as the root user through the VNC.) Then, run the df -T command to check the file systems respectively.
Compare the faulty node with the normal node, and check whether the faulty node has any file system missing. For example, in the following figure, the /srv/BigData partition is lost on the host node in the right window, while it is present in the left window. The corresponding file system is /dev/mapper/oss_vg-srv_bigdata.
- If the file system information is the same on the faulty and normal node, contact technical support personnel to locate the fault.
- If the faulty VM has any file system missing, modify the /etc/fstab file on the faulty VM by commenting out the damaged file systems with comment tags (#), and run the reboot command to restart the VM. The VM can then access the operating system.
- On the faulty VM, recover the damaged file systems.
- For EXT3 file systems, use the fsck command for recovery.
Command format: fsck <File system name>
Example: fsck /dev/mapper/oss_vg-srv_bigdata
During the recovery, the fsck command checks the file nodes and displays the fix or clear command lines.- If the fix command line is displayed, press Enter or enter y to confirm the modification.
- If the clear command line is displayed, enter n, indicating that the data does not need to be cleared.
If you enter y when the clear command line is displayed, file loss may be caused. Only if the fault cannot be rectified through common methods should you enter y under the supervision of technical support personnel when the clear command line is displayed.
- For XFS file systems, use the xfs_repair command for recovery.
Command format: xfs_repair <File system name>
Example: xfs_repair /dev/mapper/oss_vg-srv_bigdata
If the recovery using the xfs_repair command fails (as shown in the following figure), use the xfs_repair -L command for recovery as prompted.
Command format: xfs_repair -L <File system name>
Example: xfs_repair -L /dev/mapper/oss_vg-srv_bigdata
Usage of -L may lead to the recovery failure of some data. Only if the fault cannot be rectified using the xfs_repair command should you use the xfs_repair -L command under the supervision of technical support personnel to recover the file systems.
- For EXT3 file systems, use the fsck command for recovery.
- After the file systems are recovered successfully, modify the /etc/fstab file again by recovering the file systems commented out before and run the reboot command to restart the node.
- After the VM is started normally, use the df -T command to check whether the file systems are recovered successfully.
- Restart the service on the faulty node.
- For iMaster NCE-Campus, run the following commands to restart the service:
su - ossadm -c ". /opt/oss/manager/agent/bin/engr_profile.sh;ipmc_adm -cmd startnode"
- For the FusionInsight, run the following commands to restart the service:
# su - omm $ cd /opt/huawei/Bigdata/om-0.0.1/sbin $ ./restart-oms.sh
- For iMaster NCE-Campus, run the following commands to restart the service:
- Check whether the fault is rectified. If so, the process ends. If not, contact technical support personnel.
Alarms
None
Insufficient Memory Leads to Automatic Restart of VMs
Symptoms
The memory usage exceeds the alarm threshold, leading to automatic restart of VMs.
Possible Causes
Check /proc/sys/vm/panic_on_oom. If the value is 2, VM restart is triggered when the system memory is insufficient.
Troubleshooting Procedures
- Log in to the server as the sopuser user and run the su - root command to switch to the root user.
- Run the top command. Press to sort data by memory usage.
- Find and check the memory-consuming processes.
- If a non-controller process, for example, a process started by the customer, consumes a large amount of memory resources, the process needs to be stopped.
- If a controller process consumes a large amount of memory resources, contact Huawei technical support personnel.
Alarms
None
Login Failure to a FusionCompute VM
Symptoms
The login to the background of a FusionCompute VM fails.
Check the VM status on the corresponding CNA node. The FusionCompute VM is in the paused state.
Possible Causes
The memory overcommitment function is enabled on FusionCompute. To rectify the fault, disable this function.
Troubleshooting Procedures
- Log in to the FusionCompute management system.
Change the password as prompted upon the first login.
If no security certificate is installed for Internet Explorer, when you attempt to log in to FusionCompute or log in to a VM using Virtual Network Computing (VNC) for the first time, the system displays an error message, indicating that the web page cannot be displayed. If this occurs, press F5 to refresh the web page.
The system supports the following browser versions:Internet Explorer 10 or later
Google Chrome 55 or later
Mozilla Firefox 50 or later
- In the navigation pane, click
.
The Resource Pools page is displayed.
- On the Cluster tab page, click the cluster to be configured.
The Summary tab page is displayed.
- On the Host tab page, check the memory usage of each host.
The memory overcommitment function can be disabled only when the memory usage is lower than 100%.
- On the Configuration tab page, choose Configuration > Control Cluster Resource.
- In the right pane, click Edit. On Basic Configuration, click Off to disable Host Memory Overcommitment.
- Click Confirm.
- On the Summary tab page, check whether the memory overcommitment function is disabled.
- Stop and then start the VMs on all hosts in the cluster.
VMs need to be stopped and started one by one. You need to stop and then start the VMs. Do not restart the VMs directly.
After the two VMs (one Controller VM and one FusionInsight VM) on a host are started, stop and then start the VMs on another host. On the management plane of Controller, check whether all services are normal. Log in to FusionInsight Manager and check whether all services are running properly.
You are advised to stop the VMs on the third Controller node and FusionInsight node.
FusionInsight Faults
The FusionInsight Data Disk Is Damaged Due to Unexpected Power-Off
Symptoms
The server is powered off, the FusionInsight management plane cannot be logged in, and the management floating IP address cannot be pinged. The FusionInsight cluster has three nodes, and only one node can be logged in to.
Possible Causes
The FusionInsight data disk is damaged due to unexpected power-off.
Troubleshooting Procedures
- Log in to the FusionInsight server.
Log in to the FusionInsight server as the omm user and then switch to the root user. View the /etc/fstab file. If the following information is displayed, the data disk is mounted as an EXT4 file system.
- Unmount the data disk.
Comment out the five lines in the preceding figure in the /etc/fstab file and restart the VM.
- Restore the data disk.
If the disk is mounted as an EXT4 (or EXT3) file system, run the following commands to restore the disk:
fsck.ext4 -y -f /dev/oss_vg/srv_bigdata fsck.ext4 -y -f /dev/oss_vg/opt_huawei_bigdata fsck.ext4 -f -y /dev/oss_vg/var_log_bigdata fsck.ext4 -f -y /dev/oss_vg/hadoop_bigdata fsck.ext4 -f -y /dev/oss_vg/kafka_bigdata
If the disk is mounted as an XFS file system, run the following commands to restore the disk:
xfs_repair -L /dev/oss_vg/var_log_bigdata xfs_repair -L /dev/oss_vg/opt_huawei_bigdata xfs_repair -L /dev/oss_vg/srv_bigdata xfs_repair -L /dev/oss_vg/hadoop_bigdata xfs_repair -L /dev/oss_vg/kafka_bigdata
The recovery time depends on the disk size.
- Re-mount the data disk.
Uncomment the five lines in the following figure in the /etc/fstab file and restart the VM.
- Check the statuses of the OMS and FusionInsight management plane.
Run the following commands on the faulty node:
cd /opt/huawei/Bigdata/omm-0.0.1/sbin/ sh status-oms.sh
If OMS restores to the normal state, no further action is required. If the OMS, FusionInsight management plane, and GaussDB of the standby FusionInsight management node are in abnormal state (displayed as Exception), the FusionInsight data disk has bad sectors. Perform the following operations.
- Stop all FusionInsight services.
Log in to FusionInsight Manager and choose Cluster > Stop to stop all services.
- Back up data.
Back up the data in /opt/huawei, /var/log/Bigdata, and /srv/BigData.
Go to the corresponding directory, compress the data to a package, and back up the data to a directory with sufficient space. (You are advised to store the backup file on another server considering that the data size may be too large).
cd /opt/huawei tar -cvf opt_huawei_bigdata.tar.gz * cd /var/log/Bigdata tar -cvf var_log_bigdata.tar.gz * cd /srv/BigData tar -cvf srv_bigdata.tar.gz *
- Delete the data disk, log in to the faulty node, and comment out the information about the mounted BigData disk in the /etc/fstab file.
- Create a data disk.
The following describes how to create a data disk on Fusion Compute, respectively.
- Create a data disk on FusionCompute.
- Unbind the original data disk from the FusionInsight cluster.
Log in to FusionCompute, click More, and then click Detach. The original data disk is unbound from the FusionInsight cluster.
- Delete the original data disk.
Choose Data Store > More > Safely Delete. The original data disk is deleted.
- Add a data disk.
Choose Virtual Hardware > Create and Attach Disk, set the basic information as that of the original data disk, and click Confirm. The data disk is added.
- Run the df -h command on the faulty node as the root user. The command output shows that the original disk partitions have been deleted and a new disk is added.
- Restore disk partitions.
- Obtain the comm_lib.sh and create_vol.sh scripts from the /var/software/install_FI_check.zip package in the EasySuite installation directory, copy the scripts to a directory (for example, /tmp) on the node to be restored, and run the fdisk -l command to check the name of the new disk.
- Run the sh create_vol.sh /dev/sdb command. If the following information is displayed, the disk is partitioned successfully.
- Restore data on the data disk.
- Log in to the backup FusionInsight server, run the following commands to transfer the backup data package to the corresponding directory of the node to be restored.
scp -r opt_huawei_bigdata.tar.gz root@FusionInsight IP address:/opt/huawei scp -r srv_bigdata.tar.gz root@FusionInsight IP address:/srv/BigData scp -r var_log_bigdata.tar.gz root@FusionInsight IP address:/var/log/Bigdata
- Log in to the faulty node and decompress the backup data package.
cd /opt/hawei tar -xvf opt_huawei_bigdata.tar.gz cd /srv/BigData tar -xvf srv_bigdata.tar.gz cd /var/log/Bigdata tar -xvf var_log_bigdata.tar.gz
- Change user permissions on the /srv/BigData and /var/log/Bigdata directories.
chown omm:ficommon /srv/BigData chown omm:ficommon /var/log/Bigdata/ chmod 770 /srv/BigData/ chmod o+t /srv/BigData/ chmod 770 /var/log/BigData/
- Log in to the backup FusionInsight server, run the following commands to transfer the backup data package to the corresponding directory of the node to be restored.
- Log in to a FusionInsight management node and check the OMS status.
Run the cd /opt/huawei/Bigdata/om-0.0.1/sbin command and run the status-oms.sh command to check the OMS status. If the following information is displayed, the OMS is in normal state.
- Start the FusionInsight cluster.
Log in to FusionInsight Manager and choose Cluster > Start to start the FusionInsight cluster.
The HBase Service Fails if the FusionInsight Server Is Powered off Unexpectedly
Symptom
After the FusionInsight server is powered off unexpectedly and then powered on, the alarm indicating an HBase service failure is displayed on FusionInsight Manager and cannot be automatically cleared, as shown in the following figure.
Possible Causes
HDFS files are damaged after the FusionInsight server is powered off unexpectedly.
Troubleshooting Procedure
- Log in an iMaster NCE-Campus node and run the following command to check HDFS files.
su ossuser && cd /opt/hadoopclient && source bigdata_envhdfs fsck /
The command output shows the damaged HDFS files, as shown in the following figure.
- Run the following commands to clear bad blocks. The file paths specified in the commands are those displayed in the command output in the previous step.
hdfs dfs -mv /hbase/data/default/t_campus_performance_original/b87c82f8eea159ed6052b5701ecaacb8/f/25756da69916446898c398d72885735e /tmp/
hdfs dfs -mv /hbase/data/default/t_sdwan_performance_siteapp_netstream_original/3d8c30eb39afb49b77c5c8f172f4e69c/f/c08caac47ad54a339eddd3408daad17c /tmp/
- Run the following commands to restore HBase regions.
hbase shell assign 'b87c82f8eea159ed6052b5701ecaacb8' assign '3d8c30eb39afb49b77c5c8f172f4e69c'
Wait for about three minutes and then log in to FusionInsight Manager to check whether the HBase alarm is cleared. If the alarm is cleared, the HBase fault is rectified.
iMaster NCE-Campus Basic Operation Troubleshooting
Security Management Faults
This describes how to rectify the common faults of security management.
The admin User Fails to Log In to the Service Plane
Symptom and Possible Causes
Table 3-133 provides the symptom that the admin user failed to log in to the service plane, and the possible causes.
Symptom |
Possible Causes |
Troubleshooting Procedure |
---|---|---|
A message is displayed, indicating that the current user is not allowed to log in from the local host. |
|
|
A message is displayed, showing that the number of online sessions of this user has reached the maximum. |
The admin user cannot log in when the maximum number of online sessions of the admin user is reached. |
Wait until other sessions of the admin user are logged out. |
A message is displayed, showing that the login mode is the single-user mode. |
The system administrator sets the system login mode to the single-user mode. |
Wait until the system administrator completes the maintenance and switches the login mode to the multi-user mode. |
A message is displayed, showing that the current IP address has been locked. |
The number of consecutive login failures of the admin user due to incorrect password input reaches the maximum, and the IP address is locked. |
|
If the preceding measures are taken but the fault persists, contact the system administrator.
Troubleshooting Procedure
- Single-NIC client login
- Log in to the service plane as a security administrator.
- Choose from the main menu.
- In the navigation pane, choose Users.
- In the user list, click the admin user, and view the client IP address policy of the admin user on the Access Policies tab page.
- Use the IP address bound to the admin user to log in to the service plane as the admin user.
- Perform 2 to 4.
- Click Edit, and check whether any client IP address policies meet the requirements.
- If yes, select the client IP address policy and click OK.
- If no, click Create to create a client IP address policy, and click OK.
- Multi-NIC client login
- Check whether the admin user can log in to the system repeatedly.
- Log in to the service plane as a security administrator.
- Choose from the main menu.
- In the navigation pane, choose Users.
- In the user list, click the admin user, and view the client IP address policy of the admin user on the Access Policies tab page.
- Use the IP address bound to the admin user to log in to the service plane as the admin user.
- Perform 2 to 4.
- Click Edit, and add the IP addresses of all the NICs to the client IP address policy bound to the admin user.
Suggestions and Summary
- If you use a command tool to disable the admin user, the admin user may fail to log in. Therefore, exercise caution when using the command tool.
- Setting a client IP address policy for the admin user may cause the admin user whose IP address is not in the policy to forcibly log out. After logout, the admin user cannot log in again. Exercise caution when performing this operation.
- You are advised to set a proper number of online sessions of the admin user.
- If the maximum number of online sessions is set for the admin user and the number of online sessions reaches the value, the admin user fails to log in. Exercise caution when performing this operation.
- If the maximum number of online sessions set for the admin user does not meet requirements, you are advised to modify the policy in user management.
- Modify the maximum number of online sessions of the admin user.
- Set the login mode when the number of online sessions of the admin user reaches the maximum to Log out of the session.
- If the maximum number of online sessions set for the admin user does not need to be modified, forcibly log out an online session of the admin user in user management.
- If you want to maintain the system, switch the login mode to the single-user mode. Then, only the admin user can log in through one client and all the other users are forcibly logged out. You are advised to switch back to the multi-user mode immediately after finishing maintenance so that other users can use the system.
- If the IP address lockout policy does not meet the requirements, you are advised to modify it in the account policy.
A Non-admin User Fails to Log In to the Service Plane
Symptom and Possible Causes
Table 3-134 provides the symptom that a user failed to log in to the service plane, and the possible causes.
Symptom |
Possible Causes |
Troubleshooting Procedure |
---|---|---|
A message is displayed, showing that the user is disabled. |
|
|
A message is displayed, showing that the number of online sessions of this user has reached the maximum. |
A user cannot log in when the maximum number of online sessions of the user is reached. |
|
A message is displayed, indicating that the current user is not allowed to log in at this time. |
The current login time of the user is not within the range specified in the login time policy bound to the user. |
|
A message is displayed, indicating that the current user is not allowed to log in from the local host. |
|
|
A message is displayed, indicating that the user is locked and will be automatically unlocked in XXX minutes. NOTE:
XXX indicates the remaining lockout time. |
A user is locked because the number of consecutive login failures of the user due to incorrect password input reaches the maximum, and the account unlocking mode is set to automatic. |
|
A message is displayed, indicating that the user is locked and needs to contact the administrator for unlocking. |
A user is locked because the number of consecutive login failures of the user due to incorrect password input reaches the maximum, and the account unlocking mode is set to manual. |
|
A message is displayed, showing that the current IP address has been locked. |
The IP address is locked because the number of consecutive login failures of the user due to incorrect password input reaches the maximum. |
|
A message is displayed, showing that the login mode is the single-user mode. |
The system administrator sets the system login mode to the single-user mode. |
Wait until the system administrator completes the maintenance and switches the login mode to the multi-user mode. |
A message is displayed, showing that the password has expired. When the user changes the password, the system displays a message indicating that the password cannot be changed. |
The user password has expired, and the security administrator disables the user from changing the password. |
|
A message is displayed, prompting the user to enter the correct username and password. |
|
If the preceding measures are taken but the fault persists, contact the system administrator.