No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>Search

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

FusionCloud 6.3.1.1 Troubleshooting Guide 02

Rate and give feedback :
Huawei uses machine translation combined with human proofreading to translate this document to different languages in order to help you better understand the content of this document. Note: Even the most advanced machine translation cannot match the quality of professional translators. Huawei shall not bear any responsibility for translation accuracy and it is recommended that you refer to the English document (a link for which has been provided).
Rectifying Database Faults

Rectifying Database Faults

GaussDB Database

Fault Locating and Troubleshooting Process

If a fault occurs, perform required operations to analyze the fault information. You are advised to take actions only when the fault causes are clear and the fault rectifying methods are formulated, avoiding other problems.

Obtaining Environment Information

The environment configuration information includes:

  • OS type and version
  • Database deployment solution (an active database server and a standby database server, an active database server, a standby database server, a cascaded standby database server, and a sub-cascaded standby database server, a single database instance, or multiple database instances)
  • GaussDB database version
  • FusionSphere OpenStack version
Fault Analysis

The fault locating process includes:

  1. Make clear the fault cause, fault impact scope, and affected services after the fault occurs.
  2. Query logs, including database logs, OS logs, and database control script logs, or active and standby arbitration component logs.
  3. Make clear the fault occurrence frequency, including occasional occurrence, periodic occurrence, or other situations.
  4. Prepare fault-related materials if you need off-site help.
Solution

You need to formulate a solution based on fault causes. In addition, the solution must have been verified or verified by setting up the environment to reproduce the fault.

The solution not only is used to rectify the fault but also ensure that no problem is caused.

The solution should include the troubleshooting procedure and rollback procedure. If the fault still persists or a new problem is caused after the solution is implemented, roll back the system to the status before the solution is implemented.

Troubleshooting

Perform troubleshooting based on the solution. Perform troubleshooting based on the solution. During the troubleshooting, ensure that each operation is successfully performed. Verify and observe the implementation result.

Summary

After troubleshooting is complete, take note of the fault causes and modify the required guide or solution to prevent similar faults, or summarize the troubleshooting in the form of FAQ or a case for reference.

Troubleshooting Database Status Exceptions

Information similar to the following is displayed in a database control script log:

server starting.... stopped waiting
gs_ctl: could not start server
Examine the log output: /opt/gaussdb/data/pg_log/gs_ctl-current.log or the log directory configured by parameter "log_directory" in postgresql.conf

The possible causes are as follows:

Host Disk Space Being Used Up
  1. Log in to the host where the faulty database resides. For details, see Logging In to a Host Running the Database Service. Choose 2 for the authentication mode.
  2. Run the df -h command to check whether a database partition is created.

    If information similar to the following is displayed, the database partition is created.

    A 100% disk usage will cause a database fault. To rectify the fault, preferentially expand the disk space (by performing 2 to 20), and then contact technical support for assistance ,locate the root cause.

Method for expanding the disk space:

  1. Log in to the host where the faulty database resides. For details, see Logging In to a Host Running the Database Service.
  2. Run the vgs command to query the available disk space of the host.

  3. Log in to the FusionSphere OpenStack web client, choose Configuration > Disk and check whether any disk capacity expansion task is in progress.

    • If yes, wait for the task to complete and then go to 4.
    • If no, go to 4.

  4. If the available disk space is sufficient, run the following commands to expand the database partition space:

    cps hostcfg-item-update --item logical-volume --lvname database --size lv_size --volumegroup vg_name --type storage hostcfg_name

    cps commit

    In the first command:

    • hostcfg_name indicates the created partition rule name. You can run the cps hostcfg-list --type storage command to query the storage name corresponding to the host ID. In the command output, the value in the name column is the storage name.

      Information similar to the following is displayed:

      +---------+-----------------+-----------------------------------------------------------+ 
      | type    | name            | hosts                                                     | 
      +---------+-----------------+-----------------------------------------------------------+ 
      | storage | default         | default:all                                               | 
      |         |                 |                                                           | 
      | storage | control_group_3 | hostid:139D2E64-1DD2-11B2-920A-000000821800               | 
      |         |                 |                                                           | 
      | storage | control_group_2 | hostid:13A3B054-1DD2-11B2-B425-000000821800               | 
      |         |                 |                                                           | 
      | storage | compute_group0  | hostid:13D9A9F2-1DD2-11B2-99C9-000000821800, 138DC820-1DD | 
      |         |                 | 2-11B2-BEC5-000000821800                                  | 
      |         |                 |                                                           | 
      | storage | control_group_1 | hostid:13934E58-1DD2-11B2-AAC6-000000821800               | 
      |         |                 |                                                           | 
      +---------+-----------------+-----------------------------------------------------------+
    • database indicates the name of the existing logical database partition on the host. You can run the cps hostcfg-show --type storage hostcfg_name command to query the name.
    • lv_size indicates the logical volume size which can be customized based on site requirements, such as 50 GB. The value must be smaller than the available disk space.
    • vg_name has a default value cpsVG. You can query it by running the vgs command.
    NOTE:

    Disk capacity expansion tasks in the FusionSphere OpenStack system are performed serially. Specifically, if multiple capacity expansion tasks are in progress, the system proceeds with the next task only after the current task is complete. Capacity expansion will take a long period of time if multiple tasks are waiting in the queue. Therefore, you are advised not to expand disk partitions concurrently.

  5. If databases are deployed in active/standby mode and different partition rules are applied to the active and standby database nodes, run the following command to query the other host ID and perform 4 to expand the disk capacity for this host:

    cps hostcfg-list --type storage |grep hostid

    hostid specifies the host ID of the disk to be expanded.

  6. On the host that you have logged in to, run the following commands to check the file that occupies the largest disk space.

    su - gaussdba

    cd $GAUSSDATA

    du -sh *

    Figure 7-6 shows the command output.

    Figure 7-6 Command output

    Check whether the base size is greater than 70% of the original disk space.

    • If yes, go to 7.
    • If no, contact technical support for assistance.
    NOTE:

    Repeat 6 to 20 to determine whether the fault is caused by the overlarge space of the token table. If yes, clear token table to release more disk space.

    Clearing the token table will block service operations for this table. However, if the table is not cleared, the disk space consumed by tokens will not be released immediately.

  7. Run the following commands to enter the base directory and check the folder that occupies the largest disk size:

    cd base

    du -sh *

    Figure 7-7 shows the command output.

    Figure 7-7 Command output

    The 16390 folder occupies the largest disk space. 16390 indicates the value of the parameter oid of a database.

  8. Run the following command to log in to the database:

    gsql postgres

    Enter the database username and password as prompted. By default, the password is FusionSphere123. If you have changed the password, enter the new one.

  9. Run the following command to query the database name:

    select datname from pg_database where oid = 16390;

    16390 is the oid value obtained in 7.

    Figure 7-8 shows the command output.

    Figure 7-8 Command output

    As shown in the command output, the Keystone database occupies the largest disk space.

  10. Run the following command to exit the database:

    \q

  11. Run the following command to switch to the root user:

    exit

  12. Run the following commands to log in to the Keystone database:

    su gaussdba

    gsql keystone

    Enter the database username and password as prompted. By default, the password is FusionSphere123. If you have changed the password, enter the new one.

  13. Run the following command to query the sizes of all objects:

    SELECT nspname || '.' || relname AS "relation",

    pg_size_pretty(pg_total_relation_size(C.oid)) AS "size"

    FROM pg_class C

    LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)

    WHERE nspname NOT IN ('PG_CATALOG',

    'INFORMATION_SCHEMA','PG_TOAST')

    ORDER BY pg_total_relation_size(C.oid) DESC

    LIMIT 20;

    Figure 7-9 shows the command output.

    Figure 7-9 Command output

    As shown in the command output, the token table in the Keystone database occupies the largest disk space.

  14. Run the following command to query the size of the token table:

    select pg_total_relation_size('table_name')::bigint/1024/1024 MB;

    table_name indicates the name of the token table obtained in the previous step.

    Figure 7-10 shows the command output.

    Figure 7-10 Command output

  15. Run the following command to exit the database:

    \q

  16. Run the following command to switch to the root user:

    exit

  17. Run the following command to query the size of the available disk space:

    df -h

    Check whether the available disk space size for the database partition is greater than the space size obtained in 14.

    • If yes, go to 19.
    • If no, go to 18.

      For example, if the available disk space size for the database partition is 116 MB, which is less than 18615 MB obtained in 14, expand the disk space.

  18. Repeat 2 to 5 to expand the disk space.

    Ensure that the disk space size after expansion is greater than the size obtained in 14.

  19. Run the following commands to log in to the Keystone database:

    su gaussdba

    gsql keystone

    Enter the database username and password as prompted. By default, the password is FusionSphere123. If you have changed the password, enter the new one.

  20. Run the following command to clear the token table.

    vacuum full token

    NOTE:

    This operation will consume system I/O resources and block service operations for this table. Therefore, perform this operation in off-peak hours. Clearing a token table consuming 13 GB space takes about 10 minutes. The operation is complete if VACUUM is displayed in the command output.

  21. Run the following command to exit the database:

    \q

  22. Run the following command to switch to the root user:

    exit

  23. Run the following commands to stop the GaussDB database:

    cps host-template-instance-operate --service gaussdb gaussdb --action stop

    This operation will stop the GaussDB database. Ensure that services are not delivered during the period.

  24. Run the following command repeatedly to query the status of the GaussDB database until the status of all GaussDB databases is fault:

    cps template-instance-list --service gaussdb gaussdb

    +------------+---------------+---------+--------------------------------------+-------------+
    | instanceid | componenttype | status  | runsonhost                           | omip        |
    +------------+---------------+---------+--------------------------------------+-------------+
    | 0          | gaussdb       | fault   | 42309170-0261-512F-DF77-0507589D4C3D | 128.6.35.15 |
    | 1          | gaussdb       | fault   | 4230A856-E289-AAB3-15F3-82B46C8EFDCC | 128.6.35.78 |
    +------------+---------------+---------+--------------------------------------+-------------+

  25. Run the following command to start the GaussDB database.

    cps host-template-instance-operate --service gaussdb gaussdb --action start

  26. Run the following command repeatedly to query the status of the GaussDB database until the statuses of GaussDB databases are active and standby:

    cps template-instance-list --service gaussdb gaussdb

    +------------+---------------+---------+--------------------------------------+-------------+
    | instanceid | componenttype | status  | runsonhost                           | omip        |
    +------------+---------------+---------+--------------------------------------+-------------+
    | 0          | gaussdb       | active  | 42309170-0261-512F-DF77-0507589D4C3D | 128.6.35.15 |
    | 1          | gaussdb       | standby | 4230A856-E289-AAB3-15F3-82B46C8EFDCC | 128.6.35.78 |
    +------------+---------------+---------+--------------------------------------+-------------+

Incorrect Permissions on Database Files
  1. Run the cd /opt/fusionplatform/data/gaussdb_data/data command to switch to the directory.
  2. Check whether there are non-database files in the directory.

    All files in the directory are displayed as follows. Delete all non-database files.

    -rw-------  1 gaussdba dbgrp   200 Jan 19 21:18 backup_label.old 
    drwx------ 18 gaussdba dbgrp  4096 Jan 19 21:18 base 
    -rw-------  1 gaussdba dbgrp     0 Jan 19 21:18 build_completed.done 
    -rw-------  1 gaussdba dbgrp  1448 Jan 19 21:18 ca.crt 
    -rw-------  1 gaussdba dbgrp   922 Jan 19 21:18 dblink.conf 
    -rw-------  1 gaussdba dbgrp    16 Jan 22 22:15 gaussdb.state 
    drwx------  2 gaussdba dbgrp  4096 Jan 22 22:49 global 
    -rw-------  1 gaussdba dbgrp     1 Jan 19 21:18 gs_build.pid 
    drwx------  2 gaussdba dbgrp  4096 Jan 23 00:00 pg_audit 
    drwx------  2 gaussdba dbgrp  4096 Jan 21 22:03 pg_blackbox 
    drwx------  2 gaussdba dbgrp  4096 Jan 22 01:16 pg_clog 
    drwx------  2 gaussdba dbgrp  4096 Jan 22 22:15 pg_confile_backup 
    -rw-------  1 gaussdba dbgrp     0 Jan 22 22:15 pg_ctl.lock 
    -rw-------  1 gaussdba dbgrp  4733 Jan 22 22:15 pg_hba.conf 
    -rw-------  1 gaussdba dbgrp  1024 Jan 19 21:18 pg_hba.conf.lock 
    -rw-------  1 gaussdba dbgrp  1636 Jan 19 21:18 pg_ident.conf 
    drwx------  2 gaussdba dbgrp  4096 Jan 19 21:18 pg_log 
    drwx------  4 gaussdba dbgrp  4096 Jan 19 21:18 pg_multixact 
    drwx------  2 gaussdba dbgrp  4096 Jan 21 22:03 pg_notify 
    drwx------  2 gaussdba dbgrp  4096 Jan 19 21:18 pg_serial 
    drwx------  2 gaussdba dbgrp  4096 Jan 19 21:18 pg_snapshots 
    drwx------  2 gaussdba dbgrp  4096 Jan 23 00:01 pg_stat_tmp 
    drwx------  2 gaussdba dbgrp  4096 Jan 22 20:49 pg_subtrans 
    drwx------  2 gaussdba dbgrp  4096 Jan 19 21:18 pg_tblspc 
    drwx------  2 gaussdba dbgrp  4096 Jan 19 21:18 pg_twophase 
    -rw-------  1 gaussdba dbgrp     4 Jan 19 21:18 PG_VERSION 
    drwx------  2 gaussdba dbgrp  4096 Jan 19 21:18 pg_wallet 
    drwx------  3 gaussdba dbgrp  4096 Jan 22 23:45 pg_xlog 
    -rw-------  1 gaussdba dbgrp 27486 Jan 22 22:15 postgresql.conf 
    -rw-------  1 gaussdba dbgrp  1024 Jan 17 01:04 postgresql.conf.lock 
    -rw-------  1 gaussdba dbgrp    84 Jan 22 22:15 postmaster.opts 
    -rw-------  1 gaussdba dbgrp    97 Jan 22 22:15 postmaster.pid 
    -rw-------  1 gaussdba dbgrp  1241 Jan 17 01:04 server.crt 
    -rw-------  1 gaussdba dbgrp  1766 Jan 17 01:04 server.key 
    -r--------  1 gaussdba dbgrp  1064 Jan 17 01:04 server.key.cipher 
    -r--------  1 gaussdba dbgrp    24 Jan 17 01:04 server.key.rand

  3. Check whether the permission for all files is gaussdba:dbgrp.

    If no, run the following command to correct the file permission:

    chown gaussdba:dbgrp filename

Insufficient System Resources

Check whether information similar to the following is displayed in the log on the database server.

gaussmaster 8620  FATAL:  could not create shared memory segment: Invalid argument
gaussmaster 8620  DETAIL:  Failed system call was shmget(key=5432016, size=1206730752, 03600).
gaussmaster 8620  HINT:  
The PostgreSQL documentation contains more information about shared memory configuration.

Or

gaussmaster 20059  FATAL:  could not create semaphores: No space left on device
The PostgreSQL documentation contains more information about configuring your system for PostgreSQL.

If yes, check the CPU usage, system memory, and system I/O status. Any exception in the CPU usage, system memory, and system I/O status may cause slow GaussDB response, even a fault.

Fault processing method:

  1. Log in to the faulty host and run the top and iostat commands to query the consumption of the shared memory, CPU usage, and system I/O status.

  2. Query the messages log. For details, see Obtaining Logs as Required. Check whether the GaussDB process is stopped due to resource isolation. That is, check whether the following information is contained in the log:

    Memory cgroup out of memory: Kill *****gaussdb sacrifice child

  3. Query the UVP log. For details, see Obtaining Logs as Required. Check whether the resources are abnormal when a fault occurs.
  4. contact technical support for assistance, locate the fault based on the information collected in the above steps.
Active Server Is Powered Off When a Standby Server Is Being Rebuilt

The status of the active and standby servers where the databases reside is abnormal, and the active server is powered off or stopped. Observed from the database control script log on the standby server, the standby server is required to switch to the active server, but the message indicating that the build operation is not complete is displayed, and the standby server cannot be started as the active server.

Log in to the standby server and run the following commands:

su gaussdba

gs_ctl querybuild

Information similar to the following is displayed. The value of DETAIL_INFORMATION is BuildFailed. This indicates that the standby server fails and needs to be rebuilt, but the rebuilding also fails.

 Ha state:
 LOCAL_ROLE                     : Unknown
 STATIC_CONNECTIONS             : 2
 DB_STATE                       : NeedRepair
 DETAIL_INFORMATION             : BuildFailed

Fault processing method:

  1. Power on the active server.

    If the active server is abnormal, restore it based on other fault processing methods in this document.

  2. After the standby server is rebuilt, run the gs_ctl querybuild command to query the service status as the gaussdb user.

    The service on the standby server is normal if the value of DB_STATE is normal.

A Database Component Fault Occurs When the Host Is Forcibly Powered Off and then Powered On
  1. Log in to the host where the faulty database resides. For details, see Logging In to a Host Running the Database Service.
  2. Run the following command to open the server log of the faulty database using the visual interface (vi) editor:

    vi /var/log/fusionsphere/component/gaussdb/gaussdb.log

    If information similar to the following is displayed, such a fault occurs.

    [2015-11-10 08:48:38.171 CST]  startup 14076  LOG:  open dataxlogchk file success "global/pg_dataxlogchk"  
    [2015-11-10 08:48:38.175 CST]  startup 14076  LOG:  at least one crc passed, minRecoveryPoint=[75/821CC000] 
    [2015-11-10 08:48:38.175 CST]  startup 14076  FATAL:  end of error record LSN [75/82000108] LE minRecoveryPoint [75/821CC000] 
    [2015-11-10 08:48:38.198 CST]  gaussmaster 14066  LOG:  startup process (PID 14076) exited with exit code 1 
    [2015-11-10 08:48:38.198 CST]  gaussmaster 14066  LOG:  aborting startup due to startup process failure 
    [2015-11-10 08:48:38.268 CST]  gaussmaster 14066  LOG:  StreamDoUnlink sockpath =/tmp/.s.PGSQL.2345 
    [2015-11-10 08:48:38.268 CST]  gaussmaster 14066  LOG:  UnlinkLockFile socket_lock_file =/tmp/.s.PGSQL.2345.lock     

  3. Run the following command to obtain the path where the GaussDB database is mounted:

    df -h

    Information similar to the following is displayed.

  4. Run the following command to switch to the path where the GaussDB database is mounted:

    cd /opt/fusionplatform/data/gaussdb_data

  5. Run the following command to delete the check file.

    Then, the file will be created again when the database is running.

    mv data/global/pg_dataxlogchk /home/fsp

  6. After one minute, run the following command to check the GaussDB status:

    cps template-instance-list --service gaussdb gaussdb

    Set gaussdb based on the actual database status. If Nova and Neutron use their own databases, set gaussdb to gaussdb_neutron or gaussdb_nova.

    If status is active or standby in the command output, the database is running properly.

    +------------+---------------+---------+--------------------------------------+--------------+ 
    | instanceid | componenttype | status  | runsonhost                           | omip         | 
    +------------+---------------+---------+--------------------------------------+--------------+ 
    | 0          | gaussdb       | active  | 200BC798-E58C-0000-1000-1DD200001220 | 128.6.35.105 | 
    | 1          | gaussdb       | standby | CCCC8177-46D4-0000-1000-1DD200009B80 | 128.6.35.37  | 
    +------------+---------------+---------+--------------------------------------+--------------+

    If the database fault persists, contact technical support for assistance.

Troubleshooting Data Component Access Failures
IP Listening
  1. Run the following command to query the hosts accommodating database services:

    cps template-instance-list --service gaussdb gaussdb

    In the command output, the host whose status is active accommodates the active database service, and the host whose status is standby accommodates the standby database service.

  2. Log in to the hosts accommodating the databases, and run the following command to query the IP addresses:

    ip addr show | grep gau

    The host accommodating the active database service has two IP addresses assigned. One is used to provide services, and the other is used to synchronize data between the active and standby databases.

    The host accommodating the standby database service has one IP address assigned, and the IP address is used to synchronize data between the active and standby databases.

  3. Check whether the host accommodating the active database listens to the floating IP address used to provide services.

    Run the comp-ip-get all command to obtain the floating IP addresses of all databases.

  4. Run the following command to check whether the floating IP addresses are bound to the databases:

    netstat -anp | grep Floating IP address

    If yes, the IP monitoring is normal. Check other possible causes.

    If no, contact technical support to check whether the database IP addresses are used by unauthorized users.

Number of Database Connections Exceeding the Threshold

Check whether an alarm indicating that the number of database connections exceeds the threshold is generated.

If yes, clear the alarm by performing operations provided in "ALM-73405 Number of GaussDB Connections Exceeds the Threshold in the FusionCloud 6.3.1.1 Alarm and Event Reference.

Data Inconsistency Between Active and Standby Databases

Check whether an alarm indicating that data is inconsistent between active and standby databases is generated.

If yes, clear the alarm by performing operations provided in "ALM-73403 Data Inconsistency Between Active and Standby GaussDB Databases" in the FusionCloud 6.3.1.1 Alarm and Event Reference.

Troubleshooting Host OS Failures of a Single Database Node
Failure of Active or Standby Database Host

Log in to the FusionSphere OpenStack web client and check whether one of the hosts accommodating the active and standby GaussDB services, that is, the host with the database_nova, database_neutron, database_cinder, or database_keystone role assigned, is faulty.

If yes, rectify the fault by referring to Host OS Failure.

Failure of Active and Standby Database Hosts

Log in to the FusionSphere OpenStack web client and check whether both the hosts accommodating the active and standby GaussDB services, that is, the hosts with the same database role assigned, such as database_nova, database_neutron, database_cinder, or database_keystone, are faulty.

If yes, rectify the fault by referring to Troubleshooting Host OS Failures of Active and Standby Database Nodes.

Troubleshooting Failures in Service Switchover from the Active Database Node to the Standby Database Node
Symptom

The active GaussDB node is faulty and is not running. In this case, the standby GaussDB node needs to become the active GaussDB node. However, the standby GaussDB node fails to synchronize data from the active GaussDB node due to a network fault or power-off, and cannot take over services from the active GaussDB node. You can run the cps template-instance-list --service gaussdb gaussdb command to check whether the takeover is successful.

Fault Locating

Log in to the standby GaussDB node to query the log file /var/log/fusionsphere/component/gaussdbControl/gaussdbControl_error.log. If the log file contains NotAllowActiveException: time diff too big or NotAllowActiveException: ha_status is not normal, the standby GaussDB node cannot become the active GaussDB node due to a data synchronization error.

Procedure
  1. Use PuTTY to log in to the first host in the AZ through the IP address of the External OM plane.

    The username is fsp and the default password is Huawei@CLOUD8.
    NOTE:
    • The system supports login authentication using a password or private-public key pair. If a private-public key pair is used for login authentication, seeUsing PuTTY to Log In to a Node in Key Pair Authentication Mode.
    • For details about the IP address of the External OM plane, see the LLD generated by FCD sheet of the xxx_export_all.xlsm file exported from FusionCloud Deploy during software installation, and search for the IP addresses corresponding to VMs and nodes.The parameter names in different scenarios are as follows:
      • Cascading layer in the Region Type I scenario : Cascading-ExternalOM-Reverse-Proxy, Cascaded layer : Cascaded-ExternalOM-Reverse-Proxy.
      • Region Type II and Type III scenarios : ExternalOM-Reverse-Proxy.

  2. Run the following command to switch to the root user, and enter the root password as prompted:

    su - root

    The default password of the root user is Huawei@CLOUD8!.

  3. Run the TMOUT=0 command to disable user logout upon system timeout.
  4. Import environment variables. For details, see Importing Environment Variables.
  5. Run the following commands to set ha_strategy to normal (that is, the system does not check whether the standby GaussDB node successfully synchronizes data from the active GaussDB node):

    cps template-params-update --service gaussdb gaussdb --parameter ha_strategy=normal

    cps commit

  6. After 1 to 3 minutes, run the following command to check whether the standby GaussDB node becomes the active GaussDB node:

    cps template-instance-list --service gaussdb gaussdb

    Information similar to the following is displayed:

    564D28B3-FCD4-EE16-01F5-0FC0B8A9AD69:/var/log/fusionsphere/component/gaussdbControl # cps template-instance-list --service gaussdb gaussdb 
    +------------+---------------+---------+--------------------------------------+------------+ 
    | instanceid | componenttype | status  | runsonhost                           | omip       | 
    +------------+---------------+---------+--------------------------------------+------------+ 
    | 0          | gaussdb       | standby | 564D34D7-0F0A-F8FE-70F2-BA0AA0FB7E41 | 10.9.1.96  | 
    | 1          | gaussdb       | active  | 564D28B3-FCD4-EE16-01F5-0FC0B8A9AD69 | 10.9.2.169 | 
    +------------+---------------+---------+--------------------------------------+------------+

    If status of the standby GaussDB node changes to active in the command output, the standby GaussDB node becomes the active GaussDB node. Otherwise, contact technical support for assistance.

  7. Determine whether to restore backup data according to the duration when data synchronization from the active GaussDB node to the standby GaussDB node is abnormal.

    If the duration exceeds one day, you are advised to restore the backup data. For details, see section FusionCloud 6.3.1.1 Backup and Restoration Guide.

  8. After the active and standby GaussDB nodes are both restored, run the following commands to set ha_strategy to advance:

    cps template-params-update --service gaussdb gaussdb --parameter ha_strategy=advance

    cps commit

Troubleshooting Host OS Failures of Active and Standby Database Nodes
Symptom

The dedicated hosts running the active and standby database services become faulty at the same time. VMs cannot be created, Keystone authentication fails, and the databases are inaccessible.

Possible Causes
  • The hardware of the GaussDB hosts is faulty.
  • The OSs of the GaussDB hosts are faulty.
Procedure
  1. If the hardware, including a host or a hard disk, is faulty, replace the faulty host, or power off the target host and replace the faulty hard disk.

    The new hardware must have the same specifications and model as the faulty hardware. For details, see "Replacing Hosts and Accessories" in the FusionCloud 6.3.1.1 Parts Replacement.

    NOTE:

    Some RAID controller cards, for example, Lsi 1078, Lsi 2208, and Lsi 3108, require that even a single hard disk be added to RAID 0. Otherwise, the hard disk cannot be identified by the host OS.

    Therefore, after you replace the hard disk on a host using such a RAID controller card, first configure the RAID array for the new hard disk.

    For details about RAID controller card requirements and RAID array configurations, see the product documentation of the corresponding hardware.

    If you are unclear about the RAID controller card requirements, create RAID 0 for each independent hard disk on the host.

  2. Use PuTTY to log in to any host in the AZ.

    Ensure that the IP address of the External OM network and username fsp are used to establish the connection.

    The default password of the fsp user is Huawei@CLOUD8.

    NOTE:
    • The system supports login authentication using a password or private-public key pair. If a private-public key pair is used for login authentication, seeUsing PuTTY to Log In to a Node in Key Pair Authentication Mode.
    • For details about the IP address of the External OM plane, see the LLD generated by FCD sheet of the xxx_export_all.xlsm file exported from FusionCloud Deploy during software installation, and search for the IP addresses corresponding to VMs and nodes.The parameter names in different scenarios are as follows:
      • Cascading layer in the Region Type I scenario : Cascading-ExternalOM-Reverse-Proxy, Cascaded layer : Cascaded-ExternalOM-Reverse-Proxy.
      • Region Type II and Type III scenarios : ExternalOM-Reverse-Proxy.

  3. Run the following command and enter the password of user root to switch to user root:

    su - root

    The default password of the root user is Huawei@CLOUD8!.

  4. Run the following command to disable user logout upon timeout:

    TMOUT=0

  5. Import environment variables. For details, see Importing Environment Variables.
  6. Perform either of the following operations as required.

    • If the host or the hard disk is faulty, go to 7.
    • If the host OS is faulty, go to 8.

  7. Run the following command to query the active/standby status of the database component:

    cps template-instance-list --service gaussdb_service gaussdb_template

    gaussdb_service and gaussdb_template specify the service and template of the faulty database component, respectively.

    Query the status repeatedly until the result shows that the database component is running properly. Then go to 9.

  8. Restore two faulty database hosts in sequence by referring to Troubleshooting Host OS Failures of a Single Database Node.
  9. Run the following command to query the active node of the backup service:

    cps template-instance-list --service backup backup-server

    In the command output, the node in the active state is the active node of the backup service. Take note of its host ID.

  10. Run the following command to query the management IP address of the active node of the backup service:

    cps host-list | grep Active host ID of the backup service

  11. Obtain available backup packages.

    The system automatically backs up data every day and stores the backup packages in the /opt/backup/backupfile directory on the active backup-server host. If a third-party backup server is configured, the backup packages are also stored on this server in a directory that is specified during the backup policy configuration.

    GaussDB backup packages are named in the format of gaussdb-YYYY-MM-DD-HH-MM-SS-COUNT-TYPE.tar.gz.

    In database sharing scenarios, the GaussDB data packages are subdivided into gaussdb_neurton, gaussdb_nova, gaussdb_keystone, and gaussdb_cinder packages, in the format of gaussdb_neurton-YYYY-MM-DD-HH-MM-SS-COUNT-TYPE.tar.gz.

    • COUNT specifies the sequence number of the backup package.
    • TYPE specifies the backup package type, which can be single or all.
      • single indicates that the package contains data backup for a single service.
      • all indicates that the package contains data backup for all the services.
    1. Use PuTTY to log in to the active backup service node using its management IP address as the fsp user and then run the su - root command to switch to the root user.

      The default password of the fsp user is Huawei@CLOUD8.

      The default password of the root user is Huawei@CLOUD8!.

      NOTE:

      In the Region Type I scenario, the system supports the login authentication using a password or private-public key pair. If you use a private-public key pair to authenticate the login, see Using PuTTY to Log In to a Node in Key Pair Authentication Mode.

    2. Copy the backup package stored in the /opt/backup/backupfile directory to the local PC.

  12. Upload the obtained GaussDB backup package to the active node of the backup service.

    Use the winSCP tool to log in to the active node of the backup service as the root user through the management IP address of the node, and upload the backup package to the /opt/backup directory on the node.

  13. Log in to the active node of the backup service as the root user and run the following command to change the owner of the backup package to the backup user:

    chown backup:cps /opt/backup/Backup package

  14. Run the following command to manually restore the GaussDB service:

    restore execute --file Backup package name --path Directory for saving backup packages --service gaussdb

    In database sharding scenarios, you also need to specify the component of the GaussDB backup packages in the command.

    restore execute --file Backup package name --path Directory for saving backup packages --service gaussdb_Component

    In this command:

    • Backup package name specifies the name of the backup package, for example, gaussdb-2015-10-17-14-26-11-87-all.tar.gz.

      In database sharding scenarios, backup packages are named like gaussdb_neurton-2015-10-17-14-26-11-87-all.tar.gz.

    • Directory for saving backup packages specifies the directory for saving backup packages, for example, /opt/backup.

    You can run the restore progress-get --service gaussdb command to query the restoration progress. (gaussdb specifies the GaussDB name of a service in database sharding scenarios, for example, gaussdb_neutron.)

    • If the result is SUCCESS, the restoration is complete. Go to 15.
    • If the result is FAILD, run the restoration command after 15 minutes. If the fault persists, contact technical support for assistance.
    • If the result is DOING, the GaussDB service is being restored. Query the progress 5 minutes later.

      Example:

      If the GaussDB database is not shared, run the following command to restore the GaussDB service:

      restore execute --file gaussdb-2015-10-17-14-26-11-87-all.tar.gz --path /home/fsp --service gaussdb

      If the GaussDB database is shared, run the following command to restore the GaussDB service of the Neutron service:

      restore execute --file gaussdb_neutron-2015-10-17-14-26-11-87-all.tar.gz --path /home/fsp --service gaussdb_neutron

    NOTE:

    If you use Keystone authentication during GaussDB restoration, the progress query commands will be unavailable for a short period of time. After the GaussDB service is restored, the commands will become available again.

  15. Run the following command to delete the backup package uploaded in 12:

    rm /opt/backup/Name of the backup package

MongoDB Database

How Do I Resolve Database Status Exceptions?

Before you perform steps provided in this section to restore the MongoDB service, you can enable the automatic recovery function by performing steps provided in How Do I Enable Automatic MongoDB Recovery?. Before you enable this function, read through the scenario and prerequisite.

Symptom

An exception in the MongoDB service is detected on FusionSphere OpenStack.

The symptom is as follows:

The MongoDB service is in fault state. For details about how to check the status of the MongoDB service, see Query the MongoDB service status.

Possible Causes

The MongoDB process exits, or MongoDB data is inconsistent due to unexpected host power outages or other system faults. Alternatively, the data becomes inconsistent due to network or storage interruptions.

Handling Method

If FusionSphere OpenStack is deployed in cluster mode, you can clear the data on the error node and then restart the MongoDB service to automatically synchronize the data from another node.

Handling Suggestion

Query the MongoDB service status.

  1. Log in to the faulty node. For details, see Logging In to a Host Running the Database Service.
  2. Run the following command to query the MongoDB service status:

    cps template-instance-list --service mongodb mongodb

    Information similar to the following is displayed:

Use the self-copy function of the cluster to repair the database.
NOTE:
  • This method is available only when FusionSphere OpenStack is deployed with three controller nodes in cluster mode.
  • The self-copy function will increase I/O load on functional nodes during the restoration.
  1. Log in to the faulty node.
  2. Import environment variables. For details, see Importing Environment Variables.
  3. Check the fault status.
    • If the number of the MongoDB nodes in the fault state is less than 3, go to 6.
    • If all three MongoDB nodes are in the fault state and only alarm data needs to be retained, go to 7.
    • If the three MongoDB nodes are in the fault state and monitoring data and alarm data need to be retained, you are advised to contact technical support for assistance.
  4. Log in to the host where the MongoDB is faulty and perform the following operations:
    1. Run the following command to restore the MongoDB service:

      python /etc/mongodb/mongodb_recovery.py

      Information similar to the following is displayed:

      please choose the following action which you want to do: 
      (1)query mongodb inner status 
      (2)mongodb recovery 
      (q)quit mongo recovery tool 
       please choose:[1|2|q]     
    2. Enter 2 and wait until the restoration is complete.
    3. After the restoration is complete, perform 8 to verify the restoration.
  5. Log in to the host where the MongoDB is faulty and perform the following operations:
    1. Run the following command to stop the MongoDB cluster:

      cps host-template-instance-operate --action stop --service mongodb mongodb

    2. Run the following command on each MongoDB host to delete the data:

      rm -r /var/ceilometer/*

    3. Run the following command on each MongoDB host to delete the MongoDB configuration file:

      rm /etc/mongodb/run_info.json

      If a message is displayed indicating that /etc/mongodb/run_info.json does not exist, go to the next substep.

    4. Run the following command to start the MongoDB cluster:

      cps host-template-instance-operate --action start --service mongodb mongodb

    5. Run the following command to check the cluster status:

      cps template-instance-list --service mongodb mongodb

      Wait till the cluster state changes to active and go to next step.

    6. Run the following commands to restore the alarm data:

      backup package-get --service mongodb

      restore execute --service mongodb --file mongodb-2016-01-20-04-53-12-1-single.tar.gz

      restore progress-get --service mongodb

      In the preceding command, mongodb-2016-01-20-04-53-12-1-single.tar.gz is an example of alarm backup data. In actual operations, you need to obtain the latest alarm backup data from the /opt/backup/backupfile/mongodb directory on the host where MongoDB is located.

  6. Verify that the MongoDB is restored.
    NOTE:
    • Normal MongoDB states are PRIMARY, SECONDARY, SECONDARY:noprimary, and ARBITER.
    • Abnormal MongoDB states are UNKNOWN, DOWN, REMOVED, ROLLBACK, FATAL, and others.
    1. Run the following command to query the MongoDB status of each node:

      python /etc/mongodb/mongodb_recovery.py

      Information similar to the following is displayed:
       please choose the following action which you want to do: 
      (1)query mongodb inner status 
      (2)mongodb recovery 
      (q)quit mongo recovery tool 
       please choose:[1|2|q]     
    2. Enter 1.

      Information similar to the following is displayed:

      If each node is in the normal state and the optimeDate values are consistent, the restoration is complete.

      • All three MongoDB nodes are in the fault state in 5. In this case, if MongoDB is restored and Service OM has been connected, you need to reconnect the system to Service OM to perform alarm configurations and to ensure that alarms can be properly reported to Service OM. For details, see "Alarm Interconnection" in the FusionCloud 6.3.1.1 O&M Guide.
      • After alarm interconnection, the alarm difference is updated to the lastest state in 30 minutes.

      If the MongoDB service is not restored, contact technical support for assistance.

VM HA Failure Caused by a MongoDB Exception
Symptom

HA fails to trigger on faulty VMs when the following conditions are met:

  • The MongoDB data partition uses remote storage.
  • The MongoDB service is unavailable.
  • The HA function is unavailable.
  • The Nova service fails to restart the VMs on its host.
Possible Causes
  • The MongoDB service malfunctions and data cannot be read or written in MongoDB.
  • When the Nova service restarts the VMs on the host, the storage device is not ready.
Handling Method
  1. Restore the MongoDB service.
  2. Manually restore the VM.
Procedure

Restore the MongoDB service.

  1. Ensure that the storage device is ready and FusionSphere OpenStack hosts are running properly.
  2. Use PuTTY to log in to the first host of FusionSphere OpenStack through the IP address of the External OM plane.

    The username is fsp and the default password is Huawei@CLOUD8.
    NOTE:
    • The system supports login authentication using a password or private-public key pair. If a private-public key pair is used for login authentication, seeUsing PuTTY to Log In to a Node in Key Pair Authentication Mode.
    • For details about the IP address of the External OM plane, see the LLD generated by FCD sheet of the xxx_export_all.xlsm file exported from FusionCloud Deploy during software installation, and search for the IP addresses corresponding to VMs and nodes.The parameter names in different scenarios are as follows:
      • Cascading layer in the Region Type I scenario : Cascading-ExternalOM-Reverse-Proxy, Cascaded layer : Cascaded-ExternalOM-Reverse-Proxy.
      • Region Type II and Type III scenarios : ExternalOM-Reverse-Proxy.

  3. Run the following command to switch to the root user, and enter the root password as prompted:

    su - root

    The default password of the root user is Huawei@CLOUD8!.

  4. Run the TMOUT=0 command to disable user logout upon system timeout.
  5. Import environment variables. For details, see Importing Environment Variables.
  6. Restore the MongoDB service. For details, see How Do I Resolve Database Status Exceptions?.
  7. On the host you have logged in to, run the following command to check whether the MongoDB service is successfully restored:

    cps template-instance-list --service mongodb mongodb

    If the MongoDB service status changes to active on the faulty node, the MongoDB service is successfully restored.

    • If yes, no further action is required. The VMs will be automatically restarted.
    • If no, go to 8.

Manually restore the VM.

  1. On the host you have logged in to, run the following command to stop a VM to be restored:

    nova stop 4d6ac51c-81d1-4f6c-8063-cc949b57b255

    4d6ac51c-81d1-4f6c-8063-cc949b57b255 is an example of the ID of the VM to be restored. Set the VM ID based on the site requirements.

  2. Run the following command to start the VM:

    nova start 4d6ac51c-81d1-4f6c-8063-cc949b57b255

  3. Perform 8 to 9 to restore all the other VMs to be restored.
  4. Check whether all VMs are restored.

    • If yes, all VMs are restored. No further action is required.
    • If no, contact technical support for assistance.

Translation
Download
Updated: 2019-06-10

Document ID: EDOC1100063248

Views: 22686

Downloads: 37

Average rating:
This Document Applies to these Products
Related Documents
Related Version
Share
Previous Next