No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>Search

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

HUAWEI CLOUD Stack 6.5.0 Alarm and Event Reference 04

Rate and give feedback:
Huawei uses machine translation combined with human proofreading to translate this document to different languages in order to help you better understand the content of this document. Note: Even the most advanced machine translation cannot match the quality of professional translators. Huawei shall not bear any responsibility for translation accuracy and it is recommended that you refer to the English document (a link for which has been provided).
Product Monitoring

Product Monitoring

ALM-151 The CPU Usage Is High

Description

The management plane consecutively samples the CPU usage of the server. This alarm is generated when every sampled CPU usage in a sampling period is greater than or equal to the alarm generation threshold. This alarm is automatically cleared when one sampled CPU usage in a sampling period is smaller than the alarm generation threshold.

NOTE:

The sampling period is the number of overload times multiplied by the sampling interval, for example, the number of overload times is 40 and the sampling interval is 15 seconds. The sampling period is 600 seconds. The sampling interval cannot be customized.

Attribute

Alarm ID

Alarm Severity

Alarm Type

151

Major

Over limit

Parameters

Name

Meaning

Host

Name of the node for which the alarm is generated

Operating System

Operating system of the server

Site Name

Name of the site for which the alarm is generated

Threshold

Threshold for generating an alarm

Clearance threshold

Threshold for clearing an alarm

CPU Usage

CPU usage of the server

Impact on the System

  • The management plane responds slowly.
  • The real-time performance and alarm reporting delays, and information cannot be obtained promptly.
  • The system processes services slowly. As a result, messages may be accumulated.

Possible Causes

The possible causes of this alarm are as follows:

  • The management plane is busy temporarily.
  • The alarm generation threshold for the CPU usage of the management plane server is set improperly.
  • The management plane server is performing an operation that occupies many system resources.
  • The hardware performance of the management plane server is low. Therefore, the management plane cannot run properly.

Procedure

  1. Use a browser to log in to ManageOne Deployment Portal.

    URL: https://Floating IP address of ManageOne Deployment Portal:31945, for example, https://192.168.0.1:31945

    Default account: admin; default password: Huawei12#$

  2. Check whether multiple tasks are being executed on the management plane.

    On the management plane main menu, choose System > Task > Task Information List, wait until all tasks are complete, and then check whether the alarm is cleared.

    • If yes, no further action is required.
    • If no, go to 3.

  3. Check whether the threshold of the CPU usage is set properly.

    1. On the management plane, choose Application > Service Management > System Monitoring from the main menu.
    2. On the System Monitoring page, click on the right of the Nodes tab page to check whether the Alarm Generation Threshold and Threshold-crossing times of the CPU usage is set properly.
      • If yes, go to 4.
      • If no, reset the Alarm Generation Threshold and Threshold-crossing times of the CPU usage (the default values are "85" and "40", respectively). No further action is required.

  4. Check whether the CPU usage of applications exceeds the alarm generation threshold.

    1. On the Nodes tab page, find the node for which the alarm is generated.
    2. Check whether the CPU usage exceeds the alarm generation threshold.
      • If yes, the CPU is exhausted due to the applications. Wait until the related service processing is complete, and then go to 7. If the services are not completed for a long time, collect the alarm handling information and contact technical support for troubleshooting assistance.
      • If no, go to 5.

  5. Check whether the CPU usage of non-applications exceeds the alarm generation threshold.

    1. Use PuTTY to log in to the IP address corresponding to the Host parameter in the alarm parameters as the ossadm user in SSH mode.
    2. Run the top command to check the CPU usage of the processes in the CPU column.
      • If the CPU usage exceeds the threshold, contact technical support. No further action is required.
      • If the CPU usage does not exceed the threshold, go to 6.

  6. Check whether hardware performance of the server is low and the hardware does not support the running of the management plane.

    If hardware performance of the server is low:

    • The hardware requirements corresponding to the management scope of ManageOne are beyond the actual hardware capability of the server.
    • The alarm is generated consistently or frequently.

    Check whether any of the preceding two symptoms appears.

    • If yes, contact technical support. No further action is required.
    • If no, go to 7.

  7. Wait for one minute and check whether the alarm is cleared.

    • If yes, no further action is required.
    • If no, collect the alarm handling information and contact technical support.

Clearing

When the fault is rectified, the system will automatically clear the alarm. Manual clearing is not required.

The alarm cannot be automatically cleared and needs to be manually cleared in the following scenarios:

  • The name of the host for which the alarm is generated is changed.
  • The server for which the alarm is generated is not monitored.

Related Information

None

ALM-152 The OSS Service Is Terminated Abnormally

Description

This alarm is generated when the management plane detects (detection is performed every 30 seconds) that a service process exits unexpectedly and fails to restart for 10 consecutive times. This alarm is automatically cleared when the management plane detects that the process starts.

Attribute

Alarm ID

Alarm Severity

Alarm Type

152

Major

Processing error alarm

Parameters

Name

Meaning

Server name

Name of the node for which the alarm is generated

SvcAgent

Name of the process that generates the alarm

SvcName

Name of the service for which the alarm is generated

Site Name

Name of the site for which the alarm is generated

Impact on the System

Related service functions are unavailable and services that depend on the functions are unavailable.

Possible Causes

The possible causes of this alarm are as follows:

  • The issue is caused by manual operations. For example, a process is stopped manually.
  • The system resources are insufficient.

Procedure

  1. Check whether the node specified by the Server name parameter in the alarm parameters belongs to the CloudSOP-UniEP.

    1. Use a browser to log in to ManageOne Deployment Portal.

      URL: https://Floating IP address of ManageOne Deployment Portal:31945, for example, https://192.168.0.1:31945

      Default account: admin; default password: Huawei12#$

    2. On the main menu, choose Application > Service Management > System Monitoring.
    3. On the upper left corner of the System Monitoring page, move the pointer to and select CloudSOP-UniEP.
    4. Check whether the node specified by the Server name parameter in the alarm parameters exists.
      • If yes, go to 2.
      • If no, go to 5.

  2. Use PuTTY to log in to the deployment node as the sopuser user in SSH mode.

    The default password of the sopuser user is D4I$awOD7k.

  3. Run the following command to switch to the ossadm user:

    su - ossadm

    The default password of the ossadm user is Changeme_123.

  4. Run the following commands to start the management plane service, and then go to 11:

    > cd /opt/oss/manager/agent/bin

    > bash ipmc_adm -cmd startapp -tenant manager

    If information similar to the following is displayed and success is displayed for all processes, the management plane services are restarted successfully. Otherwise, contact technical support for troubleshooting assistance.

    Starting process backupwebsite-0-0 ... success  
    Starting process smapp-0-0 ... success  
    Starting process cron-0-0 ... success  
    ...

  5. Use a browser to log in to ManageOne Deployment Portal.

    URL: https://Floating IP address of ManageOne Deployment Portal:31945, for example, https://192.168.0.1:31945

    Default account: admin; default password: Huawei12#$

  6. On the main menu, choose Application > Service Management > System Monitoring.
  7. On the upper left corner of the System Monitoring page, move the pointer to and select the product to which the node specified by the Server name parameter in the alarm parameters belongs.
  8. On the Nodes tab page, click the name of the node for which the alarm is generated.
  9. In the Process tab page, check whether the service process is in the Running state.

    • If yes, go to 11.
    • If no, go to 10.

  10. Select the process to be started and click Start to start the process.
  11. Check whether the alarm is cleared.

    • If it is, no further action is required.
    • If no, contact technical support for assistance.

Related Information

None

ALM-101208 Node Status Is Abnormal

Description

This alarm is generated when the management plane detects that the node is unreachable for 4 consecutive times. (The detection interval is 180 seconds.) This alarm is automatically cleared when the node is recovered.

Attribute

Alarm ID

Alarm Severity

Alarm Type

101208

Major

Communications alarm

Parameters

Name

Meaning

Host

Name of the node for which the alarm is generated

Site Name

Name of the site for which the alarm is generated

Impact on the System

You cannot log in to the node, or an error may occur when you perform operations on the node.

Possible Causes

  • The OS of the node cannot be logged in to, or no response is returned.
  • The VM is powered off, or the VM network connection is abnormal.
  • ProductMonitorAgent on the node is abnormal.
  • The IR certificate of the node expires, and the internal communication is abnormal.
  • If a database exists on the node, the replication status of the database instances may be abnormal. As a result, the node is abnormal.

Procedure

  1. Use a browser to log in to ManageOne Deployment Portal.

    URL: https://Floating IP address of ManageOne Deployment Portal:31945, for example, https://192.168.0.1:31945

    Default account: admin; default password: Huawei12#$

  2. Execute the check items and check methods in Table 4-4 in sequence and rectify the fault according to the corresponding troubleshooting methods.

    NOTE:

    This section provides only the basic troubleshooting method. If the fault persists after troubleshooting using this method, collect the information about alarm handling, and contact technical support.

    Table 4-4 Product node troubleshooting

    No.

    Check Item

    Check Method

    Troubleshooting Method

    1

    Network connection

    Contact the administrator to check whether the network connection is normal.

    Contact the administrator to rectify the network fault.

    2

    VM running status

    Contact the administrator to check whether the VM is abnormal, for example, whether the VM is powered off or deleted.

    Contact the administrator to restart and restore the VM.

    3

    Operating system running status

    Restart the VM and use PuTTY to log in to the faulty node as the sopuser user in SSH mode.

    If the login failed or no response was returned, the OS of the faulty node is abnormal. In this situation, contact technical support for assistance.

    4

    Running status of ProductMonitorAgent

    1. Use PuTTY to log in to the faulty node as the sopuser user in SSH mode.

      The default password of the sopuser user is D4I$awOD7k.

    2. Run the following command to switch to the ossadm user:

      su - ossadm

      The default password of the ossadm user is Changeme_123.

    3. Run the following command to check whether ProductMonitorAgent is running properly:

      > ps -ef |grep ProductMonitorAgent

      If information similar to the following is displayed, ProductMonitorAgent is running:

      ossadm    21501      1  2 16:47 ?        00:01:18 /opt/oss/envs/ProductMonitorAgent/service/rtsp/python/bin/python /opt/oss/envs/ProductMonitorAgent/service/tools/pyscript/icAgent.pyc -DNFW=productmonitoragent-0-0

    If ProductMonitorAgent is not running, run the following commands to start it:

    > . /opt/oss/manager/bin/engr_profile.sh

    > ipmc_adm -cmd startapp -app ProductMonitorAgent -tenant manager

    If the following information is displayed, ProductMonitorAgent is started successfully. Otherwise, contact technical support.

    Starting process productmonitoragent-0-0 ... success

    5

    IR certificate

    1. Use PuTTY to log in to the deployment node as the sopuser user in SSH mode.

      The default password of the sopuser user is D4I$awOD7k.

    2. Run the following command to switch to the ossadm user:

      su - ossadm

      The default password of the ossadm user is Changeme_123.

    3. Run the following commands to check the IR certificate validity:

      > cd /opt/oss/manager/etc/ssl/internal

      > openssl x509 -in server.cer -noout -dates

      If information similar to the following is displayed, the time displayed on the right of notAfter is the expiration time of the IR certificate:

      notBefore=Oct 18 00:00:00 2018 GMT
      notAfter=Oct 13 00:00:00 2038 GMT
      • If the IR certificate has expired, update the CA certificate.
      • If the IR certificate is valid, the fault is not caused by certificate expiration.

    Update the CA certificate. For details, see "Certificate Management > Replacing Type B and Type C Certificates > Manually Replacing Other Certificates > Replacing the CA Certificate of ManageOne" in HUAWEI CLOUD Stack 6.5.0 Security Management Guide.

    6

    Database replication status

    For details, see ALM-101210 The database local copy status is abnormal.

    For details, see ALM-101210 The database local copy status is abnormal.

  3. Log in to the management plane again and check the node status.

    • If the restored node is in the Normal state, this alarm is cleared.
    • If the restored node is not in the Normal state, contact technical support.

Clearing

When the fault is rectified, the system will automatically clear the alarm. Manual clearing is not required.

The alarm cannot be automatically cleared and needs to be manually cleared in the following scenarios:

  • The name of the host for which the alarm is generated is changed.
  • The server for which the alarm is generated is not monitored.

Related Information

None

ALM-154 The Memory Usage Is Too High

Description

This alarm is generated when the management plane detects (detection is performed every 15 seconds) that the physical memory usage is greater than or equal to the alarm generation threshold. This alarm is automatically cleared when the physical memory usage is smaller than or equal to the alarm clearance threshold.

Attribute

Alarm ID

Alarm Severity

Alarm Type

154

Major

Over limit

Parameters

Name

Meaning

Host

Name of the node for which the alarm is generated

Operating System

Operating system of the server

Site Name

Name of the site for which the alarm is generated

Threshold

Threshold for generating an alarm

Clearance threshold

Threshold for clearing an alarm

Memory Usage

Memory usage

Impact on the System

The response speed of the management plane is low.

Possible Causes

The possible causes of this alarm are as follows:

  • The alarm generation threshold for the memory usage of the management plane server is set improperly.
  • The management plane server is performing an operation that occupies many system resources or takes a long time.
  • Services are busy; therefore, the memory usage increases.
  • A program processing exception occurs.

Procedure

  1. Use a browser to log in to ManageOne Deployment Portal.

    URL: https://Floating IP address of ManageOne Deployment Portal:31945, for example, https://192.168.0.1:31945

    Default account: admin; default password: Huawei12#$

  2. On the management plane, choose Application > Service Management > System Monitoring from the main menu.
  3. On the System Monitoring page, click on the right of the Nodes tab page to check whether the Alarm Generation Threshold and Alarm Clearance Threshold of the physical memory usage are set properly.

    • If yes, go to 4.
    • If no, reset the Alarm Generation Threshold and Alarm Clearance Threshold for the physical memory usage(the default values are "85" and "80", respectively). No further action is required.

  4. Check the physical memory usage of the application.

    1. On the Nodes tab page, find the node for which the alarm is generated.
    2. Check whether the physical memory usage exceeds the alarm generation threshold.
      • If yes, the physical memory resources are exhausted due to the applications. Wait until the related service processing is complete, and then go to 7. If the services are not completed for a long time, collect the alarm handling information and contact technical support for troubleshooting assistance.
      • If no, go to 5.

  5. Check the process with the highest physical memory usage of non-applications.

    1. Use PuTTY to log in to the IP address corresponding to the Host parameter in the alarm parameters as the ossadm user in SSH mode.
    2. Run the following command to check whether the physical memory usage of the process exceeds the alarm generation threshold in the MEM column:
      > top
      ...
      PID USER      PR  NI    VIRT    RES    SHR S   %CPU  %MEM     TIME+ COMMAND
      164860 ossadm    20   0  832312 496564  18164 S 20.199 1.539   1480:48 java
      • If yes, contact technical support. No further action is required.
      • If no, go to 6.

  6. If the physical memory is sufficient, the physical memory will not be reclaimed after services are processed. Therefore, the alarm will not be cleared. In this case, check whether the physical memory usage continuously increases. The specific steps are as follows:

    1. On the Nodes tab page, find the node for which the alarm is generated.
    2. Check whether the physical memory usage of the corresponding process increases continuously.
      • If yes, go to 7.
      • If no, no further action is required.

  7. After 1 minute, check whether the alarm is cleared.

    • If yes, no further action is required.
    • If no, collect the alarm handling information and contact technical support.

Clearing

When the fault is rectified, the system will automatically clear the alarm. Manual clearing is not required.

The alarm cannot be automatically cleared and needs to be manually cleared in the following scenarios:

  • The name of the host for which the alarm is generated is changed.
  • The server for which the alarm is generated is not monitored.

Related Information

None

ALM-36 The Disk Usage Is Too High

Description

This alarm is generated when the management plane detects (detection is performed every 15 seconds) that the disk or partition usage is greater than or equal to the alarm generation threshold. This alarm is automatically cleared if the disk or partition usage is less than or equal to the alarm clearance threshold.

Attribute

Alarm ID

Alarm Severity

Alarm Type

36

Major

Over limit

Parameters

Name

Meaning

Host

Name of the node for which the alarm is generated

Operating System

Operating system of the server

Disk

Name of the server disk for which the alarm is generated

Site Name

Name of the site for which the alarm is generated

Threshold

Threshold for generating an alarm

Clearance threshold

Threshold for clearing an alarm

Capacity

Disk capacity

Usage

Disk space usage

Impact on the System

The write operation of the management plane service may fail, and a database exception occurs.

Possible Causes

  • The alarm generation threshold for the disk usage of the management plane server is set improperly.
  • The disk contains too many unnecessary files.
    • The recycle bin is not cleared.
    • The management plane server has received a large amount of data, including NE alarms, events, and logs. The data is exported from the database to disk files in a short time.
    • There are too many temporary data files and backup files.

Procedure

  1. Use a browser to log in to ManageOne Deployment Portal.

    URL: https://Floating IP address of ManageOne Deployment Portal:31945, for example, https://192.168.0.1:31945

    Default account: admin; default password: Huawei12#$

  2. On the management plane, choose Application > Service Management > System Monitoring from the main menu.
  3. On the System Monitoring page, click on the right of the Nodes tab page to check whether Alarm Generation Threshold and Alarm Clearance Threshold in the Hard Disk area are set properly (the default values are "80" and "75", respectively).

    • If yes, go to 4.
    • If no, reset the thresholds to appropriate values. If the alarm is cleared, no further action is required. Otherwise, go to 4.

  4. Delete the unnecessary files from the disk.

    1. Use PuTTY to log in to the node for which this alarm is generated in SSH mode as the sopuser user.

      The default password of the sopuser user is D4I$awOD7k.

    2. Run the following command to switch to the ossadm user:

      su - ossadm

      The default password of the ossadm user is Changeme_123.

    3. Run the following command to switch to the root user:

      > su - root

      Password: Password of the root user
    4. Run the following command to identify which disks have high usage.

      # df -k

      If disks not listed in the Disk alarm parameter have high usage but is less than the alarm generation threshold, you can also clear the unneeded files from the disks.

    5. Run the following commands to enter the directory of the disk with high usage, query the files and subdirectories in the disk, sort the files and subdirectories by size, and write the files and subdirectories to the du_k.txt file:

      # cd filepath

      # du -k | sort -nr > /tmp/du_k.txt

    6. Run the following command to check the du_k.txt file and identify the subdirectory of a larger size:

      # more /tmp/du_k.txt

    7. Run the following commands to enter the subdirectory of a larger size, query the files and subdirectories in the subdirectory, sort the files and subdirectories by size, and write the files and subdirectories to the ls_l.txt file:

      # cd filepath

      # ls -l | sort -nr > /tmp/ls_l.txt

    8. Run the following command to check the ls_l.txt file and identify the subdirectory or file of a larger size:

      # more /tmp/ls_l.txt

    9. Repeat 4.e to 4.h to locate the files that cause the high file system usage and determine which of them are unnecessary files and then clear these files.

      You are advised to preferentially delete the installation packages, patch packages, installation packages of the adaptation layer, backup files during installation, core files, and the files of the management plane and database that can be deleted. After you delete the files, go to 5.

    NOTE:

    If you are not sure whether a file can be deleted, contact technical support.

  5. Wait 1 minute. Check whether the alarm is cleared.

    • If yes, no further action is required.
    • If no, collect the alarm handling information and contact technical support.

Clearing

When the fault is rectified, the system will automatically clear the alarm. Manual clearing is not required.

The alarm cannot be automatically cleared and needs to be manually cleared in the following scenarios:

  • The name of the host for which the alarm is generated is changed.
  • The disk mounting point that generates the alarm is changed.
  • The server for which the alarm is generated is not monitored.

Related Information

None

ALM-38 The database process is abnormal

Description

This alarm will be generated when the GaussDB 100 V1, GaussDB 100 V3, or Redis database service process does not exist or the password for logging in to the database is incorrect. This alarm will be automatically cleared when the GaussDB 100 V1, GaussDB 100 V3, or Redis database service process reappears or the password is correct.

Attribute

Alarm ID

Alarm Severity

Alarm Type

38

Major

Processing error alarm

Parameters

Table 4-5

Name

Description

Host

Name of the node for which the alarm is generated.

Operating System

Operating system of the server.

Database service

Name of the database service that generates the alarm.

Site Name

Name of the site for which the alarm is generated.

Impact on the System

The services fail to access the database. The fault and security menus that need to read data from the database are unavailable. Only the topology menu items are available. If the fault lasts for a long time, alarm information will be lost or the ManageOne functions are unavailable.

Possible Causes

  • The server is powered off, ending the database process.
  • The database process is ended manually.
  • The password of the database is incorrect.

Procedure

  • GaussDB 100, or Redis database process exception handling:
    1. Use PuTTY to log in to the node which the alarm is generated as the ossadm user in SSH mode.
    2. Run the following commands to check the status of the database instance process:

      > . /opt/oss/manager/agent/bin/engr_profile.sh

      > ipmc_adm -cmd statusdc –tenant tenantname

      NOTE:

      -tenant indicates the manager or product name.

      • If the following information is displayed, the database is running. Go to 4.

        GaussDB 100 V1 database:

        dbuser   251405      1  0 14:53 ?        00:00:03 <GaussDB 100 V1 installation directory>/app/bin/gaussdb -D /data/managedbsvr-0-999

        GaussDB 100 V3 database:

        dbuser   11889     1  9 00:35 ?        01:44:23 <GaussDB 100 V3 installation directory>/app/bin/zengine nomount -D <GaussDB 100 V3 installation directory>/data/cloudsopdbsvr-1-0
        Redis database:
        dbuser   253283      1  0 14:54 ?        00:00:08 <Redis installation directory>/bin/redis-server 10.93.61.59:26521
      • If no information is displayed, the database is not started. In this case, go to 3 to check whether the database password is correct.
    3. Run the following commands to manually start the database.

      > . /opt/oss/manager/agent/bin/engr_profile.sh

      > ipmc_adm -cmd startdc -tenant tenantname

      After the command is executed successfully, run the commands in 2 again to query the status of the database instance process and check the command output.

      • If the database process is running, the database is started successfully. If the alarm is cleared, no further action is required. If the alarm persists, go to 4.
      • If the database process is not running, the database fails to be started. In this case, contact Huawei technical support.
    4. Run the following command to switch to the user dbuser.

      > su - dbuser

      Password: Password of dbuser
    5. Run the following commands to log in to the database.
      • GaussDB 100 V1 database:

        > . ~/appgsdb.bashrc

        > cd <GaussDB 100 V1 installation directory>/app/bin

        > ./gsql -h 10.93.61.128 -d postgres -p 32080 -U dbuser

        Password for user dbuser: Password of dbuser for logging in GaussDB 100 V1 database

        Parameter

        Description

        -h

        IP address of GaussDB 100 V1 database node

        -d

        GaussDB 100 V1 database name

        -p

        Port number of the GaussDB 100 V1 database instance

        -U

        User name of GaussDB 100 V1 database

        If the following information is displayed, the login is successful and if the alarm is cleared, no further action is required. Otherwise, contact Huawei technical support.

        gsql (9.2.1)
        SSL connection (cipher: DHE-RSA-AES256-SHA, bits: 256)
        Type "help" for help.
        
        POSTGRES=#

        Enter "\q" to log out.

      • GaussDB 100 V3 database:

        > cd <GaussDB 100 V3 installation directory>/app/bin/

        > ./zsql sys@IP address of database node:port number of the database instance

        Please enter password: 
        Password of dbuser for logging in GaussDB 100 V3 database

        If the following information is displayed, the login is successful and if the alarm is cleared, no further action is required. Otherwise, contact Huawei technical support.

        connected.
        
        SQL>

        Enter "exit" to log out.

      • Redis database:

        > cd <Redis installation directory>/bin

        > ./redis-cli -h 10.93.59.218 -p 26521 -cipherdir <Redis installation directory>/etc/cipher/

        > auth productmonitorrdb@dbuser@Password of dbuser for logging in Redis database

        Parameter

        Description

        -h

        IP address of Redis database node

        -p

        Port number of the Redis database instance

        -cipherdir

        Encryption path for the Redis database

        NOTE:

        productmonitorrdb indicates the Redis database name.

        If the following information is displayed, the login is successful and if the alarm is cleared, no further action is required. Otherwise, contact Huawei technical support.

        OK
        10.93.59.218:26521>

        Enter "exit" or "quit" to log out.

    6. Collect the above-mentioned alarm information and contact Huawei technical support for a solution.

Clearing

When the fault is eliminated, the system will auto-clear the alarm. Manual clearing is not required.

This alarm cannot be automatically cleared in the following situations and requires manual clearance:

  • The name of the node for which this alarm is generated are changed.
  • The name of the site for which this alarm is generated are changed.
  • The server that generates the alarm is not monitored.

Related Information

None

ALM-44 The Database Usage Is Too High

Description

This alarm is generated when the management plane detects (detection is performed every 10 seconds) that the database tablespace usage is greater than or equal to the Alarm Generation Threshold, or the memory of Redis databases is greater than or equal to the Alarm Generation Threshold. This alarm is automatically cleared when the database or memory usage is less than or equal to the alarm generation threshold.

Attribute

Alarm ID

Alarm Severity

Alarm Type

44

Major

Over limit

Parameters

Name

Meaning

Host

Name of the node for which the alarm is generated

Database service

Name of the database service that generates the alarm

Database

Name of the database for which the alarm is generated

Site Name

Name of the site for which the alarm is generated

Size

Database tablespace size or Redis database memory size

Threshold

Alarm Generation Threshold and Alarm Clearance Threshold are provided.

Usage

Relational database tablespace usage or Redis database memory usage

Impact on the System

The operations on the management plane associated with the database tablespace usage of the relational databases or the memory of Redis databases may fail, for example, alarm information may fail to be saved to the database. This alarm may cause exceptions on the management plane if it is not cleared in a timely manner.

Possible Causes

The possible causes of this alarm are as follows:

  • The Alarm Generation Threshold for the relational database tablespace or Redis database memory of the management plane server is set improperly.
  • The relational database tablespace or Redis database memory is not released in a timely manner after the data in the database is exported or dumped.

Procedure

  1. Use a browser to log in to ManageOne Deployment Portal.

    URL: https://Floating IP address of ManageOne Deployment Portal:31945, for example, https://192.168.0.1:31945

    Default account: admin; default password: Huawei12#$

  2. On the management plane, choose Application > Service Management > System Monitoring from the main menu.
  3. Perform related operations based on the database type.

    • Redis Database: On the System Monitoring page, click on the right of the Redis Databases tab page to check whether the Alarm Generation Threshold and Alarm Clearance Threshold of the Redis database memory usage is set properly.
      • If yes, no further action is required. Collect the alarm information and contact technical support.
      • If no, reset the Alarm Generation Threshold and Alarm Clearance Threshold for the Redis database memory usage (the default values are "80" and "70", respectively).
    • Relational Database: On the System Monitoring page, click on the right of the Relational Databases tab page to check whether the Alarm Generation Threshold and Alarm Clearance Threshold of the database tablespace usage is set properly.
      • If yes, no further action is required. Collect the alarm information and contact technical support.
      • If no, reset the Alarm Generation Threshold and Alarm Clearance Threshold for the database tablespace usage (the default values are "95" and"85", respectively).

Clearing

When the fault is rectified, the system will automatically clear the alarm. Manual clearing is not required.

The alarm cannot be automatically cleared and needs to be manually cleared in the following scenarios:

  • A switchover occurs between the master and standby servers.
  • The name of the host for which the alarm is generated is changed.
  • The server for which the alarm is generated is not monitored.

Related Information

None

ALM-47 Memory Usage of Service Is Too High

Description

System detects the memory usage of a service every 15 seconds. When the memory usage of a service reaches or exceeds the preset threshold for 40 consecutive times, this alarm is generated. When the memory usage of the service falls below the preset threshold for once, this alarm is automatically cleared.

Attribute

Alarm ID

Alarm Severity

Alarm Type

47

Major

Over limit

Parameters

Table 4-6

Name

Description

Host

Name of the node for which the alarm is generated.

Operating System

Operating system of the server.

Service

Service process name of the host that generated the alarm.

Site Name

Name of the site for which the alarm is generated.

Impact on the System

The service plane server responds slowly.

Possible Causes

  • The service plane is busy in processing services; therefore, the memory usage increases.
  • The threshold for generating a high memory usage alarm is small.
  • A program error occurs.

Procedure

  1. Log in to the management plane.

    1. Access https://client IP address of the management plane:31945.
    2. Enter the username admin and its password, and click Log In.

  2. On the management plane. choose Application > Service Management > System Monitoring from the main menu.
  3. On the Nodes tab page, click the name of the node for which the alarm is generated.
  4. On the Processes tab page, select the process for which the alarm is generated, and click Stop. After the status of the process becomes Not Running, click Start.

    NOTE:

    You can obtain the name of the process for which the alarm is generated in the location information of the alarm.

  5. After the process is started, wait for 5 minutes and check whether the alarm is cleared.

    • If yes, no further action is required.
    • If no, contact Huawei technical support.

Clearing

When the fault is eliminated, the system will auto-clear the alarm. Manual clearing is not required.

This alarm cannot be automatically cleared in the following situations and requires manual clearance:

  • The name of the node for which this alarm is generated are changed.
  • The name of the site for which this alarm is generated are changed.
  • The server that generates the alarm is not monitored.

Related Information

None

ALM-54 The swap Usage Is High

Description

This alarm is generated when the management plane detects (detection is performed every 15 seconds) that the virtual memory usage is greater than or equal to the alarm generation threshold. This alarm is automatically cleared when the virtual memory usage is less than or equal to the alarm clearance threshold.

Attribute

Alarm ID

Alarm Severity

Alarm Type

54

Critical

Over limit

Parameters

Name

Meaning

Host

Name of the node for which the alarm is generated

Operating System

Operating system of the server

Site Name

Name of the site for which the alarm is generated

Threshold

Threshold for generating an alarm

Clearance threshold

Threshold for clearing an alarm

Virtual Memory Usage

Virtual memory usage of a server that generates an alarm

Impact on the System

  • The available memory on the management plane is reduced, the response is slow, and operations are delayed.
  • An error may occur when processes are running. The service processing is slow, causing message accumulation or system breakdown.
  • The swap space is used frequently and the management plane performance decreases. The real-time performance and alarm reporting delays, and information cannot be obtained promptly.

Possible Causes

The possible causes of this alarm are as follows:

  • The alarm generation threshold for the virtual memory usage on the management plane is improper.
  • The management plane server is performing an operation that occupies many system resources.

Procedure

  1. Use a browser to log in to ManageOne Deployment Portal.

    URL: https://Floating IP address of ManageOne Deployment Portal:31945, for example, https://192.168.0.1:31945

    Default account: admin; default password: Huawei12#$

  2. On the management plane, choose Application > Service Management > System Monitoring from the main menu.
  3. On the System Monitoring page, click on the right of the Nodes tab page to check whether the Alarm Generation Threshold and Alarm Clearance Threshold of the virtual memory usage are set properly.

    • If yes, go to 4.
    • If no, reset the Alarm Generation Threshold and Alarm Clearance Threshold for the virtual memory usage (the default values are "85" and "80", respectively). No further action is required.

  4. Check whether the virtual memory usage of applications exceeds the alarm generation threshold.

    1. On the Nodes tab page, find the node for which the alarm is generated.
    2. Check whether the virtual memory usage exceeds the alarm generation threshold.
      • If yes, the virtual memory resources are exhausted due to the applications. Wait until the related service processing is complete, and then go to 7. If the services are not completed for a long time, collect the alarm handling information and contact technical support for troubleshooting assistance.
      • If no, go to 5.

  5. Check whether the non-application virtual memory exceeds the alarm generation threshold of the virtual memory.

    1. Use PuTTY to log in to the node for which this alarm is generated in SSH mode as the sopuser user.

      The default password of the sopuser user is D4I$awOD7k.

    2. Run the following command to switch to the ossadm user:

      su - ossadm

      The default password of the ossadm user is Changeme_123.

    3. Run the following command to check the virtual memory usage of the processes in the MEM column:
      > top
      ...
      PID USER      PR  NI    VIRT    RES    SHR S   %CPU  %MEM     TIME+ COMMAND
      164860 ossadm    20   0  832312 496564  18164 S 20.199 1.539   1480:48 java
      • If yes, contact technical support. No further action is required.
      • If no, go to 6.

  6. If the virtual memory is sufficient, the virtual memory will not be reclaimed after services are processed. Therefore, the alarm will not be cleared. In this case, check whether the virtual memory usage continuously increases. The specific steps are as follows:

    1. On the Nodes tab page, find the node for which the alarm is generated.
    2. Check whether the virtual memory usage keeps increasing.
      • If yes, collect information generated during alarm handling and contact technical support.
      • If no, go to 7.

  7. After 1 minute, check whether the alarm is cleared.

    • If yes, no further action is required.
    • If no, collect the alarm handling information and contact technical support.

Clearing

When the fault is rectified, the system will automatically clear the alarm. Manual clearing is not required.

The alarm cannot be automatically cleared and needs to be manually cleared in the following scenarios:

  • The name of the host for which the alarm is generated is changed.
  • The server for which the alarm is generated is not monitored.

Related Information

None

ALM-101206 SSH management channel is faulty

Description

This alarm is generated when the management plane detects (detection is performed every 60 seconds) that the SSH connection between the deployment node and the product node is abnormal for 4 consecutive times. This alarm is automatically cleared when the SSH connection between the deployment node and the product node recovers.

Attribute

Alarm ID

Alarm Severity

Alarm Type

101206

Critical

Processing error alarm

Parameters

Name

Meaning

Host

Name of the node for which the alarm is generated

Site Name

Name of the site for which the alarm is generated

Impact on the System

The management plane cannot manage the corresponding nodes, which affects the system monitoring and backup and restore functions of the corresponding nodes.

Possible Causes

The possible causes of this alarm are as follows:

  • The node status is abnormal.
  • The network connection between the deployment node and a product node is abnormal.
  • The password for the ossadm user of a product node has expired.
  • The SSH trust relationship between the deployment node and a product node is damaged.

Procedure

  1. Check whether an alarm of abnormal connection status 101208 Node status is Abnormal is generated on the node specified by the Host parameter in the alarm parameters.

  2. Use PuTTY to log in to the deployment node as the sopuser user in SSH mode.

    The default password of the sopuser user is D4I$awOD7k.

  3. Run the following command to switch to the ossadm user:

    su - ossadm

    The default password of the ossadm user is Changeme_123.

  4. Run the following command to test the SSH connectivity between the deployment node and the product node:

    > ssh IP address of the node in the alarm parameters

    NOTE:

    If the node specified by the Host parameter in the alarm parameters is the deployment node, you can log in to any management node and connect any product node to test the SSH connectivity.

    • If you can log in to the node without entering the password, the SSH connection is normal. Go to 5.
    • If the following information is displayed, the password for the ossadm user of the node has expired. Update the password and test the SSH connection of the deployment node and the product node.
      WARNING: Your password has expired.
      NOTE:

      The password for the ossadm user of the product node must be the same as the password for the ossadm user of the deployment node.

    • If the password for the ossadm user is required, the SSH channel is abnormal. Perform the following operations to restore the SSH trust relationship:
      1. Press Ctrl+c to end the current operation.
      2. Run the following command to open the id_rsa.pub file of the deployment node:

        > vi /home/ossadm/.ssh/id_rsa.pub

      3. Copy the contents in the id_rsa.pub file to the local PC. After the copying is complete, press Esc and enter :q! to close the id_rsa.pub file.
      4. Use PuTTY to log in to the node in SSH mode with the SSH trust relationship to be recovered as the ossadm user.

        The default password of the sopuser user is D4I$awOD7k.

      5. Run the following command to switch to the ossadm user:

        su - ossadm

        The default password of the ossadm user is Changeme_123.

      6. Run the following command to open the authorized_keys file on the node with the SSH trust relationship to be recovered:

        > vi /home/ossadm/.ssh/authorized_keys

        The preceding command opens the vi editor. After you open the vi editor, press i and copy the contents obtained in 4.c to the end of the authorized_keys file.

      7. Press Esc and enter :wq! to save the authorized_keys file. Test the SSH connection of the deployment node and the product node again.

  5. Check whether the alarm is cleared.

    • If it is, no further action is required.
    • If no, contact technical support.

Clearing

When the fault is rectified, the system will automatically clear the alarm. Manual clearing is not required.

Related Information

None

ALM-53080 Resource Usage of the Management Plane Service Process Is Abnormal

Description

This alarm is generated when the CPU usage of the management plane service process reaches or exceeds the alarm threshold (default value: 90%) for 20 consecutive times (the check interval is 30 seconds). This alarm is automatically cleared when the CPU usage of the management plane service process is less than the alarm threshold.

Attribute

Alarm ID

Alarm Severity

Alarm Type

53080

Major

Over limit

Parameters

Name

Meaning

Host

Name of the node for which the alarm is generated

Operating System

Operating system of the server

Process name

Name of the process for which the alarm is generated

Site Name

Name of the site for which the alarm is generated

Impact on the System

The service processing is slow, causing message accumulation or system breakdown.

Possible Causes

The possible causes of this alarm are as follows:
  • Unnecessary processes are not released in a timely manner.
  • Processes occupying many system resources are running.

Procedure

  1. Use a browser to log in to ManageOne Deployment Portal.

    URL: https://Floating IP address of ManageOne Deployment Portal:31945, for example, https://192.168.0.1:31945

    Default account: admin; default password: Huawei12#$

  2. Choose System > Task Manager > Task Information List from the main menu. On the page that is displayed, check whether any task is running.

    • If any task is running, wait for 30 minutes after the task is complete, and then check whether the alarm is cleared.
      • If yes, no further action is required.
      • If no, go to 3.
    • If no task is running, perform 3.

  3. Determine whether this alarm is frequently generated.

    • If this alarm is occasionally generated and can be automatically cleared, ignore it, and no further operation is required.
    • If this alarm is generated more than five times within one hour and cannot be automatically cleared, contact technical support.

ALM-101210 The database local copy status is abnormal

Description

This alarm is generated when the replication status of the database is abnormal. If the replication status of the database service is normal, the alarm is automatically cleared.

Attribute

Alarm ID

Alarm Severity

Alarm Type

101210

Major

Processing error alarm

Parameters

Name

Meaning

Host

Name of the node for which the alarm is generated

Operating System

Operating system of the server

Database service

Name of the database service that generates the alarm

DB type

Type of the database for which the alarm is generated

Site Name

Name of the site for which the alarm is generated

Impact on the System

Abnormal status of the replication between master and slave databases causes inconsistency between the databases. If abnormal status persists for a long period, services related to the databases will become unavailable.

In a DR scenario, DR operations may be affected.

Possible Causes

  • The master database instance is not running properly.
  • The slave database instance is not running properly.
  • The nodes where the master and slave database instances reside are abnormal.
  • The communication between the nodes where the master and slave database instances reside is abnormal.

Procedure

  1. Use PuTTY to log in to the deployment node as the sopuser user in SSH mode.

    The default password of the sopuser user is D4I$awOD7k.

  2. Run the following command to switch to the ossadm user:

    su - ossadm

    The default password of the ossadm user is Changeme_123.

  3. Run the following commands to query the database instance replication status:

    > cd /opt/oss/manager/apps/DBAgent/bin/

    > bash dbsvc_adm -cmd query-db-instance

    Information similar to the following is displayed:

    DBInstanceId                             ... IP           Port  ... Role  Rpl Status     ...
    apmdbsvr-10_90_73_163-3@10_90_73_164-3   ... 10.90.73.164 32082 ... Slave Normal         ...
    apmdbsvr-10_90_73_178-21@10_90_73_179-21 ... 10.90.73.179 32080 ... Slave Abnormal (101) ...
    apmdbsvr-10_90_73_178-21@10_90_73_179-21 ... 10.90.73.179 32080 ... Slave Abnormal (103)    ...
    ...
    • If the value of Rpl Status is --, the database instance is a single instance. Go to 5.
    • If the value of Rpl Status is Normal, the database instance replication status is normal. Go to 5.
    • If the value of Rpl Status is Abnormal, the database replication status is abnormal. Record the error code in the brackets next to Abnormal. Go to 4.
    • If the value of Rpl Status is Delay, the slave database is synchronizing data from the master database. Record the error code in the brackets next to Delay. Go to 4.

  4. Based on the error code in 3, check the causes and rectify the fault by referring to the following table.

    Table 4-7 Database replication status error codes

    Error Code

    Description

    Possible Cause

    Troubleshooting Method

    101

    The node where the database instance resides is stopped or the database instance is stopped.

    • The node where the database instance resides is not started.
    • The database instance is not started.
    • The disk space of the node where the database instance resides is insufficient.
    • The communication between the active and standby nodes is abnormal.
    1. Use a browser to log in to ManageOne Deployment Portal.

      URL: https://Floating IP address of ManageOne Deployment Portal:31945, for example, https://192.168.0.1:31945

      Default account: admin; default password: Huawei12#$

    2. On the main menu of the management plane, choose Application > Service Management > System Monitoring.
    3. On the Nodes tab page, check whether Connection Status and DB Status of all nodes are Normal.
      • If yes, wait for 2 minutes. If the fault persists, contact technical support.
      • If no, select stopped nodes, and click Start on the Nodes tab page to start the nodes.

    102

    The roles for both the master and slave database instances become master.

    The nodes where the master and slave instances reside are manually set to ignore nodes.

    1. Use PuTTY to log in to the deployment node as the sopuser user in SSH mode.

      The default password of the sopuser user is D4I$awOD7k.

    2. Run the following command to switch to the ossadm user:

      su - ossadm

      The default password of the ossadm user is Changeme_123.

    3. Run the following command to check whether ignore nodes have been set:

      > cd /opt/oss/manager/apps/DBHASwitchService/bin

      > ./switchtool.sh -cmd get-ignore-nodes

      Information similar to the following is displayed:
      ignore-nodes:2; dbtype:Gauss

      In the preceding information, the value of ignore-node is the ID of the ignored node. If the value is None, the ignored nodes are not set. dbtype indicates the type of the ignored database.

    4. If the node where the master and slave databases reside is manually set to ignored node, run the following commands to cancel ignored nodes:

      > cd /opt/oss/manager/apps/DBHASwitchService/bin

      > ./switchtool.sh -cmd del-ignore-nodes

    103

    The roles for both the master and slave database instances become slave.

    104

    The roles for both the master and slave database instances are inconsistent with that on the distributed management service.

    Table 4-8 GaussDB 100 V1 database error codes

    Error Code

    Description

    Possible Cause

    Troubleshooting Method

    301

    Data synchronization to the slave instance is delayed.

    A large number of write operations in the database in a short period cause replication delay.

    Wait for 2 minutes. If problem persists, contact technical support.

    302

    The slave database instance is being started.

    Intermediate state

    303

    The slave database needs to be manually rebuilt.

    • The node where the database instance resides is not started.
    • The database instance is not started.
    • The network connection between the active and standby nodes is abnormal.
    • The versions of the active and slave instances are inconsistent.
    • The connection between the active and slave database instances is abnormal.

    304

    The active database instance is being set to the slave instance.

    Intermediate state

    305

    The standby instance is being set to the cascaded slave instance.

    Intermediate state

    306

    The standby instance is being set to the active instance.

    Intermediate state

    307

    Unknown error

    Unknown error

    310

    The slave instance needs to be rebuilt, and will be automatically restored.

    • The data to be synchronized to the slave instance does not exist in the active instance.
    • The database directories of the active and slave database instances are not created by the same database.
    • The data time of the active and slave instances is not consistent.
    Table 4-9 GaussDB 100 V3 database error codes

    Error Code

    Description

    Possible Cause

    Troubleshooting Method

    401

    Data synchronization to the slave instance is delayed.

    A large number of write operations in the database in a short period cause replication delay.

    Wait 2 to 3 minutes. If the fault persists, contact the database administrator to locate the fault.

    402

    The slave database instance is being started.

    Intermediate state

    403

    The slave instance needs to be rebuilt, and will be automatically restored.

    • The data to be synchronized to the slave instance does not exist in the active instance.
    • The database directories of the active and slave database instances are not created by the same database.
    • The data time of the active and slave instances is not consistent.

    404

    The active database instance is being set to the slave instance.

    Intermediate state

    405

    The slave instance is being set to the cascaded instance.

    Intermediate state

    406

    After the manual switchover, the slave instance is being set to the active instance.

    After the manual switchover, the slave instance is being set to the active instance.

    407

    After the switchover, the slave instance is being set to the active instance.

    After the switchover, the slave instance is being set to the active instance.

    408

    The slave instance is being rebuilt.

    Intermediate state

    409

    No action is required.

    No action is required.

    410

    The slave instance is disconnected from the active instance.

    • The node where the active instance resides is not started.
    • The active instance is not started.
    • The network connection between the active and slave nodes is abnormal.
    1. Use a browser to log in to ManageOne Deployment Portal.

      URL: https://Floating IP address of ManageOne Deployment Portal:31945, for example, https://192.168.0.1:31945

      Default account: admin; default password: Huawei12#$

    2. On the main menu of the management plane, choose Application > Service Management > System Monitoring.
    3. On the Nodes tab page, check whether Connection Status and DB Status of all nodes are Normal.
      • If yes, wait for 2 minutes. If the fault persists, contact technical support.
      • If no, select stopped nodes, and click Start on the Nodes tab page to start the nodes.

    411

    Unknown error

    Unknown error

    Wait 2 to 3 minutes. If the fault persists, contact the database administrator to locate the fault.

  5. Wait for 2 minutes and check whether this alarm is cleared.

    • If yes, no further action is required.
    • If no, collect the alarm handling information and contact technical support.

Clearing

When the fault is rectified, the system will automatically clear the alarm. Manual clearing is not required.

The alarm cannot be automatically cleared and needs to be manually cleared in the following scenarios:
  • A switchover occurs between the master and standby servers.
  • The name of the server for which this alarm is generated is changed.
  • The OS is upgraded or an OS patch is installed after the alarm is generated.
Translation
Download
Updated: 2019-08-30

Document ID: EDOC1100062365

Views: 48953

Downloads: 33

Average rating:
This Document Applies to these Products
Related Version
Related Documents
Share
Previous Next