No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>Search

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

HUAWEI CLOUD Stack 6.5.0 Alarm and Event Reference 04

Rate and give feedback:
Huawei uses machine translation combined with human proofreading to translate this document to different languages in order to help you better understand the content of this document. Note: Even the most advanced machine translation cannot match the quality of professional translators. Huawei shall not bear any responsibility for translation accuracy and it is recommended that you refer to the English document (a link for which has been provided).
ALM-73401 Faulty RabbitMQ Service

ALM-73401 Faulty RabbitMQ Service

Description

This alarm is generated when an exception occurs in the RabbitMQ service. The RabbitMQ process fails to provide services although it is still running.

Attribute

Alarm ID

Alarm Severity

Auto Clear

73401

Critical

Yes

Parameters

Name

Meaning

Fault Location Info

instance_name: specifies the name of the instance where the service for which the alarm is generated is located.

Additional Info

  • hostname: specifies the name of the host for which the alarm is generated.
  • host_id: specifies the ID of the host for which the alarm is generated.
  • Details: specifies the detailed information about an alarm.

Impact on the System

Some services become unavailable.

Possible Causes

  • Active and standby RabbitMQ nodes are restarted multiple times within a short period of time.
  • Due to frequent network reconnection, a queue fails to send messages.

Procedure

  1. Check whether the configuration of the memory watermark is correct by performing operations provided in "Adjusting RabbitMQ Memory Watermark" in HUAWEI CLOUD Stack 6.5.0 Capacity Expansion Guide.

    • If yes, go to 8.
    • If no, adjust the memory watermark and go to 2.

  2. After 10 minutes, check whether the alarm is cleared.

    • If yes, no further action is required.
    • If no, go to 3.

  3. Check whether the additional alarm information contains "Failed to publish message to topic XXX." XXX indicates the name of the RabbitMQ message queue.

    • If yes, record the alarm location information, for example, happen on (['xxx.xxx.xxx.xxx']), and go to 4.

      xxx.xxx.xxx.xxx indicates the IP address of the RabbitMQ node.

    • If no, go to 8.

  4. Use PuTTY to log in to the first FusionSphere OpenStack node through the IP address of the External OM plane.

    The default user name is fsp. The default password is Huawei@CLOUD8.

    The system supports both password and public-private key pair for identity authentication. If the public-private key pair is used for login authentication, see detailed operations in Using PuTTY to Log In to a Node in Key Pair Authentication Mode.

    NOTE:
    To obtain the IP address of the External OM plane, search for the required parameter on the Tool-generated IP Parameters sheet of the xxx_export_all.xlsm file exported from HUAWEI CLOUD Stack Deploy during software installation. The parameter names in different scenarios are as follows:
    • Region Type I scenario:

      Cascading system: Cascading-ExternalOM-Reverse-Proxy

      Cascaded system: Cascaded-ExternalOM-Reverse-Proxy

    • Region Type II and Region Type III scenarios: ExternalOM-Reverse-Proxy

  5. Run the following command to switch to the node where RabbitMQ is located:

    ssh fsp@xxx.xxx.xxx.xxx

    xxx.xxx.xxx.xxx indicates the IP address of the RabbitMQ node in 3.

  6. Run the following command and enter the password of user root to switch to user root:

    su - root

    The default password of user root is Huawei@CLOUD8!.

  7. Query the number of suspicious processes.

    /usr/local/lib/rabbitmq/sbin/rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().' |grep

    XXXXXX indicates the name of the RabbitMQ message queue in 3.

    Check the command output.

    Information similar to the following is displayed:

          <<"q-agent-notifier-update_fanout_af5c781c060d4c2dbcd16abbf38857c4">>}}}, 
    • If XXX is displayed in the command output, go to 15.
    • If no result is displayed, go to 8.

  8. Use PuTTY to log in to the first FusionSphere OpenStack node through the IP address of the External OM plane.

    The default user name is fsp. The default password is Huawei@CLOUD8.

    The system supports both password and public-private key pair for identity authentication. If the public-private key pair is used for login authentication, see detailed operations in Using PuTTY to Log In to a Node in Key Pair Authentication Mode.

    NOTE:
    To obtain the IP address of the External OM plane, search for the required parameter on the Tool-generated IP Parameters sheet of the xxx_export_all.xlsm file exported from HUAWEI CLOUD Stack Deploy during software installation. The parameter names in different scenarios are as follows:
    • Region Type I scenario:

      Cascading system: Cascading-ExternalOM-Reverse-Proxy

      Cascaded system: Cascaded-ExternalOM-Reverse-Proxy

    • Region Type II and Region Type III scenarios: ExternalOM-Reverse-Proxy

  9. Run the following command and enter the password of user root to switch to user root:

    su - root

    The default password of user root is Huawei@CLOUD8!.

  10. Run the following command to disable user logout upon system timeout:

    TMOUT=0

  1. Import environment variables. For details, see Importing Environment Variables.
  2. Obtain component information.

    • If the alarm is displayed on the alarm console, obtain the name of the host for which the alarm is generated in the alarm additional information, for example, host name=XXX.
    • If the alarm is displayed on the FusionSphere OpenStack web client, choose O&M > System Check, check the status of the RabbitMQ service and obtain the name of the faulty host in the check result. For example, if 'location':{'XXX'} is displayed in the check result, the host name is XXX.

      Then run the following command:

      cps host-template-instance-listXXX | grep rabbitmq

      AAA.BBB in the command output specifies the component information, for example, rabbitmq.rabbitmq.

  3. Query the number of suspicious processes.

    Repeat 8 to 10 to log in to the host queried in 12.

    Run the /usr/local/lib/rabbitmq/sbin/rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().' command and check the returned result.

    Information similar to the following is displayed:

    There are 1122 processes.
    Investigated 0 processes this round, 5000ms
    to go.
    ...
    Investigated 0 processes this round, 500ms
    to go.
    Found 0 suspicious processes.
    ok 
    • If information similar to "Found XXX suspicious processes." is displayed in the command output, repeat this command for three times and check whether the information is displayed each time.
      • If yes, go to 14.
      • If no, go to 15.
    • If the command has not been executed within 1 minute, or information similar to "Found XXX suspicious processes." is not displayed, go to 15.

  4. After 10 minutes, check whether the alarm is cleared.

    • If yes, no further action is required.
    • If no, go to 15.

  5. Contact technical support for assistance.

Related Information

None

Translation
Download
Updated: 2019-08-30

Document ID: EDOC1100062365

Views: 37622

Downloads: 31

Average rating:
This Document Applies to these Products
Related Version
Related Documents
Share
Previous Next