No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>Search

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

HUAWEI CLOUD Stack 6.5.0 Alarm and Event Reference 04

Rate and give feedback:
Huawei uses machine translation combined with human proofreading to translate this document to different languages in order to help you better understand the content of this document. Note: Even the most advanced machine translation cannot match the quality of professional translators. Huawei shall not bear any responsibility for translation accuracy and it is recommended that you refer to the English document (a link for which has been provided).
ALM-6017 Faulty Host

ALM-6017 Faulty Host

Description

FusionSphere OpenStack periodically (default interval: 60s) checks the statuses of all hosts. This alarm is generated when a host is in an abnormal state.

Attribute

Alarm ID

Alarm Severity

Auto Clear

6017

Major

Yes

Parameters

Name

Meaning

Fault Location Info

host_id: specifies the ID of the host for which the alarm is generated.

Additional Info

  • host_id: specifies the ID of the host for which the alarm is generated.
  • hostname: specifies the name of the host for which the alarm is generated.
  • Local_address: specifies the address of the node that reports the alarm.
  • Peer_address: specifies the IP address of the node for which the alarm is generated.
  • Fault_reason: specifies the cause of the fault.

Impact on the System

The host cannot provide services.

Possible Causes

  • A piece of hardware of the host is faulty.
  • A power failure occurs on the host hardware, causing a write I/O operation loss or data written to a wrong sector.
  • A disk on the host is damaged, or the RAID controller card does not have a battery.
  • An incorrect host is PXE-booted due to network plan errors.

Procedure

  1. Use PuTTY to log in to the first FusionSphere OpenStack node through the IP address of the External OM plane.

    The default user name is fsp. The default password is Huawei@CLOUD8.

    The system supports both password and public-private key pair for identity authentication. If the public-private key pair is used for login authentication, see detailed operations in Using PuTTY to Log In to a Node in Key Pair Authentication Mode.

    NOTE:
    To obtain the IP address of the External OM plane, search for the required parameter on the Tool-generated IP Parameters sheet of the xxx_export_all.xlsm file exported from HUAWEI CLOUD Stack Deploy during software installation. The parameter names in different scenarios are as follows:
    • Region Type I scenario:

      Cascading system: Cascading-ExternalOM-Reverse-Proxy

      Cascaded system: Cascaded-ExternalOM-Reverse-Proxy

    • Region Type II and Region Type III scenarios: ExternalOM-Reverse-Proxy

  2. Run the following command and enter the password of user root to switch to user root:

    su - root

    The default password of user root is Huawei@CLOUD8!.

  3. Run the following command to disable user logout upon system timeout:

    TMOUT=0

  4. Run the following command to import environment variables:

    source set_env

    Information similar to the following is displayed:

      please choose environment variable which you want to import: 
      (1) openstack environment variable (keystone v3) 
      (2) cps environment variable 
      (3) openstack environment variable legacy (keystone v2) 
      (4) openstack environment variable of cloud_admin (keystone v3) 
      please choose:[1|2|3|4] 

  5. Enter 1 to enable Keystone V3 authentication and enter the password of OS_USERNAME as prompted.

    Default account format: DCname_admin; default password: FusionSphere123.

  1. Query information about the host for which the alarm is generated.

    cps host-show Host ID

    The host ID can be obtained from Alarm Object.

    • If the host information is displayed, go to 7.
    • If the host information is not displayed, go to 20.

  2. In the obtained host information, query the ipmiip value in metadata.

    An example of the metadata is as follows:

    | metadata           | physical_cpu_num:XX                                                     | 
    |                    | ipmiip:XX.XX.XX.XX                                                      | 
    |                    | Product Name:XXX                                                        |

  3. Check whether the host ipmiip value is the same as the BMC IP address of the bare metal server.

    Log in to the Service OM web client, choose Services > Computing > Bare Metal Servers, and check whether the BMC IP address is the same as that of the host ipmiip.
    • If yes, log in to the FusionSphere OpenStack web client, choose O&M > Capacity Expansion or switch to the Summary page, select the host, and click Delete to delete the host. Then, manually clear the alarm.
    • If no, go to 9.

  4. Run the following command to check whether a partition attachment failure occurs on the faulty host:

    cps-env-check --mount-status

    • If information similar to the following is displayed, the partition is successfully attached. In this case, go to 12.
      linux-XJgKSf:~ # cps-env-check --mount-status
      =========== [BEGIN]check device mounting status ========================
      
      Checking /dev/cpsVG/backup mount on /opt/backup success
      Checking /dev/cpsVG/repo mount on /etc/huawei/fusionsphere/repo success
      Checking /dev/cpsVG/zookeeper mount on /opt/fusionplatform/data/zookeeper success
      Checking /dev/cpsVG/upgrade mount on /opt/fusionplatform/data/upgrade success
      Checking /dev/cpsVG/ceilometer-data mount on /var/ceilometer success
      Checking /dev/cpsVG/image-cache mount on /opt/HUAWEI/image_cache success
      Checking /dev/cpsVG/swift mount on /opt/HUAWEI/swift success
      Checking /dev/cpsVG/rabbitmq mount on /opt/fusionplatform/data/rabbitmq success
      Checking /dev/cpsVG/database mount on /opt/fusionplatform/data/gaussdb_data success
      Checking /dev/cpsVG/image mount on /opt/HUAWEI/image success
      Checking /dev/mapper/cpsVG-bak_rootfs mount on /opt/HUAWEI/bak_rootfs success
      
      =========== [END]check device mounting status ==========================
    • If information similar to the following is displayed, a partition attachment failure occurs. Record the partition name and attachment directory and go to 10.
      linux-XJgKSf:~ # cps-env-check --mount-status
      =========== [BEGIN]check device mounting status ========================
      
      Checking /dev/cpsVG/backup mount on /opt/backup success
      Checking /dev/cpsVG/repo mount on /etc/huawei/fusionsphere/repo success
      Checking /dev/cpsVG/zookeeper mount on /opt/fusionplatform/data/zookeeper success
      Checking /dev/cpsVG/upgrade mount on /opt/fusionplatform/data/upgrade success
      Checking /dev/cpsVG/ceilometer-data mount on /var/ceilometer success
      Checking /dev/cpsVG/image-cache mount on /opt/HUAWEI/image_cache success
      Checking /dev/cpsVG/swift mount on /opt/HUAWEI/swift success
      Checking /dev/cpsVG/rabbitmq mount on /opt/fusionplatform/data/rabbitmq success
      Checking /dev/cpsVG/database mount on /opt/fusionplatform/data/gaussdb_data success
      Checking /dev/cpsVG/image mount on /opt/HUAWEI/image success
      Checking /dev/mapper/cpsVG-bak_rootfs mount on /opt/HUAWEI/bak_rootfs failed
      
      =========== [END]check device mounting status ==========================
      
    • If the following information is displayed, no partition mounting information is returned. In this case, contact technical support for assistance.
      linux-XJgKSf:~ # cps-env-check --mount-status
      =========== [BEGIN]check device mounting status ========================
      
      
      =========== [END]check device mounting status ==========================
      

  5. The partition cannot be attached to the system if the system is running. Run the following command to manually attach the partition and check the result:

    mount xxx /mnt

    xxx indicates the partition name. /mnt indicates the attachment directory. If information similar to the following is displayed, the partition file system is damaged. In this case, go to 11 to restore the file system.

  6. Run the following command to restore the file system:

    1. Run the following command to query the file system type:

      df -T

      linux-XJgKSf:~ # df -T
      Filesystem                         Type     1K-blocks    Used Available Use% Mounted on
      /dev/mapper/cpsVG-rootfs           ext4       8191416 4159476   3596128  54% /
      devtmpfs                           devtmpfs  65828552       0  65828552   0% /dev
      tmpfs                              tmpfs     65837592      92  65837500   1% /dev/shm
      tmpfs                              tmpfs     65837592   83800  65753792   1% /run
      tmpfs                              tmpfs     65837592       0  65837592   0% /sys/fs/cgroup
      /dev/mapper/cpsVG-data             ext4        499656    6012    456948   2% /opt/fusionplatform/data
      /dev/mapper/cpsVG-fsp              ext4        999320   16516    913992   2% /home/fsp
      /dev/sda1                          ext4        480660   73043    378280  17% /boot
      /dev/mapper/cpsVG-log              ext4      20511312  279076  19167276   2% /var/log

      If the preceding information is displayed, the ext4 file system type is used for the partition.

    2. Run the following command to restore the file system:

      fsck.ext4 xxx

      xxx indicates the partition which cannot be attached. If the partition uses other file system types except ext4, run the corresponding fsck commands to restore the file systems. For example, run the fsck.ext3 command to restore the ext3 file system.

      • If the restoration is successful, go to 12.
      • If the restoration fails, contact technical support for assistance.

  7. Check whether the host network is faulty.

    1. Run the following command to obtain the value of manageip based on the ID of the host for which the alarm is generated:

      cps host-list

    1. Run the ping command on the first node that you log in and check whether the node can communicate with the IP address defined in manageip.

      ping manageip

      Information similar to the following is displayed:

      65BA1CEB-573F-574C-B9F0-EE7F9AFC6ECE:~ # ping 172.28.0.8
      PING 172.28.0.8 (172.28.0.8) 56(84) bytes of data.
      From 172.28.0.2 icmp_seq=1 Destination Host Unreachable
      From 172.28.0.2 icmp_seq=2 Destination Host Unreachable
      From 172.28.0.2 icmp_seq=3 Destination Host Unreachable
      From 172.28.0.2 icmp_seq=4 Destination Host Unreachable
      From 172.28.0.2 icmp_seq=5 Destination Host Unreachable
      From 172.28.0.2 icmp_seq=6 Destination Host Unreachable
      From 172.28.0.2 icmp_seq=7 Destination Host Unreachable
      From 172.28.0.2 icmp_seq=8 Destination Host Unreachable
      ^C
      --- 172.28.0.8 ping statistics ---
      9 packets transmitted, 0 received, +8 errors, 100% packet loss, time 8002ms

      If the information is displayed, it indicates that the IP address defined in manageip cannot be pinged. In this case, restore the network connection of the host.

      Then, check whether the alarm is cleared.
      • If yes, no further action is required.
      • If no, go to 13.

  8. Log in to the host BMC system and check whether the host is in maintenance mode.

    Figure 3-1 shows that a host is in maintenance mode.

    Figure 3-1 Maintenance mode

    • If yes, go to 14.
    • If no, go to 15.

  9. Enter the password of user root in the BMC system and use fsck commands to manually restore the last partition that has an error message reported (for example, /dev/mapper/cpsVG-log in Figure 3-1).

    The following describes how to use fsck commands to restore a partition (/dev/sda5 is used as an example):

    1. Detach the partition.

      For example, run the following command:

      umount /dev/sda5

    2. Restore the partition.

      For example, run the following command:

      fsck /dev/sda5

      If information similar to the following is displayed, run the preceding command again to check whether the partition is successfully restored:

      # fsck 
      /dev/sda5 fsck from util-linux 2.19.1 e2fsck 1.41.9 (22-Aug-2009) 
      /dev/sda5: recovering journal /dev/sda5 contains a file system with errors, check forced. 
      Pass 1: Checking inodes, blocks, and sizes 
      Pass 2: Checking directory structure Directory inode 8193, block 0, offset 24: directory corrupted Salvage<y>? yes  
      Pass 3: Checking directory connectivity 
      Pass 4: Checking reference counts 
      Pass 5: Checking group summary information  
      /dev/sda5: ***** FILE SYSTEM WAS MODIFIED ***** 
      /dev/sda5: 13/131072 files (0.0% non-contiguous), 17206/524112 blocks 
      # fsck 
      /dev/sda5 fsck from util-linux 2.19.1 e2fsck 1.41.9 (22-Aug-2009) 
      /dev/sda5: clean, 13/131072 files, 17206/524112 blocks 

  10. Restart the faulty host.

    Check whether the host is properly started.

    • If yes, go to 18.
    • If no, go to 16.

  11. Check whether the host hardware is faulty and replace the hardware. For details, see "Replacing Hosts and Accessories" in HUAWEI CLOUD Stack 6.5.0 Parts Replacement.

    • Check whether any disk on the host is faulty. If any disk is faulty, replace it.
    • If any other hardware is faulty, replace the hardware or the whole server.

      After the host hardware fault is rectified, go to 17.

  12. Restart the faulty host.

    Check whether the host is properly started.

    • If yes, go to 18.
    • If no, go to 20.

  13. Run the following command to check whether the host hard disk has bad sectors:

    badblocks -b block_size(Byte) -s disk

    block_size(Byte) indicates the disk block size (byte), and Disk indicates the disk device name. To query the device name of a hard disk, run the fdisk -l command.

    For example, run the following command:

    badblocks -b 4096 -s /dev/sda

    If the hard disk is faulty, replace the hard disk by referring to "Replacing Hosts and Accessories" in HUAWEI CLOUD Stack 6.5.0 Parts Replacement.

  14. One minute after the host restarts, check whether the alarm is cleared.

    • If yes, no further action is required.
    • If no, go to 20.

  15. Reinstall an OS and rebuild VMs for the faulty host. If the fault cannot be rectified, contact technical support for assistance.

Related Information

None

Translation
Download
Updated: 2019-08-30

Document ID: EDOC1100062365

Views: 36059

Downloads: 31

Average rating:
This Document Applies to these Products
Related Version
Related Documents
Share
Previous Next