No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>Search

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

FusionCloud 6.3.1.1 Troubleshooting Guide 02

Rate and give feedback :
Huawei uses machine translation combined with human proofreading to translate this document to different languages in order to help you better understand the content of this document. Note: Even the most advanced machine translation cannot match the quality of professional translators. Huawei shall not bear any responsibility for translation accuracy and it is recommended that you refer to the English document (a link for which has been provided).
Node Faults

Node Faults

Host OS Failure

Symptom

After a host OS fails, the OS must be reinstalled and all VMs on the hosts must be rebuilt. Either of the following conditions indicates a host failure:

  • If two or more hosts fail, host login related services will become unavailable.
  • The cps host-show host_id command shows that the host is in the fault state.
Possible Causes

One or more host disks are faulty.

Procedure

Operations in cascading scenarios

  1. On the FusionSphere OpenStack web client, choose O&M > Capacity Expansion and set PXE Boot Hosts to ON.
  2. If a host at the cascading layer fails, perform the following operations:

    1. Use PuTTY to log in to any host in the cascading system as user fsp through the IP address of the External OM plane.
      The username is fsp and the default password is Huawei@CLOUD8.
      NOTE:
      • The system supports login authentication using a password or private-public key pair. If a private-public key pair is used for login authentication, seeUsing PuTTY to Log In to a Node in Key Pair Authentication Mode.
      • For details about the IP address of the External OM plane, see the LLD generated by FCD sheet of the xxx_export_all.xlsm file exported from FusionCloud Deploy during software installation, and search for the IP addresses corresponding to VMs and nodes.The parameter names in different scenarios are as follows:
        • Cascading layer in the Region Type I scenario : Cascading-ExternalOM-Reverse-Proxy, Cascaded layer : Cascaded-ExternalOM-Reverse-Proxy.
        • Region Type II and Type III scenarios : ExternalOM-Reverse-Proxy.
    2. Run the following command to switch to user root, and enter the password of user root as prompted:

      su - root

      The default password of user root is Huawei@CLOUD8!.

    1. Run the TMOUT=0 command to disable user logout upon system timeout.
    2. Import environment variables. For details, see Importing Environment Variables.

  3. If a network node (a vrouter node and nat-server node at the cascading layer) fails, stop the CPU kernel bound to the network node.

    1. Use PuTTY to log in to the cascading or cascaded FusionSphere OpenStack system that contains the network node. Switch to the root user and import environment variables.
    2. Run the required commands to stop the CPU kernel bound to the network node.
      • For a vrouter node, run the following commands:

        cps template-ext-params-update --parameter vrouter01.DATA_PLANE.enable_auto_irq=False --service neutron neutron-vrouter01

        cps commit

      • For a nat-server node, run the following commands:

        cps template-ext-params-update --parameter neutron_l3_nat_agent01.DEFAULT.enable_auto_irq=False --service neutron neutron-l3-nat-agent01

        cps commit

  4. Manually start the faulty host, set its boot device to network during the starting process, and reinstall the host OS.

    To set the boot device to network, you can use the remote control function of the server BMC system or use a KVM to connect to the server.

    The installation takes about 10 to 15 minutes.

    NOTE:

    The installation will fail (services are unavailable although the OS is installed) if the disk drive letter changes after the installation, which may be caused by the following operations and can be handled according to How Do I Handle Drive Letter Changes?:

    • Adjust the host RAID arrays, including reconstructing, recreating, and deleting RAID arrays.
    • Adjust the FusionSphere OpenStack system disk or disk partitions, including expanding the disk capacity and adding partitions.
    • Attach the SSD card used by the MongoDB service to the host. This issue must be handled according to How Do I Handle Drive Letter Changes?.

    If the reinstallation fails with none of the operations above performed, contact technical support. Contact technical support for assistance.

    After the host OS is reinstalled, the passwords of OS accounts, including users root and fsp, will be reset to the default ones.

  5. Use PuTTY and the External OM plane IP address to log in to the host to be restored.

    The username is fsp and the default password is Huawei@CLOUD8.

    NOTE:
    • The system supports login authentication using a password or private-public key pair. If a private-public key pair is used for login authentication, seeUsing PuTTY to Log In to a Node in Key Pair Authentication Mode.
    • For details about the IP address of the External OM plane, see the LLD generated by FCD sheet of the xxx_export_all.xlsm file exported from FusionCloud Deploy during software installation, and search for the IP addresses corresponding to VMs and nodes.The parameter names in different scenarios are as follows:
      • Cascading layer in the Region Type I scenario : Cascading-ExternalOM-Reverse-Proxy, Cascaded layer : Cascaded-ExternalOM-Reverse-Proxy.
      • Region Type II and Type III scenarios : ExternalOM-Reverse-Proxy.

  6. Import environment variables. For details, see Importing Environment Variables.
  7. Run the following command to check whether the MongoDB component on the fault restoration node is in the fault state:

    cps template-instance-list --service mongodb mongodb

    • If yes, go to 8.
    • If no, go to 11.

  8. The controller nodes are interconnected with FusionStorage, and MongoDB is deployed remotely. Before the storage is recovered, MongoDB cannot be recovered and the progress cannot reach 100%. In this case, recover the storage first. If the faulty KVM host has been connected to FusionStorage Block, recover storage by following the instructions in "Server OS Faults" in the FusionStorage V100R006C20 Product Documentation.
  9. On the FusionSphere OpenStack web client, choose Configuration > Disk. On the displayed page, select the faulty host and check whether error message "Configuration failed. Click the host state to view details." is displayed.

    • If yes, click on the right of Expand Storage Capacity and go to 10.
    • If no, go to 12.

  10. Wait for 5 minutes and run the following command to check whether the MongoDB component is restored:

    cps template-instance-list --service mongodb mongodb

    • If yes, go to 12.
    • If no, Contact technical support for assistance.

  11. View the installation progress on the Summary page on the FusionSphere OpenStack web client and check whether the installation is complete.

    The installation is complete if the progress reaches 100%.

    NOTE:

    If the reinstalled host is the first host in the FusionSphere OpenStack system, choose Management > Capacity Expansion to query the host installation status, because the host status displayed on the Summary page is always faulty. After the host reinstallation progress reaches 70%, the actual installation progress will be updated on the Summary page.

    • If it is, go to 12.
    • If it is not, Contact technical support for assistance.

  12. For a network node (vrouter nodes and nat-server nodes at the cascading layer), start the CPU kernel bound to the network node after the installation is complete.

    1. Use PuTTY to log in to the cascading or cascaded FusionSphere OpenStack system that contains the network node. Switch to the root user and import environment variables.
    2. Run commands to start the CPU kernel bound to the network node.
      • For a vrouter node, run the following commands:

        cps template-ext-params-update --parameter vrouter01.DATA_PLANE.enable_auto_irq=True --service neutron neutron-vrouter01

        cps commit

      • For a nat-server node, run the following commands:

        cps template-ext-params-update --parameter neutron_l3_nat_agent01.DEFAULT.enable_auto_irq=True --service neutron neutron-l3-nat-agent01

        cps commit

  13. Select the new host and click Reboot.

    NOTE:

    The host automatically synchronizes system configuration after it is successfully installed. Some advanced functions, such as resource isolation, will take effect only after a host restarts.

    To ensure that all configurations can take effect, restart hosts after the reinstallation.

  14. Use PuTTY to log in to the controller host in the AZ through the External OM plane IP address.

    The username is fsp and the default password is Huawei@CLOUD8.
    NOTE:
    • The system supports login authentication using a password or private-public key pair. If a private-public key pair is used for login authentication, seeUsing PuTTY to Log In to a Node in Key Pair Authentication Mode.
    • For details about the IP address of the External OM plane, see the LLD generated by FCD sheet of the xxx_export_all.xlsm file exported from FusionCloud Deploy during software installation, and search for the IP addresses corresponding to VMs and nodes.The parameter names in different scenarios are as follows:
      • Cascading layer in the Region Type I scenario : Cascading-ExternalOM-Reverse-Proxy, Cascaded layer : Cascaded-ExternalOM-Reverse-Proxy.
      • Region Type II and Type III scenarios : ExternalOM-Reverse-Proxy.

  15. Run the following command to switch to user root, and enter the password of user root as prompted:

    su - root

    The default password of user root is Huawei@CLOUD8!.

  16. Run the TMOUT=0 command to disable user logout upon system timeout.
  17. For details, see Importing Environment Variables.
  18. Run the following command to query information about VMs on the host:

    1. Run the following command to enter the secure operation mode:

      runsafe

      Information similar to the following is displayed:

      Input command:
    2. Run the following command to query information about VMs on the host:

      nova list --all-t --host <host_id>

      • If the host accommodates VMs (including the service VM), take a note of the service VM ID and go to 19.
      • If the host accommodates only management VMs, such as VRM VMs, FSM VM, and Service OM VMs, go to 27.

  19. Perform the following operations to rebuild service VMs:

    1. Reconstruct other service VMs based on Rebuilding VMs on Other Hosts.
    2. Check whether any other service VMs that are not rebuilt are present.
      • If yes, select a service VM and go to 19.a.
      • If no, go to 27.

  20. If a controller node accommodating cloud service VMs is deployed on the node to be restored, you need to restore VMs that are not cloud service VMs (FusionCompute VRM, Service OM, or FSM VMs). Perform 21 to 26.
  21. Use PuTTY and the External OM plane IP address to log in to the host to be restored.

    The username is fsp and the default password is Huawei@CLOUD8.

  22. Import environment variables. For details, see Importing Environment Variables.
  23. Run the following command to obtain uuid of the host:

    cat /etc/uuid

  24. Run the following command to check whether VMs on the host are in the ERROR state:

    nova list --host uuid

    • If yes, go to 25.
    • If no, go to 26.

  25. Run the following commands to stop and start all VMs except local disk management VMs (FusionCompute VRM, Service OM, and FSM VMs), in which uuid indicates the VM ID:

    nova stop uuid

    nova start uuid

  26. Run the following command to check whether all VMs on the host are in the ACTIVE state:

    nova list --host uuid

    • If yes, go to 27.
    • If no, Contact technical support for assistance.

  27. If the faulty host accommodates the VRM VM, restore the VRM VM by following the instructions provided in "OS Fault of A Single VRM VM" in the FusionSphere V100R006C10SPC600 Product Documentation (Server Virtualization,FusionCompute V100R006C10SPH105).
  28. If the faulty host accommodates the FSM VM, restore the VM by following the instructions in "Fault Management" > "Troubleshooting" in the FusionStorage V100R006C20 Product Documentation.
  29. If the Service OM VM is deployed on the faulty host, perform operations provided in A Single Service OM VM Faulty in Active/Standby Deployment Mode to rebuild the Service OM VM.
  30. On the Service OM web client, choose Services > Service OM > Centralized O&M > Alarm > Alarm List. On the displayed page, check whether there is any alarm about the installed host.

    • If there is, go to 31.
    • If no, go to 32.

  31. Locate the row that contains the alarm, click Clear in the Operation column, select the alarm to be deleted, and click Clear in the upper left corner.
  32. Log in to the FusionSphere OpenStack web client.

    For details, see Logging In to the FusionSphere OpenStack Web Client (ManageOne Mode).

  33. On the FusionSphere OpenStack web client, choose Configuration > Disk. On the displayed page, check whether the management host group containing the faulty host uses remote storage devices.

    • If it does, go to 34.
    • If it does not, no further action is required.

  34. Use PuTTY to log in to the controller node through the reverse proxy IP address of the External OM plane.

    The default account is fsp and the default password is Huawei@CLOUD8.

  35. Run the following command to obtain the hash ID of the faulty host:

    python -c 'print(hash("hostid"))'

    hostid indicates the ID of the faulty host. The command output is the hash ID of the faulty host.

  36. Run the following command to obtain the WWN list of the remote LUN reconnected to the board:

    cat /etc/huawei/fusionsphere/osConfig.storage/cfg/osConfig.storage.cfg.effect | python -mjson.tool | grep lun_wwn

    Information similar to the following is displayed:

     "lun_wwn": "6487b6b1004bc09524031c1e0000003d"

  37. Log in to the DeviceManager. On the home page, choose Provisioning > LUN. On the displayed page, query the host hash IDs by name, and check each LUN one by one. If a WWN is not in the WWN list obtained in 36, the LUN is to be cleared. Record and then delete to-be-cleared LUNs from the corresponding LUN groups on the DeviceManager. For details, see product documentation for the corresponding model of OceanStor.

Operations in non-cascading scenarios

  1. On the FusionSphere OpenStack web client, choose O&M > Capacity Expansion and set PXE Boot Hosts to ON.
  2. Manually start the faulty host, set its boot device to network during the starting process, and reinstall the host OS.

    To set the boot device to network, you can use the remote control function of the server BMC system or use a KVM to connect to the server.

    The installation takes about 10 to 15 minutes.

    NOTE:

    The installation will fail (services are unavailable although the OS is installed) if the disk drive letter changes after the installation, which may be caused by the following operations and can be handled according to How Do I Handle Drive Letter Changes?:

    • Adjust the host RAID arrays, including reconstructing, recreating, and deleting RAID arrays.
    • Adjust the FusionSphere OpenStack system disk or disk partitions, including expanding the disk capacity and adding partitions.
    • Attach the SSD card used by the MongoDB service to the host. This issue must be handled according to How Do I Handle Drive Letter Changes?.

    If the reinstallation fails with none of the operations above performed, contact technical support. Contact technical support for assistance.

    After the host OS is reinstalled, the passwords of OS accounts, including users root and fsp, will be reset to the default ones.

  3. Use PuTTY and the External OM plane IP address to log in to the host to be restored.

    The username is fsp and the default password is Huawei@CLOUD8.

    NOTE:
    • The system supports login authentication using a password or private-public key pair. If a private-public key pair is used for login authentication, seeUsing PuTTY to Log In to a Node in Key Pair Authentication Mode.
    • For details about the IP address of the External OM plane, see the LLD generated by FCD sheet of the xxx_export_all.xlsm file exported from FusionCloud Deploy during software installation, and search for the IP addresses corresponding to VMs and nodes.The parameter names in different scenarios are as follows:
      • Cascading layer in the Region Type I scenario : Cascading-ExternalOM-Reverse-Proxy, Cascaded layer : Cascaded-ExternalOM-Reverse-Proxy.
      • Region Type II and Type III scenarios : ExternalOM-Reverse-Proxy.

  4. Run the following command to switch to user root, and enter the password of user root as prompted:

    su - root

    The default password of user root is Huawei@CLOUD8!.

  5. Run the TMOUT=0 command to disable user logout upon system timeout.
  6. Import environment variables. For details, see Importing Environment Variables.
  7. Run the following command to check whether the MongoDB component on the fault restoration node is in the fault state:

    cps template-instance-list --service mongodb mongodb

    • If yes, go to 8.
    • If no, go to 13.

  8. The controller nodes are interconnected with FusionStorage, and MongoDB is deployed remotely. Before the storage is recovered, MongoDB cannot be recovered and the progress cannot reach 100%. In this case, recover the storage first. If the faulty KVM host has been connected to FusionStorage Block, recover storage by following the instructions in "Server OS Faults" in the FusionStorage V100R006C20 Product Documentation.
  9. In the Type II scenario, you need to connect FusionSphere to AC again. For details, see "Manual Configuration" in the Agile Controller-DCN V300R003C00 Product Documentation and perform steps 7 to 9 and step 13 in sequence to configure the AC.
  10. On the FusionSphere OpenStack web client, choose Configuration > Disk. On the displayed page, select the faulty host and check whether error message "Configuration failed. Click the host state to view details." is displayed.

    • If yes, click on the right of Expand Storage Capacity and go to 11.
    • If no, go to 13.

  11. Wait for 5 minutes and run the following command to check whether the MongoDB component is restored:

    cps template-instance-list --service mongodb mongodb

    • If it is, go to 13.
    • If no, Contact technical support for assistance.

  12. View the installation progress on the Summary page on the FusionSphere OpenStack web client and check whether the installation is complete.

    The installation is complete if the progress reaches 100%.

    NOTE:

    If the reinstalled host is the first host in the FusionSphere OpenStack system, choose Management > Capacity Expansion to query the host installation status, because the host status displayed on the Summary page is always faulty. After the host reinstallation progress reaches 70%, the actual installation progress will be updated on the Summary page.

    • If it is, go to 13.
    • If it is not, Contact technical support for assistance.

  13. Select the new host and click Reboot.

    NOTE:

    The host automatically synchronizes system configuration after it is successfully installed. Some advanced functions, such as resource isolation, will take effect only after a host restarts.

    To ensure that all configurations can take effect, restart hosts after the reinstallation.

  14. Use PuTTY to log in to the controller host in the AZ through the External OM plane IP address.

    The username is fsp and the default password is Huawei@CLOUD8.
    NOTE:
    • The system supports login authentication using a password or private-public key pair. If a private-public key pair is used for login authentication, seeUsing PuTTY to Log In to a Node in Key Pair Authentication Mode.
    • For details about the IP address of the External OM plane, see the LLD generated by FCD sheet of the xxx_export_all.xlsm file exported from FusionCloud Deploy during software installation, and search for the IP addresses corresponding to VMs and nodes.The parameter names in different scenarios are as follows:
      • Cascading layer in the Region Type I scenario : Cascading-ExternalOM-Reverse-Proxy, Cascaded layer : Cascaded-ExternalOM-Reverse-Proxy.
      • Region Type II and Type III scenarios : ExternalOM-Reverse-Proxy.

  15. Run the following command to switch to user root, and enter the password of user root as prompted:

    su - root

    The default password of user root is Huawei@CLOUD8!.

  16. Run the TMOUT=0 command to disable user logout upon system timeout.
  17. Import environment variables. For details, see Importing Environment Variables.
  18. If the hardware SDN is connected, reconnect the restored node AC.
  19. Run the following command to query information about VMs on the host:

    1. Run the following command to enter the secure operation mode:

      runsafe

      Information similar to the following is displayed:

      Input command:
    2. Run the following command to query information about VMs on the host:

      nova list --all-t --host <host_id>

      • If the host accommodates VMs (including the service VM), take a note of the service VM ID and go to 20.
      • If the host accommodates only management VMs, such as VRM VMs, FSM VMs, and Service OM VMs, go to 28.
      • If the host does not accommodate any VMs, go to 31.

  20. Perform the following operations to rebuild service VMs:

    1. Run the runsafe command to enter the secure operation mode, and run the following command to query detailed information about a VM:

      nova show <vm_id>

      <vm_id> specifies the ID of the service VM queried in 19.

      os-extended-volumes:volumes_attached in the command output specifies the volume used by the VM.

      The VM is booted from an image if information similar to the following is displayed:

      | os-extended-volumes:volumes_attached | []                                                |

      If information similar to the following is displayed, run the cinder show volumes_id command:

      | os-extended-volumes:volumes_attached | [{"id": "a596caae-a79c-4d61-9abe-db17f01fb7c0"}]  |

      In the command, volumes_id specifies the id value in the os-extended-volumes:volumes_attached field. If multiple IDs are available, run the cinder show volumes_id command for each ID.

      If the bootable value in the output of each command is false, the VM is booted from an image. Otherwise, the VM is booted from a volume.

      Information similar to the following is displayed:

      | bootable| false  |
      • If the VM is booted from a volume, go to 20.b.
      • If the VM is booted from an image, go to 20.c.
    2. Rebuild the VMs on other hosts. For details, see Rebuilding VMs on Other Hosts.

      After this step is complete, go to 20.e.

    3. Run the runsafe command to enter the secure operation mode, and run the following command to rebuild a service VM:

      nova rebuild <vm_id> <image_id>

    4. Run the runsafe command to enter the secure operation mode, and run the following command to check whether the VM is successfully rebuilt:

      nova show <vm_id>

      Check the status value in the command output:

      • If the value is REBUILD, the VM is being rebuilt. Query the VM again 1 minute later.
      • If the status is ACTIVE, the VM is successfully rebuilt. Go to 20.e.
      • If the status is neither of the above, the VM fails to be rebuilt. Contact technical support for assistance.
    5. Check whether other service VMs have not been rebuilt.
      • If they are, select a service VM and go to 20.a.
      • If they are not, go to 28.

  21. If a controller node accommodating cloud service VMs is deployed on the node to be restored, you need to restore VMs that are not cloud service VMs (FusionCompute VRM, Service OM, or FSM VMs). Perform 22 to 27.
  22. Use PuTTY and the External OM plane IP address to log in to the host to be restored.

    The username is fsp and the default password is Huawei@CLOUD8.

    NOTE:
    • The system supports login authentication using a password or private-public key pair. If a private-public key pair is used for login authentication, seeUsing PuTTY to Log In to a Node in Key Pair Authentication Mode.
    • For details about the IP address of the External OM plane, see the LLD generated by FCD sheet of the xxx_export_all.xlsm file exported from FusionCloud Deploy during software installation, and search for the IP addresses corresponding to VMs and nodes.The parameter names in different scenarios are as follows:
      • Cascading layer in the Region Type I scenario : Cascading-ExternalOM-Reverse-Proxy, Cascaded layer : Cascaded-ExternalOM-Reverse-Proxy.
      • Region Type II and Type III scenarios : ExternalOM-Reverse-Proxy.

  23. Import environment variables. For details, see Importing Environment Variables.
  24. Run the following command to obtain uuid of the host:

    cat /etc/uuid

  25. Run the following command to check whether VMs on the host are in the ERROR state:

    nova list --host uuid

    • If it is, go to 24.
    • If no, go to 27.

  26. Run the following commands to stop and start all VMs except local disk management VMs (FusionCompute VRM, FusionSphere OpenStack OM, and FSM VMs), in which uuid indicates the VM ID:

    nova stop uuid

    nova start uuid

  27. Run the following command to check whether all VMs on the host are in the ACTIVE state:

    nova list --host uuid

    • If yes, go to 26.
    • If no, Contact technical support for assistance.

  28. If the faulty host accommodates the VRM VM, restore the VRM VM by following the instructions provided in "OS Fault of A Single VRM VM" in the FusionSphere V100R006C10SPC600 Product Documentation (Server Virtualization,FusionCompute V100R006C10SPH105).
  29. If the faulty host accommodates the FSM VM, see "Fault Management" > "Troubleshooting" in the FusionStorage V100R006C20 Product Documentation and restore the VM.
  30. If the Service OM VM is deployed on the faulty host, perform operations provided in A Single Service OM VM Faulty in Active/Standby Deployment Mode to rebuild the Service OM VM.
  31. If the faulty KVM host has been connected to FusionStorage Block, recover storage by following the instructions in "Server OS Faults" in the FusionStorage V100R006C20 Product Documentation.

    NOTE:

    If the blockstorage-driver, blockstorage-driver-vrmxxx, or blockstorage-driver-kvmxxx role (xxx can be 001 or 002) is assigned, install an eBackup driver by performing step 1 in Installing Disaster Recovery Services > Installation and Initial Configuration (CSBS) > Installing and Configuring OceanStor BCManager eBackup > Interconnecting with FusionSphere OpenStack.

  32. On the Service OM web client, choose Services > Service OM > Centralized O&M > Alarm > Alarm List. On the displayed page, check whether there is any alarm about the installed host.

    • If there is, go to 33.
    • If there is not, no further action is required.

  33. Locate the row that contains the alarm, click Clear in the Operation column, select the alarm to be deleted, and click Clear in the upper left corner.
  34. Log in to the FusionSphere OpenStack web client.

    For details, see Logging In to the FusionSphere OpenStack Web Client (ManageOne Mode).

  35. On the FusionSphere OpenStack web client, choose Configuration > Disk. On the displayed page, check whether the management host group containing the faulty host uses remote storage devices.

    • If it does, go to 36.
    • If it does not, no further action is required.

  36. Use PuTTY to log in to the controller node through the reverse proxy IP address of the External OM plane.

    The default account is fsp and the default password is Huawei@CLOUD8.

  37. Run the following command to obtain the hash ID of the faulty host:

    python -c 'print(hash("hostid"))'

    hostid indicates the ID of the faulty host. The command output is the hash ID of the faulty host.

  38. Run the following command to obtain the WWN list of the remote LUN reconnected to the board:

    cat /etc/huawei/fusionsphere/osConfig.storage/cfg/osConfig.storage.cfg.effect | python -mjson.tool | grep lun_wwn

    Information similar to the following is displayed:

     "lun_wwn": "6487b6b1004bc09524031c1e0000003d"

  39. Log in to the DeviceManager. On the home page, choose Provisioning > LUN. On the displayed page, query the host hash IDs by name, and check each LUN one by one. If a WWN is not in the WWN list obtained in 38, the LUN is to be cleared. Record and then delete to-be-cleared LUNs from the corresponding LUN groups on the DeviceManager. For details, see product documentation for the corresponding model of OceanStor.

Rectifying the Fault of the MongoDB-Dedicated SSD Disk

Perform the steps in this section to rectify the single point of failure (SPOF) condition of the ceilometer-data disk.

Symptom

MongoDB is independently deployed on an SSD disk. The MongoDB status is fault, and the /var/ceilometer directory of the MongoDB node is mounted to an independent SSD disk. The MongoDB host fails to start.

Possible Causes

The MongoDB-dedicated SSD becomes faulty, causing the failures of disk configuration and system startup.

Procedure

Locate the fault.

  1. Log in to the remote console of the faulty MongoDB host using its BMC IP address.
  2. Run the following command and enter the password of the root user to switch to the root user:

    su - root

  3. Run the following command to disable user logout upon timeout:

    TMOUT=0

  4. Run the following command to check whether the MongoDB hard disk is faulty.

    lvs
    • If the command output does not contain the logical volume (LV) of the ceilometer-data disk, the hard disk is considered faulty, in which case, perform subsequent steps.
    • If other faults occur,Contact technical support for assistance.
    NOTE:

    LV indicates a logical volume.

Rectify the fault.

  1. On the MongoDB host you have logged in to, run the following command to create a temporary MongoDB LV:

    lvcreate -L 5g cpsVG -n ceilometer-data

  2. Run the following command to format the LV:

    mkfs.ext3 -K -E resize=16380g /dev/cpsVG/ceilometer-data

    LV indicates the created logical volume.

  3. Back up the disk configuration file.

    cp -a /opt/fusionplatform/data/fusionsphere/osConfig.storage/cfg/osConfig.storage.cfg.effect /opt/fusionplatform/data/fusionsphere/osConfig.storage/cfg/osConfig.storage.cfg.effect.bak

  4. Modify and save the MongoDB configuration in the disk configuration file.

    Use the vi editor to open the configuration file.

    vim /opt/fusionplatform/data/fusionsphere/osConfig.storage/cfg/osConfig.storage.cfg.effect

    Change the values of the size and vg parameters in the logical-volume area where the name value is ceilometer-data.

    { 
                "backendtype": "local", 
                "format": "ext3", 
                "io-error-policy": "continue", 
                "name": "ceilometer-data", 
                "path": "/var/ceilometer", 
                "role": "mongodb", 
                "size": "5g", 
                "vg": "cpsVG" 
            }     

  5. Run the following command repeatedly and check the process ID of python /usr/local/bin/cps-client/cps_client/cpsclient.py. If the process ID does not change within 3 minutes, the system has started properly.

    ps -ef |grep cps-client

    NOTE:

    The preceding operations can be used only to rectify system startup exceptions caused by the MongoDB disk faults in emergency scenarios. For any service exceptions caused by the MongoDB disk faults after the system starts, replace the faulty parts and rectify MongoDB faults accordingly.

Rectifying the Disk Partition Fault on Compute Nodes

Symptom
  • VMs fail to be started by tenant users.
  • The standby database fails to be switched to the active database.
Possible Causes

The image partition is read-only, the file system is damaged, and the data is abnormal, and as a result, the VM cannot be started normally.

Procedure
  • If remote storage is used, rectify the fault according to Remote Storage Is Faulty.
  • If local storage is used, perform the steps in in this section to rectify the fault.
  1. Use PuTTY and the External OM plane IP address to log in to the faulty host

    The username is fsp and the default password is Huawei@CLOUD8.
    NOTE:
    • The system supports login authentication using a password or private-public key pair. If a private-public key pair is used for login authentication, seeUsing PuTTY to Log In to a Node in Key Pair Authentication Mode.
    • For details about the IP address of the External OM plane, see the LLD generated by FCD sheet of the xxx_export_all.xlsm file exported from FusionCloud Deploy during software installation, and search for the IP addresses corresponding to VMs and nodes.The parameter names in different scenarios are as follows:
      • Cascading layer in the Region Type I scenario : Cascading-ExternalOM-Reverse-Proxy, Cascaded layer : Cascaded-ExternalOM-Reverse-Proxy.
      • Region Type II and Type III scenarios : ExternalOM-Reverse-Proxy.

    Run the following command to switch to user root, and enter the password of user root as prompted:

    su - root

    The default password of user root is Huawei@CLOUD8!.

  2. Run the TMOUT=0 command to disable user logout upon system timeout.
  3. Importing Environment Variables
  1. Run the following command to check whether the partition is read-only:

    cat /proc/mounts | grep -w "ro," | grep /dev/mapper/cpsVG-image

    If ro is displayed, the partition is read-only.
    • If yes, go to 5.
    • If no, contact technical support for assistance.

  2. Run the following command to unmount the read-only partition:

    umount /dev/mapper/cpsVG-image

    If the partition fails to be unmounted, and information similar to Figure 7-1 is displayed, the partition is being used by some programs. In this case, contact technical support for assistance.

    Figure 7-1 Command output

  3. Run the following command to restore the file system:

    fsck.ext3 /dev/mapper/cpsVG-image

    NOTE:

    If the partition uses the file systems excluding ext3, use other fsck file restoration commands, for example, fsck.ext2 is used for the ext2 file system.

    If no error information is displayed in the command output, the restoration is successful.

    • If the automatic restoration is successful, go to 7.
    • If the automatic restoration fails and the system partition is damaged, contact technical support for assistance.
      NOTE:

      If the restoration fails, you also can run the fsck -y /dev/mapper/cpsVG-image command to forcibly restore the file system. The forcible restoration may fail, and as a result, data is lost or damaged.

  4. Run the following command to mount the partition:

    mount /dev/mapper/cpsVG-image /opt/HUAWEI/image

    /opt/HUAWEI/image indicates the directory to which the partition is mounted, and you can specify the directory based on requirements.

Handling a Single Faulty Controller Node VM in a Cascaded FusionSphere OpenStack System (Region Type I)

Symptom

The VM OS cannot start, and the fault cannot be recovered using other methods.

Possible Causes
  • The OS is faulty.
  • VM files are corrupted or resources are abnormal.
Procedure
NOTE:

This operation applies to the scenarios where only one of the three controller node VMs in the cascaded FusionSphere OpenStack system is faulty.

  1. Log in to any controller node of the cascading FusionSphere OpenStack system and run the following command to query the subnet of the corresponding cascaded Internal Base plane:

    neutron net-list | grep cascaded_internal_base

  2. Run the following command to check whether DHCP is enabled (true: yes, false: no):

    neutron subnet-show subnet_id | grep dhcp

    NOTE:

    In the preceding command, subnet_id indicates the subnet ID obtained in 1.

    • If yes, go to 3.
    • If no, go to 4.

  3. Run the following command to disable DHCP:

    neutron subnet-update --disable-dhcp subnet_id

    NOTE:

    In the preceding command, subnet_id indicates the subnet ID obtained in 1.

  4. On the web client of the cascaded FusionSphere OpenStack system that you have logged in, choose O&M > Capacity Expansion. On the Capacity Expansion page, disable PXE Boot Hosts.
  5. On the web client of the cascaded FusionSphere OpenStack system that you have logged in to, choose Cloud Deploy > Environment Management. Click the name of the virtualization deployment environment of the cascaded FusionSphere OpenStack system and query names of components with the type of CascadedVM.

  6. Use PuTTY to log in to the first node of the cascading FusionSphere OpenStack system through the External OM plane IP address.

    The username is fsp and the default password is Huawei@CLOUD8.
    NOTE:
    • The system supports login authentication using a password or private-public key pair. If a private-public key pair is used for login authentication, see .Using PuTTY to Log In to a Node in Key Pair Authentication Mode.
    • For details about the IP address of the External OM plane, see the LLD generated by FCD sheet of the xxx_export_all.xlsm file exported from FusionCloud Deploy during software installation, and search for the IP addresses corresponding to VMs and nodes.

  7. Run the following command to switch to user root, and enter the password of user root as prompted:

    su - root

    The default password of user root is Huawei@CLOUD8!.

  8. Run the TMOUT=0 command to disable user logout upon system timeout.
  9. For details, see Importing Environment Variables.
  10. Run the following command to query the status of each of the controller node VMs in 5 and make a note of the ID of the faulty VM:

    nova list |grep cascadedvm

    cascadedvm indicates the name of the component obtained in 5.

    If component cascadedvm003 obtained in 5 is used as an example, run the following command:

    nova list |grep cascadedvm003

  11. Run the following command to query the flavor name of the faulty VM based on the faulty VM ID queried in 10:

    nova show cascadedvm_id |grep flavor

    cascadevm_id indicates the faulty VM ID queried in 10.

    If the faulty VM ID queried in 10 is dd7d75af-c446-49bf-be0e-f3fe8fa3b356, run the following command:

    nova show dd7d75af-c446-49bf-be0e-f3fe8fa3b356 |grep flavor

  12. Run the following command to query the UUID of the host that houses the faulty VM based on the ID queried in 10:

    nova show cascadedvm_id |grep hypervisor_hostname

    cascadedvm_id indicates the ID queried in 10.

    If the faulty VM ID queried in 10 is dd7d75af-c446-49bf-be0e-f3fe8fa3b356, run the following command:

    nova show dd7d75af-c446-49bf-be0e-f3fe8fa3b356 |grep hypervisor_hostname

  13. Run the following command to check whether a data volume is attached to the faulty VM based on the faulty VM ID queried in 10:

    cinder list |grep cascadedvm_id

    cascadevm_id indicates the faulty VM ID queried in 10.

    If the faulty VM ID queried in 10 is dd7d75af-c446-49bf-be0e-f3fe8fa3b356, run the following command:

    cinder list |grep dd7d75af-c446-49bf-be0e-f3fe8fa3b356

    If the command output shown in Figure 7-2 is displayed, only the boot volume is attached to the faulty VM, and the SSD passthrough is configured for the faulty VM.

    Figure 7-2 Command output

    If the command output shown in Figure 7-3 is displayed, the boot and data volumes are attached to the faulty VM. Make a note of the data volume size.

    Figure 7-3 Command output

  14. On the web client of the cascaded FusionSphere OpenStack system, delete components of the faulty VM and click Deploy This Environment.

  15. After the deployment is successful, switch to the Environment Deployment page on the web client of the cascading FusionSphere OpenStack system, and create a VM instance for the cascaded FusionSphere OpenStack system. For the host ID, enter the UUID obtained from 12.

    NOTE:

    Ensure that the specifications and configuration of the new VM are consistent with those of the faulty VM.

    1. Drag Cascaded VM from the application component to the Drop Components here area to create a VM instance.

    2. Configure the VM instance.

      • Name: specifies the VM instance name, which must be unique. The value must start with an uppercase or lowercase letter. The name contains only letters, digits, hyphens (-), and underscores (_).
      • Nova Availability Zone: specifies the AZ where the VM is to be deployed. Set it to manage-az.
      • Cinder Availability Zone: specifies the AZ where the VM volume is to be created. Set it to manage-az.
      • Boot volume size (G): specifies the size of the volume from which the VM is booted. Set it to 300 GB.
      • Data volume size (G): specifies the data volume size of the VM. If the SSD passthrough is configured for the VM queried in 13, keep this parameter left blank. If the data volume is attached to the VM queried in 13, enter the original data volume size.
      • Host: specifies the ID of the host accommodating the VM. One VM maps one compute node in the cascading FusionSphere OpenStack system. You can query the host ID on the Summary page on the web client of the cascading FusionSphere OpenStack system or run the cps host-list command. You are advised to select compute nodes in the cascading FusionSphere OpenStack system.
      • Image: specifies the image name. Set it to cascaded_image.
      • Flavor: specifies the VM specifications. Select the cascaded_vm-related specifications. Select the flavor name of the faulty VM queried in 11.
      • Volume Backend Name: specifies the name of the backend storage configured for the KVM resource pool. You can query the backend storage name by choosing Resource Pool > Configure Storage Cluster on the web client of the cascading FusionSphere OpenStack system.

      • Delay Deploy Time (minute): specifies whether to delay the deployment. The default value is 0. When the cascading and cascaded FusionSphere OpenStack systems share one Windows PC, the Windows PC is first used by the cascading FusionSphere OpenStack system. Therefore, it needs to be allocated to the network planes of the cascaded FusionSphere OpenStack system temporarily to PXE-boot the cascaded FusionSphere OpenStack system. In this case, set this parameter, enabling the Windows PC to PXE-boot hosts in the cascaded FusionSphere OpenStack system within the time specified by this parameter.
    3. Select the network used by the cascaded VM.

      NOTE:

      Cascaded provision (optional) and Cascaded BMC Base (optional) are the network planes in bare metal scenarios. The default network plane is used in KVM scenarios.

    4. Deploy the VM instance.

      After the VM instance is created, click Deploy This Environment. If the status changes to Ready, the VM instance is successfully deployed.

  16. On the cascading PuTTY page, run the following command and make a note of the MAC address of the new VM:

    neutron port-list|grep cascadedvm_new

    cascadedvm_new indicates the name of the VM instance created in 15.

    If VM instance name cascadedvm004 created in 15 is used as an example, run the following command:

    neutron port-list|grep cascadedvm004

  17. Use PuTTY to log in to the first node in the cascaded FusionSphere OpenStack system through the External OM IP address.

    The username is fsp and the default password is Huawei@CLOUD8.
    NOTE:
    • The system supports login authentication using a password or private-public key pair. If a private-public key pair is used for login authentication, see .Using PuTTY to Log In to a Node in Key Pair Authentication Mode.
    • For details about the IP address of the External OM plane, see the LLD generated by FCD sheet of the xxx_export_all.xlsm file exported from FusionCloud Deploy during software installation, and search for the IP addresses corresponding to VMs and nodes.

  18. Run the following command to switch to user root, and enter the password of user root as prompted:

    su - root

    The default password of user root is Huawei@CLOUD8!.

  19. Run the TMOUT=0 command to disable user logout upon system timeout.
  20. Import environment variables. For details, see Importing Environment Variables.
  21. On the web client of the cascaded FusionSphere OpenStack system, click Summary and query the ID of the faulty controller node.
  22. Run the following commands to configure the mapping between the ID of the faulty host and the MAC address of the new host:

    cps hostid-mac-add --mac MAC address of the new host --hostid Original host ID

    cps commit

    In the command, Original host ID indicates the ID of the faulty controller node queried in 21, and MAC address of the new host indicates the MAC address of the newly-created VM queried in 16.

    Figure 7-4 Command output

    As shown in Figure 7-4, if the command output contains "hostid already defined!", run the following commands to update the mapping between the host ID and the MAC address of the new VM:

    cps hostid-mac-update --mac MAC address of the new host --hostid Original host ID

    cps commit

    Run the following command to check whether the configured faulty host ID is consistent with the MAC address of the new host:

    cps hostid-mac-list

  23. On the web client of the cascaded OpenStack system that you have logged in, choose O&M > Capacity Expansion. The record Wait for PXE Boot is displayed on the Capacity Expansion page. After confirming the MAC address queried in 16, click .

  24. Click Summary and query the installation progress of the faulty host. If the blockstorage role is deployed on the faulty host, the installation is paused when the installation progress reaches 96%. For details, see "Server OS Faults" in the FusionStorage V100R006C20 Product Documentation and perform operations provided in "Install FSA for the server using the CLI" and "Restore storage resources" in sequence.
  25. On the web client of the cascaded FusionSphere OpenStack system that you have logged in, choose O&M > Capacity Expansion. On the Capacity Expansion page, disable PXE Boot Hosts.

NAT Gateway Service Provisioning Failure (Region Type I)

Symptom

The ManageOne operation plane cannot provision new NAT gateway services.

Possible Causes

The data plane of the NAT gateway is deployed in active/standby mode. When control component neutron-nat-data-agent on the active node of the NAT gateway becomes faulty, the data plane of the NAT gateway cannot receive new services.

Fault Diagnosis

Check whether the node where the neutron-nat-gw-data-agent component in the faulty status is located is the active NAT gateway node.

  • If yes, perform an active/standby switchover.
  • If no, contact technical support for assistance.
Procedure
  1. Use PuTTY to log in to any host in the AZ through the IP address of the External OM plane.

    The default account is fsp and the default password is Huawei@CLOUD8.
    NOTE:
    • The system supports login authentication using a password or private-public key pair. If a private-public key pair is used for login authentication, see .Using PuTTY to Log In to a Node in Key Pair Authentication Mode.
    • For details about the IP address of the External OM plane, see the LLD generated by FCD sheet of the xxx_export_all.xlsm file exported from FusionCloud Deploy during software installation, and search for the IP addresses corresponding to VMs and nodes.

  2. Run the following command to switch to user root and enter the password of user root as prompted:

    su - root

    The default password of user root is Huawei@CLOUD8!.

  3. Run the TMOUT=0 command to disable user logout upon system timeout.
  4. Import the environment variables.

    For details, see Importing Environment Variables.

  5. Run the following command to query the node where the neutron-nat-gw-data-agent component in the faulty status is located:

    cps template-instance-list --service nat-gateway neutron-nat-gw-data-agent

    +----------------+---------------------------+--------+--------------------------------------+-------------+
    | instanceid     | componenttype             | status | runsonhost                           | omip        |
    +----------------+---------------------------+--------+--------------------------------------+-------------+
    | agt_0000000001 | neutron-nat-gw-data-agent | active | 53034D15-B603-084B-82D6-AAB15E9F3503 | 4.20.43.169 |
    | agt_0000000000 | neutron-nat-gw-data-agent | fault  | 45EBDB95-711D-7640-8B5C-EE3D98561285| 4.20.43.158|
    +----------------+---------------------------+--------+--------------------------------------+-------------+

  6. Use PuTTY to log in to the NAT gateway node using the IP address obtained in 5.

    The default account is fsp and the default password is Huawei@CLOUD8.
    NOTE:

    The system supports login authentication using a password or private-public key pair. If a private-public key pair is used for login authentication, see Using PuTTY to Log In to a Node in Key Pair Authentication Mode.

  7. Run the following command to switch to user root and enter the password of user root as prompted:

    su - root

    The default password of user root is Huawei@CLOUD8!.

  8. Run the TMOUT=0 command to disable user logout upon system timeout.
  9. Import the environment variables.

    For details, see Importing Environment Variables.

  10. Run the following command to check whether the node where the neutron-nat-gw-data-agent component in the faulty status is located is the active NAT gateway node based on the command output. Check whether status is MASTER in the command output.

    /usr/sbin/ugw_shell vrrp show all

    ===============================================
    vrrpid   enable   status    ha-link   
    -----------------------------------------------
    1        yes      MASTER     eth0.2038      
    ===============================================
    • If it is, go to 11.
    • If it is not, contact technical support for assistance.

  11. Run the following command to perform an active/standby switchover:

    cps host-template-instance-operate --service nat-gateway neutron-nat-gw-dataplane --action stop --host hostid

    +--------------------------+--------------------------------------+--------+---------+
    | template                 | runsonhost                           | action | result  |
    +--------------------------+--------------------------------------+--------+---------+
    | neutron-nat-gw-dataplane | 45EBDB95-711D-7640-8B5C-EE3D98561285 | stop   | success |
    +--------------------------+--------------------------------------+--------+---------+

    cps host-template-instance-operate --service nat-gateway neutron-nat-gw-dataplane --action start --host hostid

    +--------------------------+--------------------------------------+--------+---------+
    | template                 | runsonhost                           | action | result  |
    +--------------------------+--------------------------------------+--------+---------+
    | neutron-nat-gw-dataplane | 45EBDB95-711D-7640-8B5C-EE3D98561285 | start  | success |
    +--------------------------+--------------------------------------+--------+---------+

    hostid is the host ID obtained in 5. If success is displayed in the command output, the active/standby switchover is successful.

Handling Hosts Erroneously Connected to an AZ

Symptom

If a started host in another availability zone (AZ) is connected directly to the network without reinstallation in subsequently operations, this host is considered as a mistakenly added host. To avoid unnecessary alarms, you need to delete the host in a timely manner.

Hosts that do not belong to an AZ are connected to this AZ by mistake, as illustrated in Figure 7-5.

Figure 7-5 Hosts are erroneously connected to an AZ
Procedure
  1. Use PuTTY to log in to any host in the local AZ.

    Ensure that the IP address of the External OM network plane and username fsp are used to establish the connection.

    You can obtain the IP address of the external OM plane on the Summary page on the FusionSphere OpenStack web client.

    The system supports the login authentication using a password or private-public key pair. If you use a private-public key pair to authenticate the login, see Using PuTTY to Log In to a Node in Key Pair Authentication Mode.

  2. Run the following command and enter the password of user root to switch to user root:

    su - root

  3. Run the following command to disable logout on timeout:

    TMOUT=0

  4. Importing Environment Variables.
  5. Run the following command to query the host list:

    cps host-list

  6. Use the reverse proxy IP address to log in to the FusionSphere OpenStack web client and query the host list on the Summary page.
  7. Compare the two host lists.

    If the host list queried using the command contains a host in the fault state but the host list displayed on the web client does not contain such a host, this host is considered as a mistakenly added one. In this case, take note of its hostid value.

    NOTE:

    If multiple such hosts with the same IP address exist, you need to determine whether the hosts are mistakenly added or correctly installed but faulty ones.

  1. Run the following command to query the host configuration file hostcfg:

    cps hostcfg-list

    Information similar to the following is displayed:

    +---------------+----------------+-----------------------------------------------------------+
    | type          | name           | hosts                                                     |
    +---------------+----------------+-----------------------------------------------------------+
    | site          | default        | hostid:                                                   |
    |               |                |                                                           |
    | storage       | default        | default:all                                               |
    |               |                |                                                           |
    | storage       | default_sdi    | capability:{boardtype: SDI}                               |
    |               |                |                                                           |
    | storage       | control_group0 | hostid:564D0074-B2AA-B928-B748-B32F52C5B8B5               |
    |               |                |                                                           |
    | storage       | control_group1 | hostid:564D159D-F177-F8B6-32DC-D7E034FF0182               |
    |               |                |                                                           |
    | storage       | control_group2 | hostid:564D4412-43F7-893F-1220-2D8F9D09B788               |
    |               |                |                                                           |
    | kernel        | default        | default:all                                               |
    |               |                |                                                           |
    | kernel        | default_sdi    | capability:{boardtype: SDI}                               |
    |               |                |                                                           |
    | network       | default        | default:all                                               |
    |               |                |                                                           |
    | resgrp-define | server         | capability:{role: sys-server}                             |
    |               |                |                                                           |
    | resgrp-define | default        | default:all                                               |
    |               |                |                                                           |
    | resgrp-define | default_sdi    | capability:{boardtype: SDI}                               |
    |               |                |                                                           |
    | resgrp-define | control_group1 | hostid:564D0074-B2AA-B928-B748-B32F52C5B8B5, 564D4412-43F |
    |               |                | 7-893F-1220-2D8F9D09B788, 564D159D-F177-F8B6-32DC-        |
    |               |                | D7E034FF0182                                              |
    |               |                |                                                           |
    +---------------+----------------+-----------------------------------------------------------+

  1. Run the following command to delete information about the incorrectly connected host from the hostcfg list:

    cps hostcfg-host-delete --host hostid=hostid  --type type name

    hostid indicates the ID of the host that is incorrectly connected.

    type and name indicate the values of the corresponding fields of the host that is incorrectly connected in the hostcfg list.

  2. Run the following command to query the role of the host:

    cps host-role-list hostid

  3. Run the following command to delete the role for the host:

    cps role-host-delete --host hostid role_name

    hostid is that recorded in 7, and role_name is the role name obtained in 10.

  4. Run the following command to delete the host connected by mistake.

    cps host-delete --host hostid

    cps commit

Network Node OS Faults (Region Type I)

Symptom

When a network node OS is faulty (the network nodes indicate the nodes where vrouter and nat-server are located in the cascading FusionSphere OpenStack system), you need to install OS on the host and deploy and configure components again. The host status in the command output of cps host-show <host_id> is fault, which indicates that the network node is faulty. Network nodes can be deployed on physical servers and VMs. Therefore, the following scenarios are involved:

  • Network nodes deployed on physical servers: The OS of the host on which the network node resides is fault.
  • Network nodes deployed on VMs:
    • The OS of the host (physical server) on which the network node resides is faulty.
    • The OS of the host on which network nodes vrouter and nat-server reside is faulty
    • The host OS of the VM on which network node elb resides is faulty.
    • The host OS of the VM on which network node nat gateway resides is faulty.
Possible Causes
  • Network nodes deployed on physical servers: The OS of host on which the network node resides is damaged by high-risk or abnormal operations.
  • Network nodes deployed on VMs:
    • The OS of host (physical server) on which the network node resides is damaged by high-risk or abnormal operations.
    • The host OS of the VM on which network nodes vrouter and nat-server reside is damaged by high-risk or abnormal operations.
    • The host OS of the VM on which network node elb resides is damaged by high-risk or abnormal operations.
    • The host OS of the VM on which network node nat gateway resides is damaged by high-risk or abnormal operations.
Prerequisites
  • Except for network nodes with OS faults, all other nodes and functions are normal.
  • You have obtained the host ID or VM name of the faulty node.
  • You can log in to the ManageOne OM plane and FusionSphere OpenStack web client.
  • You can log in to one node of the cascading FusionSphere OpenStack system through the terminal CLI.
  • The following procedure applies to the scenario where the OS of a host is faulty. If OSs of multiple hosts are faulty, perform the following steps on each faulty host.
Procedure
  • The network node is deployed on a physical server, and the OS of the host on which the network node resides is faulty.
  1. Perform the following operations to log in to any normal node in the cascading system. Unless otherwise specified, perform the subsequent operations on this node:

    1. Use PuTTY to log in to any host in the cascading system as user fsp through the IP address of the External OM plane.

      The username is fsp and the default password is Huawei@CLOUD8.

      NOTE:
      • The system supports login authentication using a password or private-public key pair. If a private-public key pair is used for login authentication, see .Using PuTTY to Log In to a Node in Key Pair Authentication Mode.
      • For details about the IP address of the External OM plane, see the LLD generated by FCD sheet of the xxx_export_all.xlsm file exported from FusionCloud Deploy during software installation, and search for the IP addresses corresponding to VMs and nodes.
    2. Run the following command to switch to user root, and enter the password of user root as prompted:

    su - root

    The default password of user root is Huawei@CLOUD8!.

    1. Run the TMOUT=0 command to disable user logout upon system timeout.
    2. Import environment variables. For details, see Importing Environment Variables.

  2. For example, if the ID of the faulty host is 2E88A028-D21D-B211-81DD-0018E1C5D866, run the following commands to assign a value to the variable:

    host_id=2E88A028-D21D-B211-81DD-0018E1C5D866#Faulty host ID

    nat_role=`cps host-role-list $host_id | grep nat-server | awk '{print $2}'`

    flag=`echo $nat_role| awk '{print substr($1,11)}'` #Network exit flag

    network_grp=`cps hostcfg-list --type network | grep $host_id | awk '{print $4}'` #Network group

  1. Run the following commands to stop the CPU core bound to the network node: (If the faulty network node does not use an Intel 82599 optical network adapter, skip this step. In the standard environment, the optical network adapter is used.)

    cps template-ext-params-update --parameter vrouter$flag.DATA_PLANE.enable_auto_irq=False --service vrouter neutron-vrouter$flag

    cps commit

    cps template-ext-params-update --parameter neutron_l3_nat_agent$flag.DEFAULT.enable_auto_irq=False --service nat-server neutron-l3-nat-agent$flag

    cps commit

  2. On the FusionSphere OpenStack web client, choose O&M > Capacity Expansion and set PXE Boot Hosts to ON.
  3. To use the remote control function of the BMC system of the faulty server, perform the following operations:

    Choose Power > Power Control and click Power Off.

    Choose Configure > Boot Option, set Take effect to One-time and Boot Media to PXE, and click Save.

    Choose Remote Control > Integrated Remote Console and click Remote Virtual Console (Shared Mode).

    On the Remote Virtual Console page, click Power On to start the faulty host in PXE mode. It takes about 10 to 15 minutes to reinstall the host.

    Alternatively, use the keyboard and monitor to directly connect to the server. During the server startup, manually select the network boot mode.

    NOTE:

    The installation will fail (services are unavailable although the OS is installed) if the disk drive letter changes after the installation, which may be caused by the following operations and can be handled according to How Do I Handle Drive Letter Changes?:

    • Adjust the host RAID arrays, including reconstructing, recreating, and deleting RAID arrays.
    • Adjust the FusionSphere OpenStack system disk or disk partitions, including expanding the disk capacity and adding partitions.
    • Attach the SSD card used by the MongoDB service to the host. This issue must be handled according to How Do I Handle Drive Letter Changes?.

    If the reinstallation fails with none of the operations above performed, contact technical support for assistance.

    After the host OS is reinstalled, the passwords of OS accounts, including users root and fsp, will be reset to the default ones.

  1. After the OS is reinstalled, click Summary on the FusionSphere OpenStack web client. Wait until the configuration progress of the faulty board reaches 100%.

    Choose O&M > Capacity Expansion and set PXE Boot Hosts to ON.

  1. Configure the network adapter list of the data plane. If the faulty network node does not use an Intel 82599 optical network adapter, skip this step. In the standard environment, the optical network adapter is used.

    Run the following command to check the PCISLOT of the data plane network adapter:

    cps hostcfg-show $network_grp --type network | grep PCISLOT | awk '{print substr($4,0,12)}'

    0000:81:00.0

    0000:81:00.1

    ...

    Run the following command to obtain the OM IP address of the faulty network node:

    cps host-list | grep $host_id | awk '{print $12}'

    8.18.63.73

    Log in to the faulty network node based on 1, and run the following commands on the network node to execute the command for each PCISLOT:

    ll /sys/class/net/ | grep eth | grep 0000:81:00.0 | awk '{print substr($11,length($11)-3)}'

    eth2

    ll /sys/class/net/ | grep eth | grep 0000:81:00.1 | awk '{print substr($11,length($11)-3)}'

    eth3

    ...

    For example, if the obtained data plane network adapter list is eth2 eth3 eth5 eth7 (usually 4 or 6 network adapters), run the following command (separate multiple network adapter names by spaces):

    python2.7 /etc/neutron/networking-cascading/set_irq.py set eth2 eth3 eth5 eth7

  1. Run the following commands to start the CPU core bound to the network node: (If the faulty network node does not use an Intel 82599 optical network adapter, skip this step. In the standard environment, the optical network adapter is used.)

    cps template-ext-params-update --parameter vrouter$flag.DATA_PLANE.enable_auto_irq=True --service vrouter neutron-vrouter$flag

    cps commit

    cps template-ext-params-update --parameter neutron_l3_nat_agent$flag.DEFAULT.enable_auto_irq=True --service nat-server neutron-l3-nat-agent$flag

    cps commit

  1. On the FusionSphere OpenStack web client, click Summary, select the newly installed host, and click Restart.

    NOTE:

    The host automatically synchronizes system configuration after it is successfully installed. Some advanced functions, such as resource isolation, will take effect only after a host restarts.

    To ensure that all configurations can take effect, restart hosts after the reinstallation.

  1. Run the following commands to check whether network nodes vrouter and nat-server are normal:

    cps template-instance-list --service vrouter neutron-vrouter$flag

    cps template-instance-list --service nat-server neutron-l3-nat-agent$flag

    • If status in the command output is active, log in to the ManageOne OM plane and choose Alarms > Current Alarms to manually clear all alarms of the faulty host.
    • If status in the command output is fault, contact technical support for assistance.

  • The network node is deployed on a VM, and the OS of the host on which the network node resides is faulty.
NOTE:

In this deployment mode, if the host OS of the VM where the network node resides is faulty, you need to reinstall the host and the VM on the host.

  1. Before reinstalling the OS of the faulty host, handle the VMs on the faulty host. If the ID of the faulty host is 62A5E267-2107-E811-8F9A-B4FBF9AD8203, run the following commands to query the information about the faulty VMs:

    phy_host_id=62A5E267-2107-E811-8F9A-B4FBF9AD8203#Faulty host ID

    vms=`nova list | grep netcluster_ | awk '{print $4}'`

    function get_host { nova show $x| grep hypervisor_hostname | awk '{print $4}'; }

    for x in $vms; { if [[ `get_host` == $phy_host_id ]]; then echo $x; fi; }

    The VMs displayed in the command output are those to be cleared. For example:

    netcluster_elb_api_vm_2_01

    netcluster_elb_db_vm_2_01

    netcluster_elb_lvs_vm_2_01

    netcluster_elb_nginx_vm_2_01

    netcluster_vrouter_nat_vm_2_01 # vrouter_nat VM name

    netcluster_nat_gateway_vm_2_01 # nat gateway VM name

    For the vrouter_nat VM, run the following commands to query the host ID of the VM:

    flag=`echo $x | awk '{print substr($1,length($1)-1)}'`

    cps template-ext-params-show --service vrouter neutron-vrouter$flag
    +--------------------------------------------------+----------------------------------------------------+
    | Property                                         |
    Value                                              |
    +--------------------------------------------------+----------------------------------------------------+
    | neutron_vrouter01.DEFAULT.hostid_for_virt_deploy | 3FF0190C-B8B5-9D4A-ADF8-34FDD5B4697F:62A5E267-2107 |
    |                                                 
    | -E811-8F9A-B4FBF9AD8203,F35E8E5E-99E7-0747-B2EB-DF
    |
    |                                                 
    | 9E1117FAC2:5E2C1AA8-E506-E811-9D39-B4FBF9AD81F1    |
    | neutron_vrouter01.agent.report_interval          | 120                                               
    |
    | vrouter01.DEFAULT.pod_fip_cidr                   | 100.64.0.0/10                                      |
    +--------------------------------------------------+----------------------------------------------------

    In the command output, 3FF0190C-B8B5-9D4A-ADF8-34FDD5B4697F of phy_host_id is the ID of host accommodating the vrouter_nat VM to be cleared.

  1. Open six PuTTY systems to process the six types of VMs on the faulty host so that the variables of the VMs can be configured. For details about the VMs to be cleared, see 1. Perform the following steps:

    Clear the vrouter_nat VM. For details, see 1 to 3 in The network is deployed on a VM, and the host OS of the VM where network nodes vrouter and nat-server reside is faulty.

    Clear the elb_lvs VM. For details, see 1 to 3 in The network node is deployed on a VM, and the OS of the VM on which the network node elb resides is faulty.

    Delete the elb_nginx VM. For details, see 1 to 3 in The network node is deployed on a VM, and the OS of the VM on which the network node elb resides is faulty.

    Clear the elb_db VM. For details, see 1 to 3 in The network node is deployed on a VM, and the OS of the VM on which the network node elb resides is faulty.

    Clear the elb_api VM. For details, see 1 to 3 in The network node is deployed on a VM, and the OS of the VM on which the network node elb resides is faulty.

    Delete the NAT gateway VM. For details, see 1 to 3 in The network is deployed on a VM, and the host OS of the VM where network node nat gateway resides is faulty.

  2. On the FusionSphere OpenStack web client, choose O&M > Capacity Expansion and set PXE Boot Hosts to ON.
  3. To use the remote control function of the BMC system of the faulty server, perform the following operations:

    Choose Power > Power Control and click Power Off.

    Choose Configure > Boot Option, set Take effect to One-time and Boot Media to PXE, and click Save.

    Choose Remote Control > Integrated Remote Console and click Remote Virtual Console (Shared Mode).

    On the Remote Virtual Console page, click Power On to start the faulty host in PXE mode. It takes about 10 to 15 minutes to reinstall the host.

    Alternatively, use the keyboard and monitor to directly connect to the server. During the server startup, manually select the network boot mode.

    NOTE:

    The installation will fail (services are unavailable although the OS is installed) if the disk drive letter changes after the installation, which may be caused by the following operations and can be handled according to How Do I Handle Drive Letter Changes?:

    • Adjust the host RAID arrays, including reconstructing, recreating, and deleting RAID arrays.
    • Adjust the FusionSphere OpenStack system disk or disk partitions, including expanding the disk capacity and adding partitions.
    • Attach the SSD card used by the MongoDB service to the host. This issue must be handled according to How Do I Handle Drive Letter Changes?.

    If the reinstallation fails with none of the operations above performed, contact technical support for assistance.

    After the host OS is reinstalled, the passwords of OS accounts, including users root and fsp, will be reset to the default ones.

  1. After the OS is reinstalled, click Summary on the FusionSphere OpenStack web client. Wait until the configuration progress of the faulty board reaches 100%.

    Choose O&M > Capacity Expansion and set PXE Boot Hosts to OFF.

  2. Perform the following steps to restore the six types of VMs on the network node on the basis of 2:

    Restore the vrouter_nat VM. For details about how to rectify the fault, see 4 to 19 in The network is deployed on a VM, and the host OS of the VM where network nodes vrouter and nat-server reside is faulty.

    Restore the elb_lvs VM. For details, see 4 to 12 in The network node is deployed on a VM, and the OS of the VM on which the network node elb resides is faulty.

    Restore the elb_nginx VM. For details about how to rectify the fault, see 4 to 12 in The network node is deployed on a VM, and the OS of the VM on which the network node elb resides is faulty.

    Restore the elb_db VM. For details about how to rectify the fault, see 4 to 12 in The network node is deployed on a VM, and the OS of the VM on which the network node elb resides is faulty.

    Restore the elb_api VM. For details, see 4 to 12 in The network node is deployed on a VM, and the OS of the VM on which the network node elb resides is faulty.

    Restore the NAT VM. For details about how to rectify the fault, see 4 to 19 in The network is deployed on a VM, and the host OS of the VM where network node nat gateway resides is faulty.

  3. On the ManageOne OM plane, choose Alarms > Current Alarms to manually clear all alarms of the faulty host.
  • The network is deployed on a VM, and the host OS of the VM where network nodes vrouter and nat-server reside is faulty.
NOTE:

In this deployment mode, network nodes include the VM (netcluster_vrouter_nat_vm) where vrouter and nat-server reside and VMs netcluster_elb_lvs_vm, netcluster_elb_nginx_vm, netcluster_elb_db_vm, and netcluster_elb_api_vm.

In the following steps, the VM (netcluster_vrouter_nat_vm) where vrouter and nat-server reside is faulty. The current VM cannot be restarted in PXE mode and needs to be deleted and recreated.

  1. Perform the following operations to log in to any normal node in the cascading system. Unless otherwise specified, perform the subsequent operations on this node:

    1. Use PuTTY to log in to any host in the cascading system as user fsp through the IP address of the External OM plane.

      The username is fsp and the default password is Huawei@CLOUD8.

      NOTE:
      • The system supports login authentication using a password or private-public key pair. If a private-public key pair is used for login authentication, see .Using PuTTY to Log In to a Node in Key Pair Authentication Mode.
      • For details about the IP address of the External OM plane, see the LLD generated by FCD sheet of the xxx_export_all.xlsm file exported from FusionCloud Deploy during software installation, and search for the IP addresses corresponding to VMs and nodes.
    2. Run the following command to switch to user root, and enter the password of user root as prompted:

    su - root

    The default password of user root is Huawei@CLOUD8!.

    1. Run the TMOUT=0 command to disable user logout upon system timeout.
    2. Import environment variables. For details, see Importing Environment Variables.

  2. Determine the information about the faulty VM.

    The VM (netcluster_vrouter_nat_vm) where vrouter and nat-server reside is faulty. For example, the ID of the faulty VM is 3FF0190C-B8B5-9D4A-ADF8-34FDD5B4697F.

    Run the following commands to assign the variable values:

    vm_host_id=3FF0190C-B8B5-9D4A-ADF8-34FDD5B4697F  #VM ID

    nat_role=`cps host-role-list $vm_host_id | grep nat-server | awk '{print $2}'`

    flag=`echo $nat_role| awk '{print substr($1,11)}'`  #Network exit flag

    Run following commands:

    echo neutron-vrouter$flag

    neutron_vrouter01

    cps template-ext-params-show --service vrouter neutron-vrouter$flag

    +--------------------------------------------------+----------------------------------------------------+
    | Property                                         | Value                                              |
    +--------------------------------------------------+----------------------------------------------------+
    | neutron_vrouter01.DEFAULT.hostid_for_virt_deploy | 3FF0190C-B8B5-9D4A-ADF8-34FDD5B4697F:62A5E267-2107 |
    |                                                  | -E811-8F9A-B4FBF9AD8203,F35E8E5E-99E7-0747-B2EB-DF |
    |                                                  | 9E1117FAC2:5E2C1AA8-E506-E811-9D39-B4FBF9AD81F1    |
    | neutron_vrouter01.agent.report_interval          | 120                                                |
    | vrouter01.DEFAULT.pod_fip_cidr                   | 100.64.0.0/10                                      |
    +--------------------------------------------------+----------------------------------------------------+

    In the command output, 62A5E267-2107-E811-8F9A-B4FBF9AD8203 corresponding to neutron-vrouter01 and vm_host_id is the ID of the host where the VM is located. Run the following command to assign a value to the variable:

    phy_host_id=62A5E267-2107-E811-8F9A-B4FBF9AD8203

    Run the following commands to obtain the VM name in Nova mapping the VM host:

    mac=`cps host-show $vm_host_id | grep slave1 -A 1 | awk '{if (NR==2) print substr($3,5)}'`

    vm_name=`neutron port-list | grep vrouter_port | grep -i $mac | awk '{print substr($4,16,30)}'`

  3. Delete the faulty VM host. Run following commands:

    echo $flag

    On the FusionSphere OpenStack web client, choose Cloud Deploy > Environment Management, select the environment whose prefix is netcluster_env_ and suffix is 01, and click Component Manage.

    Run the following command to query the name of the VM to be deleted:

    echo $vm_name

    netcluster_vrouter_nat_vm_2_01

    On the FusionSphere OpenStack web client, click Delete of netcluster_vrouter_nat_vm_2_01, and then click Deploy This Environment. Wait until the Status value changes to Ready.

  4. Run the following command to obtain the image required by the vrouter_nat VM:

    image=`nova image-list | grep vrouter_nat_image | awk '{if (NR==1) print $2}'`

  1. Run the following commands to obtain the AZ of the faulty network node:

    echo $phy_host_id#Checking the host ID

    62A5E267-2107-E811-8F9A-B4FBF9AD8203

    nova availability-zone-list

    +-----------------------------------------+----------------------------------------+
    | Name                                    | Status                                 |
    +-----------------------------------------+----------------------------------------+
    | az0.dc0                                 | available                              |
    | |- 5E2C1AA8-E506-E811-9D39-B4FBF9AD81F1 |                                        |
    | | |- nova-compute                     | enabled :-) 2018-06-07T21:53:10.536249 |
    | |- 62A5E267-2107-E811-8F9A-B4FBF9AD8203 |                                        |
    | | |- nova-compute                       | enabled :-) 2018-06-07T21:53:07.675847 |
    +-----------------------------------------+----------------------------------------+

    The name corresponding to the host ID is used to create an AZ for the new VM. Run the following command:

    az=az0.dc0

  2. Run the following command to obtain the VLAN of the External OM plane:

    om_vlan=`neutron net-show external_om | grep provider:segmentation_id | awk '{print $4}'`

  3. On the FusionSphere OpenStack web client, choose O&M > Capacity Expansion and set PXE Boot Hosts to ON.
  4. On the FusionSphere OpenStack web client, choose Cloud Deploy > Environment Management, and click Component Manage in the environment the same as 3. Drag the NetCluster VM application component and drop it to the box, and deploy the network node VM.

    In the displayed dialog box, set the following parameters:

    • echo $vm_name #Name

      netcluster_vrouter_nat_vm_2_01

    • VM PF PortNum: Enter 2:2. For a non-standard environment, set this parameter based on the configuration item description.
    • echo $az#Nova Availability Zone

      az0.dc0

    • echo $flag #Outlet

      01

    • echo $phy_host_id #Host

      62A5E267-2107-E811-8F9A-B4FBF9AD8203

    • echo $image # VM Image

      00b944d7-a84a-428f-b908-b685c5f4da5c

    • VM Flavor: Select flavor_vrouter_nat_normal corresponding to Outlet.
    • Delay Deploy Time (minute): Enter 0.

    Click Next. Configure parameters as follows:

    • netcluster internal base: Select internal_base.
    • external om: Retain the default value.
    • echo $om_vlan #external om vlan

      2043

    Click Create. After the VM configuration is complete, click Deploy This Environment. Wait until the VM status changes to Ready, which takes about 10 minutes.

  1. On the FusionSphere OpenStack web client, choose O&M > Change Board and click Start.

  2. Ensure that the PXE progress of the board reaches 100%. Then, set PXE Boot Hosts to OFF. In the host list, select the new host and click Expand. Run the following command and take a note of the new host ID:

    new_vm_host_id=4F52C59B-D048-B5C0-433680D884D6 #New host ID

  3. After the capacity expansion progress of the new host reaches 100%, select the faulty host in the host list and click Next.

  4. In the left pane of the Synchronize Configuration page, select the configuration items in sequence and click Automatic Synchronize in the right pane.

    In the displayed dialog box, click Confirm. After the synchronization is complete, check whether automatic synchronization is successful in the displayed dialog box and click Confirm.

    NOTE:
    • If the ID of the faulty host is the same as the host name, and the host ID has never been changed, a dialog box is displayed during automatic synchronization. In this case, you do not need to synchronize the host name.
    • If the system prompts you to restart the host after automatic synchronization is complete, continue to perform the synchronization. After the process is complete, restart the host.

  5. After the synchronization is complete, click End Process. In the displayed dialog box, click Confirm.
  6. On the FusionSphere OpenStack web client, choose Configuration, select Resource Isolation and Kernel Option, and check whether the configuration status of the new host is Effective After Restart.

    • If yes, go to 15.
    • If no, go to 17.

  7. On the Summary tab of the FusionSphere OpenStack web client, select the host and click Reboot.
  8. On the FusionSphere OpenStack web client, choose Configuration, select Resource Isolation and Kernel Option, and check whether the configuration status of the new host is Effective.

    • If yes, go to 19.
    • If no, contact technical support for assistance.

  9. On the Summary tab of the FusionSphere OpenStack web client, if the faulty host is still in the list, select the faulty host and click Delete to delete the faulty host.
  10. Modify the network component configuration. If the host_id value of the newly created host is different from that configured before, perform the following operations:

    echo $vm_host_id#Original faulty host ID

    3FF0190C-B8B5-9D4A-ADF8-34FDD5B4697F

    echo $new_vm_host_id#New host ID

    4F52C59B-D048-B5C0-433680D884D6

    Update the fip_mappings, hm_ip_mappings, lb_mappings, and vrouter_mappings configuration items of neutron-l3-service-agent.

    • Run the following command to query the configuration items of neutron-l3-service-agent:

      cps template-params-show --service neutron neutron-l3-service-agent$flag

    • Update the configuration items of neutron-l3-service-agent and replace $vm_host_id with $new_vm_host_id.

      cps template-params-update --service neutron neutron-l3-service-agent01 --parameter fip_mappings='host-id1:1,host-id2:2' hm_ip_mappings='host-id1:cidr1 cidr2,host-id2:cidr3,cidr4' lb_mappings='host-id1:cidr1,host-id2:cidr2' vrouter_mappings='host-id1:1,host-id2:2'

      cps commit

    Update configuration item ecmp_elbv2_ip_mappings of the vrouter component.

    • Run the following command to query the configuration item of vrouter:

      cps template-params-show --service vrouter neutron-vrouter$flag

    • Run the following commands to update the configuration item and replace $vm_host_id with $new_vm_host_id:

      cps template-params-update --service vrouter neutron-vrouter$flag --parameter ecmp_elbv2_ip_mappings='host-id1:cidr1,host-id2:cidr2'

      cps commit

    Update the extended configuration item neutron_vrouter$flag.DEFAULT.hostid_for_virt_deploy of the vrouter component.

    • Run the following command to query the configuration item of vrouter:

      cps template-ext-params-show --service vrouter neutron-vrouter$flag

    • Run the following commands to update the configuration item and replace $vm_host_id with $new_vm_host_id:

      cps template-ext-params-update --service vrouter neutron-vrouter$flag --parameter neutron_vrouter$flag.DEFAULT.hostid_for_virt_deploy='vm_host1:phy_host1,vm_host2:phy_host2'

      cps commit

    Update configuration item host_ip_mapping of the neutron-l3-nat-agent component.

    • Run the following command to query the configuration item of the neutron-l3-nat-agent component:

      cps template-params-show --service nat-server neutron-l3-nat-agent$flag

    • Run the following commands to update the configuration item and replace $vm_host_id with $new_vm_host_id:

      cps template-params-update --service nat-server neutron-l3-nat-agent$flag --parameter host_ip_mapping='host-id1:cidr1,host-id2:cidr2'

      cps commit

  11. Run the following commands to check whether network nodes vrouter and nat-server are normal:

    cps template-instance-list --service vrouter neutron-vrouter$flag

    cps template-instance-list --service nat-server neutron-l3-nat-agent$flag

    • If status in the command output is active, log in to the ManageOne OM plane and choose Alarms > Current Alarms to manually clear all alarms of the faulty host.
    • If status in the command output is fault, contact technical support for assistance.

  • The network node is deployed on a VM, and the OS of the VM on which the network node elb resides is faulty.
NOTE:

In this deployment mode, network nodes include the VM (netcluster_vrouter_nat_vm) where vrouter and nat-server reside and VMs netcluster_elb_lvs_vm, netcluster_elb_nginx_vm, netcluster_elb_db_vm, and netcluster_elb_api_vm.

  • netcluster_elb_lvs_vm: LVS node
  • netcluster_elb_nginx_vm: Nginx node
  • netcluster_elb_db_vm: MySQL database node
  • netcluster_elb_api_vm: ELB management node

The following steps are performed for VMs netcluster_elb_lvs_vm, netcluster_elb_nginx_vm, netcluster_elb_db_vm, and netcluster_elb_api_vm that encountering faults. Perform the following steps based on the type of the faulty VM and skip the steps of other VM types.

The faulty VM name, for example, netcluster_elb_lvs_vm_2_01, must be obtained. Operation can be performed only on one VM at a time.

  1. Perform the following operations to log in to any normal node in the cascading system. Unless otherwise specified, perform the subsequent operations on this node:

    1. Use PuTTY to log in to any host in the cascading system as user fsp through the IP address of the External OM plane.

      The username is fsp and the default password is Huawei@CLOUD8.

      NOTE:
      • The system supports login authentication using a password or private-public key pair. If a private-public key pair is used for login authentication, see .Using PuTTY to Log In to a Node in Key Pair Authentication Mode.
      • For details about the IP address of the External OM plane, see the LLD generated by FCD sheet of the xxx_export_all.xlsm file exported from FusionCloud Deploy during software installation, and search for the IP addresses corresponding to VMs and nodes.
    1. Run the following command to switch to user root, and enter the password of user root as prompted:

    su - root

    The default password of user root is Huawei@CLOUD8!.

    1. Run the TMOUT=0 command to disable user logout upon system timeout.
    2. Import environment variables. For details, see Importing Environment Variables.

  2. Determine the information about the faulty VM. For example, if the name of the faulty VM is netcluster_elb_lvs_vm_2_01, run the following commands:

    vm_name=netcluster_elb_lvs_vm_2_01 #Name of the faulty VM

    flag=`echo $vm_name | awk '{print substr($1,length($1)-1)}'`# Network egress flag

    phy_host_id=`nova show $vm_name | grep hypervisor_hostname | awk '{print $4}'` #Physical host ID

    If netcluster_elb_api_vm or netcluster_elb_db_vm is faulty, run the following command to ensure that the External OM IP address of the new VM is the same as that of the original VM:

    om_ip=`nova list | grep $vm_name | awk '{print substr($12,13)}'`

  3. Run the following command to delete the faulty VM:

    echo $flag

    01

    On the FusionSphere OpenStack web client, choose Cloud Deploy > Environment Management, select the environment whose prefix is netcluster_env_ and suffix is 01, and click Component Manage.

    Run the following command to query the name of the VM to be deleted:

    echo $vm_name

    netcluster_elb_lvs_vm_2_01

    On the FusionSphere OpenStack web client, click Delete of netcluster_elb_lvs_vm_2_01, and then click Deploy This Environment. Wait until the Status value changes to Ready.

  4. Run the following command to obtain the image required by the ELB VM:

    image=`nova image-list | grep image-kvm-euler | awk '{if (NR==1) print $2}'`

  5. Run the following commands to obtain the AZ of the faulty network node:

    echo $phy_host_id#Checking the host ID

    62A5E267-2107-E811-8F9A-B4FBF9AD8203

    nova availability-zone-list

    +-----------------------------------------+----------------------------------------+
    | Name                                    | Status                                 |
    +-----------------------------------------+----------------------------------------+
    | az0.dc0                                 | available                              |
    | |- 5E2C1AA8-E506-E811-9D39-B4FBF9AD81F1 |                                        |
    | | |- nova-compute                       | enabled :-) 2018-06-07T21:53:10.536249 |
    | |- 62A5E267-2107-E811-8F9A-B4FBF9AD8203 |                                        |
    | | |- nova-compute                       | enabled :-) 2018-06-07T21:53:07.675847 |
    +-----------------------------------------+----------------------------------------+

    The name corresponding to the host ID is used to create an AZ for the new VM. Run the following command:

    az=az0.dc0

  6. Run the following command to obtain the VLAN of the External OM plane:

    om_vlan=`neutron net-show external_om | grep provider:segmentation_id | awk '{print $4}'`

  7. On the FusionSphere OpenStack web client, choose Cloud Deploy > Environment Management to enter the page in 3, drag the NetCluster VM application component and drop it to the box, and deploy the network node VM.

    In the displayed dialog box, set the following parameters:

    • echo $vm_name #Name

      netcluster_elb_lvs_vm_2_01

    • Set VM PF PortNum as follows:
      • netcluster_elb_lvs_vm and netcluster_elb_nginx_vm: Enter 2.
      • netcluster_elb_db_vm and netcluster_elb_api_vm: Enter 0.
    • echo $az#Nova Availability Zone

      az0.dc0

    • echo $flag #Outlet

      01

    • echo $phy_host_id #Host

      62A5E267-2107-E811-8F9A-B4FBF9AD8203

    • echo $image # VM Image

      8a89148e-0b40-46b1-80be-c60b33211c7e

    • Set VM Flavor as follows:
      • netcluster_elb_lvs_vm and netcluster_elb_nginx_vm: Enter flavor_elb_lvs_nginx_normal corresponding to Outlet.
      • netcluster_elb_db_vm and netcluster_elb_api_vm: Enter flavor_elb_db_api_normal corresponding to Outlet.
    • Delay Deploy Time (minute): Enter 0.

    Click Next. Configure parameters as follows:

    • netcluster internal base: Select internal_base.
    • external om: Retain the default value.
    • echo $om_vlan #external om vlan

      2043

    Click Create. After the VM configuration is complete, click Deploy This Environment. Wait until the VM status changes to Ready, which takes about 15 to 20 minutes.

  8. To ensure normal interconnection between VMs, run the following commands to ensure that the External OM IP address of the new VM is the same as that of the original VM:

    new_om_ip=`nova list | grep $vm_name | awk '{print substr($12,13)}'`

    new_port_id=`nova interface-list $vm_name | grep "$new_om_ip" | awk '{print $4}'`

    neutron port-update $new_port_id --fixed-ip ip_address=$om_ip

    nova reboot $vm_name

    After the VM is restarted successfully, the value of status is ACTIVE in the nova show $vm_name command output.

  9. Partition the ELB VM disk. For details, see "Disk partition of the ELB VM" in the FusionCloud 6.3.1.1 Parts Replacement. Run the following command to obtain the IP address of the ELB VM node:

    nova list | grep elb |grep -v etcd | grep ${vm_name:0-4}

    | e1934e8b-92de-4f74-84c1-99dd32ea5cc0 | netcluster_elb_api_vm_2_01     | ACTIVE | -          | Running     | external_om=4.20.43.76                       |
    | c019d55d-9a65-40b1-9140-9f79a4c87a9b | netcluster_elb_db_vm_2_01      | ACTIVE | -          | Running     | external_om=4.20.43.183                      |
    | 018fa2b2-e636-4aae-80b4-3db5d7b78ea0 | netcluster_elb_lvs_vm_2_01     | ACTIVE | -          | Running     | external_om=4.20.43.140                      |
    | 97583d6a-7bf8-49de-84d-6abea4eb8015 | netcluster_elb_nginx_vm_2_01   | ACTIVE | -          | Running     | external_om=4.20.43.92                       |
    NOTE:

    If only one VM is faulty, you can enter the information about the four VMs in the configuration file of the DMK node. Typically, the previous configuration information is still available. You only need to change the IP address of the faulty VM (new VM).

    In the configuration file, mysql corresponds to the elb_db VM.

  10. Manually configure the ELB VM NIC. For details, see "Manually Configuring NICs for the ELB VM" in the FusionCloud 6.3.1.1 Parts Replacement. Run the following command to obtain the OM IP address corresponding to the normal ELB VM node:

    nova list | grep elb |grep -v etcd | grep -v ${vm_name:0-4} | grep ${vm_name:0:${#vm_name}-4}

    | 3f06d89c-eb4a-4e3a-89a2-5c19ac69c845 | netcluster_elb_lvs_vm_1_01     | ACTIVE | - | Running | external_om=4.20.43.59 |

    The OM IP of the faulty VM (new VM) is as follows:

    nova list | grep $vm_name

    | 018fa2b2-e636-4aae-80b4-3db5d7b78ea0 | netcluster_elb_lvs_vm_2_01     | ACTIVE | - | Running | external_om=4.20.43.140 |

    NOTE:

    Perform only the steps related to the faulty VM and skip the steps related to other types of VMs. For example, perform only the steps related to the elb_lvs VM.

  11. Deploy the created ELB VM. For details, see "Manually Deploying the ELB VM" in the FusionCloud 6.3.1.1 Parts Replacement. Perform the following steps based on the type of the faulty VM and skip the steps of other VM types.

    • netcluster_elb_lvs_vm: includes common and LVS operations, excluding Nginx, DB, API, and etcd operations.
    • netcluster_elb_nginx_vm: includes common and Nginx operations, excluding LVS, DB, API, and etcd operations.
    • netcluster_elb_db_vm: includes common, DB, and MySQL operations, excluding LVS, Nginx, API, and etcd operations.
    • netcluster_elb_api_vm: includes common, API, and management node operations, excluding LVS, Nginx, DB, and etcd operations.
    NOTE:

    When using DMK to deploy a faulty ELB VM, you need to copy the configuration file and node configuration file from the previous record. Details are described as follows.

    On the Task Board page:

    • Select ELB for ALL TASK.
    • Select Succeeded for ALL Status.
    • Set Action to Deploy.

    If there are multiple records, check whether the configuration of the task can be inherited from top to bottom. To be specific, click Details on the right of a record. If the OM IP address of the faulty VM (newly created VM) is in the node configuration file, copy the content on the left of the configuration file and node configuration file to the required location of the new task. Otherwise, check the records in sequence until the configuration file and node configuration file are found.

    In the node configuration file of the new task, retain the configuration section and OM IP address of the faulty VM, and delete the configuration sections and OM IP addresses of other normal VMs. In addition, if the elb_db VM is deployed, set the IP address of the faulty VM to slave in the configuration file and set the IP address of the normal VM to master. For example:

    database_ips:

    elb_database_host_master: "99.99.12.108" #ip address of master mysql node Normal VM IP address

    elb_database_host_slave: "99.99.12.94" #ip address of slave mysql node Faulty VM IP address

  12. On the ManageOne OM plane, choose Alarms > Current Alarms to manually clear all alarms of the faulty host.
  • The network is deployed on a VM, and the host OS of the VM where network node nat gateway resides is faulty.
NOTE:

In this deployment mode, network nodes include the VM (netcluster_vrouter_nat_vm) where vrouter and nat-server reside and VMs netcluster_elb_lvs_vm, netcluster_elb_nginx_vm, netcluster_elb_db_vm, netcluster_elb_api_vm, and netcluster_nat_gateway_vm.

In the following steps, the VM (netcluster_nat_gateway_vm) where the nat gateway resides is faulty. The current VM cannot be restarted in PXE mode and needs to be deleted and recreated.

  1. Perform the following operations to log in to any normal node in the cascading system. Unless otherwise specified, perform the subsequent operations on this node:

    1. Use PuTTY to log in to any host in the cascading system as user fsp through the IP address of the External OM plane.

      The username is fsp and the default password is Huawei@CLOUD8.

      NOTE:
      • The system supports login authentication using a password or private-public key pair. If a private-public key pair is used for login authentication, see .Using PuTTY to Log In to a Node in Key Pair Authentication Mode.
      • For details about the IP address of the External OM plane, see the LLD generated by FCD sheet of the xxx_export_all.xlsm file exported from FusionCloud Deploy during software installation, and search for the IP addresses corresponding to VMs and nodes.
    1. Run the following command to switch to user root, and enter the password of user root as prompted:

    su - root

    The default password of user root is Huawei@CLOUD8!.

    1. Run the TMOUT=0 command to disable user logout upon system timeout.
    2. Import environment variables. For details, see Importing Environment Variables.

  2. Determine the information about the faulty VM.

    The VM (netcluster_nat_gateway_vm) where nat-gateway-data resides is faulty. For example, the ID of the faulty VM is 3FF0190C-B8B5-9D4A-ADF8-34FDD5B4697F.

    Run the following commands to assign the variable values:

    vm_host_id=3FF0190C-B8B5-9D4A-ADF8-34FDD5B4697F  #VM ID

    nat_cluster = cps cluster-list | grep $vm_host_id | awk '{print $2}'

    flag=`echo ${nat_cluster:6:2} `# Network exit flag

    Run following commands:

    cps template-ext-params-show --service nat-gateway neutron-nat-gw-dataplane --cluster $ nat_cluster

    +----------------------------------------------------+----------------------------------------------------+
    | Property                                           | Value                                              |
    +----------------------------------------------------+----------------------------------------------------+
    | neutron_nat_gateway_dataplane.DEFAULT.hostid_for_v | 3FF0190C-B8B5-9D4A-ADF8-34FDD5B4697F: 62A5E267-2107 |
    irt_deploy                                         | -E811-8F9A-B4FBF9AD8203                            |
    +----------------------------------------------------+----------------------------------------------------+

    In the command output, 62A5E267-2107-E811-8F9A-B4FBF9AD8203 corresponding to vm_host_id is the ID of the host where the VM is located. Run the following command to assign a value to the variable:

    phy_host_id=62A5E267-2107-E811-8F9A-B4FBF9AD8203

    Run the following commands to obtain the VM name in Nova mapping the VM host:

    mac=`cps host-show $vm_host_id | grep slave1 -A 1 | awk '{if (NR==2) print substr($3,5)}'`

    vm_name=`neutron port-list | grep vrouter_port | grep -i $mac | awk '{print substr($4,16,30)}'`

  3. Run the following command to delete the faulty VM host:

    echo $flag

    01

    On the FusionSphere OpenStack web client, choose Cloud Deploy > Environment Management, select the environment whose prefix is netcluster_env_ and suffix is 01, and click Component Manage.

    Run the following command to query the name of the VM to be deleted:

    echo $vm_name

    netcluster_nat_gateway_vm_1_01

    On the FusionSphere OpenStack web client, click Delete of netcluster_nat_gateway_vm_1_01, and then click Deploy This Environment. Wait until the Status value changes to Ready.

  4. Run the following command to obtain the image required by the NAT gateway VM:

    image=`nova image-list | grep snat_image | awk '{if (NR==1) print $2}'`

  5. Run the following commands to obtain the AZ of the faulty network node:

    echo $phy_host_id#Checking the host ID

    62A5E267-2107-E811-8F9A-B4FBF9AD8203

    nova availability-zone-list

    +-----------------------------------------+----------------------------------------+
    | Name                                    | Status                                 |
    +-----------------------------------------+----------------------------------------+
    | az0.dc0                                 | available                              |
    | |- 5E2C1AA8-E506-E811-9D39-B4FBF9AD81F1 |                                        |
    | | |- nova-compute                       | enabled :-) 2018-06-07T21:53:10.536249 |
    | |- 62A5E267-2107-E811-8F9A-B4FBF9AD8203 |                                        |
    | | |- nova-compute                       | enabled :-) 2018-06-07T21:53:07.675847 |
    +-----------------------------------------+----------------------------------------+

    The name corresponding to the host ID is used to create an AZ for the new VM. Run the following command:

    az=az0.dc0

  6. Run the following command to obtain the VLAN of the External OM plane:

    om_vlan=`neutron net-show external_om | grep provider:segmentation_id | awk '{print $4}'`

  7. On the FusionSphere OpenStack web client, choose O&M > Capacity Expansion and set PXE Boot Hosts to ON.
  8. On the FusionSphere OpenStack web client, choose Cloud Deploy > Environment Management and click the management component in the environment the same as 3. Drag the NetCluster VM application component and drop it to the box, and deploy the network node VM.

    In the displayed dialog box, set the following parameters:

    • echo $vm_name #Name
    • netcluster_nat_gateway_vm_1_01
    • VM PF PortNum: Enter 2:2. For a non-standard environment, set this parameter based on the configuration item description.
    • echo $az#Nova Availability Zone

      az0.dc0

    • echo $flag #Outlet

      01

    • echo $phy_host_id #Host

      62A5E267-2107-E811-8F9A-B4FBF9AD8203

    • echo $image # VM Image

      00b944d7-a84a-428f-b908-b685c5f4da5c

    • VM Flavor: Select 01flavor_snat_normal corresponding to Outlet.
    • Delay Deploy Time (minute): Enter 0.

    Click Next. Configure parameters as follows:

    • netcluster internal base: Select internal_base.
    • external om: Retain the default value.
    • echo $om_vlan #external om vlan

      2043

    Click Create. After the VM configuration is complete, click Deploy This Environment. Wait until the VM status changes to Ready, which takes about 10 minutes.

  9. On the FusionSphere OpenStack web client, choose O&M > Change Board and click Start.

  10. Ensure that the PXE progress of the board reaches 100%. Then, set PXE Boot Hosts to OFF. In the host list, select the new host and click Expand. Run the following command and take a note of the new host ID:

    new_vm_host_id=4F52C59B-D048-B5C0-433680D884D6 #New host ID

  11. After the capacity expansion progress of the new host reaches 100%, select the faulty host in the host list and click Next.

  12. In the left pane of the Synchronize Configuration page, select the configuration items in sequence and click Automatic Synchronize in the right pane.

    In the displayed dialog box, click Confirm. After the synchronization is complete, check whether automatic synchronization is successful in the displayed dialog box and click Confirm.

    NOTE:
    • If the ID of the faulty host is the same as the host name, and the host ID has never been changed, a dialog box is displayed during automatic synchronization. In this case, you do not need to synchronize the host name.
    • If the system prompts you to restart the host after automatic synchronization is complete, continue to perform the synchronization. After the process is complete, restart the host.

  13. After the synchronization is complete, click End Process. In the displayed dialog box, click Confirm.
  14. On the FusionSphere OpenStack web client, choose Configuration, select Resource Isolation and Kernel Option, and check whether the configuration status of the new host is Effective After Restart.

    • If yes, go to 15.
    • If no, go to 17.

  15. On the Summary tab of the FusionSphere OpenStack web client, select the host and click Reboot.
  16. On the FusionSphere OpenStack web client, choose Configuration, select Resource Isolation and Kernel Option, and check whether the configuration status of the new host is Effective.

    • If yes, go to 19.
    • If no, contact technical support for assistance.

  17. On the Summary tab of the FusionSphere OpenStack web client, if the faulty host is still in the list, select the faulty host and click Delete to delete the faulty host.
  18. Modify the network component configuration. If the host_id value of the newly created host is different from that configured before, perform the following operations:

    echo $vm_host_id#Original faulty host ID

    3FF0190C-B8B5-9D4A-ADF8-34FDD5B4697F

    echo $new_vm_host_id#New host ID

    4F52C59B-D048-B5C0-433680D884D6

    Update configuration item hostid_for_v irt_deploy of neutron-nat-gw-dataplane.

    • Run the following command to query the configuration items of neutron-nat-gw-dataplane:

      cps template-params-show --service nat-gateway neutron-nat-gw-dataplane –cluster $ nat_cluster

    • Update the configuration items of nneutron-nat-gw-dataplane and replace $vm_host_id with $new_vm_host_id.

      cps template-ext-params-update --service nat-gateway neutron-nat-gw-dataplane --parameter neutron_nat_gateway_dataplane.DEFAULT.hostid_for_virt_deploy='$new_vm_host_id: $phy_host_id' –cluster $ nat_cluster

      cps commit

  19. Run the following commands to check whether nat-gateway is normal:

    cps template-instance-list --service nat-gateway neutron-nat-gw-dataplane –host $ new_vm_host_id

    cps template-instance-list --service nat-gateway neutron-nat-gw-data-agent –host $ new_vm_host_id

    • If status in the command output is active, log in to the ManageOne OM plane and choose Alarms > Current Alarms to manually clear all alarms of the faulty host.
    • If status in the command output is fault, contact technical support for assistance.

A Fault Occurs If the System Disk of a VM Is Detached

Symptom

After the system disk is detached from the VM, the VM becomes faulty, and no new system disk can be attached to the VM.

Possible Causes
  • The host is powered off or fails.
  • The VM status is abnormal.
Procedure
  1. Use PuTTY to log in to FusionSphere OpenStack through the IP address of the External OM plane.

    The username is fsp and the default password is Huawei@CLOUD8.
    NOTE:
    • The system supports login authentication using a password or private-public key pair. If a private-public key pair is used for login authentication, seeUsing PuTTY to Log In to a Node in Key Pair Authentication Mode.
    • For details about the IP address of the External OM plane, see the LLD generated by FCD sheet of the xxx_export_all.xlsm file exported from FusionCloud Deploy during software installation, and search for the IP addresses corresponding to VMs and nodes.The parameter names in different scenarios are as follows:
      • Cascading layer in the Region Type I scenario : Cascading-ExternalOM-Reverse-Proxy, Cascaded layer : Cascaded-ExternalOM-Reverse-Proxy.
      • Region Type II and Type III scenarios : ExternalOM-Reverse-Proxy.

  2. Run the following command to switch to the root user, and enter the root password as prompted:

    su - root

    The default password of the root user is Huawei@CLOUD8!.

  3. Run the TMOUT=0 command to disable user logout upon system timeout.
  4. Import environment variables. For details, see Importing Environment Variables.
  5. Run the following command to query the host accommodating the faulty VM:

    nova show Faulty VM ID | grep host

    The faulty VM ID can be obtained from FusionSphere OpenStack Management Console.

    Record the value of OS-EXT-SRV-ATTR:host in the command output. This value is the ID of the host accommodating the faulty VM.

  6. Run the following command to check whether the host is running properly:

    cps host-list | grep Host ID

    The host ID is the value obtained in the preceding command.

    In the command output, the normal value indicates that the host is running properly.

    • If yes, go to 9.
    • If no, go to 7.

      If the host status is abnormal, record the internal management IP address of the host.

  7. Perform either of the following operations to restore the host:

    1. Run the su - fsp command to switch to the fsp user, and run the ssh fsp@Host IP address command to log in to the host as the fsp user and then switch to the root user.

      Enter the private key password as prompted. The default password is Huawei@CLOUD8!. If you have successfully replaced the public and private key files, enter the new private key password. Alternatively, press Enter and type the password of the fsp user to log in.

    2. Run the reboot command to restart the host.

      If the connection to the host cannot be set up due to host faults, use another method, such as the host baseboard management controller (BMC), to restart the host.

      You can query the host status by performing 6. The host has been restored if its status changes to normal within 10 minutes.

  8. Check whether the host is restored.

    • If yes, go to 9.
    • If no, go to 13 to rebuild the faulty VM.
    NOTE:

    This rebuilding method will change the VM ID and network information. In addition, if the VM uses local disks, the disk information will be lost after VM rebuilding.

  9. Run the following command to check the VM status:

    nova show Faulty VM ID | grep status

    • If the VM is in the SHUTOFF state, go to 12.
    • If the VM is in another state, go to 10.

  10. Run the following command to stop the VM:

    nova stop Faulty VRM VM ID

  11. Perform 9 to check the VM status again.

    • If the VM is in the SHUTOFF state, go to 12.
    • If the VM is in another state, contact technical support for assistance.

  12. Attach a new system disk to the VM.

    • If the attaching is successful, no further action is required.
    • If the attaching fails, contact technical support for assistance.

  13. Run the following command to check the VM status and take note of the VM attributes:

    nova show Faulty VM ID

    Table 7-1 lists the VM attributes.

    Table 7-1 VM attributes

    Attribute

    Description

    Example Value

    os-extended-volumes:volumes_attached

    Specifies the ID of the disk attached to the VM. Multiple disks can be attached to a VM.

    8458dbff-1acd-4445-a3ea-751b6c4a8d80

    tenant_id

    Specifies the ID of the tenant who owns the VM.

    bca6f4e8b2034d3eb93e7c94e897d619

    user_id

    Specifies the ID of the user who creates the VM.

    3cf46a44f05642149c1c9273913429cc

    config_drive

    Specifies whether to use the config_drive disk for file injection.

    The value can be True or left blank.

    flavor

    Specifies the flavor used by the VM.

    m1.tiny (1)

    metadata

    Specifies whether to designate the metadata.

    {}

    name

    VM Name

    Test01

    security_groups

    Security group information

    default

    tags

    Specifies the tag information.

    []

  14. Run the following command to import environment variables based on the tenant ID:

    export tenant_id=tenant_id

  15. Use the visual interface (vi) editor to create file d.json and write the following content into the file:

    { 
        "auth": { 
            "identity": { 
                "methods": [ 
                    "password" 
                ], 
                "password": { 
                    "user": { 
                        "domain": { 
                            "name": "vdc_name" 
                        }, 
                        "name": "username", 
                        "password": "password" 
                    } 
                } 
            }, 
            "scope": { 
                "project": { 
                    "domain": { 
                        "name": "vdc_name" 
                    }, 
                    "name": "vpc_name" 
                } 
            } 
        } 
    }     

    In the preceding content,

    • username: Enter the username based on the user_id of the VM. To obtain the username, log in to the FusionSphere OpenStack Management Console web client as the cloud_admin user, choose System > User Management, and locate the user who has the same ID as the user_id of the VM in the ID column.
    • password: Enter the password of the user.
    • vdc_name: Contact the administrator to check whether the username belongs to a VDC by using the service provisioning tool in use. If the username belongs to a VDC, enter the VDC name. If it does not belong to any VDC, enter Default.
    • vpc_name: Contact the administrator to check the ID of each VPC in the service provisioning tool and find the VPC whose tenant_id is the same as that of the VM. Then, enter the VPC name. If no compliant VPC is found, in the FusionSphere OpenStack system you have logged in, run the openstack project list | grep tenant_id command to query the VPC. In the command output, the second field displays the target VPC name.
    NOTE:

    If the preceding parameters cannot be queried, contact technical support for assistance.

  16. Run the following command to import the token environment variables:

    export TOKEN=$(curl -ki -d @d.json -H "Content-type: application/json" https://identity.localdomain.com:8023/identity/v3/auth/tokens | awk '/X-Subject-Token/ {print $2}' | tr -d '\r')

    NOTE:

    To ensure account security, delete the d.json file created in 15 after this step is complete.

  17. Run the following command to query VM network information and record the network ID:

    nova interface-list VM ID

    Information similar to the following is displayed:

     
    +------------+--------------------------------------+--------------------------------------+--------------+-------------------+  
    | Port State | Port ID                              | Net ID                               | IP addresses | MAC Addr          |  
    +------------+--------------------------------------+--------------------------------------+--------------+-------------------+  
    | ACTIVE     | 60b5961f-8afb-4735-9f3c-368d414857d2 | 04c20a58-b01f-4d18-b3e9-6c49ce78a22c | 192.168.211.6| fa:16:3e:56:d3:71 |  
    +------------+--------------------------------------+--------------------------------------+--------------+-------------------+

    If the VM uses multiple networks, record all network IDs.

  18. Run the following command to create a port on the network used by the faulty VM and record the port ID:

    curl -i --insecure -X POST https://network.localdomain.com:8020/v2.0/ports.json -H "User-Agent: python-neutronclient" -H "Content-Type: application/json" -H "Accept: application/json" -H "X-Auth-Token: $TOKEN" -d '{"port":{"network_id": "Net_ID", "tenant_id": "'tenant_id'", "binding:vnic_type": "vNIC_type", "name": "port_name", "admin_state_up": true}}'

    In this command, the value of vNIC_type can be direct (PCI passthrough) or normal (EVS or OVS).

    Example:

    curl -i --insecure -X POST https://network.localdomain.com:8020/v2.0/ports.json -H "User-Agent: python-neutronclient" -H "Content-Type: application/json" -H "Accept: application/json" -H "X-Auth-Token: $TOKEN" -d '{"port": {"network_id": "04c20a58-b01f-4d18-b3e9-6c49ce78a22c", "tenant_id": "'$tenant_id'", "binding:vnic_type": "normal", "name": "lt-test", "admin_state_up": true}}'

    Record the port ID, for example, 3f7ebd45-9a96-474c-88e6-5e3bf6e018cc.

  19. Run the following command to create a snapshot for the disk on the faulty VM:

    curl -i --insecure -X POST https://volume.localdomain.com:8776/v2/${tenant_id}/snapshots -H "User-Agent: python-cinderclient" -H "Content-Type: application/json" -H "Accept: application/json" -H "X-Auth-Token: $TOKEN" -d '{"snapshot": {"description": description, "metadata": {metadata}, "force": "True", "name": "snapshot_name", "volume_id": "volume_id"}}'

    Example:

    curl -i --insecure -X POST https://volume.localdomain.com:8776/v2/${tenant_id}/snapshots -H "User-Agent: python-cinderclient" -H "Content-Type: application/json" -H "Accept: application/json" -H "X-Auth-Token: $TOKEN" -d '{"snapshot": {"description": null, "metadata": {}, "force": "True", "name": "data_snap01", "volume_id": "8458dbff-1acd-4445-a3ea-751b6c4a8d80"}}'

    Record the snapshot ID, for example, 1e5b7681-faa1-4de0-9d80-ee3e2a606eeb.

    Run the following command repeatedly to query the snapshot status:

    cinder snapshot-show Snapshot ID

    The snapshot is successfully created if its status changes to available. If the VM has multiple disks, create a snapshot for each disk.

  20. Run the following command to create a disk using the snapshot of the original disk:

    curl -i --insecure -X POST https://volume.localdomain.com:8776/v2/${tenant_id}/volumes -H "User-Agent: python-cinderclient" -H "Content-Type: application/json" -H "Accept: application/json" -H "X-Auth-Token: $TOKEN" -d '{"volume": {"name": "new_volume_name", "availability_zone": AZname,"metadata": {metadata},"snapshot_id": "snapshot_id"}}'

    In this command, AZname specifies the AZ to which the disk belongs. You can run the cinder show Disk ID command to query the AZ of the original VM disk. The AZ of the new disk must be the same as that of the original disk.

    Example:

    curl -i --insecure -X POST https://volume.localdomain.com:8776/v2/${tenant_id}/volumes -H "User-Agent: python-cinderclient" -H "Content-Type: application/json" -H "Accept: application/json" -H "X-Auth-Token: $TOKEN" -d '{"volume": {"name": "data01_volume", "availability_zone": null,"metadata": {},"snapshot_id": "1e5b7681-faa1-4de0-9d80-ee3e2a606eeb"}}'

    Record the ID of the new disk, for example, 2ffa5677-0a13-4ea6-bcdc-1575b40780a8.

    If multiple snapshots have been created, create a disk using each of them.

  21. In the service provisioning system, create a VM.

    The disks attached to the VM must be virtual disks, and the attributes and network information of the new VM must be the same as those of the faulty VM.

    If the VM creation fails in the service provisioning system, run the following command:

    curl -i --insecure 'https://compute.localdomain.com:8001/v2/'$tenant_id'/os-volumes_boot' -X POST -H "Accept: application/json" -H "Content-Type: application/json" -H "User-Agent: python-novaclient" -H "X-Auth-Token: $TOKEN" -d '"server": {"name": "vm_name", "imageRef": "", "block_device_mapping_v2": [{"source_type": "source_type", "destination_type": "destination_type", "boot_index": "boot_index", "uuid": "source_id", "volume_size": "volume_size"}], "flavorRef": "flavorRef", "max_count": max_count, "min_count": min_count, "networks": [{"port": "port_id"}], "config_drive": false}}'

    Example:

    curl -i --insecure 'https://compute.localdomain.com:8001/v2/'$tenant_id'/os-volumes_boot' -X POST -H "Accept: application/json" -H "Content-Type: application/json" -H "User-Agent: python-novaclient" -H "X-Auth-Token: $TOKEN" -d '"server": {"name": "new_vm_02", "imageRef": "", "block_device_mapping_v2": [{"source_type": "image", "destination_type": "volume", "boot_index": "0", "uuid": "5f6a4b1d-1815-4c33-9c50-710fac909cc0", "volume_size": "1"}], "flavorRef": "1", "max_count": 1, "min_count": 1, "networks": [{"port": "3f7ebd45-9a96-474c-88e6-5e3bf6e018cc"}], "config_drive": false}}'

    After the VM is created, record the VM ID. Run the following command repeatedly to query the VM status:

    nova show VM ID

  22. The VM is successfully created if its status changes to active. In the service provisioning system, attach the disks created in 20 to the new VM.

    If the disk attaching fails in the service provisioning system, perform the following operations:

    1. Run the following command to import the environment variables of the VM:

      export uuid=vm_id

    2. Run the following command to attach disks to the VM:

      curl -i --insecure "https://compute.localdomain.com:8001/v2/$TENANT_ID/servers/${uuid}/os-volume_attachments" -X POST -H "X-Auth-Project-Id: service" -H "X-Auth-Token: $TOKEN_ID" -H "Content-Type: application/json" -H "Accept: application/json" -H "User-Agent: python-novaclient" -d '{"volumeAttachment": {"volumeId": "volumeId"}}'

      Example:

      curl -i --insecure "https://compute.localdomain.com:8001/v2/$TENANT_ID/servers/${uuid}/os-volume_attachments" -X POST -H "X-Auth-Project-Id: service" -H "X-Auth-Token: $TOKEN_ID" -H "Content-Type: application/json" -H "Accept: application/json" -H "User-Agent: python-novaclient" -d '{"volumeAttachment": {"volumeId": "3f28cb04-c9f9-4ae4-b350-6dd6da80f41d"}}'

  23. Delete the faulty VM after the new VM runs properly.

    You can also delete the disks as well as the created disk snapshots if they are no longer used.

Handling the Issue that the OS of a Controller Node VM in a Cascaded FusionSphere OpenStack System Is Faulty (Region Type I)

If the OS of the VM in the cascaded system is faulty, you need to reinstall the OS for the VM.

Symptom

The command output of cps host-show host_id shows that the host status is fault.

Prerequisites
  • You have obtained the ID of the faulty controller node in the cascaded system. This section uses 55c6f891-ae56-4cd5-b348-287d3a433ffc as an example.
  • You have logged in to the cascading and cascaded FusionSphere OpenStack web clients.
Procedure
  1. On the web client of the cascading FusionSphere OpenStack system that you have logged in to, click Virtual Deploy to switch to the Environment Management page.

  2. Click the name of the virtualization deployment environment of the cascaded FusionSphere OpenStack system and query names of components with the type of CascadedVM.

  3. Use PuTTY to log in to the first node in the cascading FusionSphere OpenStack system.

    Ensure that the reverse proxy IP address of the External OM plane and username fsp are used to establish the connection. The default password of user fsp is Huawei@CLOUD8.

  4. Run the following command to switch to the root user:

    su - root

    The default password of the root user is Huawei@CLOUD8!.

    NOTE:
    • The system supports login authentication using a password or private-public key pair. If a private-public key pair is used for login authentication, see .Using PuTTY to Log In to a Node in Key Pair Authentication Mode.
    • For details about the IP address of the External OM plane, see the LLD generated by FCD sheet of the xxx_export_all.xlsm file exported from FusionCloud Deploy during software installation, and search for the IP addresses corresponding to VMs and nodes.

  5. Run the TMOUT=0 command to disable user logout upon system timeout.
  6. Import environment variables.

    For details, see Importing Environment Variables.

  7. Run the following command to query the ID of the faulty controller node VM in the cascaded system:

    nova list |grep cascaded_vm

    cascaded_vm indicates the component name queried in 2.

    Use cascade_vm_az1_dc1_0 as an example and run the following command:

    nova list |grep cascade_vm_az1_dc1_0

    NOTE:

    55c6f891-ae56-4cd5-b348-287d3a433ffc indicates the ID of the faulty controller node VM in the cascaded system.

  8. Run the following command to obtain the VNC URL for the faulty controller node VM in the cascaded system:

    nova get-vnc-console 55c6f891-ae56-4cd5-b348-287d3a433ffc novnc

    +-------+-----------------------------------------------------------------------------------------------------------------------+
    | Type  | Url                                                                                                                   |
    +-------+-----------------------------------------------------------------------------------------------------------------------+
    | novnc | https://nova-novncproxy.az1.dc1.domainname.com:8002/vnc_auto.html?token=559e04af-8036-4649-9cd2-04941ac64ba2&lang=EN  |
    +-------+-----------------------------------------------------------------------------------------------------------------------+

    55c6f891-ae56-4cd5-b348-287d3a433ffc indicates the VM ID queried in 7.

  9. Replace nova-novncproxy.az1.dc1.domainname.com in the Url column in the command output with the reverse proxy IP address of FusionSphere OpenStack. For example, https://192.168.55.21:8002/vnc_auto.html?token=559e04af-8036-4649-9cd2-04941ac64ba2&lang=EN
  10. Use the replaced URL to log in to the faulty VM in any browser.
  11. Run the following command to change the startup mode of the faulty VM to PXE:

    nova meta 55c6f891-ae56-4cd5-b348-287d3a433ffc set __bootDev=network,hd

  12. On the web client of the cascaded FusionSphere OpenStack system that you have logged in to, choose O&M > Capacity Expansion. On the Capacity Expansion page, disable PXE Boot Hosts.
  13. Run the following command to restart the faulty VM:

    nova reboot 55c6f891-ae56-4cd5-b348-287d3a433ffc

  14. Log in to the faulty VM using the VNC URL of the VM queried in 8.
  15. Check the VNC window. If the following interface is displayed, the VM is PXE-booted.

    If the VM fails to be booted from PXE, contact technical support for assistance.

  16. Run the following command to change the startup mode of the faulty VM to volume:

    nova meta 55c6f891-ae56-4cd5-b348-287d3a433ffc set __bootDev=hd,network

  17. After the faulty host is restored, push agents, including Zabbix agent and eSight agent, to the new controller node VM in the cascaded FusionSphere OpenStack system.
Translation
Download
Updated: 2019-06-10

Document ID: EDOC1100063248

Views: 22569

Downloads: 37

Average rating:
This Document Applies to these Products
Related Documents
Related Version
Share
Previous Next