No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>Search

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

HUAWEI CLOUD Stack 6.5.0 Alarm and Event Reference 04

Rate and give feedback:
Huawei uses machine translation combined with human proofreading to translate this document to different languages in order to help you better understand the content of this document. Note: Even the most advanced machine translation cannot match the quality of professional translators. Huawei shall not bear any responsibility for translation accuracy and it is recommended that you refer to the English document (a link for which has been provided).
Using KVM for Virtualization

Using KVM for Virtualization

Audit Overview

Scenarios

The system audit is required for the OpenStack-based FusionSphere system when data inconsistency occurs in the following scenarios:

  • When a service-related operation is performed, a system exception occurs, for example, when you create a VM, a host process restarts, causing the operation to fail. In this case, residual data may reside in the system or resources may become unavailable.
  • If any service-related operation is performed after a system database is backed up and before the database is restored, residual data may reside in the system or resources may become unavailable after the system database is being restored using the data backup.

The system audit is used to help administrators detect and handle data inconsistency. Therefore, conduct a system audit when:

  • An alarm is generated indicating that data inconsistency verification fails.
  • The system database is restored using a data backup.
  • The routine system maintenance is performed.
NOTE:
  • You are advised to conduct a system audit when the system is running stably. Do not use audit results when a large number of service-related operations are in progress.
  • During the audit process, if service-related operations (for example, creating a VM or expanding the system capacity) are performed or any system exception occurs, the audit result may be distorted. During the audit process, if service-related operations are performed or any system exception occurs (for example, a host is faulty), the audit result may be distorted. In this case, conduct the system audit again after the system recovers. In this case, the system provides instructions to confirmation for the detected problems.

Audit Mechanism

The system audit consists of audit and post log analysis.

The following illustrates how a system audit works:

  • The system obtains service data from databases, hosts, and storage devices, compares the data, and generates an audit report.
  • The system also provides this audit guide and Command Line Interface (CLI) commands for users to locate and handle the data inconsistency problems listed in the audit report.

You can conduct a system audit in either of the following methods:

  • The system automatically starts auditing at 04:00 every day and reports an alarm and generates an audit report if it detects any data inconsistency. You can log in to the web client, and choose Configuration > System > System Audit to change the start time and period for the system audit. The system reports an alarm and generates an audit report if it detects any problem. If an alarm has been generated and has not been cleared, the system does not generate the alarm again. If no data inconsistency is detected but an alarm has been generated for data inconsistency, the system automatically clears this alarm.
  • Log in to the FusionSphere OpenStack system and run the infocollect audit command to start the audit.

Post log analysis is used after the system database is restored using a data backup. It analyzes historical logs and then generates an audit report sorting records of tenant's operations on resources (such as VMs and volumes) in a specified time period.

Audit Process

If any audit alarm is generated, conduct an audit based on the process shown in Figure 18-5.

Figure 18-5 Audit process

Manual Audit

Scenarios

Conduct the manual audit when:

  • The system database is restored using a data backup.
  • After the inconsistency problems are automatically handled, the manual audit is used to verify that the problems have been rectified.

Prerequisites

Services in the system are running properly.

Procedure

  1. Log in to the first controller node in the AZ. For details, see Using SSH to Log In to a Host.
  2. Import environment variables to the node. For details, see Importing Environment Variables.

    Please enter 1, enable Keystone V3 authentication with the built-in DC administrator.

  3. Perform the following operations to start a manual audit:

    1. Enter the security mode. For details, see Command Execution Methods.
    2. Run the following command to conduct a manual audit:

      infocollect audit --item ITEM --parameter PARAMETER --type TYPE

      Table 18-146 describes parameters in the command.

      If you do not specify the audit item, an audit alarm will be triggered when an audit problem is detected. However, if the audit item is specified, no audit alarm will be triggered when an audit problem is detected.

      Table 18-146 Parameter description

      Parameter

      Mandatory or Optional

      Description

      item

      Optional

      Specifies a specific audit item. If you do not specify the audit item, an audit alarm will be reported when an audit problem is detected. However, if the audit item is specified, no audit alarm will be reported when an audit problem is detected. Values:

      • 1001: indicates that a VM is audited. The following audit reports are generated after an audit is complete:
        • orphan_vm.csv: Audit report about orphan VMs
        • invalid_vm.csv: Audit report about invalid VMs
        • host_changed_vm.csv: Audit report about VM location inconsistency
        • stucking_vm.csv: Audit report about stuck VMs
        • diff_property_vm.csv: Not available in NFV scenarios
        • diff_state_vm.csv: Not available in NFV scenarios
        • host_invalid_migration.csv: Audit report about abnormal hosts that adversely affect cold migrated VMs
      • 1002: indicates that an image is audited. The following audit report is generated after an audit is complete:

        stucking_images.csv: Audit report about stuck images

      • 1003: indicates that a zombie process is audited. The following audit report is generated after an audit is complete:

        zombie_process_hosts.csv: Audit report about zombie processes

      • 1004: indicates that the residual nova-compute service is audited. No audit report is generated after an audit is complete. This item is required when the role is deleted by the CPS.

        nova_service_cleaned.csv: Audit report about the residual nova-compute service

      • 1005: indicates that the records of migrated databases are audited. The following audit reports are generated after an audit is complete:
        • cold_cleaned.csv: Audit report about residual data after cold migration
        • live_cleaned.csv: Audit report about residual data after live migration
        • cold_stuck.csv: Audit report about stuck databases of the cold migration
      • 1007: indicates that the Nova database has not submitted events for auditing for more than one hour. The following audit report is generated after an audit is complete:

        nova_idle_transactions.csv: Audit report about the Nova database not submitting events for auditing for more than one hour

      • 1102: indicates that a redundant Neutron namespace (DHCP and router namespaces) is audited. The following audit report is generated after an audit is complete:
        • redundant_namespaces.csv: Audit report about redundant Neutron namespaces
      • 1103: indicates that an orphan Neutron port is audited. An orphan Neutron port is that Neutron determines that this port is used by a VM, but this VM does not exist actually. The following audit report is generated after an audit is complete.
        • neutron_wild_ports.csv: Audit report about orphan Neutron ports
      • 1201: indicates that the invalid volume, orphan volume, volume attachment status and stuck volume are audited. The following audit reports are generated after an audit is complete:
        • fakeVolumeAudit.csv: Audit report about invalid volumes
        • wildVolumeAudit.csv: Audit report about orphan volumes
        • VolumeAttachmentAudit.csv: Audit report about the volume attachment status
        • VolumeStatusAudit.csv: Audit report about stuck volumes
        • FrontEndQosAudit.csv: Audit report about front-end QoS
        • VolumeQosAudit.csv: Audit report about volume QoS
      • 1204: indicates that the invalid snapshot, orphan snapshot, and stuck snapshot are audited. The following audit reports are generated after an audit is complete:
        • fakeSnapshotAudit.csv: Audit report about invalid snapshots
        • wildSnapshotAudit.csv: Audit report about orphan snapshots
        • SnapshotStatusAudit.csv: Audit report about stuck snapshots
        • wildInstanceSnapshotAudit.csv: Audit report about residual orphan child snapshots
      • 1301: indicates that a bare metal server (BMS) is audited. The following audit report is generated after an audit is complete:
        • invalid_ironic_nodes.csv: Audit report about unavailable BMSs
        • invalid_ironic_instances.csv: Audit report about BMS consistency
        • stucking_ironic_instances.csv: Audit report about stuck BMSs
      • 1501: indicates that an orphan replication pair is audited. The following audit report is generated after an audit is complete:

        wildReplicationAudit.csv: Audit report about orphan replication pairs

      • 1502: indicates that an invalid replication pair is audited. The following audit report is generated after an audit is complete:

        fakeReplicationAudit.csv: Audit report about invalid replication pairs

      • 1504: indicates that replication pair statuses are audited. The following audit report is generated after an audit is complete:

        statusReplicationAudit.csv: Audit report about replication pair statuses

      • 1505: indicates that an orphan consistency replication group is audited. The following audit report is generated after an audit is complete:

        wildReplicationcgAudit.csv: Audit report about an orphan consistency replication group

      • 1506: indicates that an invalid consistency replication group is audited. The following audit report is generated after an audit is complete:

        fakeReplicationcgAudit.csv: Audit report about an invalid consistency replication group

      • 1508: indicates that consistency replication group statuses are audited. The following audit report is generated after an audit is complete:

        statusReplicationcgAudit.csv: Audit report about consistency replication group statuses

      • 1509: indicates that a consistency replication pair in the replication group is audited. The following audit report is generated after an audit is complete:

        contentReplicationcgAudit.csv: Audit report about consistency replication group content

      • 1601: indicates that an orphan HyperMetro pair is audited. The following audit report is generated after an audit is complete:

        wildHypermetroAudit.csv: Audit report about an orphan HyperMetro pair

      • 1602: indicates that an invalid HyperMetro pair is audited. The following audit report is generated after an audit is complete:

        fakeHypermetroAudit.csv: Audit report about an invalid HyperMetro pair

      • 1604: indicates that an orphan HyperMetro consistency group is audited. The following audit report is generated after an audit is complete:

        wildHypermetrocgAudit.csv: Audit report about an orphan HyperMetro consistency group

      • 1605: indicates that an invalid HyperMetro consistency group is audited. The following audit report is generated after an audit is complete:

        fakeHypermetrocgAudit.csv: Audit report about an invalid HyperMetro consistency group

      • 1702: indicates that an ECS snapshot is audited. The following audit report is generated after an audit is complete:

        images_vm_snapshots.csv: audit report about residual ECS snapshots

      If the parameter is not specified, all the audit items are performed by default.

      parameter

      Optional. This parameter can be specified only after the audit item is specified.

      Specifies an additional parameter. You can specify only one value which needs to match the item.

      • If item is set to 1001, you can set the value of vm_stucking_timeout which indicates the timeout threshold in seconds for VMs in an intermediate state. The default value is 14400. The value affects the audit report about stuck VMs. You can also set the value of host_invalid_timeout which indicates the heartbeat timeout threshold in seconds for abnormal hosts. The default value is 14400. The value affects the audit report about abnormal hosts that adversely affect cold migrated VMs.
      • If item is set to 1002, you can set the value of image_stucking_timeout which indicates the timeout period in seconds for transient images. The default value is 86400. The value affects the audit report about stuck images.
      • If item is set to 1005, you can set the value of migration_stucking_timeout which indicates the timeout period in seconds. The default value is 14400. The migration_stucking_timeout parameter affects the audit report about intermediate state of the cold migration.
      • If item is set to other values, no additional parameter is required.

      Example: --parameter vm_stucking_timeout=3600

      type

      Optional

      Specifies the additional parameter, which indicates whether an audit is synchronous or asynchronous. If this parameter is not specified for an audit, the audit is a synchronous one. The values are:

      • sync: specifies a synchronous audit. For details, see the following command.
      • async: specifies an asynchronous audit. For details, see Asynchronous Audit. The audit progress and audit result status of an asynchronous audit can be obtained by invoking the interface for querying the task status.

      Run the following command to detect a VM in the intermediate state for greater than or equal to 3600 seconds:

      infocollect audit --item 1001 --parameter vm_stucking_timeout=3600

      Information similar to the following is displayed:

      +--------------------------------------+----------------------------------+ 
      | Hostname                             | Path                             | 
      +--------------------------------------+----------------------------------+ 
      | CCCC8175-8EAC-0000-1000-1DD2000011D0 | /var/log/audit/2015-04-22_020324 | 
      +--------------------------------------+----------------------------------+

      In the command output, Hostname indicates the ID of the host for which the audit report is generated, and Path indicates the directory containing the audit report.

      You need log in to the host firstly and then view audit reports based on Collecting Audit Reports to view it.

Collecting Audit Reports

Scenarios

Collect audit reports when:

  • Alarms about the volume audit, VM audit, snapshot audit, and image audit are generated.
  • Routine maintenance is performed

Prerequisites

A local PC running the Windows operating system is available.

Procedure

  1. Log in to the first host in an availability zone (AZ).

    For details, see Using SSH to Log In to a Host.

  2. Import environment variables to the host.

    For details, see Importing Environment Variables.

    Please enter 1, enable Keystone V3 authentication with the built-in DC administrator.

  3. Perform the following operations to obtain the IDs of the hosts where the active and standby audit services are deployed:

    1. Enter the secure operation mode.

      For details, see Command Execution Methods.

    2. Run the following command to obtain the IDs of the hosts where the active and standby audit services are deployed:

      cps template-instance-list --service collect info-collect-server

      The following information is displayed:

      +------------+---------------------+---------+------------+----------------+ 
      | instanceid | componenttype       | status  | runsonhost | omip           | 
      +------------+---------------------+---------+------------+----------------+ 
      | 1          | info-collect-server | active  | 152-slot9  | 192.168.153.69 | 
      | 0          | info-collect-server | standby | 152-slot10 | 192.168.153.98 | 
      +------------+---------------------+---------+------------+----------------+

      The values of runsonhost indicate the host IDs.

  4. Enter secure operation mode based on Command Execution Methods and run the following command to obtain the management IP address of the host where the active audit service is deployed:

    cps host-show Host ID | grep manageip

    NOTE:
    • Select the host in the active state from the hosts.
    • If the host you have logged in to is the one for which the management IP address is to be obtained, go to 6.

  5. Run the following commands to log in to the host where the active audit service is deployed:

    su fsp

    ssh fsp@Management IP address of host

    su - root

  6. Run the following command to query the time for the last audit conducted on the host:

    ls /var/log/audit -Ftr | grep /$ | tail -1

    Information similar to the following is displayed:

    2014-09-20_033137/
    NOTE:
    • The directory name indicates the audit time. For example, 2014-09-20_033137 indicates 3:31:37 on September 20th, 2014.
    • If no result is returned, no audit report is available on the host.

  7. Run the following command to create a temporary directory used for storing audit reports:

    mkdir -p /home/fsp/last_audit_result

  8. Run the following command to copy the latest audit report to the temporary directory:

    cp -r /var/log/audit/`ls /var/log/audit -Ftr | grep /$ | tail -1` /home/fsp/last_audit_result

  9. Run the following command to modify the permissions of files in the temporary directory:

    chmod 777 /home/fsp/last_audit_result/ -R

  10. Switch to user fsp and copy the temporary directory to the first host in the AZ.

    Run the following command to switch to user fsp:

    exit

    Run the following command to copy the temporary directory:

    scp -r /home/fsp/last_audit_result fsp@host_ip:/home/fsp

    In the command, the value of host_ip is the management IP address of the first host. If the value is 172.28.0.2, run the following command:

    scp -r /home/fsp/last_audit_result fsp@172.28.0.2:/home/fsp

    During the copy process the password of user root is required. The default password of user root is Huawei@CLOUD8!.

  11. Run the following command to delete the temporary directory from the host where the latest audit report is saved:

    rm -r /home/fsp/last_audit_result

  12. Log in to the first host in the AZ. For details, see Using SSH to Log In to a Host.
  13. Use WinSCP or other tools to copy the folder /home/fsp/last_audit_result to the local PC.
  14. Run the following command to delete the temporary folder from the first host:

    NOTE:

    This audit directory will be used for subsequent audit operations. Perform this step only after relevant audit operations are completed.

    rm -r /home/fsp/last_audit_result

Obtaining the Operation Report

Scenarios

After the system database is restored using a data backup and is audited, if any audit alarm is generated, data inconsistency occurs. In this scenario, obtain operation reports to locate the inconsistent data and use the operation replay tool to analyze the operation logs and find out the inconsistency cause.

Operation Replay Tool Function

The operation replay tool is used to collate and analyze OpenStack component operation logs, and generate an operation report, which records operations performed on system resources in the specified time period. Then, users can check these operations.

This tool can analyze operation logs of the following components: Nova, Cinder, and Glance.

The system resources can be analyzed include VMs, images, volumes, and snapshots.

Report Format

The report generated by the operation replay tool is a .cvs file.

The file name format is Component name-Start time_End time.csv, for example, nova-2014:09:10-10:00:00_2014:09:11-10:00:00.csv.

Table 18-147 describes parameters in the operation report. The second column in the table is the description of each parameter. The actual report does not contain such content.

Table 18-147 Parameter description

Parameter

Description

Example Value

tenant

Specifies the tenant ID.

94e010f2246f435ca7f13652e64ff0fb

res_id

Specifies the resource ID.

8ff25fba9-61cd-424f-a64a-c4a07b372d51

res_type

Specifies the resource type.

volumes

time

Specifies the time when the operation was performed.

18/Sep/2014:12:37:15

host

Specifies the host ID.

CCCC8171-7958-0000-1000-1DD40000CAD0

action

Provides detailed information about the operation, including HTTP request, method, URL, request body, and the result code.

POST https://volume.az1.dc1.vodafone.com:8776/v2/94e010f2246f435ca7f13652e64ff0fb/volumes {"volume": {"status": "creating" "description": null "availability_zone": null "source_volid": null "snapshot_id": null "size": 1 "user_id": null "name": "ooooo" "imageRef": null "attach_status": "detached" "volume_type": null "shareable": false "project_id": null "metadata": {}}} 8ff25fba9-61cd-424f-a64a-c4a07b372d51 202

Command Format

The format of the command used on this tool is as follows:

operate-replay analyse --path <log-path> [--dest <dest>] [--start <time>] [--end <time>]

This command can be executed on any host. Table 18-148 describes parameters in the command.

Table 18-148 Parameter description

Parameter

Mandatory

Description

--path

Mandatory

Specifies the directory containing the operation logs. The directory structure is as follows:

.../log-path/host_id_1/nova-api

.../log-path/host_id_1/cinder-api

--dest

Optional

Specifies the analysis object. The values can be nova, cinder, and glance. Separate multiple objects using commas (,) with spaces. If the parameter is not specified, the tool analyzes all the three components.

--start

Optional

Specifies the start time.

The value format is YYYY/MM/DD-HH:MM:SS. For example, 2014/05/06-10:12:15.

The default value is unlimit, indicating the earliest time of log generation.

If both the start time and end time are specified, the difference between the two values must be less than or equal to 48 hours.

--end

Optional

Specifies the end time.

The value format is YYYY/MM/DD-HH:MM:SS. For example, 2014/05/06-12:12:15.

The default value is unlimit, indicating the current time of the system.

Procedure

  1. Log in to a host in the first host in an availability zone (AZ).

    For details, see Using SSH to Log In to a Host.

  2. Import environment variables to the host.

    For details, see Importing Environment Variables.

  3. Perform the following operations to locate the controller hosts in the AZ:

    1. Enter the secure operation mode.

      For details, see Command Execution Methods.

    2. Run the following command to query the IDs and management IP addresses of all controller hosts:

      cps host-list

      The host whose roles value contains controller indicates a controller host. Make a note of the IDs and management IP addresses of all controller hosts, and the management IP address of the first host.

  4. Log in to the host whose omip address of the first host obtained in 3 and run the su fsp command to switch to user fsp.
  5. In the home/fsp directory, create a directory used for storing log files.

    If the directory to be created is op_log, run the following command:

    mkdir -p /home/fsp/op_log

  6. In the directory created in 5, run the following command to create a subdirectory for each controller host querying in 3 and name the subdirectories with the IDs of the controller hosts:

    mkdir -p /home/fsp/op_log/controller_host_id

    If the controller host ID is controller-node-1, run the following command:

    mkdir -p /home/fsp/op_log/controller-node-1

  7. Perform the following operations on each controller node obtained in 3 to copy controller node operation logs to their corresponding subdirectoriess created in 6:

    1. Run the following commands to log in to each controller node using the management IP addresses obtained in 3 and switch to user root:

      ssh fsp@Management IP address

      su - root

    2. Run the following command to create a temporary directory:

      mkdir /home/fsp/op_tmp_log

    3. Run the following command to switch to the directory containing the operation logs:

      cd /var/log/fusionsphere/operate

    4. Run the following commands to copy controller node operation logs to the temporary directory /home/fsp/op_tmp_log:

      cp -r ./nova-api /home/fsp/op_tmp_log

      cp -r ./cinder-api /home/fsp/op_tmp_log

      cp -r ./glance-api /home/fsp/op_tmp_log

    5. Run the following command to modify the permissions of the temporary directory /home/fsp/op_tmp_log:

      chmod 777 /home/fsp/op_tmp_log -R

    6. Run the following commands to copy the temporary directory /home/fsp/op_tmp_log to the first host:

      su fsp

      scp -r /home/fsp/op_tmp_log/* fsp@first_node_ip:/home/fsp/op_log/controller_host_id/

      In the commands, first_node_ip indicates the management IP address of the first host, and controller_host_id indicates the subdirectory created in 6 for the host.

      NOTE:

      If it is a operation report of the first control node, please execute the following comands.

      cp -r /home/fsp/op_tmp_log/* /home/fsp/op_log/controller_host_id/

    7. Run the following command to delete the temporary directory:

      rm /home/fsp/op_tmp_log -r

    After the operation logs of all controller nodes are copied to the subdirectories, log in to the first host.

  8. Run the following command to generate an operation report:

    su - root

    operate-replay analyse --dest nova,cinder,glance --path /home/fsp/op_log [--start <time>] [--end <time>]

    You can set the time range and analysis objects as required.

    The value format of <time> is YYYY/MM/DD-HH:MM:SS, for example, 2015/12/11-18:00:00.

    A .csv report file is displayed in the specified directory.

  9. Use WinSCP or other tools to copy the audit report on the first host to the local PC.

    If you no longer needs files in the xxx directory after the report is successfully copied, delete this directory.

    rm /home/fsp/op_log -r

  10. Use Excel to open the report.

Analyzing Audit Results

Scenarios

Analyze the audit results when:

  • When receiving audit-related alarms, such as volume, VM, snapshot, and image audit alarms, log in to the system, obtain the audit reports, and rectify the faults accordingly.
  • After enabling the backup and restoration feature, log in to the system and perform a consistency audit. Then obtain the audit reports and rectify the fault accordingly.
  • To perform routine maintenance for the system, log in to the system and perform an audit. Then obtain the audit reports and rectify the fault accordingly.

Prerequisites

Procedure

  1. Determine the audit report name.

    If the alarm is an audit alarm, choose Additional Info > Details and select the required audit report from displayed audit reports.

  2. Check the audit report name.

    • VM Audit Alarm
      • If the report name is orphan_vm.csv and the report is not empty, rectify the fault based on Orphan VMs. Otherwise, residual resources may exist.
      • If the report name is invalid_vm.csv and the report is not empty, rectify the fault based on Invalid VMs. Otherwise, unavailable VMs may be visible to users.
      • If the report name is host_changed_vm.csv and the report is not empty, rectify the fault based on VM Location Inconsistency. Otherwise, VMs may become unavailable.
      • If the report name is stucking_vm.csv and the report is not empty, rectify the fault based on Stuck VMs. Otherwise, VMs may become unavailable.
      • If the report name is cold_stuck.csv and the report is not empty, rectify the fault based on Intermediate State of the Cold Migration. Otherwise, the affected VMs may fail to be maintained.
      • If the report name is host_invalid_migration.csv and the report is not empty, rectify the fault based on Cold Migrated VMs That Are Adversely Affected by Abnormal Hosts. Otherwise, the affected VMs may fail to be maintained.
      • If the report name is nova_service_cleaned.csv and the report is not empty, rectify the fault based on Handling nova-compute Service Residuals. Otherwise, user experience is affected.
      • If the report name is nova_idle_transactions.csv and the report is not empty, rectify the fault based on Nova Database Not Submitting Events for Auditing. Otherwise, the number of available Nova database connections may decrease.
    • Volume Audit
      • If the report name is wildVolumeAudit.csv and the report is not empty, rectify the fault based on Orphan Volumes. Otherwise, residual resources may exist.
      • If the report name is fakeVolumeAudit.csv and the report is not empty, rectify the fault based on Invalid Volumes. Otherwise, unavailable volumes may be visible to users.
      • If the report name is VolumeStatusAudit.csv and the report is not empty, rectify the fault based on Stuck Volumes. Otherwise, volumes may become unavailable.
      • If the report name is VolumeAttachmentAudit.csv and the report is not empty, rectify the fault based on Inconsistent Volume Attachment Information. Otherwise, residual resources may exist.
      • If the report name is FrontEndQosAudit.csv and the report is not empty, rectify the fault based on Frontend QoS. Otherwise, residual resources may exist.
      • If the report name is VolumeQosAudit.csv and the report is not empty, rectify the fault based on Volume QoS. Otherwise, unavailable volumes may be visible to users.
    • Snapshot Audit
      • If the report name is wildSnapshotAudit.csv and the report is not empty, rectify the fault based on Orphan Volume Snapshots. Otherwise, residual resources may exist.
      • If the report name is fakeSnapshotAudit.csv and the report is not empty, rectify the fault based on Invalid Volume Snapshots. Otherwise, unavailable volume snapshots may be visible to users.
      • If the report name is SnapshotStatusAudit.csv and the report is not empty, rectify the fault based on Stuck Volume Snapshots. Otherwise, volume snapshots may be unavailable.
      • If the report name is wildInstanceSnapshotAudit.csv, rectify the fault based on Residual Orphan Child Snapshots.Otherwise, residual volumes snapshot resources exist, occupying system resources.
    • Image Audit
      • If the report name is stucking_images.csv and the report is not empty, rectify the fault based on Stuck Images. Otherwise, residual resources may exist.
    • Virtual Network Resource Audit
      • If the report name is redundant_namespaces.csv and the report is not empty, rectify the fault based on Redundant Neutron Namespaces. Otherwise, residual namespace may exist and fail to be maintained.
      • If the report name is neutron_wild_ports.csv and the report is not empty, rectify the fault based on Orphan Ports. Otherwise, ports cannot be used by VMs and cannot be maintained.
    • Bare Metal Server Audit
      • If the report name is invalid_ironic_nodes.csv and the report is not empty, rectify the fault based on Unavailable Bare Metal Servers. Otherwise, BMSs may be unavailable.
      • If the report name is invalid_ironic_instances.csv and the report is not empty, rectify the fault based on Bare Mental Server Audit Consistency.Otherwise, unavailable physical servers may be visible to users, or residual resources may exist in the environment.
      • If the report name is stucking_ironic_instances.csv and the report is not empty, rectify the fault based on Handling Bare Metal Servers in an Intermediate State.Otherwise, BMSs may be unavailable.
    • Other Audit
    • If the report name is zombie_process_hosts.csv and the report is not empty, zombie processes have been generated in the nova-novncproxy service and have been automatically processed. For details, see Nova novncproxy Zombie Process.
    • If the report name is cold_cleaned.csv and the report is not empty, residual cold migration records exist in the environment and have been automatically processed. For details, see Detecting and Deleting Residual Cold Migration Data.
    • If the report name is live_cleaned.csv and the report is not empty, residual live migration records exist in the environment and have been automatically processed. For details, see Detecting and Deleting Residual Live Migration Data.
    • If the report name is fakeHypermetroAudit.csv and the report is not empty, rectify the fault based on Invalid HyperMetro Pairs. Otherwise, unavailable HyperMetro pairs may be visible to users.
    • If the report name is fakeHypermetrocgAudit.csv and the report is not empty, rectify the fault based on Invalid HyperMetro Consistency Groups. Otherwise, unavailable HyperMetro consistency groups may be visible to users.
    • If the report name is fakeReplicationAudit.csv and the report is not empty, rectify the fault based on Invalid Replication Pairs. Otherwise, unavailable replication pairs may be visible to users.
    • If the report name is fakeReplicationcgAudit.csv and the report is not empty, rectify the fault based on Invalid Remote Replication Consistency Group. Otherwise, unavailable consistency replication groups may be visible to users.
    • If the report name is statusReplicationAudit.csv and the report is not empty, rectify the fault based on Replication Pair with Inconsistent Statuses. Otherwise, replication pairs may be unavailable.
    • If the report name is statusReplicationcgAudit.csv and the report is not empty, rectify the fault based on Remote Replication Consistency Groups with Inconsistent States. Otherwise, consistency replication groups may be unavailable.
    • If the report name is wildHypermetroAudit.csv and the report is not empty, rectify the fault based on Orphan HyperMetro Pairs. Otherwise, residual resources may exist.
    • If the report name is wildHypermetrocgAudit.csv and the report is not empty, rectify the fault based on Orphan HyperMetro Consistency Groups. Otherwise, residual resources may exist.
    • If the report name is wildReplicationAudit.csv and the report is not empty, rectify the fault based on Orphan Replication Pair. Otherwise, residual resources may exist.
    • If the report name is wildReplicationcgAudit.csv and the report is not empty, rectify the fault based on Orphan Remote Replication Consistency Groups. Otherwise, residual resources may exist.
    • If the report name is contentReplicationcgAudit.csv and the report is not empty, rectify the fault based on Remote Replication Consistency Groups with Inconsistent Replication Pairs. Otherwise, consistency replication groups may be unavailable.
    • If the report name is images_vm_snapshots.csv, rectify the fault based on Residual ECS Snapshots.Otherwise, residual ECS snapshot resources exist, occupying system resources.

    If multiple faults in the audit report are displayed, the faults must be rectified based on the sequence listed in 2.

  3. (Optional)After the faults are rectified, check the operation logs and perform operations provided in Inconsistency Between VM HA Flags and Startup Flags to prevent inconsistency between the VM HA flag and the startup flag.

    NOTE:

    Perform this step based on whether the VM HA flag bit or startup mode was changed after the management data was backed up and before the system database was restored.

Handling Audit Results

Orphan VMs

Context

A VM is orphaned in either of the following scenarios:

  • Common orphan VM: The VM is present on a host but does not exist in the system database or is in the deleted state in the database.
  • Orphan VM caused by an HA exception: The VM exists in the database, and two copies of the VM are present in the system, one is in the pause state, and the other is in the running state.

Parameter Description

The name of the audit report for an orphan VM is orphan_vm.csv. Table 18-149 describes parameters in the report.

Table 18-149 Parameter description

Parameter

Description

uuid

Specifies the VM universally unique identifier (UUID).

hyper_vm_name

Specifies the VM name registered in the hypervisor.

host_id

Specifies the ID of the host accommodating the VM.

Possible Causes

  • The database is reverted using a data backup to the state when backup was created. However, after the backup was created, one or more VMs were created. After the database is restored, records of these VMs are deleted from the database, but these VMs reside on their hosts and become orphan VMs.
  • Some VMs were manually created on hosts using system commands.
  • The system was not stable (for example, VMs were being live migrated) when the audit was conducted
  • During the VM HA rescheduling process, a network or system exception occurred, causing the system failed to clear the VM resources in the source host. Therefore, VM information resides on the source host. After the VM is rebuilt on the destination host, the VM location recorded in the database is changed to the destination host. In this case, the VM exists on both the source and destination hosts. Therefore, in the system audit report, the VM location for the VM on the source host is incorrect.
  • During the VM live migration, cold migration, resize, resize-revert, or resize-confirm process, the network or system is unstable, or storage encounters an exception, resulting in a VM fault. After the faulty VM is deleted, residual data may exist on the source or destination host.

Impact on the System

  • VMs orphaned by database restoration are invisible to tenants.
  • Residual system resources reside.

Procedure

  1. Collect audit reports and copy them to the first controller node in an AZ. For details, see Collecting Audit Reports.
  2. Log in to the first controller node in an AZ. For details, see Using SSH to Log In to a Host.
  3. Import environment variables. For details, see Importing Environment Variables.
  4. Run the following command to query detailed information about the orphan VM, including the information about the host accommodating the VM and the value of instance_name:

    1. Run the following command to enter the secure mode:

      runsafe

      Information similar to the following is displayed.

      Input command:
    2. Run the following command to check whether the orphan VM exists in the database:

      python /etc/nova/nova-util/invalid_vm_result.py orphan_vm status audit_path

      audit_path indicates the patch to which audit reports are copied to in 1, as shown in the following.

      /home/fsp/last_audit_result/date or /home/fsp/last_audit_result.

      where, date must be replaced with that in the actual audit result, for example,

      /home/fsp/last_audit_result/2018-03-26_142612-02/

    3. Run the following command, and check the output:

      ls audit_path

      Check whether the output is as follows:

      audit  data_collect.zip
    4. Check whether there is a VM ID list in the output after 4.b execution, as shown in the following figure (the second row), is displayed.

      • If yes, go to the next step.
      • If no, the fault is falsely reported due to time differences. In this case, no further action is required.

  5. Log in to the host accommodating the orphan VM. For details, see Using SSH to Log In to a Host. The management IP address of the host is the OM IP address obtained in 4.
  6. Run the following command to check whether an orphan VM runs on the host:

    python /etc/nova/nova-util/invalid_vm_result.py orphan_vm result "hyper_vm_str"

    hyper_vm_str indicates the VM ID list displayed in 4 in the "vm_id1,hyper_vm_id1;vm_id2,hyper_vm_id2..." format.

    Check whether a VM ID list, as shown in the following figure (from the second row), is displayed.

    • If yes, go to 7.
    • If no, the VM is not an orphan VM, and the fault is falsely reported due to time differences. No further action is required.

  7. Determine whether to delete the orphan VM.

    • If yes, go to 8.
    • If no, contact the user for further processing, and no further action is required.

  8. Run the following command to delete the orphan VM:

    python /etc/nova/nova-util/invalid_vm_result.py orphan_vm clean "hyper_vm_str"

    hyper_vm_str indicates the VM ID list displayed in 4 in the "vm_id1,hyper_vm_id1;vm_id2,hyper_vm_id2..." format. Replace hyper_vm_str with the ID of orphan VM to be deleted in the required format.

    The orphan VM is successfully handled.

Invalid VMs

Context

An invalid VM is the one that exists in the system database and is in a normal state in the database but is not present in the hypervisor.

For an invalid VM, have the tenant to determine whether the VM is useful. If the VM is not useful, delete the VM record from the database.

Parameter Description

The name of the audit report is invalid_vm.csv. Table 18-150 describes parameters in the report.

Table 18-150 Parameterdescription

Parameter

Description

uuid

Specifies the VM UUID.

tenant_id

Specifies the tenant ID.

hyper_vm_name

Specifies the VM name on the host, for example, instance_xxx.

updated_at

Specifies the last time when the VM status was updated.

status

Specifies the current VM state.

task_status

Specifies the current VM task state.

Impact on the System

Users can query the VM using the Nova APIs, but the VM does not exist on the host.

Possible Causes

  • The database is reverted using a data backup to the state when backup was created. However, after the backup was created, one or more VMs were deleted. After the database is restored, records of these VMs are present to the database, but these VMs have been deleted.
  • The system was not stable (for example, VMs were being live migrated) when the audit was conducted.
  • Some hosts are abnormal, resulting in that VMs on the hosts are incorrectly reported as invalid VMs. In this case, conduct the system audit again after the system recovers.

Procedure

  1. Log in to the first controller node in the AZ.

    For details, see Using SSH to Log In to a Host.

  2. Import environment variables. For details, see Importing Environment Variables.
  3. Perform the following operations to query the management IP addresses of controller nodes:

    1. Run the following command to enter the secure operation mode:

      runsafe

      The following information is displayed:

      Input command:
    2. Run the following command to obtain information about the host housing the invalid VM and instance_name:

      nova showuuid

      The VM UUID can be obtained from the audit report.

    Check whether the VM details are displayed.

    • If yes, go to the next step.
    • If no, the fault is falsely reported due to time differences. In this case, no further action is required.

  4. Run the following command to query the ID of the host (host_id) accommodating the invalid VM:

    nova show uuid | grep host

    The output is as follows. host_id of the host accommodating the invalid VM is displayed at the right side of the row that contains OS-EXT-SRV-ATTR:host.

  5. Perform the following operations to query the management IP address of the host accommodating the invalid VM:

    Run the runsafe command to enter the secure operation mode, enter the user password, and run the following command as prompted:

    cps host-show host_id | grep manageip | awk -F '|' '{print $3}'

    Where, host_id  is the host ID obtained in 4.

  6. Run the following commands to log in to the controller node:

    su fsp

    ssh fsp@Management IP address

    su - root

  7. Import environment variables. For details, see 2.
  8. Verify the invalid VM.

    1. Run the following command to check whether the host contains this VM.

      nova_virsh_cmd virsh-list-name

      The VM runs on the host if information about the VM is displayed in the command output. For example:

      Id    Name                           State  
      ----------------------------------------------------  
       3     instance-00000064              running
    2. Run nova show uuid | grep name. In the command output, record instance_name at the right side of the row that contains OS-EXT-SRV-ATTR:instance_name.

    3. Check whether instance_name obtained in 8.b is also displayed in the output obtained in 8.a.
      • If yes, no further action is required.
      • If no, go to the next step.

  9. Run the following command to determine the boot device of the VM.

    nova show uuid | grep image

    • If the output contains the following information, the VM boots from an image. Perform 10 to delete the VM.

    • If the output contains the following information, the VM boots from a volume, in which case, proceed to 13.

  10. Determine whether to delete the VM.

    • If yes, go to 19.
    • If no, go to the next step.

  11. Run the runsafe command to enter the secure operation mode, and run the following command to rebuild a service VM:

    nova rebuild<vm_id> <image_id>

    In the preceding command, vm_id is obtained from 3 and image_id is obtained from output in 9.

    NOTE:

    If the VM boots from an image, rebuilding the VM may cause data loss on the system volume. Therefore, contact technical support for assistance to verity information correctness before rebuilding the VM.

  12. Run the runsafe command to enter the secure operation mode, and run the following command to check whether the VM is successfully rebuilt:

    nova show<vm_id>

    Check the status value in the command output:

    • If the status is REBUILD, the VM is being rebuilt. Query the VM again 1 minute later.
    • If the status is ACTIVE, the VM is successfully rebuilt, no further action is required.
    • If the status is neither of the above, the VM fails to rebuild. Contact technical support for assistance.

  13. Run the runsafe command to enter the secure operation mode, enter the user password as prompted, and run the following command to query the VM volumes.

    cinder list | grep uuid

    The VM UUID can be obtained from the audit report.

    The VM volume IDs are displayed in the first row of the command output.

    If the above operation is abnormal, please contact .

  14. Login in to the audit report node. For details, see Using SSH to Log In to a Host. Check whether all the volume IDs obtained in 13 are listed in the audit report of the invalid volume.

    • If yes, represented as a invalid volume, go to 19 to delete the VM. If you need to handle this invalid volume, please contact technical support for assistance.
    • If no, represented as a valid volume, go to 15.

  15. Obtain the operation report (for details, see Obtaining the Operation Report) and check whether the VM was deleted after the management data was backed up and before the system database was restored.

    If the VM was deleted after the management data was backed up and before the system database was restored, all of the following conditions in Table 18-151 must be met:

    Table 18-151 Data in theoperation report

    Parameter

    Value

    res_id

    The value is that of uuid in the audit report.

    res_type

    servers

    time

    The value is the time within the period after the management data was backed up and before the database was restored.

    action

    The value of the HTTP request method is POST, and the HTTP request URL is /v2/tenant_id/servers. The value of tenant_id is the value of tenant in the audit report.

    • If yes, go to the next step.
    • If no, contact technical support for assistance.

  16. Determine whether to delete the VM or restore the VM.

    • To restore the VM, go to 17.
    • To delete the VM, go to 19.

  17. Perform the following operations to rebuild the VM:

    Run the runsafe command to enter the secure operation mode, enter the user password, and run the following command as prompted:

    /opt/cloud/services/nova/venv/bin/python2.7 /etc/nova/nova-util/reschedule_vm.py vm_uuid

  18. Perform the following operations to query the VM status:

    Run the runsafe command to enter the secure operation mode, enter the user password, and run the following command as prompted:

    nova show vm_uuid

    In the command output, check the value of status.

    • If the value is REBUILD, the VM is rebuilding. Query the VM status again after 1 minute.
    • If the value is ACTIVE, the VM is restored. No further action is required.
    • If other values are displayed, the VM fails to be restored. Contact technical support for assistance.

  19. Log in to the host accommodating the active GaussDB node. For details, see Logging In to the Active GaussDB Node.
  20. Run the following script to clear information about the fake VM from the database:

    sh /usr/bin/info-collect-script/audit_resume/FakeVMCleanup.sh vm_uuid

    You are required to enter the password of database account gaussdba during the script execution. The default password is FusionSphere123. Determine whether the operation is successful.

    • If yes, no further action is required.
    • If no, contact technical support for assistance.

VM Location Inconsistency

Context

The host accommodating a VM recorded in the system database is inconsistent with the actual host.

If the fault is confirmed, correct the actual VM location information (host ID) in the database.

Parameter Description

The name of the audit report is host_changed_vm.csv. Table 18-152 describes parameters in the report.

Table 18-152 Parameter description

Parameter

Description

uuid

Specifies the VM UUID.

tenant_id

Specifies the tenant ID.

hyper_vm_name

Specifies the VM name registered in the hypervisor.

updated_at

Specifies the last time when the VM status was updated.

status

Specifies the VM state.

task_status

Specifies the VM task state.

host_id

Specifies the ID of the host accommodating the VM recorded in the database.

hyper_host_id

Specifies the ID of the actual host accommodating the VM.

hypervisor_hostname

Reserved for connecting to VRM and is left blank in KVM scenarios.

hyper_hypervisor_hostname

Reserved for connecting to VRM and is left blank in KVM scenarios.

Possible Causes

The database is reverted using a data backup to the state when backup was created. However, after the backup was created, one or more VMs were migrated. After the database is restored, location records of these VMs in the database are inconsistent with the actual VM locations.

Impact on the System

The VM becomes unavailable if the VM location recorded in the database is inconsistent with the actual host accommodating the VM.

Procedure

  1. Log in to the first controller node in the AZ.

    For details, see Using SSH to Log In to a Host.

  2. Import environment variables. For details, see Importing Environment Variables.
  3. Run the following command to obtain the management IP address of the actual host accommodating the VM:

    cps host-show hyper_host_id | grep manageip | awk -F '|' '{print $3}'

    The value of hyper_host_id can be obtained from the audit report.

  4. Log in to the host accommodating the VM and run the following command to check whether the VM runs on the host:

    nova_virsh_cmd virsh-list-name | grep hyper_vm_name

    The value of hyper_vm_name can be obtained from the audit report.

    Check whether the command output contains information about this VM.

    • If yes, go to 5.
    • If no, contact technical support for assistance.

  5. Log in to the active GaussDB node. For details, see Logging In to the Active GaussDB Node.
  6. Run the following commands to correct the information about the host accommodating the VM recorded in the database. The password of the gaussdba account is required during the command execution process. The default password of user gaussdba is FusionSphere123.

    sh /usr/bin/info-collect-script/audit_resume/host_changed_handle.sh uuid hyper_host_id

    The VM UUID can be obtained from the audit report.

    The value of hyper_host_id can be obtained from the audit report.

    Check whether the command is successfully executed based on the command output.

    • If yes, go to 7.
    • If no, contact technical support for assistance.

  7. Run the following command to stop the VM:

    nova stop uuid

    After a few seconds, go to 8.

    Check whether the value of status in the command output is SHUTOFF.

    • If yes, go to 9.
    • If no, contact technical support for assistance.

  8. Run the following command to query the VM status:

    nova show uuid | grep status

  9. Run the following command to migrate the VM:

    nova migrate uuid

    After a few seconds, go to 8.

    Check whether the value of status in the command output is VERIFY_RESIZE.

    • If yes, go to 10.
    • If no, contact technical support for assistance.

  10. Run the following command to check whether the VM is successfully migrated:

    nova resize-confirm uuid

    After a few seconds, go to 8.

    Check whether the value of status in the command output is SHUTOFF.

    • If yes, go to 11.
    • If no, contact technical support for assistance.

  11. Run the following command to restart the VM.

    nova start uuid

    Check whether the VM is started

    • If yes, no further action is required.
    • If no, contact technical support for assistance.

Stuck VMs

Context

A stuck VM is the one that is kept in a transition state for a long time and cannot be automatically restored if any system exception (for example, a host restart) occurs during a VM service process (for example, starting a VM).

Manually restore the VM based on the VM status and the task status.

Parameter Description

The name of the audit report is stucking_vm.csv. Table 18-153 describes parameters in the report.

Table 18-153 Parameter description

Parameter

Description

uuid

Specifies the VM UUID.

tenant_id

Specifies the tenant ID.

hyper_vm_name

Specifies the VM name registered in the hypervisor.

updated_at

Specifies the last time when the VM status was updated.

status

Specifies the VM state.

task_status

Specifies the VM task state.

Possible Causes

A system exception occurred when a VM service operation was in process.

NOTE:

If a host fault caused the VM to be stuck in a transient state, the VM can be restored using the HA mechanism. For details about the VM HA feature, see the product documentation.

Impact on the System

The VM becomes unavailable and occupies system resources.

Procedure

Restore the VM based on the VM statuses and task statuses listed in Table 18-154. For other situations, contact technical support for assistance.contact technical support for assistance.

Table 18-154 VM restoration methods

VM Status

Task Status

Possible Scenario

Restoration Method

building

scheduling

Creating a VM

Method 2

building

None

Creating a VM

Method 2

building

block_device_mapping

Creating a VM

Method 2

building

networking

Creating a VM

Method 2

building

spawning

Creating a VM

Method 2

N/A

image_snapshot_pending

Exporting a snapshot

Set the VM state to active. For details, see Method 1.

N/A

image_snapshot

Exporting a snapshot

Set the VM state to active. For details, see Method 1.

N/A

image_pending_upload

Exporting a snapshot

Method 4

N/A

image_uploading

Exporting a snapshot

Set the VM state to active. For details, see Method 1.

N/A

image_backup

Creating a VM backup

Set the VM state to active. For details, see Method 1.

N/A

resize_prep

Migrating a VM or modifying VM attributes

Set the VM state to active. For details, see Method 1.

N/A

resize_migrating

Migrating a VM or modifying VM attributes

Method 4

N/A

resize_migrated

Migrating a VM or modifying VM attributes

Method 4

N/A

resize_finish

Migrating a VM or modifying VM attributes

Method 4

N/A

resize_reverting

Migrating a VM or modifying VM attributes

Method 4

N/A

rebooting

Restarting a VM

Set the VM state to active. For details, see Method 3.

N/A

reboot_pending

Restarting a VM

Set the VM state to active. For details, see Method 3.

N/A

reboot_started

Restarting a VM

Set the VM state to active. For details, see Method 3.

N/A

rebooting_hard

Restarting a VM

Set the VM state to active. For details, see Method 3.

N/A

reboot_pending_hard

Restarting a VM

Set the VM state to active. For details, see Method 3.

N/A

reboot_started_hard

Restarting a VM

Set the VM state to active. For details, see Method 3.

N/A

pausing

Pausing a VM

Set the VM state to active. For details, see Method 1.

N/A

unpausing

Unpausing a VM

Set the VM state to paused. For details, see Method 1.

N/A

suspending

Suspending a VM

Set the VM state to active. For details, see Method 1.

N/A

resuming

Resuming a VM

Set the VM state to suspended. For details, see Method 1.

N/A

powering_off

Stopping a VM

Set the VM state to active. For details, see Method 1.

N/A

powering_on

Starting a VM

Set the VM state to stopped. For details, see Method 1.

N/A

rebuilding

Rebuilding a VM

Method 6

N/A

rebuild_block_device_mapping

Rebuilding a VM

Method 6

N/A

rebuild_spawning

Rebuilding a VM

Method 6

N/A

migrating

Live migrating a VM

Method 4

N/A

rescheduling

Rescheduling a VM

Method 6

N/A

deleting

Deleting a VM

Method 5

Method 1

Reset the VM status based on the stuck state and wait for the user to confirm the recovery.

  1. Reset the VM status based on the stuck state.

    Based on the returned VM status, reset the VM status. For details, see Setting the VM State. Stop the VM and then start it.

  2. Log in to the first controller node in the AZ.

    For details, see Using SSH to Log In to a Host.

  3. Import environment variables. For details, see Importing Environment Variables.
  4. Perform the following operations to query the VM attributes:

    1. Run the following command to enter the secure operation mode:

      runsafe

      The following information is displayed:

      Input command:
    2. Run the following command to query the VM attributes:

      nova show uuid

      The VM UUID can be obtained from the audit report.

  5. Run the following commands to stop the VM first and then query the VM state to verify that the VM is in the stopped state:

    nova stop uuid

    nova show uuid

    The VM UUID can be obtained from the audit report.

    After the VM is stopped, Check whether any exception occurs when you perform the preceding operations.

    • If yes, go to Method 4.
    • If no, go to the next step.

  6. Run the following commands to start the VM and then query the VM state to verify that the VM is in the active state:

    nova start uuid

    nova show uuid

    Check whether any exception occurs when you perform the preceding operations.

    • If yes, go to Method 4.
    • If no, go to the next step.

  7. Have a tenant to log in to the VM and check whether any exception occurs.

    • If yes, go to Method 4.
    • If no, no further action is required.

Method 2

If a VM fails to create, ask the VM tenant whether this VM is useful. If yes, trigger VM high availability (HA) to restore the VM. If no, delete the VM and create another one.

NOTE:

Only an HA-enabled VM can be restored. A non-HA-enabled VM can only be deleted in this case.

For details about the HA function, see the product documentation.

  1. Ask the VM tenant whether this VM is useful and is necessary to restore.

    • If yes, go to the next step.
    • If no, have the tenant to delete the VM. No further action is required.
    NOTE:

    If the VM is stuck in the deleting state for a long time, go to Method 5.

  2. Log in to the first controller node in the AZ.

    For details, see Using SSH to Log In to a Host.

  3. Import environment variables. For details, see Importing Environment Variables.
  4. Perform the following operations to query the VM attributes:

    1. Run the following command to enter the secure operation mode:

      runsafe

      The following information is displayed:

      Input command:
    2. Run the following command to query the VM attributes:

      nova show uuid

      The VM UUID can be obtained from the audit report.

  5. Run the following command to query whether the VM has HA enabled:

    nova show uuid

    Check whether there is a {'_ha_policy_type':'close'} key value pair in the VM metadata field:

    • If yes, the VM does not have HA enabled.
    • If no, the VM has HA enabled.

  6. Determine the subsequent operation based on whether the VM has HA enabled.

    • If the VM has HA enabled, set the VM state to error. For details, see Setting the VM State. No further action is required. The VM will be automatically rebuilt at the preset HA triggering time.
    • If the VM does not have HA enabled, have the tenant to delete the failed VM.

Method 3

If a VM fails to restart, have the tenant to perform steps provided in Setting the VM State to reset the VM status and then restart the VM. If the fault persists, detach the volumes from the VM and create the VM again.

  1. Have the tenant to reset the VM status and then restart the VM. Check whether the VM is successfully restarted.

Method 4

Handle the failure based on the boot device of the VM.

  1. Log in to the first controller node in the AZ.

    For details, see Using SSH to Log In to a Host.

  2. Import environment variables. For details, see Importing Environment Variables.
  3. Determine the boot device of the VM.

    1. Run the following command to enter the secure operation mode:

      runsafe

      The following information is displayed:

      Input command:
    2. Run the following command to query the VM attributes:

      nova show vm_uuid

      vm_uuid specifies the UUID of the VM in the intermediate state in the audit report.

      In the command output, the image parameter provides information about the image used by the VM.

      The image boots from a volume if the following information is displayed:

      | image                                | Attempt to boot from volume - no image supplied|

      The VM boots from an image if information similar to the following is displayed (d0bd0551-07f2-45f6-8516-f481e0152715 specifies the image ID):

      | image                                | cirros (d0bd0551-07f2-45f6-8516-f481e0152715)|
      • If the VM boots from an image, check whether the VM task is in the resize_migrating, resize_migrated, resize_finish, or resize_reverting state.
        • If yes, go to the next step.
        • If no, set the VM status to error according to Setting the VM State and then go to 8.
      • If the VM boots from a volume, handle the VM according to Detaching Volumes from a VM and Creating a VM Again.

  4. Execute the rollback script for VM cold migration/resize to migrate the VM to the source host. Then run the runsafe command to enter the secure operation mode and run the following command:

    python /etc/nova/nova-util/revert_migrate_vm.py vm_uuid

    • If no command output is displayed, the VM is successfully rolled back. Go to the next step.
    • If "WARNNING:xxx vm cannot be reverted." or other exception message is displayed, contact technical support for assistance.

  5. Execute the rollback script for the VM cold migration/VM image file resize.

    1. Then run the runsafe command to enter the secure operation mode and run the following command to query the host accommodating the VM:

      nova show vm_uuid

      The OS-EXT-SRV-ATTR:host parameter in the command output specifies the ID of the host accommodating the VM.

    2. Log in to the FusionSphere OpenStack web client, and query the IP address of the host External OM plane on the Summary page based on the host ID.
    3. Log in to the host as user root and execute the rollback script for the VM cold migration/VM image file resize.

      sh /etc/nova/nova-util/revert_migrate_vm_file.sh vm_uuid

      • If no command output is displayed, the image file is successfully rolled back. Go to the next step.
      • If "WARNNING:xxx image file need not revert." is displayed and the VM task status is resize_migrating, go to the next step. In other situations, contact technical support for assistance.

  6. On the controller node, perform the cold migration operation. For details, see 1.

    The cold migration is mandatory. Otherwise, resource collection error may occur on the host. Before performing the cold migration, ensure that the other hosts have sufficient resources for accommodating the migrated VMs.

    1. Run the runsafe command to enter the secure operation mode and run the following command:

      nova migrate vm_uuid

      If no command output is displayed, the cold migration can be performed. If "No valid host" is displayed, release host resources first.

    2. Run the nova show vm_uuid command multiple times to check whether status of the VM is VERIFY_RESIZE and task_state is -.
      • If yes, the operation is successful. Go to the next step.
      • If no, contact technical support for assistance.
    3. Run the nova resize-confirm vm_uuid command to check the cold migration result.
    4. Run the nova show vm_uuid command multiple times to check whether status of the VM is SHUTOFF and task_state is -.
      • If yes, the cold migration is complete. Go to the next step.
      • If no, contact technical support for assistance.

  7. Start the VM.

    1. Run the runsafe command to enter the secure operation mode and run the following command:

      nova start vm_uuid

    2. Run the nova show vm_uuid command multiple times to check whether status of the VM is ACTIVE and task_state is -.
      • If yes, the VM is successfully restored. No further action is required.
      • If no, go to the next step.

  8. Rebuild the VM.

    If the VM boots from an image, rebuilding the VM may cause data loss on the system volume. Therefore, contact technical support for assistance before rebuilding the VM.

    1. Run the runsafe command to enter the secure operation mode and run the following command:

      nova show vm_uuid

      In the command output, the image parameter specifies the image ID.

      | image                                | cirros (d0bd0551-07f2-45f6-8516-f481e0152715)|

      For example, d0bd0551-07f2-45f6-8516-f481e0152715 is the image ID.

    2. Run the following command to rebuild the VM:

      nova rebuild vm_uuid Image ID

    3. Run the nova show vm_uuid command multiple times to check whether status of the VM is ACTIVE and task_state is -.
      • If yes, the VM is successfully restored. No further action is required.
      • If no, contact technical support for assistance.

Method 5

If a VM stucks in the deleting state and cannot be restored, manually delete the VM.

  1. If a VM is stuck in the deleting state and cannot be restored, manually delete the VM.

    Set the VM to the stopped state. For details, see Setting the VM State.

  2. Log in to the first controller node in the AZ.

    For details, see Using SSH to Log In to a Host.

  3. Import environment variables. For details, see Importing Environment Variables.
  4. Perform the following operations to check whether the VM exists in the database:

    NOTE:

    An orphan VM does not exist in the system database or is in the deleted state in the database.

    1. Run the following command to enter the secure operation mode:

      runsafe

      The following information is displayed:

      Input command:
    2. Run the following command to check whether the VM exists in the database:

      nova show vm_uuid

      Check whether the command is correctly executed and whether the VM information is displayed:

      • If yes, make a note of the value of OS-EXT-SRV-ATTR:instance_name in the command output, which indicates the VM name (hyper_vm_name) registered in the hypervisor, and the value of OS-EXT-SRV-ATTR:host, which indicates the host ID (host_id), and go to the next step.
      • If no, contact technical support for assistance.

  5. Enter the secure operation mode according to 3 and obtain the management IP address of the host running the VM.

    Run the runsafe command to enter the secure operation mode, enter the user password as prompted, and run the following command:

    cps host-show host_id | grep manageip | awk -F '|' '{print $3}'

  6. Log in to the host accommodating the VM and run the following command to check whether the VM runs on the host:

    nova_virsh_cmd virsh-instance-state hyper_vm_name

    hyper_vm_nam is the value you obtained in 4. Check whether the Non-Active is displayed when you perform the preceding operations:

    • If yes, contact technical support for assistance.
    • If no, go to the next step.

  7. Run the following command to stop the VM:

    nova_virsh_cmd virsh-instance-shutdown hyper_vm_name

  8. Log in to the controller node (for details, see 2) and run the following command in the database to check whether the VM has volumes attached:

    nova show vm_uuid

    Check whether the os-extended-volumes:volumes_attached field contains any value.

    • If yes, the VM has volumes attached. Make a note of the UUIDs of each volume and go to 9.
    • If no, no volumes are attached to the VM. Go to 10.

  9. On the controller node, run the following command for each volume attached to the VM one by one to detach the volumes:

    nova volume-detach vm_uuid volume_uuid

    NOTE:

    Do not detach the root device volume.

    Check whether any exception occurs when you perform the preceding operations.

    • If yes, contact technical support for assistance.
    • If no, go to 10.

  10. Determine whether this VM can be deleted.

    • If yes, go to 11.
    • If no, contact technical support for assistance. No further action is required.

  11. Run the following command to delete the VM:

    nova delete vm_uuid

    Run the following command to check whether the VM is successfully deleted:

    nova show vm_uuid

    • If yes, no further action is required.
    • If no, contact technical support for assistance.

Method 6

Reset the VM task status and recreate the VM.

  1. Log in to the first controller node in the AZ.

    For details, see Using SSH to Log In to a Host.

  2. Import environment variables. For details, see Importing Environment Variables.
  3. Perform the following operations to reset the VM task status:

    1. Run the following command to enter the secure operation mode:

      runsafe

      The following information is displayed:

      Input command:
    2. Run the following command to reset the VM task status:

      nova reset-state vm_uuid

  4. Check whether the VM has HA enabled.

    • If yes, the VM is automatically recreated after 5 to 10 minutes. After 10 minutes, go to 6.
    • If no, go to 5.

  5. Rebuild the VM according to 3.

    If the VM boots from an image, rebuilding the VM may cause data loss on the system volume. Therefore, contact technical support for assistance before rebuilding the VM.

    1. Run the runsafe command to enter the secure operation mode, enter the user password as prompted, and run the following command:

      nova show vm_uuid

      In the command output, check the image value. If information similar to the following is displayed, the VM boots from an image:

      | image | cirros (d0bd0551-07f2-45f6-8516-f481e0152715)|
    2. Run the following command to rebuild the VM:

      nova rebuild vm_uuid image_uuid

  6. Perform the following operations to query the VM status. For details, see 3.

    Run the runsafe command to enter the secure operation mode, enter the user password, and run the following command as prompted:

    nova show vm_uuid

    In the command output, check the value of status.

    • If the value is REBUILD, the VM is recreating. Query the VM status again after 1 minute.
    • If the value is ACTIVE, the VM is restored. No further action is required.
    • If other values are displayed, the VM fails to be restored. Contact technical support for assistance.

Stuck Images

Context

An image in the active state is available for use. If an image is stuck in the queued or saving state, the image is unavailable. If an image is kept stuck in the queued or saving state, delete the image.

Description

The name of the audit report is stucking_images.csv. Table 18-155 describes parameters in the report.

Table 18-155 Parameter description

Parameter

Description

id

Specifies the image ID.

status

Specifies the image status.

updated_at

Specifies the last time when the image was updated.

owner

Specifies the ID of the tenant who created the image.

Impact on the System

  • An image in the queued state does not occupy system resources, but the image is unavailable.
  • An image in the saving state has residual image files that occupy the storage space.

Possible Causes

  • The image creation process is not complete: The image was not uploaded to the image server within 24 hours after it was created. In this case, the image is kept in the queued state.
  • During the image creation process, an exception (for example, intermittent network disconnection) occurred when the image was being uploaded. In this case, the image is kept in the queued state.
  • When an image was being uploaded, the Glance service failed. In this case, the image is kept in the saving state.

Procedure

Delete the image that is kept stuck in the queued or saving state and create another one.

  1. Use PuTTY by using the Reverse-Proxy to log in to the first host in the FusionSphere OpenStack system.

    The default username is fsp, and the default password is Huawei@CLOUD8.

  2. Run the following command and enter the password Huawei@CLOUD8! of user root to switch to user root:

    su - root

  3. Import environment variables. For details, see Importing Environment Variables.
  4. Perform the following operations to query the management IP addresses of controller nodes:

    1. Run the following command to enter the secure operation mode:

      runsafe

      The following information is displayed:

      Input command:
    2. Run the following CLI command to delete the image and check whether the command is successfully executed:

      glance image-delete id

      NOTE:

      You can also have the tenant to delete the image.

      The image ID can be obtained from the id field in the audit report.

      • If yes, no further action is required.
      • If no, contact technical support for assistance.

Orphan Volumes

Context

An orphan volume is the one that is present to a storage device but is not recorded in the Cinder database.

If a volume is orphaned and the management data is lost due to backup-based system restoration, use the orphan volume to create another volume and notify the tenant of using the new volume.

NOTE:

Do not delete the orphan VM volumes handled in Orphan VMs.

Parameter Description

The name of the audit report is wildVolumeAudit.csv. Table 18-156 describes parameters in the report.

Table 18-156 Parameter description

Parameter

Description

volume_name

Specifies the volume name on the storage device.

volume_type

Specifies the volume type, including san, dsware(FusionStorage), and v3..

NOTE:

In the section, san indicates the Huawei 5500 T series storage devices, and v3 indicates V3 series storage devices ( including Dorado and 18000).

Impact on the System

An orphan volume is unavailable in the Cinder service but occupies the storage space.

Possible Causes

  • The database is reverted using a data backup to the state when backup was created. However, after the backup was created, one or more volumes were created. After the database is restored, records of these volumes are deleted from the database, but these volumes reside on their storage devices and become orphan volumes.
  • The storage system is shared by multiple FusionSphere systems.
  • Volumes on the storage device are not created using the Cinder service.
NOTE:

When you design system deployment for a site, do not make multiple hypervisors to share one storage system, and use only the Cinder service to create volumes on a storage device. Otherwise, false audit reports may be generated.

Procedure

  1. Log in to the first host in an AZ.

    For details, see Using SSH to Log In to a Host.

  2. Import environment variables to the host.

    For details, see Importing Environment Variables.

  3. Search the volume on the backend storage based on volume_name of each record in the report.

    • If a volume is unmapped to a VM, observe it for 72 hours. If the volume is still unmapped to a VM, ask the tenant whether the orphan volume needs to restore.
    • If a volume is mapped to a VM, see Inconsistent Volume Attachment Information.
    NOTE:

    During the 72–hour observation period, check whether there are any VM becomes faulty with the tenant because a VM will become faulty if it uses a volume as its disk but the volume is not mapped to the VM. In addition, check whether FusionSphere OpenStack is running properly. If the target volume is created on the disk array, FusionSphere OpenStack may generate an alarm.

  4. Obtain the operation report for the volume. For details, see Obtaining the Operation Report.
  5. Run the following command to query the mapping between the volume ID and the volume name and make a note of the volume ID:

    python /usr/bin/info-collect-script/audit_resume/storage_name_relations.py -ia Path to the audit report for the orphan volume -io Path to the operation log report -o Path to the execution result file -vt volume

    NOTE:

    Ensure that the audit report and the operation log report have been copied to the current host.

    The following is an example:

    python /usr/bin/info-collect-script/audit_resume/storage_name_relations.py -ia /var/log/audit/2014-09-23_070554/audit/wildVolumeAudit.csv -io /tmp/op_log/cinder-2014\:09\:22-00\:00\:00_unlimit.csv -o /tmp/result.csv -vt volume

    The command is successfully executed if the following information is displayed:

    Successful!

    Check whether the command is successfully executed.

    • If yes, go to 6.
    • If no, contact technical support for assistance.

  6. Run the following command to view the execution result file:

    cat Execution result file name

    The following is an example:

    cat /tmp/result.csv

    NOTE:

    /tmp/result.csv is a file storing execution results. If the file content is empty, the volume does not exist.

    Information similar to the following is displayed:

    volume_name,volume_id,tenants_id  
    volume-044e14af-9d11-4ee9-9b5a-0dcbcd5033aa,044e14af-9d11-4ee9-9b5a-0dcbcd5033aa,5c5e1c868a184035a84b3aaa61e32993  
    volume-18ff2024-07d1-427c-924d-dd8207f9af99,18ff2024-07d1-427c-924d-dd8207f9af99,5c5e1c868a184035a84b3aaa61e32993  
    volume-bcda8a1b-cb15-4bb8-8b55-0cb7c763a85a,bcda8a1b-cb15-4bb8-8b55-0cb7c763a85a,5c5e1c868a184035a84b3aaa61e32993

    Check whether the command output contains the orphan volume.

    • If yes, go to the next step.
    • If no, contact technical support for assistance.

  7. Obtain the volume attributes. For details, see Querying Volume Attributes.
  8. Use the orphan volume to create another volume and replicate the original data to the new volume.

    For details, see Restoring Volume Data.

    Check whether any exception occurs when you perform the preceding operations.

Invalid Volumes

Context

An invalid volume is the one that is recorded in the Cinder database but is not present to a storage device.

Delete the invalid volume from the Cinder database.

Parameter Description

The name of the audit report is fakeVolumeAudit.csv. Table 18-157 describes parameters in the report.

Table 18-157 Parameter description

Parameter

Description

volume_id

Specifies the volume ID.

volume_displayname

Specifies the name of the volume created by a tenant.

volume_name

Specifies the volume name on the storage device.

volume_type

Specifies the volume type, including san, dsware(FusionStorage), and v3.

location

Specifies the volume location.

NOTE:

In the section, san indicates the Huawei 5500 T series storage devices, and v3 indicates V3 series storage devices ( including Dorado and 18000).

Impact on the System

The volume can be queried using the Cinder command but it does not exist on any storage device and is unavailable for use.

Possible Causes

The database is reverted using a data backup to the state when backup was created. However, after the backup was created, one or more volumes were deleted. After the database is restored, records of these volumes reside in the database and become invalid volumes.

Procedure

  1. Log in to the first host in an AZ.

    For details, see Using SSH to Log In to a Host.

  2. Import environment variables to the host.

    For details, see Importing Environment Variables.

  3. Perform the following operations to check whether the volume exists in the Cinder service:

    1. Run the following command to enter the secure operation mode:

      runsafe

      The following information is displayed:

      Input command:
    2. Run the following command to check whether the volume exists in the Cinder service:

      cinder show Volume ID

      The volume ID can be obtained from the audit report. The following is an example:

      cinder show 044e14af-9d11-4ee9-9b5a-0dcbcd5033aa

      Check whether the command output contains ERROR, which indicates that the volume does not exist in the Cinder service.

      • If yes, the volume does not exist in the Cinder service. Contact technical support for assistance.
      • If no, the volume exists in the Cinder service. Go to 4.

  4. Perform the following operations to query the host list:

    1. Enter the secure operation mode.

      For details, see Command Execution Methods.

    2. Run the following command to query the management IP address of any controller node:

      cps host-list

      Information similar to the following is displayed:

      +--------------------------------------+-----------+----------------------+--------+------------+ 
      | id                                   | boardtype | roles                | status | manageip | 
      +--------------------------------------+-----------+----------------------+--------+------------+ 
      | 778F416E-C3BB-11A0-8567-000000821800 | BC11BTSA  | auth,                | normal | 172.29.6.1 | 
      |                                      |           | blockstorage-driver, |        |            | 
      |                                      |           | compute,             |        |            | 
      |                                      |           | controller,          |        |            | 
      |                                      |           | image                |        |            | 
      | AE0CCD20-C1CF-1179-8567-000000821800 | BC11BTSA  | auth,                | normal | 172.29.0.2 | 
      |                                      |           | blockstorage-driver, |        |            | 
      |                                      |           | compute,             |        |            | 
      |                                      |           | controller,          |        |            | 
      |                                      |           | image,               |        |            | 
      |                                      |           | loadbalancer,        |        |            | 
      |                                      |           | router,              |        |            | 
      |                                      |           | sys-server           |        |            | 
      | 404ECF92-DBCF-11E4-8567-000000821800 | BC11BTSA  | auth,                | normal | 172.29.6.3 | 
      |                                      |           | blockstorage-driver, |        |            | 
      |                                      |           | compute,             |        |            | 
      |                                      |           | controller,          |        |            | 
      |                                      |           | image                |        |            | 
      +--------------------------------------+-----------+----------------------+--------+------------+

      The value of manageip indicates the management IP address of the controller node.

  5. Run the following command to log in to a host whose roles value is blockstorage-driver:

    su fsp

    ssh fsp@Management IP address

    The following is an example:

    ssh fsp@172.29.6.3

    After you log in to a host as user fsp, run the su - root command to switch to user root.

    • The default password of user fsp is Huawei@CLOUD8.
    • The default password of user root is Huawei@CLOUD8!.

  6. Run the following command to query the storage type:

    python /usr/bin/info-collect-script/audit_resume/get_host_storage_info.py

    Determine the storage type based on the command output.

    NOTE:

    If the storage type displayed in the command output is inconsistent with the volume type, go to 5 to log in to another host with blockstorage-driver assigned and perform the subsequent operations.

    • If the following information is displayed:
      storage_type=dsware 
      addition info is : 
                manage_ip=172.28.0.231 
                vbs_url=172.28.6.1,172.28.6.0,172.28.0.2

      The volume storage type is dsware. Go to 7. The value of manage_ip indicates the FusionStorage Manager node IP address, and the value of the vbs_url indicates the compute node management IP address.

    • If the following information is displayed:
      storage_type=san 
      addition info is : 
      ControllerIP0 is 192.168.172.40 
      ControllerIP1 is 192.168.172.41

      The volume storage type is san. Go to 8. The values of ControllerIP0 and ControllerIP1 indicate the SAN storage device management IP addresses.

      NOTE:

      If the values of ControllerIP0 and ControllerIP1 are x.x.x.x or 127.0.0.1 in the command output, the volume storage type is v3. Go to 8.

  7. Run the following command to query information about the volume:

    fsc_cli --ip Compute node management IP address --manage_ipFusionStorage Manager node IP address --port 10519 --op queryVolume --volName Volume name on the storage device

    The following is an example:

    fsc_cli --ip 172.29.6.6 --manage_ip 172.29.0.231 --port 10519 --op queryVolume --volName volume-6f2282f1-22b3-41f1-8b3f-d15aa9790388

    The volume exists on the storage device if information similar to the following is displayed:

    result=0  
    vol_name=volume-6f2282f1-22b3-41f1-8b3f-d15aa9790388,father_name=,status=0,vol_size=1024,real_size=-1,pool_id=0,create_time=2014-09-18 07:23:21

    Check whether the volume exists on the storage device attached to the host whose roles value is blockstorage-driver in the AZ.

    • If yes, contact technical support for assistance.
    • If no, go to 10.

  8. Log in to the OceanStor DeviceManager of the IP SAN device, choose Storage Resource > LUN (for v3 storage devices, choose Provisioning > LUN), search for volumes, and check whether the volume exists on the storage device attached to the host whose roles value is blockstorage-driver in the AZ.

    • If yes, contact technical support for assistance.
    • If no, go to 10.

  9. Import environment variables to the host.

    For details, see Importing Environment Variables.

  10. Enter secure operation mode (for details, see 3) and run the following command to to delete the invalid volume:

    cinder delete --cascade Volume ID

    Run the following command to query the status of the deleted volume:

    cinder show Volume ID

    Check whether the deleted volume still exists.

    • If yes, contact technical support for assistance.
    • If no, no further action is required.

Orphan Volume Snapshots

Context

An orphan volume snapshot is the one that is present to a storage device but is not recorded in the Cinder database.

Delete the orphan volume snapshot from the storage device.

Parameter Description

The name of the audit report is wildSnapshotAudit.csv. Table 18-158 describes parameters in the report.

Table 18-158 Parameter description

Parameter

Description

snap_name

Specifies the volume snapshot name on the storage device.

snap_type

Specifies the snapshot type, including san, dsware(FusionStorage), and v3.

NOTE:

In the section, san indicates the Huawei 5500 T series storage devices, and v3 indicates V3 series storage devices ( including Dorado and 18000).

Impact on the System

An orphan volume snapshot occupies the storage space.

Possible Causes

The database is reverted using a data backup to the state when backup was created. However, after the backup was created, one or more volume snapshots were created. After the database is restored, records of these snapshots are deleted from the database, but these snapshots reside on their storage devices and become orphan volume snapshots.

Procedure

  1. Obtain the operation report for the volume snapshot. For details, see Obtaining the Operation Report

    If the snapshot was created after the management data was backed up and before the system database was restored, all of the following conditions must be met:

    The value of res_id is that of uuid in the audit report. 
    The value of res_type is snapshots. 
    The value of time is the time after the management data was backed up and before the system database was restored. 
    In the values of the action field, the value of the HTTP request method is POST, the HTTP request URL is /v2/tenant_id/snapshots, and the value of tenant_id is the value of tenant in the audit report.

  2. Log in to the first host in an AZ.

    For details, see Using SSH to Log In to a Host.

  3. Import environment variables to the host.

    For details, see Importing Environment Variables.

  4. Run the following command to query the mapping between the snapshot ID and the snapshot name and make a note of the snapshot ID:

    python /usr/bin/info-collect-script/audit_resume/storage_name_relations.py -ia Path to the audit report for the orphan snapshot -ioPath to the operation log report -o Path to the result file -vt snapshot

    NOTE:

    Ensure that the audit report and the operation log report have been copied to the current host.

    The following is an example:

    python /usr/bin/info-collect-script/audit_resume/storage_name_relations.py -ia /var/log/audit/2014-09-23_070554/audit/wildSnapshotAudit.csv -io /tmp/op_log/cinder-2014\:09\:22-00\:00\:00_unlimit.csv -o /tmp/result.csv -vt snapshot

    NOTE:

    /tmp/result.csv is a file storing execution results. If the file content is empty, the snapshot does not exist.

    The command is successfully executed if the following information is displayed:

    Successful!

    Check whether the command is successfully executed.

    • If yes, go to 5.
    • If no, contact technical support for assistance.

  5. Run the following command to view the execution result file:

    cat Execution result file name

    The following is an example:

    cat /tmp/result.csv

    Information similar to the following is displayed:

    snap_name,snap_id,tenants_id  
    _snapshot-d57ecea2-5408-4976-b944-3b6d948c398b,d57ecea2-5408-4976-b944-3b6d948c398b,5c5e1c868a184035a84b3aaa61e32993

    Check whether the command output contains the snapshot name.

    • If yes, the snapshot ID is the value of snap_id of the snapshot. Go to the next step.
    • If no, contact technical support for assistance.

  6. Perform the following operations to check whether the snapshot exists in the Cinder service:

    1. Run the following command to enter the secure operation mode:

      runsafe

      The following information is displayed:

      Input command:
    2. Run the cinder snapshot-show Snapshot ID command.

      The following is an example:

      cinder snapshot-show 1cd5c6eb-e729-4773-b846-e9f1d3467c56

      Check whether the command output contains ERROR. If ERROR is contained, the snapshot does not exist in the Cinder service.

      ERROR: No snapshot with a name or ID of '1cd5c6eb-e729-4773-b846-e9f1d3467c56' exists.

      Check whether the snapshot exists in the Cinder service.

      • If yes, contact technical support for assistance.
      • If no, go to 7.

  7. Perform the following operations to query the host list:

    1. Enter the secure operation mode.

      For details, see Command Execution Methods.

    2. Run the following command to query the management IP address of any controller node:

      cps host-list

      Information similar to the following is displayed:

      +--------------------------------------+-----------+----------------------+--------+------------+  
      | id                                   | boardtype | roles                | status | manageip |  
      +--------------------------------------+-----------+----------------------+--------+------------+  
      | 778F416E-C3BB-11A0-8567-000000821800 | BC11BTSA  | auth,                | normal | 172.29.6.1 |  
      |                                      |           | blockstorage-driver, |        |            |  
      |                                      |           | compute,             |        |            |  
      |                                      |           | controller,          |        |            |  
      |                                      |           | image                |        |            |  
      | AE0CCD20-C1CF-1179-8567-000000821800 | BC11BTSA  | auth,                | normal | 172.29.0.2 |  
      |                                      |           | blockstorage-driver, |        |            |  
      |                                      |           | compute,             |        |            |  
      |                                      |           | controller,          |        |            |  
      |                                      |           | image,             |        |            |  
      |                                      |           | loadbalancer,        |        |            |  
      |                                      |           | router,              |        |            |  
      |                                      |           | sys-server           |        |            |  
      | 404ECF92-DBCF-11E4-8567-000000821800 | BC11BTSA  | auth,                | normal | 172.29.6.3 |  
      |                                      |           | blockstorage-driver, |        |            |  
      |                                      |           | compute,             |        |            |  
      |                                      |           | controller,          |        |            |  
      |                                      |           | image                |        |            |  
      +--------------------------------------+-----------+----------------------+--------+------------+

      The value of manageip indicates the management IP address of the controller node.

  8. Run the following command to log in to a host whose roles value is blockstorage-driver:

    su fsp

    ssh fsp@Management IP address

    The following is an example:

    ssh fsp@172.29.6.3

    After you log in to a host as user fsp, run the su - root command to switch to user root.

    • The default password of user fsp is Huawei@CLOUD8.
    • The default password of user root is Huawei@CLOUD8!.

  9. Run the following command to query the storage type:

    python /usr/bin/info-collect-script/audit_resume/get_host_storage_info.py

    Determine the storage type based on the command output.

    NOTE:

    If the storage type displayed in the command output is inconsistent with the snapshot type, go to 8 to log in to another host with blockstorage-driver assigned and perform the subsequent operations.

    • If the following information is displayed:
      storage_type=dsware 
      addition info is : 
                manage_ip=172.28.0.231 
                vbs_url=172.28.6.1,172.28.6.0,172.28.0.2

      The snapshot storage type is dsware. Go to 10. The value of manage_ip indicates the FusionStorage Manager node IP address, and the value of the vbs_url indicates the compute node management IP address.

    • If the following information is displayed:
      storage_type=san 
      addition info is : 
                ControllerIP0 is 192.168.172.40 
                ControllerIP1 is 192.168.172.41

      The snapshot storage type is san. Go to 11. The values of ControllerIP0 and ControllerIP1 indicate the SAN storage device management IP addresses.

      NOTE:

      If the values of ControllerIP0 and ControllerIP1 are x.x.x.x or 127.0.0.1 in the command output, the volume storage type is v3. Go to Step 11.

  10. Run the following command to query the snapshot information:

    fsc_cli --ip Compute node management IP address --manage_ip FusionStorage Manager node IP address --port 10519 --op querySnapshot --snapName Snapshot name on the storage device

    The following is an example:

    fsc_cli --ip 172.29.6.6 --manage_ip 172.29.0.231 --port 10519 --op querySnapshot --snapName snapshot-6f2282f1-22b3-41f1-8b3f-d15aa9790388

    The snapshot exists on the storage device if information similar to the following is displayed:

    result=0  
    snap_name=snapshot-6f2282f1-22b3-41f1-8b3f-d15aa9790388,father_name=,status=0,snap_size=1024,real_size=-1,pool_id=0,create_time=2014-09-18 07:23:21

    Check whether the snapshot exists on the storage device attached to the host whose roles value is blockstorage-driver in the AZ.

    • If yes, go to 12.
    • If no, contact technical support for assistance.

  11. Log in to the OceanStor DeviceManager of the IP SAN device, choose Storage Resource > LUN (for V3 storage devices, choose Data Protection > Snapshots. You need to check whether snapshots with the specified names exist on each LUN), search for snapshots, and check whether the snapshot exists on the storage device attached to the host whose roles value is blockstorage-driver in the AZ.

    • If yes, go to 12.
    • If no, contact technical support for assistance.

  12. Obtain the operation report for the volume snapshot. For details, see Obtaining the Operation Report. Then determine whether to delete the snapshot.

    • If yes, go to 13.
    • If no, no further action is required.

  13. Check the snapshot storage type and run its corresponding command to delete the snapshot.

    • If the storage type is san, log in to the OceanStor DeviceManager system and delete the target snapshot (for V3 storage devices, choose Data Protection > Snapshots to search for target snapshots and delete them).
      NOTE:

      Before you delete the snapshot, you need to confirm the run state of the snapshot. If it is activated, you need to click more first, and click Cancel in the drop-down menu, then you can delete it.

    • If the storage type is dsware, run the following command to delete the snapshot:

      fsc_cli --ip Compute node management IP address --manage_ipFusionStorage Manager node IP address--port 10519 --op deleteSnapshot --snapName Snapshot name on the storage device

      The following is an example:

      fsc_cli --ip 172.29.6.6 --manage_ip 172.29.0.231 --port 10519 --op deleteSnapshot --snapName snapshot-6f2282f1-22b3-41f1-8b3f-d15aa9790388

    Check whether the snapshot is successfully deleted.

    • If yes, no further action is required.
    • If no, contact technical support for assistance.

Invalid Volume Snapshots

Context

An invalid volume snapshot is the one that is recorded in the Cinder database but is not present to a storage device.

Delete the invalid volume snapshot from the Cinder database.

Parameter Description

The name of the audit report is fakeSnapshotAudit.csv. Table 18-159 describes parameters in the report.

Table 18-159 Parameter description

Parameter

Description

snap_id

Specifies the snapshot ID.

snap_name

Specifies the volume snapshot name on the storage device.

volume_id

Specifies the base volume ID.

snap_type

Specifies the snapshot type, including san, dsware(FusionStorage), and v3.

location

Specifies the snapshot location.

NOTE:

In the section, san indicates the Huawei 5500 T series storage devices, and v3 indicates V3 series storage devices ( including Dorado and 18000).

Impact on the System

The snapshot can be queried using the Cinder command but it does not exist on any storage device and is unavailable for use.

Possible Causes

The database is reverted using a data backup to the state when backup was created. However, after the backup was created, one or more volume snapshots were deleted. After the database is restored, records of these volume snapshots reside in the database and become invalid volume snapshots.

Procedure

  1. Log in to the first host in an AZ.

    For details, see Using SSH to Log In to a Host.

  2. Import environment variables to the host.

    For details, see Importing Environment Variables.

  3. Perform the following operations to check whether the snapshot exists in the Cinder service:

    1. Run the following command to enter the secure operation mode:

      runsafe

      The following information is displayed:

      Input command:
    2. Run the cinder snapshot-show Snapshot ID command.

      The following is an example:

      cinder snapshot-show 1cd5c6eb-e729-4773-b846-e9f1d3467c56

      Check whether the command output contains ERROR. If ERROR is contained, the snapshot does not exist in the Cinder service.

      ERROR: No snapshot with a name or ID of '1cd5c6eb-e729-4773-b846-e9f1d3467c56' exists.

      Check whether the snapshot exists in the Cinder service.

      • If yes, go to 4.
      • If no, contact technical support for assistance.

  4. Perform the following operations to query the host list:

    1. Enter the secure operation mode. For details, see Command Execution Methods.
    2. Run the following command to query the management IP address of any controller node:

      cps host-list

      Information similar to the following is displayed:

      +--------------------------------------+-----------+----------------------+--------+------------+ 
      | id                                   | boardtype | roles                | status | manageip | 
      +--------------------------------------+-----------+----------------------+--------+------------+ 
      | 778F416E-C3BB-11A0-8567-000000821800 | BC11BTSA  | auth,                | normal | 172.29.6.1 | 
      |                                      |           | blockstorage-driver, |        |            | 
      |                                      |           | compute,             |        |            | 
      |                                      |           | controller,          |        |            | 
      |                                      |           | image                |        |            | 
      | AE0CCD20-C1CF-1179-8567-000000821800 | BC11BTSA  | auth,                | normal | 172.29.0.2 | 
      |                                      |           | blockstorage-driver, |        |            | 
      |                                      |           | compute,             |        |            | 
      |                                      |           | controller,          |      |            | 
      |                                      |           | image,               |        |            | 
      |                                      |           | loadbalancer,        |        |            | 
      |                                      |           | router,              |        |            | 
      |                                      |           | sys-server           |        |            | 
      | 404ECF92-DBCF-11E4-8567-000000821800 | BC11BTSA  | auth,                | normal | 172.29.6.3 | 
      |                                      |           | blockstorage-driver, |        |            | 
      |                                      |           | compute,             |        |            | 
      |                                      |           | controller,          |        |            | 
      |                                      |           | image                |        |            | 
      +--------------------------------------+-----------+----------------------+--------+------------+

      The value of manageip indicates the management IP address of the controller node.

  5. Run the following command to log in to a host whose roles value is blockstorage-driver:

    su fsp

    ssh fsp@Management IP address

    The following is an example:

    ssh fsp@172.29.6.3

    After you log in to a host as user fsp, run the su - root command to switch to user root.

    • The default password of user fsp is Huawei@CLOUD8.
    • The default password of user root is Huawei@CLOUD8!.

  6. Run the following command to query the storage type:

    python /usr/bin/info-collect-script/audit_resume/get_host_storage_info.py

    Determine the storage type based on the command output.

    NOTE:

    If the storage type displayed in the command output is inconsistent with the snapshot type, go to 5 to log in to another host with blockstorage-driver assigned and perform the subsequent operations.

    • If the following information is displayed:
      storage_type=dsware 
      addition info is : 
                manage_ip=172.28.0.231 
                vbs_url=172.28.6.1,172.28.6.0,172.28.0.2

      The snapshot storage type is dsware. Go to 7. The value of manage_ip indicates the FusionStorage Manager node IP address, and the value of the vbs_url indicates the compute node management IP address.

    • If the following information is displayed:
      storage_type=san 
      addition info is : 
                ControllerIP0 is 192.168.172.40 
                ControllerIP1 is 192.168.172.41

      The snapshot storage type is san. Go to 8. The values of ControllerIP0 and ControllerIP1 indicate the SAN storage device management IP addresses.

      NOTE:

      If the values of ControllerIP0 and ControllerIP1 are x.x.x.x or 127.0.0.1 in the command output, the volume storage type is v3. Go to 8.

  7. Run the following command to query the snapshot information:

    fsc_cli --ip Compute node management IP address --manage_ip FusionStorage Manager node IP address--port 10519 --op querySnapshot --snap Snapshot name on the storage device

    The following is an example:

    fsc_cli --ip 172.29.6.6 --manage_ip 172.29.0.231 --port 10519 --op querySnapshot --snap snapshot-6f2282f1-22b3-41f1-8b3f-d15aa9790388

    The snapshot exists on the storage device if information similar to the following is displayed:

    result=0  
    snap_name=snapshot-6f2282f1-22b3-41f1-8b3f-d15aa9790388,father_name=,status=0,snap_size=1024,real_size=-1,pool_id=0,create_time=2014-09-18 07:23:21

    Check whether the snapshot exists on the storage device attached to the host whose roles value is blockstorage-driver in the AZ.

    • If yes, contact technical support for assistance.
    • If no, go to 9.

  8. Log in to the OceanStor DeviceManager of the IP SAN device, choose SAN Services > Snapshots (for v3 storage devices, choose Data Protection > Snapshots), search for snap_name, and check whether the snapshot exists on the storage device attached to the host whose roles value is blockstorage-driver in the AZ.

    • If yes, contact technical support for assistance.
    • If no, go to 9.

  9. Obtain the operation report for the volume snapshot. For details, see Obtaining the Operation Report. Confirm that the value of "time" in operation report is between management data backup time and recovery time.

    • If yes, go to the next step.
    • If no, contact technical support for assistance.

  10. Determine whether to delete the snapshot.

    • If yes, go to 12.
    • If no, contact technical support for assistance.

  11. Import environment variables. For details, see Importing Environment Variables.
  12. Enter the secure operation mode (for details, see 3) and run the following command to delete the snapshot:

    cinder snapshot-delete Snapshot ID

    The following is an example:

    cinder snapshot-delete 1cd5c6eb-e729-4773-b846-e9f1d3467c56

    Check whether the snapshot is successfully deleted.

    If the command output contains ERROR, the snapshot failed to delete.

    • If yes, no further action is required.
    • If no, contact technical support for assistance.

Inconsistency Between VM HA Flags and Startup Flags

Context

After the database management data is restored using a backup, VM HA flags and startup flags are also restored. In this case, flag inconsistency may occur.

Possible Causes

The database is reverted using a data backup to the state when backup was created. However, after the backup was created, the HA flags or startup flags for VMs were changed. After the database is restored, the flags in the database are inconsistent with the actual flags.

Impact on the System

  • The VM rescheduling time and type are inconsistent with those set before the restoration.
  • The VM startup mode is inconsistent with that set before the restoration.

Procedure

  1. Obtain the operation report (for details, see Obtaining the Operation Report) and check whether the VM HA flag or startup flag was changed after the management data was backed up and before the system database was restored.

    If the VM HA flag or startup flag was changed after the management data was backed up and before the system database was restored, all of the following conditions must be met:

    1. The value of tenant is the UUID of the tenant who performed the operation.
    2. The value of res_id is the value of the VM UUID.
    3. The value of res_type is servers.
    4. The value of time is after the management data was backed up and before the system database was restored.
    5. In the values in the action field, the value of the HTTP request method is POST, the HTTP request URL is /v2/tenant_id/servers/instance_id/metadata, the value of tenant_id is the value of the tenant UUID, the value of instance_id is the VM UUID. The metadata field contains one or more of _ha_policy_time, _ha_policy_type, and __bootDev. The action is changing metadata. Or the value of the HTTP request method is DELETE, the HTTP request URL is /v2/tenant_id/servers/instance_id/metadata/key, the value of tenant_id is the tenant UUID, the value of instance_id is the VM UUID, the value of key is _ha_policy_time, _ha_policy_type, or __bootDev, and the action is deleting metadata.
    • If yes, go to 2.
    • If no, no further action is required.

  2. Use PuTTY by using the Reverse-Proxy to log in to the first host in the FusionSphere OpenStack system.

    The default username is fsp, and the default password is Huawei@CLOUD8.

  3. Run the following command and enter the password Huawei@CLOUD8! of user root to switch to user root:

    su - root

  4. Import environment variables. For details, see Importing Environment Variables.
  5. Perform the following operations to query the management IP addresses of controller nodes:

    cps host-list

    The node whose roles value is controller indicates a controller node. The value of manageip indicates the management IP address.

  6. Run the following commands to log in to the controller node:

    su fsp

    ssh fsp@Management IP address

    su - root

  7. Import environment variables. For details, see 4.
  8. If the metadata was changed, run the following command to set the VM HA flag and startup flag:

    nova meta instance_id set _ha_policy_time=time _ha_policy_type=type __bootDev=dev

    The value of instance_id is the value of instance_id in the action field in 1. Add the values of _ha_policy_time, _ha_policy_type, and __bootDev in the action field as key values to the command line. The values of time, type, and dev are the changed values of _ha_policy_time, _ha_policy_type, and __bootDev, respectively.

    NOTE:

    The preceding three items may not be present concurrently. Add the item value only when the corresponding field is present. For example, if only the value of __bootDev is changed, set only the value of __bootDev. In this case, the command is:

    nova meta instance_id set __bootDev=dev

  9. If the metadata was deleted, run the following command to delete the VM HA flag and startup flag:

    nova meta instance_id delete _ha_policy_time _ha_policy_type __bootDev

    The value of instance_id is the value of instance_id in the action field in 1. If the metadata for the VM is deleted in the operation report, add the deleted metadata to the command line.

    NOTE:

    The preceding three items may not be present concurrently. Add the item value only when the corresponding field is present. For example, if only the value of __bootDev is deleted, add only the value of __bootDev to the command. In this case, the command is:

    nova meta instance_id delete __bootDev

Stuck Volumes

Context

A volume in the available or in-use state is an available volume. A volume in the transient state (including creating, downloading, deleting, error_deleting, error_attaching, error_detaching, attaching, detaching, uploading, retyping, reserved, and maintenance) is an unavailable volume. If a volume is kept stuck in a transient state for more than 24 hours, restore the volume based on site conditions.

Parameter Description

The name of the audit report is VolumeStatusAudit.csv. Table 18-160 describes parameters in the report.

Table 18-160 Parameters in the audit report

Parameter

Description

volume_id

Specifies the volume ID.

volume_displayname

Specifies the name of the volume created by a user.

volume_name

Specifies the volume name on the storage device only when the volume type is dsware (FusionStorage).

volume_type

Specifies the volume type, such as san, dsware (FusionStorage), and v3.

location

Specifies the volume location.

status

Specifies the volume status.

last_update_time

Specifies the last time when the volume was updated.

NOTE:

san indicates Huawei 5500 T series storage devices, and v3 indicates OceanStor V3 or later series storage devices (including Dorado and 18000).

Possible Causes

  • An exception occurred during a volume service operation, delaying the update of the volume status.
  • The database is rolled back using the management data backup to the state when the backup was created. However, after the backup was created, the states of one or more volumes were changed. After the database is restored, records of these volume states are restored to their former states in the database.

Impacts on the System

The stuck volume becomes unavailable but consumes system resources.

Procedure

Handle the volume based on the volume states listed in Table 18-161. For other situations, contact technical support for assistance.

Table 18-161 Stuck volume handling methods

Volume Status

Transient State or Not

Description

Scenario

Handling Method

creating

Y

Creating

Creating a volume

For details, see Method 1.

downloading

Y

Downloading

Creating a volume from an image

For details, see Method 2.

deleting

Y

Deleting

Deleting a volume

Forcibly delete the volume. For details, see Method 3.

error_deleting

N

Deletion failed.

Volume deletion failed

Forcibly delete the volume. For details, see Method 3.

error_attaching

N

Attachment failed.

Volume attachment failed

Set the volume state to available or in-use. For details, see Method 4.

error_detaching

N

Detachment failed.

Volume detachment failed

Set the volume state to available or in-use. For details, see Method 4.

attaching

Y

Attaching

Attaching a volume

Set the volume state to available or in-use. For details, see Method 4.

If the volume is a DR placeholder volume, no action is required.

detaching

Y

Detaching

Detaching a volume

Set the volume state to available or in-use. For details, see Method 4.

uploading

Y

Uploading

Creating an image from a volume

For details, see Method 5.

retyping

Y

Migrating

A system exception occurs during storage migration.

For details, see Method 6.

reserved

N

Reserving

After VM live migration, the original volume is reserved. After the user confirms that the service is normal, the original volume is deleted.

For details, see Method 7.

maintenance

Y

Maintaining

A process exception occurs during data copy.

Contact technical support for assistance.

Method 1

  1. Log in to the first controller node in the AZ. For details, see Using SSH to Log In to a Host.
  2. Import environment variables to the node. For details, see Importing Environment Variables.
  3. Perform the following operations on the node to query information about the volume:

    1. Run the following command to enter the secure operation mode:

      runsafe

      Information similar to the following is displayed:

      Input command:
    2. Run the following command to query information about the volume status:

      cinder show Volume ID

      Check whether the value of status in the command output is consistent with the volume state in the audit report.

      • If yes, go to 4.
      • If no, no further action is required.

  4. View the value of last_update_time in the audit report and check whether the time difference between the value and the current time exceeds 24 hours.

    • If yes, go to 5.
    • If no, contact technical support for assistance.

  5. Use the operation log replay tool (for details, see Obtaining the Operation Report) to obtain volume creation information. Volume creation information is similar to the following:

    tenant,res_id,res_type,time,host,action 
    4b14e24ab18e4e58a96405ba0ea94e6d,8ce350f0-ebcd-4fce-8d14-2e797128ac74,volumes,09/Oct/2014:17:52:24,host_id_68B81E2C-08BB-1170-8567-000000821800,POST https://volume.az1.dc1.domainname.com:8776/v2/4b14e24ab18e4e58a96405ba0ea94e6d/volumes {"volume": {"status": "creating"  "description": null  "availability_zone": null  "source_volid": null  "snapshot_id": null  "size": 1  "user_id": null  "name": "heyan"  "imageRef": null  "attach_status": "detached"  "volume_type": null  "shareable": false  "project_id": null  "metadata": {}}} 8ce350f0-ebcd-4fce-8d14-2e797128ac74 202
    • If source_volid displays, the volume is created from the source volume whose ID is the value of source_volid. In this case, go to 6.
    • If snapshot_id displays, the volume is created from the snapshot whose ID is the value of snapshot_id. In this case, go to 6.
    • If imageRef displays, the volume is created from the image whose ref is the value of imageRef. In this case, go to 10.
    • For other scenarios, go to 6.

  6. Query the volume storage type, host, and management information about the storage device. For details, see Querying Volume Attributes.

    • If the storage type is san, go to 7.
    • If the storage type is dsware, go to 8.

  7. Log in to the OceanStor DeviceManager system of the IP SAN device.

    1. Choose Storage Resource > LUN. On the displayed page, search for the volume name on the storage device to ensure that the health status is normal, the running status is online.
    2. Choose SAN Service > LUN copy. On the displayed page, search for the volume name on the storage device to check the volume copy status.
      • If no data displays in the copy status, go to 9.
      • For other situations, go to 11.
      NOTE:

      You can obtain the volume name on the storage device based on Querying Volume Attribut....

  8. Log in to the host and run the following command to query information about the volume:

    fsc_cli --ip Management IP address of the compute node --manage_ip IP address of the DSWare management node --port 10519 --op queryVolume --volName Volume name on the storage device

    For example, run the following command:

    fsc_cli --ip 172.29.6.6 --manage_ip 172.29.0.231 --port 10519 --op queryVolume --volName volume-6f2282f1-22b3-41f1-8b3f-d15aa9790388

    Information similar to the following is displayed:

    result=0 
    vol_name=volume-6f2282f1-22b3-41f1-8b3f-d15aa9790388,father_name=,status=0,vol_size=1024,real_size=-1,pool_id=0,create_time=2014-09-18 07:23:21

    Check the value of status in the command output.

    • If the value is 0, go to 9.
    • If the value is not 0, go to 11.

  9. Enter the secure operation mode based on 3 and run the following command to set the volume status to available:

    cinder reset-state --state available Volume ID

    For example, run the following command:

    cinder reset-state --state available 5f27b4fd-ef1b-4726-8252-c7c95b714f29

    After the command is executed, the operation is complete.

  10. Run the following command to set the volume status to error based on the qemu-img process:

    ps -ef | grep qemu-img | grep -v grep > /dev/null && cinder reset-state --state error Volume ID

    For example, run the following command:

    ps -ef | grep qemu-img | grep -v grep > /dev/null && cinder reset-state --state error 910f26ad-54a2-4fe3-a65b-8de80bb81174

    After the command is executed, the operation is complete.

  11. Enter the secure operation mode based on 3 and run the following command to delete the volume:

    cinder force-delete Volume ID

    For example, run the following command:

    cinder force-delete 1cd5c6eb-e729-4773-b846-e9f1d3467c56

    If the command output contains ERROR, the volume fails to delete. Check whether the volume is deleted.

    • If yes, no further action is required.
    • If no, contact technical support for assistance.

Method 2

  1. Log in to the first controller node in the AZ. For details, see Using SSH to Log In to a Host.
  2. Import environment variables to the node. For details, see Importing Environment Variables.
  3. Perform the following operations on the node to query information about the volume:

    1. Run the following command to enter the secure operation mode:

      runsafe

      Information similar to the following is displayed:

      Input command:
    2. Run the following command to query information about the volume status:

      cinder show Volume ID

      Check whether the value of status in the command output is consistent with the volume state in the audit report.

      • If yes, go to 4.
      • If no, no further action is required.

  4. View the value of last_update_time in the audit report and check whether the time difference between the value and the current time exceeds 24 hours.

    • If yes, go to 5.
    • If no, contact technical support for assistance.

  5. Check whether the audit is conducted after the database is restored using a backup.

    • If yes, go to 6.
    • If no, go to 8.

  6. Use the operation log replay tool (for details, see Obtaining the Operation Report) to obtain the volume operation logs. Check whether any action type is POST.

    • If yes, go to 7.
    • If no, go to 9.

  7. Enter the secure operation mode based on 3 and run the following command to set the volume status to available:

    cinder reset-state --state available Volume ID

    For example, run the following command:

    cinder reset-state --state available 5f27b4fd-ef1b-4726-8252-c7c95b714f29

    After resetting the volume status, run the following command to query the volume status:

    cinder show Volume ID

    In the command output, check whether the value of status is available.

    • If yes, no further action is required.
    • If no, contact technical support for assistance.

  8. Run the following command to check whether the qemu-img process exists:

    ps -ef | grep qemu-img

    • If yes, no further action is required.
    • If no, go to 9.

  9. Enter the secure operation mode based on 3 and run the following command to delete the volume:

    cinder force-delete Volume ID

    For example, run the following command:

    cinder force-delete 1cd5c6eb-e729-4773-b846-e9f1d3467c56

    If the command output contains ERROR, the volume fails to delete. Check whether the volume is deleted.

    • If yes, no further action is required.
    • If no, contact technical support for assistance.

Method 3

  1. Log in to the first controller node in the AZ. For details, see Using SSH to Log In to a Host.
  2. Import environment variables to the node. For details, see Importing Environment Variables.
  3. Perform the following operations to query the volume status:

    1. Run the following command to enter the secure operation mode:

      runsafe

      Information similar to the following is displayed:

      Input command:
    2. Run the following command and view the command output:

      cinder show Volume ID | grep status | grep deleting && echo "OK."

      If the command output does not contain "OK", no further action is required.

  4. View the value of last_update_time in the audit report and check whether the time difference between the value and the current time exceeds 24 hours.

    • If yes, run the following command:

      cinder force-delete Volume ID

      For example, run the following command:

      cinder force-delete 1cd5c6eb-e729-4773-b846-e9f1d3467c56

      Check whether the command output contains ERROR.

      If yes, the deletion fails. In this case, contact technical support for assistance.

      If no, no further action is required.

    • If no, contact technical support for assistance.

Method 4

  1. Log in to the first controller node in the AZ. For details, see Using SSH to Log In to a Host.
  2. Import environment variables to the node. For details, see Importing Environment Variables.
  3. Perform the following operations on the node to query information about the volume:

    1. Run the following command to enter the secure operation mode:

      runsafe

      Information similar to the following is displayed:

      Input command:
    2. Run the following command to query the volume status:

      cinder show Volume ID

      Check whether the value of status in the command output is consistent with the volume state in the audit report.

      • If yes, go to 4.
      • If no, no further action is required.

  4. Check whether the volume is in the error_attaching or error_detaching state.

    • If yes, go to 6.
    • If no, go to 5.

  5. View the value of last_update_time in the audit report and check whether the time difference between the value and the current time exceeds 24 hours.

    • If yes, go to 6.
    • If no, contact technical support for assistance.

  6. Enter the secure operation mode based on 3 and run the following command to query the volume status:

    cinder show Volume ID

    In the command output, check whether the value of attachments is left blank.

    • If yes, go to 7.
    • If no, go to 9.

  7. Enter the secure operation mode based on 3 and run the following command to set the volume status to available:

    cinder reset-state --state available Volume ID

    For example, run the following command:

    cinder reset-state --state available 5f27b4fd-ef1b-4726-8252-c7c95b714f29

  8. Log in to the node where the active GaussDB process resides by performing operations provided in Setting the VM State in Common Operations. Then, execute the following scripts to delete residual volume mounting information from the VM:

    sh /usr/bin/info-collect-script/audit_resume/delete_bdm.sh VM ID Volume ID

    Enter the database password twice as prompted in the script. If the input parameters are correct, the message "... doesn't exist in the block_device_mapping table" can be ignored. Then check whether the command output contains "...block_device_mapping failed".

    • If yes, contact technical support for assistance.
    • If no, go to 10.

  9. Enter the secure operation mode based on 3 and run the following command to set the volume status to in-use:

    cinder reset-state --state in-use Volume ID

    For example, run the following command:

    cinder reset-state --state in-use 5f27b4fd-ef1b-4726-8252-c7c95b714f29

    No further action is required.

  10. Use the operation log replay tool (for details, see Obtaining the Operation Report) to check whether the operation records contain attachment information about the volume.

    tenant,res_id,res_type,time,host,action 
    4b14e24ab18e4e58a96405ba0ea94e6d,8ce350f0-ebcd-4fce-8d14-2e797128ac74,volumes,09/Oct/2014:17:56:29,host_id_4709A23A-9340-1185-8567-000000821800,POST https://volume.az1.dc1.domainname.com:8776/v2/4b14e24ab18e4e58a96405ba0ea94e6d/volumes/8ce350f0-ebcd-4fce-8d14-2e797128ac74/action {"os-attach": {"instance_uuid": "9d5f3cb2-690b-4725-a4e0-cfe96640fb37"  "mountpoint": "/dev/vdb"  "mode": "rw"}} - 202

    instance_uuid specifies the VM ID, and mountpoint specifies the mount point.

    • If yes, go to 11.
    • If no, no further action is required.

  11. Enter the secure operation mode based on 3 and run the following command to attach the volume:

    nova volume-attach VM ID Volume ID Mount point

    For example, run the following command:

    nova volume-attach 9d5f3cb2-690b-4725-a4e0-cfe96640fb37 8ce350f0-ebcd-4fce-8d14-2e797128ac74

    The volume failed to attach if the command output contains ERROR.

    • If the command output does not contain ERROR, the volume is successfully attached. In this case, go to 12.
    • If "ERROR (CommandError): No server with a name or ID of 'XXXXX' exists." is displayed, the VM does not exist, and no further action is required. In other situations, contact technical support for assistance.

  12. Enter the secure operation mode based on 3 and run the following command to query the volume status:

    cinder show Volume ID

    In the command output, check whether the value of status is in-use.

    • If yes, no further action is required.
    • If no, go to 13.

  13. Do not specify the audit item and check whether any volume-related audit result exists. For details, see Manual Audit.

Method 5

  1. Log in to the first controller node in the AZ. For details, see Using SSH to Log In to a Host.
  2. Import environment variables to the node. For details, see Importing Environment Variables.
  3. Perform the following operations on the node to query information about the volume:

    1. Run the following command to enter the secure operation mode:

      runsafe

      Information similar to the following is displayed:

      Input command:
    2. Run the following command to check whether the volume status is correct:

      cinder show Volume ID | grep status | grep uploading && echo "OK."

      If the command output does not contain "OK", the operation is complete.

  4. View the value of last_update_time in the audit report and check whether the time difference between the value and the current time exceeds 24 hours.

    • If yes, go to 5.
    • If no, contact technical support for assistance.

  5. Run the following command to set the volume status:

    status=available; cinder show Volume ID | grep attachments | grep attachment_id > /dev/null && status="in-use"; cinder reset-state --state $status Volume ID

    For example, run the following command:

    status=available; cinder show 813497d6-5aa8-4c4f-9158-2ab598c62bb7 | grep attachments | grep attachment_id > /dev/null && status="in-use"; cinder reset-state --state $status 813497d6-5aa8-4c4f-9158-2ab598c62bb7

Method 6

  1. Log in to any controller node in the AZ. For details, see Using SSH to Log In to a Host.
  2. Run the following command to query information about the volume status:

    cinder show Volume UUID

    NOTE:

    Volume UUID is the volume_id value in the audit report.

    Check whether the value of status in the command output is consistent with the volume state in the audit report.

    • If yes, go to 3.
    • If no, contact technical support for assistance.

  3. View the value of last_update_time in the audit report and check whether the time difference between the value and the current time exceeds 24 hours.

    • If yes, go to 4.
    • If no, contact technical support for assistance.

  4. Confirm with the tenant whether the volume status is to be changed.

    • If yes, go to 5.
    • If no, no further action is required.

  5. Run the following command to query the volume status:

    cinder show Volume UUID

    NOTE:

    Volume UUID is the volume_id value in the audit report.

    In the command output, check whether the value of attachments is left blank.

    • If yes, go to 6.
    • If no, go to 8.

  6. Run the following command to set the volume status to available:

    cinder reset-state --state available --reset-migration-status --attach-status detached Volume UUID

    NOTE:

    Volume UUID is the volume_id value in the audit report.

  7. Run the following command to query the volume status:

    cinder show Volume UUID

    NOTE:

    Volume UUID is the volume_id value in the audit report.

    In the command output, check whether the value of status is available.

    • If yes, go to 10.
    • If no, contact technical support for assistance.

  8. Run the following command to set the volume status to in-use:

    cinder reset-state --state in-use --reset-migration-status --attach-status attached Volume UUID

    NOTE:

    Volume UUID is the volume_id value in the audit report.

  9. Run the following command to query the volume status:

    cinder show Volume UUID

    NOTE:

    Volume UUID indicates the UUID of the volume whose status needs to be reset.

    In the command output, check whether the value of status is in-use.

    • If yes, go to 10.
    • If no, contact technical support for assistance.

  10. Contact the user and run the retype command again to change the disk type of the volume.

Method 7

  1. Log in to any controller node in the AZ. For details, see Using SSH to Log In to a Host.
  2. Run the following command to query information about the volume status:

    cinder show Volume UUID

    NOTE:

    Volume UUID is the volume_id value in the audit report.

    Check whether the value of status in the command output is consistent with the volume state in the audit report.

    • If yes, go to 3.
    • If no, contact technical support for assistance.

  3. View the value of last_update_time in the audit report and check whether the time difference between the value and the current time exceeds 24 hours.

    • If yes, go to 4.
    • If no, contact technical support for assistance.

  4. Obtain the original volume of the reserved volume which is a copy of the original volume and check whether services on the VM where the original volume resides are normal.

    Run the following command to obtain the ID of the original volume:

    cinder show uuid | grep description

    Check whether the command output contains "migration src for Original volume ID".

    • If yes, confirm with the user whether services on the VM where the original volume resides are normal. If VM services are normal, submit the migration task on the Service OM migration task page.
    • If no, contact technical support for assistance.

Inconsistent Volume Attachment Information

Context

Volume attachment information includes the following:

  • Attachment status of volumes recorded in Cinder management data
  • Attachment status of volumes recorded in Nova management data
  • Attachment between hosts and volumes recorded on storage devices
  • Attachment between VMs and volumes recorded in hypervisors

The system audits the consistency between the preceding volume attachment information.

NOTE:

Do not handle the invalid volumes handled in Invalid Volumes.

Parameter Description

The name of the audit report is VolumeAttachmentAudit.csv. Table 18-162 describes parameters in the report.

Table 18-162 Parameter description

Parameter

Description

volume_id

Specifies the volume ID (uuid).

volume_displayname

Specifies the name of the volume created by a tenant.

volume_type

Specifies the volume type, including san, dsware(FusionStorage), and v3.

NOTE:

In the section, san indicates the Huawei 5500 T series storage devices, and v3 indicates V3 series storage devices ( including Dorado and 18000).

location

Specifies the detailed volume attachment information recorded in the Cinder service, Nova service, hypervisors, or storage devices.

Values:

  • ATTACH_TO: Volume attachment information recorded in the Cinder management data. The following is an example:

    'ATTACH_TO': [{'instance_id': u'e32d3e98-2d61-4652-b805-afccb7fbc592'}]

    The value of instance_id indicates the VM UUID.

  • BELONG_TO: Information about the host to which the volume belongs.
  • HYPER_USE: Information recorded in the hypervisor about the VM to which the volume is attached. The following is an example:

    'HYPER_USE': [{'instance_name': u'instance-00000003', 'location': u'4709A23A-9340-1185-8567-000000821800'}]

    The value of instance_name indicates the VM name, and the value of location indicates the host accommodating the VM.

  • MAP_TO: Information about the host to which the volume belongs and is recorded on the storage device. The following is an example:

    'MAP_TO': [{'location': u'68B81E2C-08BB-1170-8567-000000821800'}]

  • NOVA_USE: Information about the VM to which the volume is attached and recorded in the Nova management data. The following is an example:

    'NOVA_USE': [{'instance_name': u'instance-00000004', 'instance_id': u'e32d3e98-2d61-4652-b805-afccb7fbc592'}]

attach_status

Specifies the volume attachment status.

Values:

  • management_status: Comparison result between attachment information in the Cinder service and the Nova service. match indicates that the information is consistent, and not_match indicates that information is inconsistent.
  • cinder_status: Comparison result between attachment information in the Cinder service and the storage device. match indicates that the information is consistent, and not_match indicates that information is inconsistent.
  • hyper_status: Comparison result between attachment information in the hypervisor and the storage device. match indicates that the information is consistent, and not_match indicates that information is inconsistent.

Impact on the System

  • Residual volume attachment information may reside on hosts.
  • Volume-related services may be affected. For example, if a volume has inconsistent attachment information recorded, FusionStorage may fail to create snapshots for the volume.

Possible Causes

  • The database is reverted using a data backup to the state when backup was created. However, after the backup was created, one or more volumes were attached to VMs. After the database is restored, records of the volume attachment information are deleted from the database, but the information resides on the storage devices.
  • If service operation fails and is rolled back, volume-related information rollback fails.

Procedure

  1. Restore the volume attachment information based on the volume statuses listed in Table 18-163. For other situations, contact technical support for assistance.

    Table 18-163 Volume attachment information restoration methods

    management_status

    cinder_status

    hyper_status

    Possible Scenario

    Restoration Method

    not_match

    not_match

    not_match

    N/A

    See Method 9.

    not_match

    not_match

    match

    The volume is not recorded as attached in the Cinder service, but is recorded in the Nova service and on the VM.

    See Method 6.

    not_match

    match

    not_match

    The volume is recorded as attached in the Cinder service, but is not recorded in the Nova service or on the VM.

    For details, see Method 1.

    not_match

    match

    match

    1. The volume is recorded as attached in the Cinder service and on the VM, but is not recorded in the Nova service.

    2. The volume is not recorded as attached in the Cinder service or on the VM, but is recorded in the Nova service.

    3. The volume is recorded as attached in the Cinder, and the virtual machine mounted in the Nova and VM is deleted.

    1. For details, see Method 2.

    2. For details, see Method 5.

    3. For details, see Method 8.

    match

    not_match

    not_match

    The volume is attached to an orphan VM.

    The volume is recorded as attached in the Cinder service, Nova service, and on the VM, but is not recorded on the host.

    Residual mapping between the disk array and hosts exists. The number of hosts in the MAP_TO field is greater than that in HYPER_USE

    Handle the orphan VM. For details, see Orphan VMs.

    Go to Method 3

    See Method 7.

    match

    not_match

    match

    The volume is recorded as attached on the host, but is not recorded in the Cinder service.

    For details, see Method 3.

    match

    match

    not_match

    The volume is recorded as attached in the Cinder service and the Nova service, but is not recorded on the VM.

    For details, see Method 4.

Method 1

  1. Log in to the first host in an AZ.

    For details, see Using SSH to Log In to a Host.

  2. Import environment variables to the host.

    For details, see Importing Environment Variables.

  3. Perform the following operations to query the volume attributes:

    1. Run the following command to enter the secure operation mode:

      runsafe

      The following information is displayed:

      Input command:
    2. Run the following command to query the volume attachment information:

      cinder show uuid

      In the command output, the value of attachments may contain multiple records. Locate the record in which the value of server_id is the faulty VM ID and obtain the information about attachments_id, which is the attachment information required for detaching volumes.

  4. Run the following command to clear the attachment information about the volume:

    token=`openstack token issue | awk '/id/{print $4}' | awk '/id/{print $1}'`

    TENANT_ID=`openstack project list | grep $OS_PROJECT_NAME | awk '{print $2}'`

    sh /usr/bin/info-collect-script/audit_resume/cinder_cmd.sh --operate detach --vol_id uuid --attachment_id attachment_id --os-token "$token" --tenant_id $TENANT_ID

    Check whether the command output contains SUCCESS.

    • If yes, no further action is required.
    • If no, contact technical support for assistance.

Method 2

  1. Log in to the first host in an AZ.

    For details, see Using SSH to Log In to a Host.

  2. Import environment variables to the host.

    For details, see Importing Environment Variables.

  3. Perform the following operations to query the volume attributes:

    1. Run the following command to enter the secure operation mode:

      runsafe

      The following information is displayed:

      Input command:
    2. Run the following command to query the attachment information about the volume:

      cinder show uuid

      In the attachment information:

      Locate the record with server_id being the ID of the faulty VM to obtain the attachment information (attachments_id) needed for volume detaching.

  4. Run the following command to clear the attachment information about the volume:

    token=`openstack token issue | awk '/id/{print $4}' | awk '/id/{print $1}'`

    TENANT_ID=`openstack project list | grep $OS_PROJECT_NAME | awk '{print $2}'`

    sh /usr/bin/info-collect-script/audit_resume/cinder_cmd.sh --operate detach --vol_id uuid --attachment_id attachment_id --os-token "$token" --tenant_id $TENANT_I

    Check whether the command output contains SUCCESS.

    • If yes, go to 5.
    • If no, contact technical support for assistance.

  5. Run the runsafe command to enter the secure operation mode, enter the user password as prompted, and run the following command to attach the volume to a VM:

    nova volume-attach vm-uuid uuid

    Check whether the volume is successfully attached:

    • If yes, no further action is required.
    • If no, contact technical support for assistance.

Method 3

  1. Obtain the volume attributes. For details, see Querying Volume Attributes.
  2. Check whether the host records attachment information of the volume based on MAP_TO in the audit results.

    • If yes, go to 3.
    • If no, go to 4.

  3. Detach the volume based on its status.

    Run the following command to check whether the volume is in the in-use state:

    cinder show uuid

    • If yes, contact technical support for assistance.
    • If no, perform the following operation.
      • dsware: run the following command to detach the volume.

        vbs_cli -c detachwithip -v Volume name on the storage device -i dsware_manage_ip -p 0

      • san or v3: perform the following steps to detach the volume from the host.
        1. Run the following command to perform the hash operation on the location field in HYPER_USE.

        python -c "print hash('2046B2B1-4D27-E711-8084-A08CF81DF81E')"

        Information similar to the following is displayed:

        8697324651793178011

        Run the cinder show volume_id|grep lun command to obtain the value of lun_id.

        Information similar to the following is displayed:

        |os-volume-replication:driver_data|{"ip": "10.31.4.54", "ESN": "fa5a871550301236", "vol_name":"volume-705dfb73-aeb4-4d95-b2de-f662112527f5", "pool": 0, "lun_id": "150"} 
        1. Log in to OceanStor DeviceManager, choose Provisioning > Host, and select the host in 1. Then, click Properties and choose Owning Host Group to find the name of the host group to which the host belongs.

        1. Choose Provisioning > Host > Host Group, select the host group in 2, and click Properties to view the mapping view of the host group.

        1. Choose Provisioning > Mapping View, select the mapping view obtained in 3, and view the corresponding LUN group.

        1. Choose Provisioning > LUN > LUN Group, select the LUN group obtained in 4, view its LUNs, and search for the LUN ID obtained in 1 (You need to click in the upper right corner to display Set the information items to be displayed, and then you can view the LUN ID in the list). Then, select the target LUN andclick Remove to delete the mappingbetween the LUN and host.

  4. Attach the volume based on its status.

    • dsware: run the following command to attach the volume.

      vbs_cli -c attachwithip -v Volume name on the storage device -i dsware_manage_ip -p 0

    • san or v3: perform the following steps to attach the volume to the host.
      1. Run the following command to perform the hash operation on the location field in HYPER_USE.

      python -c "print hash('2046B2B1-4D27-E711-8084-A08CF81DF81E')"

      Information similar to the following is displayed:

      8697324651793178011

      Run the cinder show volume_id|grep lun command to obtain the value of lun_id.

      Information similar to the following is displayed:

      |os-volume-replication:driver_data|{"ip": "10.31.4.54", "ESN": "fa5a871550301236", "vol_name":"volume-705dfb73-aeb4-4d95-b2de-f662112527f5", "pool": 0, "lun_id": "150"} 
      1. Log in to OceanStor DeviceManager, choose Provisioning > Host, and select the host in 1. Then, click Properties and choose Owning Host Group to find the name of the host group to which the host belongs.

      1. Choose Provisioning > Host > Host Group, select the host group obtained in 2, and click Properties to view the mapping view of the host group.

      1. Choose Provisioning > Mapping View, select the mapping view obtained in 3, and view the corresponding LUN group.

      1. Choose Provisioning > LUN > LUN Group, select the target LUN group in 4, click Add Object, and add the LUN to the LUN group based on the the LUN ID obtained in 1 to complete the mapping between the LUN and host.

      1. Log in to the FusionSphere OpenStack host and run the hot_add command to scan disks to ensure that the host can use the volume.

Method 4

  1. Log in to the first host in an AZ.

    For details, see Using SSH to Log In to a Host.

  2. Import environment variables to the host.

    For details, see Importing Environment Variables.

  3. Contact technical support engineers to check whether the volume can be detached from the VM.

    If there is no risk, perform the following operations to detach the volume from the VM:
    1. Run the following command to enter the secure operation mode:

      runsafe

      The following information is displayed:

      Input command:
    2. Run the following command to detach the volume from the VM:

      nova volume-detach vm-uuid uuid

      Check whether the volume is successfully attached.

      • If yes, no further action is required.
      • If no, contact technical support for assistance.

Method 5

  1. Log in to the first host in an AZ.

    For details, see Using SSH to Log In to a Host.

  2. Import environment variables to the host.

    For details, see Importing Environment Variables.

  3. Run the following command to query the VM status:

    nova show VM ID | grep OS-SRV-USG:launched_at

    Check whether time information is displayed in the command output.

    • If no, the VM is an invalid VM. In this case, ensure that the invalid VM has been properly deleted. For details, see Invalid VMs.
    • If yes, go to 4.

  4. Log in to the host housing the active GaussDB node based on section "Setting the VM State." Then run the following script to clear the residual attachment information of the volume from the VM:

    sh /usr/bin/info-collect-script/audit_resume/delete_bdm.sh VM ID Volume ID

    Enter the database password twice as prompted in the script and check whether success is included in the command output.

    • If yes, no further action is required.
    • If no, contact technical support for assistance.

Method 6

  1. Log in to any host in the AZ. For details, see Using SSH to Log In to a Host.
  2. Import environment variables to the host.

    For details, see Importing Environment Variables.

  3. Run the following command to query the host accommodating the VM:

    nova show VM ID | grep OS-EXT-SRV-ATTR:host

    • If no host information is displayed, contact technical support for assistance.
    • If the host information is displayed, go to 4.

  4. Run the following command to query the VM instance name:

    nova show VM ID | grep OS-EXT-SRV-ATTR:instance_name

    • If no instance name is displayed, contact technical support for assistance.
    • If the instance name is displayed, go to 5.

  5. Run the following command to query the device information of the VM volumes:

    nova volume-attachments VM ID

    • If no device information is displayed, contact technical support for assistance.
    • If the device information is displayed, go to 6.

  6. Run the following command to check whether information of volumes attached to the VM has the VM information. Record volumes unattached to the VM, and record the associated device information based on the command output of last step.

    cinder list --all-t

    • If all the attached volumes have the VM information, no further action is required.
    • If a residual volume exists, go to 7.

  7. Log in to the host accommodating the VM and run the following command to clear the residual information at the virtualization layer:

    nova_virsh_cmd virsh-detach-disk VM instance name

    Check whether the command output contains success.
    • If yes, go to 8.
    • If no, contact technical support for assistance.

  8. Run the following command to clear the residual volume attachment information of the VM:

    sh /usr/bin/info-collect-script/audit_resume/delete_bdm.sh VM ID volume ID

    Enter the database password twice as prompted in the script and check whether success is included in the command output.

    • If yes, go to 9.
    • If no, contact technical support for assistance.

  9. Run the following command to attach the volume to the VM:

    nova volume-attach vm-uuid volume-uuid

    Check whether the volume is successfully attached:

    • If yes, no further action is required.
    • If no, contact technical support for assistance.

Method 7

NOTE:

If inconsistencies are found during a system audit, after alarms generated for the Nova and Cinder components are cleared and the disk array has recovered, wait for a processing period (10 minutes), and then perform a manual audit or wait for the next round of system audit to complete. If the audit alarms still are not cleared, perform the steps in this section.

  1. Log in to any host in the AZ. For details, see Using SSH to Log In to a Host.
  2. Import environment variables to the host.

    For details, see Importing Environment Variables.

  3. Based on the hosts displayed in HYPER_USE, identify the redundant host in MAP_TO and thereby identify the correct host. Example:

    volume_id,volume_displayname,volume_type,location,attach_status6120ca56-12e0-4835-a0fb-1fab336bfd8f,bl_sys,v3,"{'ATTACH_TO': [{'instance_id': u'910f2975-f2f9-4376-8809-ce1d56527dba'}], 'BELONG_TO': u'cinder@IPSAN_V3', 'HYPER_USE': [{'instance_name': u'instance-00000002', 'location': u'8B13124C-7C15-11CF-8567-000000821800'}], 'MAP_TO': [{'location': u'7913694372495436396'}, {'location': u'8031912861936646597'}], 'NOVA_USE': [{'instance_name': u'instance-00000002', 'instance_id': u'910f2975-f2f9-4376-8809-ce1d56527dba'}]}","{'management_status': 'match', 'cinder_status': 'not_match', 'hyper_status': 'not_match'}"

    Run the following command to obtain the hash of the location field in HYPER_USE.

    python -c "print hash('8B13124C-7C15-11CF-8567-000000821800')",

    The result is as follows:

    7913694372495436396

    Compare the hash result with the locations in MAP_TO: the correcthost is 7913694372495436396, and the redundant one is 8031912861936646597.

  4. Run the following command to check whether the host is location in the audit report.

    nova show uuid

    uuid indicates instance_id in the audit report. For example,

    nova show 910f2975-f2f9-4376-8809-ce1d56527dba

    The result is as follows:

    +---------------------------------+---------------------------------------------------+
    |  Property                       |   Value                                           |                                                                                 +---------------------------------+---------------------------------------------------+
    |  OS-DCF:diskConfig              |  MANUAL                                           |
    |  OS-EXT-Az:availability_zone    |  az1.dc1                                          |
    |  OS-EXT-SRV-ATTR:host           |  8B13124C-7C15-11CF-8567-000000821800           |
    |  OS-EXT-SRV-ATTR:hostname       |  bl-vm     

    Check whether the value Value of host is the same as the location in HYPER_USE in 3.

    • If yes, go to 5.
    • If no, contact technical support for assistance.

  5. Log in to the disk array management page and locate the LUN ID of the volume.

    Run the following command on controller nodes to convert the volume ID to LUN ID based on the volume type:

    1. volume_type: v3
      python -c "from cinder.volume.drivers.huawei.huawei_driver import huawei_utils;print huawei_utils.encode_name('volume_id')"
    2. volume_type: san
      python 
      hash('volume-volume_id')

  6. Check whether the V3 disk array is used.

    • If yes, go to 7.
    • If no, the host in 3 is the correct host recorded in MAP_TO. Log in to the disk array and remove the mapping between the extra host recorded in MAP_TO and the LUN. No further action is required.

  7. Log in to the disk array and query the residual host ID.

    Log in to the v3 disk array and choose Provisioning > Host. Locate the name of the host group to which the extra host belongs to.

  8. Identify the mapping between the host group and LUN group.

    The host group name is OpenStack_HostGroup_7 obtained in step 7. Choose Provisioning > Host > Host Group and check the mapping view. The mapping view that corresponds to the host group is OpenStack_Mapping_View_7. Choose Provisioning > Mapping View and check the mapping view. The LUN group that corresponds to OpenStack_Mapping_View_7 is OpenStack_LunGroup_7.

  9. Remove the mapping between the extra host and LUN.

    Choose Provisioning > LUN > LUN Group and locate the LUN group obtained in 8. Select the LUN group and click Remove Object. In the search box, enter the volume name obtained in 5 and click Search. Select the obtained LUN and remove it.

Method 8

  1. Log in to the host accommodating the first controller node in the AZ. For details, see Using SSH to Log In to a Host.
  2. Import environment variables to the host.

    For details, see Importing Environment Variables.

  3. Run the following command to query volume attachment information.

    cinder show volume_id

    • If the volume status is in-use and shareable is False, perform 5.
    • If the volume status is in-use and shareable is True, perform 4.

  4. In the output obtained in 3, attachments may contain multiple server_ids.

    Check whether there is only one server_id.

    • If yes, go to 5.
    • If no, go to 7.

  5. Run the following command to clear volume attachment information.

    cinder reset-state --attach-status detached volume_id

  6. Run the following command to query volume attachment information.

    cinder show volume_id

    In the output, check whether the volume status is available.

    • If yes, no further action is required.
    • If no, contact technical support for assistance.

  7. Log in to the node that hosts the database. For details, see Logging In to a Host Running the Database Service in HUAWEI CLOUD Stack 6.5.0 Troubleshooting Guide.
  8. Perform the following steps to log in to the Cinder database.

    1. Run the following command to switch to the database account.

      su gaussdba

    2. Run the following command to log in to the Cinder database.

      gsql cinder

      Default password: FusionSphere123.

  9. Run the following command to clear volume attachment information.

    UPDATE VOLUME_ATTACHMENT SET ATTACH_STATUS='detached' WHERE INSTANCE_UUID='server_id' AND VOLUME_ID='volume_id' AND ATTACH_STATUS='attached';

    In the command, replace server_id with the ID of the deleted VM.

  10. Go back to the controller node, and run the following command to query volume attachment information.

    cinder show volume_id

    In the output, check whether attachments contain server_id of the deleted VM.

    • If yes, contact technical support for assistance.
    • If no, no further action is required.

Method 9

  1. Log in to the host accommodating the first controller node in the AZ. For details, see Using SSH to Log In to a Host.
  2. Import environment variables to the host.

    For details, see Importing Environment Variables.

  3. Run the following command to query the volume details:

    cinder show uuid

    Information similar to the following is displayed:

    GNC-RN01-SRN01-HOST01:~ # cinder show 608a83de-7749-4014-a6e9-a683df2bd849+---------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+|                Property               |                                                                                                                                                   Value                                                                                                                                                    |+---------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+|              attachments              | [{'server_id': 'f2891596-36d3-48b3-8b9e-635bbf33b796', 'attachment_id': '54820174-57cb-4291-8968-9f0c68d70ad6', 'attached_at': '2018-12-13T10:03:28.569558', 'host_name': None, 'volume_id': '608a83de-7749-4014-a6e9-a683df2bd849', 'device': '/dev/vda', 'id': '608a83de-7749-4014-a6e9-a683df2bd849'}]  ||          availability_zone           |                                                                                                                                                 az1.gncdc1                                                                                                                                                 ||                bootable               |                                                                                                                                                    true    

    Check the value of bootable.

    • If the value is true, the volume is a system disk of the VM.
    • If the value is false, the volume is a data disk of the VM.

  4. Based on server_id, obtained in 3, of the VM to which the volume is attached, run the following command to check whether the VM has been deleted:

    nova instance-action-list server_id

    Information similar to the following is displayed:

    GNC-RN01-SRN01-HOST01:~ # nova instance-action-list f2891596-36d3-48b3-8b9e-635bbf33b796+------------+------------------------------------------+---------+----------------------------+| Action     | Request_ID                               | Message | Start_Time                 |+------------+------------------------------------------+---------+----------------------------+| create     | req-6aed42ea-69d7-484b-9060-7b3db2c74ee4 | -       | 2018-12-13T10:03:20.221384 || reboot     | req-613bca14-59c0-4dc6-92cb-7c17d89418fc | -       | 2019-01-17T06:51:32.423814 || reboot     | req-e50dde15-2d8a-493b-a3dc-e782188d1101 | -       | 2019-01-18T04:14:31.036272 || reschedule | req-62d463a0-b21d-4d16-9998-3a036885e6e9 | -       | 2019-01-18T04:18:38.157223 || reschedule | req-a82ec634-f484-4f6a-8dd7-b4d6858e79ec | -       | 2019-01-18T04:48:48.389963 || reschedule | req-b221f108-c2ca-4aa3-90cb-fc2aef677d13 | -       | 2019-01-18T05:18:47.891362 || reschedule | req-788d648b-f3b8-4460-a9d7-442fc5a9b5cc | -       | 2019-01-18T05:43:27.167151 || reschedule | req-75230bb1-f00f-46bd-b8da-e0d2f53b7c80 | -       | 2019-01-18T06:13:18.024269 || reschedule | req-751b95b2-88e6-4882-b8be-0bc57a670452 | -       | 2019-01-18T06:38:40.423816 || reschedule | req-a0c99c91-d01b-4112-8f09-90b686a5e30c | -       | 2019-01-18T07:03:31.775610 || stop       | req-2bbbf4df-460a-49b8-97d5-3c808f23cc88 | -       | 2019-01-18T07:31:09.810141 || reschedule | req-cffe4c6d-91ab-4939-8a74-4d798ab443d4 | -       | 2019-01-18T07:34:32.334318 || reschedule | req-278d4de1-af3a-4ed0-bf12-916ed82a0a3f | -       | 2019-01-18T07:54:47.257292 || reschedule | req-e103d7dc-59cd-4563-9520-39effd410015 | -       | 2019-01-18T08:19:30.707745 || reschedule | req-e270f850-7d1a-4983-9c42-0c3ab7dcc818 | -       | 2019-01-18T08:49:21.560663 || reschedule | req-da522b9e-a1ba-40f6-a6e2-ea7262c6a22c | -       | 2019-01-18T09:09:20.646063 || delete     | req-66c48553-3338-478e-8965-7d878c08ac03 | -       | 2019-01-18T09:34:49.506238 |+------------+------------------------------------------+---------+----------------------------+
    • If the last operation record of the VM is the deletion operation and the residual volume is the system disk of the VM, the volume can be deleted.
    • If the residual volume is the data disk of the VM, confirm with the user whether the volume can be deleted.

  5. Run the following commands to reset the volume status to available and delete the volume:

    cinder reset-state --state available --attach-status detached volume_id

    cinder delete volume_id

    Information similar to the following is displayed:

    GNC-RN01-SRN01-HOST01:~ #cinder reset-state --state available --attach-status detached 608a83de-7749-4014-a6e9-a683df2bd849GNC-RN01-SRN01-HOST01:~ # cinder list --all-t | grep 608| 608a83de-7749-4014-a6e9-a683df2bd849 | f18d48fa0063461e8cac98a7231308b6 | available | vnfm_volume_4c3b8544-6c23-41c0-851b-de1b2f89cc1a |  5   | VolumeService01 |   true   |    False    |                                      |GNC-RN01-SRN01-HOST01:~ # cinder delete 608a83de-7749-4014-a6e9-a683df2bd849
    Request to delete volume 608a83de-7749-4014-a6e9-a683df2bd849 has been accepted

    Check whether the command output contains "Request to delete volume volume_id has been accepted".

    • If yes, the volume has been deleted. No further action is required.
    • If no, contact technical support for assistance.

Nova novncproxy Zombie Process

Context

The Nova novncproxy service may generate zombie processes due to the websockify module or the Python version. However, the probability for this issue is found to be very low. To improve system stability, the system also audits and automatically clears these zombie processes.

Parameter Description

The audit configuration item is max_zombie_process_num, which is stored in the /etc/info-collect.conf file on the novncproxy-deployed node. The configuration item specifies the threshold for automatically clearing zombie processes. The default value is 10.

  • The system automatically clears these zombie processes only when the number of zombie processes on a compute node exceeds the threshold.
  • If the threshold is set -1, the system does not clear zombie processes.

The name of the audit report is zombie_process_hosts.csv. Table 18-164 describes parameters in the report.

Table 18-164 Parameter description

Parameter

Description

host

Specifies the compute node name.

zombieprocess

Specifies the number of zombie processes detected on the node.

is restart

Specifies whether any automatic zombie process deletion is conducted. The default value is True.

Impact on the System

  • Excessive zombie processes may deteriorate the system performance.
  • After a zombie process is deleted, the nova-novncproxy service restarts, which interrupts in-use novnc services.

Possible Causes

  • The websockify module used by the nova-novncproxy service is defective.
  • Python 2.6 is defective.

Procedure

No operation is required. The system automatically clears excessive zombie processes based on the specified threshold.

NOTE:

Before the system automatically clears a zombie process, this zombie process is attached to process 1. Therefore, this zombie process clearing does not immediately take effect.

Detecting and Deleting Residual Cold Migration Data

Context

FusionSphere OpenStack stores VM cold migration information in the database and will automatically delete it after the migration confirmation or rollback. However, if an exception occurs, residual information is not deleted from the database.

Parameters

The name of the audit report is cold_cleaned.csv. Table 18-165 describes parameters in the report.

Table 18-165 Parameters in the audit report

Parameter

Description

instance_uuid

Specifies the universally unique identifier (UUID) of the VM that is cold migrated.

Impact on the System

  • This issue incurs a high quota usage.
  • This issues adversely affects the code implementation and resource usages of subsequent VM cold migrations.

Possible Causes

  • The nova-compute service is restarted during the migration.
  • The VM status is reset after the migration.

Procedure

No operations are required.

Detecting and Deleting Residual Live Migration Data

Context

FusionSphere OpenStack stores VM live migration information in the database and will automatically delete it after the migration confirmation or rollback. However, if an exception occurs, residual information is not deleted from the database.

Parameters

The name of the audit report is live_cleaned.csv. Table 18-166 describes parameters in the report.

Table 18-166 Parameters in the audit report

Parameter

Description

instance_uuid

Specifies the universally unique identifier (UUID) of the VM that is live migrated.

Impact on the System

  • This issue adversely affect resource usages of subsequent VM live migrations.

Possible Causes

  • The nova-compute service is restarted during the migration.

Procedure

No operations are required.

Intermediate State of the Cold Migration

Context

FusionSphere OpenStack stores VM cold migration information in the database. If the source node is restarted during the migration confirmation, the cold migration may be stuck in the intermediate state.

Parameters

The name of the audit report is cold_stuck.csv. Table 18-167 describes parameters in the report.

Table 18-167 Parameters in the audit report

Parameter

Description

instance_uuid

Specifies the universally unique identifier (UUID) of the VM that is cold migrated.

migration_id

Specifies the ID of the cold migration record.

migration_updated

Specifies the time when the migration is confirmed.

instance_updated

Specifies the time when the VM information is updated.

Impact on the System

Maintenance operations cannot be performed on the VM.

Possible Causes

  • The nova-compute service on the source node is restarted during the cold migration.
  • Network exceptions cause packet loss.

Procedure

  1. Use PuTTY by using the Reverse-Proxy to log in to the first host in the FusionSphere OpenStack system.

    The default username is fsp, and the default password is Huawei@CLOUD8.

  2. Run the following command and enter the password Huawei@CLOUD8! of user root to switch to user root:

    su - root

  3. Import environment variables. For details, see Importing Environment Variables.
  4. Perform the following operations to query the management IP addresses of controller nodes:

    cps host-list

    The node whose roles value contains controller is a controller node, and its manageip value is the management IP address.

  5. Run the following commands to log in to a controller node:

    su fsp

    ssh fsp@Management IP address

    su - root

  6. Import environment variables. For details, see 3.
  7. Perform the following operations to set the VM status:

    1. Run the following command to enter the secure operation mode:

      runsafe

      The following information is displayed:

      Input command:
    2. Check the migration_status value in the audit report. If the value is reverting, perform 7.3 to 7.5. Otherwise, perform 7.3 and then go to 7.5.
    3. Run the following command to clear the intermediate state of the cold migration:

      python /usr/bin/info-collect-script/audit_resume/clean_stuck_migration.py instance_uuid migration_id

      instance_uuid and migration_id can be obtained from the audit report.

    4. Run the following command to migrate VM stuck in the reverting state back to the original host:

      nova resize-revert instance_uuid

    5. Run the following command to check whether the VM changes to the active state:

      nova show uuid

      Check whether the VM is running properly.

      • If yes, no further action is required.
      • If no, contact technical support for assistance.

Cold Migrated VMs That Are Adversely Affected by Abnormal Hosts

Context

A VM is running on a host. If the host becomes faulty, VM services will be interrupted. In addition, if the source host becomes faulty during a VM cold migration, the cold migration will be adversely affected. Perform an audit to detect the cold migrated VMs that are adversely affected by faulty hosts in the system.

Parameters

The name of the audit report is host_invalid_migration.csv. Table 18-168 describes parameters in the report.

Table 18-168 Parameters in the audit report

Parameter

Description

id

Specifies the ID of the cold migration record.

instance_uuid

Specifies the universally unique identifier (UUID) of the VM that is cold migrated.

source_compute

Specifies the source host in the cold migration.

source_host_state

Specifies the status of the source host.

Impact on the System

Maintenance operations cannot be performed on the VM.

Possible Causes

  • The source host is powered off.
  • The compute role of the source host is deleted.
  • The nova-compute service on the source host runs improperly.

Procedure

  1. Use PuTTY by using the reverse proxy IP address to log in to the first host in the AZ.
  2. Use PuTTY by using the Reverse-Proxy to log in to the first host in the FusionSphere OpenStack system.

    The default username is fsp, and the default password is Huawei@CLOUD8.

  3. Run the following command and enter the password Huawei@CLOUD8! of user root to switch to user root:

    su - root

  4. Import environment variables. For details, see Importing Environment Variables.
  5. Perform the following operations to query the management IP addresses of controller nodes:

    cps host-list

    The node whose roles value contains controller is a controller node, and its manageip value is the management IP address.

  6. Run the following commands to log in to a controller node:

    su fsp

    ssh fsp@Management IP address

    su - root

  7. Import environment variables. For details, see 4.
  8. Perform the following operations to restore the VM:

    1. Run the following command to enter the secure operation mode:

      runsafe

      The following information is displayed:

      Input command:
    2. Run the following command to query hosts:

      cps host-list

      Check whether the command output contains a host whose ID is the same as the source_compute value in the audit report.

      • If yes, go to the next step.
      • If no, go to 9.
    3. Locate the host whose status value is fault in the command output and check whether its services cannot be restored.
      • If yes, go to 9.
      • If no, restore the services and perform the audit again.
    4. Run the following command to query hosts:

      cps host-list

      Locate the host whose ID is the same as the source_compute value in the audit report and check whether the host has the compute role assigned.

      • If yes, indicate that the nova-compute service is normal and the VM is unable to operate, contact technical support for assistance.
      • If no, go to 9.

  9. Perform the following operations to clear unmaintainable cold migration information:

    1. Run the following command to delete residual cold migration data:

      python /usr/bin/info-collect-script/audit_resume/clean_stuck_migration.py instance_uuid id

      instance_uuid and id can be obtained from the audit report.

    2. Run the following command to check whether the VM changes to the active state:

      nova show uuid

      • If yes, no further action is required.
      • If no, contact technical support for assistance.

Redundant Neutron Namespaces

Context

A network has been deleted, but its DHCP namespace still exists. This namespace is a redundant one.

After the user confirms that a DHCP namespace is redundant, restart the neutron-dhcp-agent service to delete the namespace.

In centralized router scenarios, a router has been deleted, but the router namespace still exists. This namespace is a redundant one. In distributed router scenarios, the router namespace on a node is redundant if the router namespace exists on the node but VMs on all the subnets connected to the router do not exist.

Parameter Description

The name of the audit report is redundant_namespaces.csv. Parameters in the report are described in the following table.

Table 18-169 Parameter description

Parameter

Description

host_id

Specifies the universally unique identifier (UUID) of the node accommodating redundant namespaces.

namespace_list

Specifies the list of the redundant namespaces.

Possible Causes

  • When networks are deleted in batches, the RPC messages consumed by dhcp-agent are transmitted in serial mode, which is prone to message stack in the message queue. At this time, if dhcp-agent is disconnected from RabbitMQ, the RPC broadcast messages will be lost, causing the failure to delete the DHCP namespaces of some networks.

Impact on the System

  • The system contains residual Neutron (DHCP or router) namespaces.

Procedure

  1. Log in to any controller host in an AZ.

    For details, see Using SSH to Log In to a Host.

  2. Import environment variables to the host.

    For details, see Importing Environment Variables.

    Perform the following operations to query the DHCP or router deployment mode:
    1. Enter the secure operation mode. The following information is displayed:
      Input command:
    2. Query the DHCP or router deployment mode.
      1. To query the DHCP deployment mode

        Run the following command:

        cps template-params-show --service neutron neutron-server|grep dhcp_distributed

        If True is displayed in the command output, distributed DHCP is used. Otherwise, centralized DHCP is used.

      2. To query the router deployment mode

        Run the following command:

        cps template-params-show --service neutron neutron-openvswitch-agent|grep enable_distributed_routing

        If True is displayed in the command output, distributed router is used. Otherwise, centralized router is used.

        • In centralized DHCP scenarios, perform 3 to 9.
        • In distributed DHCP scenarios, perform 9 to 12.
        • In centralized router scenarios, perform 12 to 17.
        • In distributed router scenarios, perform 18 to 20.

  3. Determine the node containing a redundant Neutron namespace based on host_id in the audit report. Then log in to the node and import environment variables (for details, see 1 and 2).
  4. Enter secure operation mode by referring to Command Execution Methods and check whether the redundant DHCP namespaces exist on the node.

    1. The following information is displayed:
      Input command:
    2. Run the following command to check whether the redundant DHCP namespace exists on the node:

      ip netns | grep namespace_id

      NOTE:

      namespace_id specifies the ID of each namespace in the namespace_list field of the audit report.

      Check whether the redundant namespace exists on the host.

      • If yes, go to the next step.
      • If no, the namespace is not redundant. No further action is required.

  5. Run the following command in the secure operation mode and check whether the networks that correspond to the redundant namespaces exist in the system:

    neutron net-show network_id

    network_id specifies the network ID that corresponds to namespace_id in 4.

    For example:

    If the value of namespace_id is qdhcp-9c4c4872-af61-4fe0-9148-04324233a5e9, the value of network_id is 9c4c4872-af61-4fe0-9148-04324233a5e9.

    Check whether the network that corresponds to the redundant namespace exists in the system.

    • If yes, the namespace is not redundant. No further action is required.
    • If no, go to the next step.

  6. Run the following command in the secure operation mode and check whether any port device resides on the redundant DHCP namespace:

    ip netns exec namespace_id ip address

    The value of namespace_id can be obtained in 4.

    In an OVS networking mode, the command output is as follows.

    The residual port device is tapf2b974ce-ba.

    In an EVS networking mode, the command output is as follows.

    The residual port device is hnic1.3958.

    Check whether any port device resides on the namespace.

    • If yes, go to 7 when the port device name contains tap and go to 8 when the name contains hnic.
    • If no, go to 9.

  7. Run the following command in the secure operation mode to delete the tap port device:

    ovs-vsctl del-port tap_id

    The value of tap_id can be obtained in 6.

    For example, the value of tap_id is tapf2b974ce-ba.

    Then go to 9.

  8. Run the following command in the secure operation mode to delete the hnic port device:

    Delete the port device on the namespace.

    ip netns exec namespace_id ip link delete hnic_id

    The value of namespace_id can be obtained in 4.

    The value of hnic_id can be obtained in 6, for example, hnic1.3958.

    Run the following command to delete the residual port device from the network bridge:

    ovs-vsctl -- --if-exists del-port hnic_port

    The value of hnic_port is the network bridge port that corresponds to the hnic port. For example, if the hnic port is hnic1.3958, the value of hnic_port is hnic1-3958. Replace . with -.

    Then go to 9.

  9. Run the following command in the secure operation mode to delete the redundant DHCP namespace from the node:

    ip netns del namespace_id

    The value of namespace_id can be obtained in 4.

  10. Enter the secure operation mode by referring to Command Execution Methods and check whether the network port that corresponds to the redundant namespace exists on the node accommodating the namespace.

    1. The following information is displayed:
      Input command:
    2. Run the following command to check whether the network port information exists on the node:

      neutron port-list --network_id network_id --binding:host_id host_id

      NOTE:

      network_id can be obtained in 5, and the host_id value is the host_id value in the audit report.

      In the command output:

      • If only one distributed_dhcp_port record is displayed, this node does not contain other network ports. Go to the next step.
      • If multiple distributed_dhcp_port records are displayed, the namespace is not redundant. No further action is required.

  11. Perform 3 and 4 to log in to the node containing the redundant DHCP namespace and check whether the redundant DHCP namespace exists.
  12. If the redundant namespace exists, perform 5 to 9.
  13. Enter secure operation mode by referring to Command Execution Methods and check whether the redundant router namespaces exist on the node.

    1. The following information is displayed:
      Input command:
    2. Run the following command to check whether the router namespace exists on the node:

      ip netns | grep namespace_id

      NOTE:

      namespace_id specifies the ID of each namespace in the namespace_list field of the audit report.

      Check whether the redundant namespace exists on the host.

      • If yes, go to the next step.
      • If no, the namespace is not redundant. No further action is required.

  14. Run the following command in the secure operation mode and check whether the router that corresponds to the redundant namespace exists in the system:

    neutron router-show router_id

    The value of router_id is the router ID corresponding to the value of namespace_id in 13.

    For example:

    If the value of namespace_id is qrouter-af15306f-2ccd-4f1e-932d-9007f31c7f6f, the value of router_id is af15306f-2ccd-4f1e-932d-9007f31c7f6f.

    Check whether the router corresponding to the redundant namespace exists in the system.

    • If yes, the namespace is not redundant. No further action is required.
    • If no, go to the next step.

  15. Run the following command in the secure operation mode and check whether any port device resides on the redundant router namespace:

    ip netns exec namespace_id ip address

    The namespace_id value can be obtained in 13.

    As shown in the preceding figure, the residual port device is qr-2250925a-8e.

    Check whether any port device resides on the namespace.

    • If a port device resides on the namespace, go to 16.
    • If no, go to 17.

  16. Run the following command in the secure operation mode to delete the qr port device:

    ovs-vsctl del-port qr_id

    The value of qr_id can be obtained in 15.

    For example, the value of qr_id is qr-2250925a-8e.Then go to 17.

  17. Run the following command in the secure operation mode to delete the redundant router namespace from the node:

    ip netns del namespace_id

    The namespace_id value can be obtained in 13.

  18. Enter the secure operation mode by referring to Command Execution Methods and check whether the network port that corresponds to the redundant namespace exists on the node accommodating the namespace.

    1. The following information is displayed:
      Input command:
    2. Run the following command to check whether the networks to which all subnets connected to the router belong:

      Check whether the router exists and obtain the IDs of the networks to which all subnets connected to the router belong.

      Obtain the value of port_id of the router:

      neutron router-port-list router_id

      Obtain the value of network_id of the router port:

      neutron port-show router_port_id -c network_id

      If the router exists, the namespace is not redundant. Otherwise, the namespace is redundant. If the namespace is redundant, go to the next step.

    3. Run the following command to check whether the network port information exists on the host:

      neutron port-list --network_id network_id--binding:host_id host_id

      NOTE:

      The value of host_id can be obtained from the audit report.

      In the command output:

      • If all networks connected to the router have no VM ports, the router namespace is redundant. Go to the next step.
      • If the networks connected to the router have VM ports, the router namespace is not redundant. No further action is required.

  19. Perform 3 and 13 to log in to the node accommodating the redundant namespace and check whether the redundant namespace exists.
  20. If the redundant namespace exists, perform 15 to 17.

Stuck Volume Snapshots

Context

A stuck volume snapshot is the one that is kept in a transition state (including creating, deleting, and error_deleting) and is unavailable for use. If a volume snapshot is kept stuck in a transition state for more than 24 hours, restore the volume based on site conditions.

Parameter Description

The name of the audit report is SnapshotStatusAudit.csv. The following table describes parameters in the report.

Table 18-170 Parameter description

Parameter

Description

snap_id

Specifies the snapshot ID.

snap_name

Specifies the volume snapshot name on the storage device.

snap_type

Specifies the snapshot type, including san, dsware (FusionStorage), and v3.

status

Specifies the snapshot status.

last_update_time

Specifies the last time when the snapshot was updated.

NOTE:

In the section, san indicates the Huawei 5500 T series storage devices, and v3 indicates V3 series storage devices ( including Dorado and 18000).

Impact on the System

The stuck volume snapshot becomes unavailable and occupies system resources.

Possible Causes

  • A system exception occurred when a service operation on the volume snapshot was in process.
  • A database is backed up for future restoration. However, after the backup is created, the statuses of one or more volume snapshots are changed. After the database is restored, records of these volume snapshot statuses are restored to their former statuses in the database.

Procedure

Restore the volume snapshot based on the volume snapshot statuses listed in the following table. For other situations, contact technical support for assistance.

Table 18-171 Volume snapshot restoration methods

Volume Snapshot Status

In Transition Mode

Description

Possible Scenario

Restoration Method

creating

Y

The volume snapshot is being created.

Creating a volume snapshot

For details, see Method 1.

deleting

Y

The volume snapshot is being deleted.

Deleting a volume snapshot

For details, see Method 2.

error_deleting

N

The volume snapshot fails to delete.

Failed to delete a volume snapshot.

For details, see Method 2.

Method 1

  1. Log in to the first controller host in an AZ.

    For details, see Using SSH to Log In to a Host.

  2. Import environment variables to the host.

    For details, see Importing Environment Variables.

  3. Perform the following operations to query the volume snapshot status on the host:

    1. Run the following command to enter the secure operation mode:

      runsafe

      The following information is displayed:

      Input command:
    2. Run the following command to query the volume snapshot status:

      cinder snapshot-show Snapshot ID

      Check whether the value of status in the command output is consistent with the volume snapshot status in the audit report.

      • If yes, go to 4.
      • If no, contact technical support for assistance.

  4. Query the value of last_update_time in the audit report and check whether the time difference between the value and the current time exceeds 24 hours.

    • If yes, go to 5.
    • If no, contact technical support for assistance.

  5. Run the following command to set the snapshot status to error:

    Enter the secure operation mode based on 3 and run the following command:

    cinder snapshot-reset-state Snapshot ID --state error

  6. Run the following command to query the snapshot status:

    Enter the secure operation mode based on 3 and run the following command:

    cinder snapshot-show Snapshot ID

    In the command output, check whether the value of status is error.

    • If yes, no further action is required.
    • If no, contact technical support for assistance.

  7. Run the following command to delete the volume snapshot:

    cinder snapshot-delete Snapshot UUID

    NOTE:

    The Snapshot UUID value is the value of snap_id obtained from the audit report.

  8. Run the following command to check whether the volume snapshot is deleted:

    cinder snapshot-show Snapshot UUID

    NOTE:

    The Snapshot UUID value is the value of snap_id obtained from the audit report.

    If information similar to the following is displayed, the volume snapshot is deleted:

    ERROR: No snapshot with a name or ID of 'e318e16e-5a1c-471f-89c2-5c76719aa346' exists.

    If the value of status in the command output is error_deleting, the volume snapshot fails to delete.

    If the value of status in the command output is deleting, the volume snapshot is being deleted. Wait for about one minute and perform 8 again until the volume snapshot is deleted or fails to delete.

    Check whether the volume snapshot is successfully deleted.

    • If yes, no further action is required.
    • If no, contact technical support for assistance.

Method 2

  1. Log in to the first controller host in an AZ.

    For details, see Using SSH to Log In to a Host.

  2. Import environment variables to the host.

    For details, see Importing Environment Variables.

  3. Run the following command to query the volume snapshot status on the host:

    1. Run the following command to enter the secure operation mode:

      runsafe

      The following information is displayed:

      Input command:
    2. Run the following command to query the volume snapshot status:

      cinder snapshot-show Snapshot ID

      Check whether the value of status in the command output is consistent with the volume snapshot status in the audit report.

      • If the snapshot statuses are consistent, perform different operations based on the volume snapshot status:
        • If the volume snapshot status is deleting, go to 4.
        • If the volume snapshot status is error_deleting, go to 5.
      • If no, no further action is required.

  4. Query the value of last_update_time in the audit report and check whether the time difference between the value and the current time exceeds 24 hours.

    • If yes, go to 5.
    • If no, contact technical support for assistance.

  5. Perform the following operations to set the volume snapshot status to available:

    Enter the secure operation mode based on 3 and run the following command:

    cinder snapshot-reset-state Snapshot ID --state available

  6. Delete the volume snapshot.

    Enter the secure operation mode based on 3 and run the following command:

    cinder snapshot-delete Snapshot ID

  7. Perform the following operations to check whether the volume snapshot is deleted:

    Enter the secure operation mode based on 3 and run the following command:

    cinder snapshot-show Snapshot ID

    If information similar to the following is displayed, the volume snapshot is deleted:

    ERROR: No snapshot with a name or ID of 'e318e16e-5a1c-471f-89c2-5c76719aa346' exists.

    If the value of status in the command output is error_deleting, the volume snapshot fails to delete.

    If the value of status in the command output is deleting, the volume snapshot is being deleted. Wait for about one minute and perform this step again until the volume snapshot is deleted or fails to delete.

    Check whether the volume snapshot is successfully deleted.

    • If yes, no further action is required.
    • If no, contact technical support for assistance.

Orphan Ports

Context

If Neutron determines that a port is used by a VM but the VM does not exist, this port is an orphan port. If a user unbinds a port from a VM, but Neutron does not receive such a request from Nova. As a result, data in Nova and Neutron databases is inconsistent. A VM cannot use an orphan port.

Parameters

The name of the audit report for an orphan VM is neutron_wild_ports.csv. The following table describes parameters in the report.

Table 18-172 Parameters in the audit report

Parameter

Description

port_id

Specifies the universally unique identifier (UUID) of the orphan port.

device_id

Specifies the UUID of the VM for the orphan port.

Possible Causes

If Nova binds a NIC to a VM, it updates port attributes including device_id. If Nova detects a problem, it enters the rollback process and clears the port attributes. However, this information becomes residual data in the Neutron database due to an invoking error.

Impact on the System

The port cannot be used by a VM.

Procedure

  1. Log in to any host in the availability zone (AZ). For details, see Using SSH to Log In to a Host.
  2. Import environment variables to the host.

    For details, see Importing Environment Variables.

  3. Check whether device_id in the audit report exists.

    1. Run the following command to enter the secure operation mode:

      runsafe

      The following information is displayed:

      Input command:
    2. Run the nova show device_id command.

      For example, run the nova show 1cd5c6eb-e729-4773-b846-e9f1d3467c56 command.

      If information similar to the following is displayed, the VM does not exist in Nova:

      ERROR (CommandError): No server with a name or ID of '1cd5c6eb-e729-4773-b846-e9f1d3467c56' exists.

      Check whether the VM exists in Nova.

      • If yes, go to the 4.
      • If no, go to the 5.

  4. Check whether the VM has the NIC:

    1. Run the following command to query the host where the VM is located and the instance name of the VM:

      nova show uuid

      The VM UUID can be obtained from the audit report.

    2. Run the following command to query the IP address of the host:

      cps host-list

    3. Log in to the host where the VM is located by referring to Using SSH to Log In to a Host. The host IP address is the one obtained in 4.b.
    4. Import environment variables to the host.

      For details, see Importing Environment Variables.

    5. Run the following command to query the MAC address of the VM:

      nova interface-list device_id

    6. Run the following command to check whether the NIC exists: