No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>Search

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

HUAWEI CLOUD Stack 6.5.0 Alarm and Event Reference 04

Rate and give feedback:
Huawei uses machine translation combined with human proofreading to translate this document to different languages in order to help you better understand the content of this document. Note: Even the most advanced machine translation cannot match the quality of professional translators. Huawei shall not bear any responsibility for translation accuracy and it is recommended that you refer to the English document (a link for which has been provided).
Using FusionCompute for Virtualization

Using FusionCompute for Virtualization

Overview

When using the FusionSphere OpenStack cloud platform, problems such as resource residues and resource unavailable occur because of unexpected system failures (such as host reboot, process restart), or backup recovery, resulting in service failure. In this case, the resource pool consistency audit is performed to ensure data consistency in the resource pool, ensuring the normal operation of the services.

Scenarios

The system audit is required for the OpenStack-based FusionSphere system when data inconsistency occurs in the following scenarios:

  • When a service-related operation is performed, a system exception occurs, for example, when you create a VM, a host process restarts, causing the operation to fail. In this case, residual data may reside in the system or resources may become unavailable.
  • If any service-related operation is performed after a system database is backed up and before the database is restored, residual data may reside in the system or resources may become unavailable after the system database is being restored using the data backup.

The system audit is used to help administrators detect and handle data inconsistency.

Therefore, conduct a system audit when:

  • An audit alarm, such as that of volumes, VMs, snapshots, or images, is generated.
  • The system database is restored using a data backup.
  • The routine system maintenance is performed.

If any of the preceding events occurs, log in to the first host in the OpenStack system to obtain the audit report, and locate and handle data inconsistency.

NOTE:

You are advised to conduct a system audit when the system is running stably. Do not use audit results when a large number of service-related operations are in progress.

During the audit process, if service-related operations (for example, provisioning a VM or expanding the system capacity) are performed or any system exception occurs, the audit result may be distorted. In this case, conduct the system audit again after the system recovers. In addition, confirm the detected problems again based on the audit result processing procedure.

Audit Mechanism

The system audit consists of audit and post log analysis.

The following illustrates how a system audit works:

  • The system obtains service data from databases, hosts, and storage devices, compares the data, and generates an audit report.
  • The audit guide and Command Line Interface (CLI) commands are provided for users to locate and handle the data inconsistency problems listed in the audit report.

You can conduct a system audit in either automatic or manual mode:

  • Automatic: The system automatically starts to audit at 4:00 every day. Users can log in to the FusionSphere OpenStack web client and choose Configuration > System > System Audit to change the start time and audit interval for automatic system audits. The system reports an alarm and generates an audit report if it detects any data inconsistency. If the alarm has been generated, the system does not generate a second one. If no data inconsistency problem is detected but an alarm has been generated for data inconsistency, the system automatically clears this alarm.
  • Manual: Log in to FusionSphere OpenStack and run the required command to start an audit.

Post log analysis is used after the system database is restored using a data backup. It analyzes historical logs and then generates an audit report sorting records of tenant's operations on resources (such as VMs and volumes) in a specified time period. Based on the report and this audit guide, the administrator can locate and handle problems listed in the audit reports.

Audit Process

If any audit alarm is generated, conduct an audit based on the process shown in Figure 18-4.

Figure 18-4 Audit process

Manual Audit

Scenarios

  • The system database is restored using a data backup.
  • Inconsistency problems are handled. The manual audit is used to verify that the problems are rectified.

Prerequisites

Services in the system are running properly.

Procedure

  1. Log in to any controller node in the AZ. For details, see Using SSH to Log In to a Host.
  2. Import environment variables to the node. For details, see Importing Environment Variables.

    Please enter 1, enable Keystone V3 authentication with the built-in DC administrator.

  3. Run the following command in security mode to perform manual audit: (For details, see Command Execution Methods.)

    infocollect audit --item ITEM --parameter PARAMETER --type TYPE

    Table 18-83 describes the parameter in the command.

    If you do not specify the audit item, an audit alarm will be triggered when an audit problem is detected. However, if the audit item is specified, no audit alarm will be triggered when an audit problem is detected.

    Table 18-83 Parameter description

    Parameter

    Mandatory or Optional

    Description

    item

    Optional

    Specifies a specific audit item. If you do not specify the audit item, an audit alarm will be reported when an audit problem is detected. However, if the audit item is specified, no audit alarm will be reported when an audit problem is detected. Values:

    • 1001: indicates that a VM is audited. The following audit reports are generated after an audit is complete:
      • orphan_vm.csv: Audit report about orphan VMs
      • invalid_vm.csv: Audit report about invalid VMs
      • host_changed_vm.csv: Audit report about VM location inconsistency
      • stucking_vm.csv: Audit report about stuck VMs
      • diff_property_vm.csv: Audit report about VM attribute inconsistency
      • diff_state_vm.csv: Audit report about VM status inconsistency
      • host_invalid_migration.csv: Audit report about abnormal hosts that adversely affect cold migrated VMs
    • 1002: indicates that an image is audited. The following audit report is generated after an audit is complete:

      stucking_images.csv: Audit report about stuck images

    • 1003: indicates that a zombie process is audited. The following audit report is generated after an audit is complete:

      zombie_process_hosts.csv: Audit report about zombie processes

    • 1004: This item is required only for the KVM virtualization platform but not for FusionCompute.
    • 1005: indicates that the records of migrated databases are audited. The following audit reports are generated after an audit is complete:
      • cold_cleaned.csv: Audit report about residual data after cold migration
      • live_cleaned.csv: This report is generated only in KVM scenarios and not generated in FusionCompute scenarios.
      • cold_stuck.csv: Audit report about stuck databases of the cold migration
    • 1007: indicates that the Nova database has not submitted events for auditing for more than one hour. The following audit report is generated after an audit is complete:

      nova_idle_transactions.csv: Audit report about the Nova database not submitting events for auditing for more than one hour

    • 1102: indicates that the Neutron namespace is audited. The following audit report is generated after an audit is complete:

      redundant_namespaces.csv:

      Audit report about redundant Neutron namespaces

    • 1201: indicates that the invalid volume, orphan volume, volume attachment status and stuck volume are audited. The following audit reports are generated after an audit is complete:
      • fakeVolumeAudit.csv: Audit report about invalid volumes
      • wildVolumeAudit.csv: Audit report about orphan volumes
      • VolumeAttachmentAudit.csv: Audit report about the volume attachment status
      • VolumeStatusAudit.csv: Audit report about stuck volumes
      • FrontEndQosAudit.csv: Audit report about front-end QoS
      • VolumeQosAudit.csv: Audit report about volume QoS
    • 1204: indicates that the invalid snapshot, orphan snapshot, and stuck snapshot are audited. The following audit reports are generated after an audit is complete:
      • fakeSnapshotAudit.csv: Audit report about invalid snapshots
      • wildSnapshotAudit.csv: Audit report about orphan snapshots
      • SnapshotStatusAudit.csv: Audit report about stuck snapshots
      • wildInstanceSnapshotAudit.csv: Audit report about residual orphan child snapshots
    • 1205: indicates that an orphan snapshot is audited. The following audit report is generated after an audit is complete:

      wildSnapshotAudit.csv: Audit report about orphan snapshots

    • 1206: indicates that the volume attachment status is audited. The following audit report is generated after an audit is complete:

      VolumeAttachmentAudit.csv: Audit report about the volume attachment status

    • 1207: indicates that a stuck volume is audited. The following audit report is generated after an audit is complete:

      VolumeStatusAudit.csv: Audit report about stuck volumes

    • 1208: indicates that a stuck snapshot is audited. The following audit report is generated after an audit is complete:

      SnapshotStatusAudit.csv: Audit report about stuck snapshots

    • 1301: indicates that a bare metal server (BMS) is audited. The following audit report is generated after an audit is complete:
      • invalid_ironic_nodes.csv: Audit report about unavailable BMSs
      • invalid_ironic_instances.csv: Audit report about BMS consistency
      • stucking_ironic_instances.csv: Audit report about stuck BMSs
    • 1501: indicates that an orphan replication pair is audited. The following audit report is generated after an audit is complete:

      wildReplicationAudit.csv: Audit report about orphan replication pairs

    • 1502: indicates that an invalid replication pair is audited. The following audit report is generated after an audit is complete:

      fakeReplicationAudit.csv: Audit report about invalid replication pairs

    • 1503: indicates that a stuck replication pair is audited. The following audit report is generated after an audit is complete:

      ReplicationMidStatusAudit.csv: Audit report about stuck replication pairs

    • 1504: indicates that replication pair statuses are audited. The following audit report is generated after an audit is complete:

      statusReplicationAudit.csv: Audit report about replication pair statuses

    • 1505: indicates that an orphan consistency replication group is audited. The following audit report is generated after an audit is complete:

      wildReplicationcgAudit.csv: Audit report about an orphan consistency replication group

    • 1506: indicates that an invalid consistency replication group is audited. The following audit report is generated after an audit is complete:

      fakeReplicationcgAudit.csv: Audit report about an invalid consistency replication group

    • 1507: indicates that an stuck consistency replication group is audited. The following audit report is generated after an audit is complete:

      ReplicationcgMidStatusAudit.csv: Audit report about stuck consistency replication groups

    • 1508: indicates that consistency replication group statuses are audited. The following audit report is generated after an audit is complete:

      statusReplicationcgAudit.csv: Audit report about consistency replication group statuses

    • 1509: indicates that a consistency replication pair in the replication group is audited. The following audit report is generated after an audit is complete:

      contentReplicationcgAudit.csv: Audit report about consistency replication group content

    • 1601: indicates that an orphan HyperMetro pair is audited. The following audit report is generated after an audit is complete:

      wildHypermetroAudit.csv: Audit report about an orphan HyperMetro pair

    • 1602: indicates that an invalid HyperMetro pair is audited. The following audit report is generated after an audit is complete:

      fakeHypermetroAudit.csv: Audit report about an invalid HyperMetro pair

    • 1603: indicates that a stuck HyperMetro pair is audited. The following audit report is generated after an audit is complete:

      HypermetroMidStatusAudit.csv: Audit report about a stuck HyperMetro pair

    • 1604: indicates that an orphan HyperMetro consistency group is audited. The following audit report is generated after an audit is complete:

      wildHypermetrocgAudit.csv: Audit report about an orphan HyperMetro consistency group

    • 1605: indicates that an invalid HyperMetro consistency group is audited. The following audit report is generated after an audit is complete:

      fakeHypermetrocgAudit.csv: Audit report about an invalid HyperMetro consistency group

    • 1606: indicates that an stuck HyperMetro consistency group is audited. The following audit report is generated after an audit is complete:

      HypermetrocgMidStatusAudit.csv: Audit report about stuck HyperMetro consistency groups

    • 1702: indicates that an ECS snapshot is audited. The following audit report is generated after an audit is complete:

      images_vm_snapshots.csv: audit report about residual ECS snapshots

    If the parameter is not specified, all the audit items are performed by default.

    parameter

    Optional. This parameter can be specified only after the audit item is specified.

    Specifies an additional parameter. You can specify only one value which needs to match the item.

    • If item is set to 1001, you can set the value of vm_stucking_timeout which indicates the timeout threshold in seconds for VMs in an intermediate state. The default value is 14400. The value affects the audit report about stuck VMs. You can also set the value of host_invalid_timeout which indicates the heartbeat timeout threshold in seconds for abnormal hosts. The default value is 14400. The value affects the audit report about abnormal hosts that adversely affect cold migrated VMs.
    • If item is set to 1002, you can set the value of image_stucking_timeout which indicates the timeout period in seconds for transient images. The default value is 86400. The value affects the audit report about stuck images.
    • If item is set to 1005, you can set the value of migration_stucking_timeout which indicates the timeout period in seconds. The default value is 14400. The migration_stucking_timeout parameter affects the audit report about intermediate state of the cold migration.
    • If item is set to other values, no additional parameter is required.

    Example: --parameter vm_stucking_timeout=3600

    type

    Optional

    Specifies the additional parameter, which indicates whether an audit is synchronous or asynchronous. If this parameter is not specified for an audit, the audit is a synchronous one. The values are:

    • sync: specifies a synchronous audit. For details, see the following command.
    • async: specifies an asynchronous audit. For details, see Asynchronous Audit. The audit progress and audit result status of an asynchronous audit can be obtained by invoking the interface for querying the task status.

    Run the following command to detect a VM in the intermediate state for greater than or equal to 3600 seconds when conducting an audit:

    infocollect audit --item 1001 --parameter vm_stucking_timeout=3600

    Information similar to the following is displayed:

    +--------------------------------------+----------------------------------+  
     | Hostname                             | Path                             |  
     +--------------------------------------+----------------------------------+  
     | CCCC8175-8EAC-0000-1000-1DD2000011D0 | /var/log/audit/2015-04-22_020324 |  
     +--------------------------------------+----------------------------------+

    In the command output, Hostname indicates the ID of the host for which the audit report is generated, and Path indicates the directory containing the audit report.

    You need log in to the host firstly and then view audit reports based on Collecting Audit Reports to view it.

Collecting Audit Reports

Scenarios

  • An audit alarm, such as that of volumes, VMs, snapshots, or images, is generated.
  • Routine maintenance is performed.

Prerequisites

A local PC running the Windows operating system is available.

Procedure

  1. Log in to any controller host in an AZ. For details, see Using SSH to Log In to a Host.
  2. Import environment variables to the host.

    For details, see Importing Environment Variables.

    Please enter 1, enable Keystone V3 authentication with the built-in DC administrator.

  3. Run the following command to obtain the External OM plane IP address of a host where the audit service is deployed. For details, see Command Execution Methods.

    cps template-instance-list --service collect info-collect-server

    Information similar to the following is displayed:

  4. Log in to the info-collect-server-assigned host using the OM plane IP address. For details, see Logging In to a Host with a Role Deployed.
  5. Run the following command to query the time for the last audit conducted on the host:

    ls /var/log/audit -Ftr | grep /$ | tail -1

    Information similar to the following is displayed:

    2014-09-20_033137/
    NOTE:
    • The command output indicates the audit time. For example, 2014-09-20_033137 indicates 03:31:37 on September 20th, 2014.
    • If no result is returned, no audit report is available on the host.

  6. Run the following command to create a temporary directory where you can store audit reports:

    mkdir -p /home/fsp/last_audit_result

  7. Run the following command to copy the latest audit report to the temporary directory:

    cp -r /var/log/audit/`ls /var/log/audit -Ftr | grep /$ | tail -1` /home/fsp/last_audit_result

  8. Run the following command to modify the permissions of the temporary directory and file:

    chmod 777 /home/fsp/last_audit_result/ -R

  9. Use WinSCP or other tools to copy the folder /home/fsp/last_audit_result to the local PC.
  10. Run the following command to delete the temporary folder from the host:

    rm -r /home/fsp/last_audit_result

Obtaining the Operation Report

Scenarios

After the system database is restored using a data backup and is audited, if any audit alarm is generated, data inconsistency occurs. In this scenario, obtain operation reports to locate the inconsistent data and use the operation replay tool to obtain and analyze the operation logs and find out the inconsistency cause.

Operation Replay Tool Function

The operation replay tool is used to collect and analyze OpenStack component operation logs, and generate an operation report, which records operations performed on system resources in the specified time period. Then, users can check these operations.

This tool can analyze operation logs of the following components: Nova, Cinder, and Glance.

The system resources can be analyzed include VMs, images, volumes, and snapshots.

Report Format

The report generated by the operation replay tool is a .csv file.

The file name format is Component name-Start time_End time.csv, for example, nova-2014:09:10-10:00:00_2014:09:11-10:00:00.csv.

Table 18-84 describes parameters in the operation report.

Table 18-84 Report format description

Parameter

Description

Example Value

tenant

Specifies the tenant ID.

94e010f2246f435ca7f13652e64ff0fb

res_id

Specifies the resource ID.

8ff25fba9-61cd-424f-a64a-c4a07b372d51

res_type

Specifies the resource type.

volumes

time

Specifies the time when the operation was performed.

18/Sep/2014:12:37:15

host

Specifies the host ID.

CCCC8171-7958-0000-1000-1DD40000CAD0

action

Specifies detailed information about the operation.

POSThttps://volume.az1.dc1.vodafone.com:8776/v2/94e010f2246f435ca7f13652e64ff0fb/volumes{"volume": {"status": "creating""description": null"availability_zone": null"source_volid": null"snapshot_id": null"size": 1"user_id": null"name": "ooooo""imageRef": null"attach_status": "detached""volume_type": null"shareable": false"project_id": null"metadata": {}}} 8ff25fba9-61cd-424f-a64a-c4a07b372d51 202

Command Format

The format of the command used on this tool is as follows:

operate-replay analyse --path <log-path> [--dest <dest>] [--start <time>] [--end <time>]

This command can be executed on any host.

Table 18-85 describes parameters in the command.

Table 18-85 Parameter description

Parameter

Mandatory or Optional

Description

--path

Mandatory

Specifies the directory containing the operation logs. The directory structure is as follows:

.../log-path/host_id_1/nova-api

.../log-path/host_id_1/cinder-api

--dest

Optional

Specifies the analysis object. The value can be nova, cinder, or glance. Separate multiple objects using commas (,), such as --dest nova,cinder.

If the parameter is not specified, the tool analyzes all the three objects.

--start

Optional

Specifies the start time. The value format is YYYY/MM/DD-HH:MM:SS, for example, 2014/05/06-10:12:15.

The default value is unlimit, indicating the earliest time of log generation.

If both the start time and end time are specified, the difference between the two values must be less than or equal to 48 hours.

--end

Optional

Specifies the end time. The value format is YYYY/MM/DD-HH:MM:SS, for example, 2014/05/06-12:12:15.

The default value is unlimit, indicating the current time of the system.

Procedure

  1. Log in to a host in an AZ.

    For details, see Using SSH to Log In to a Host.

  2. Import environment variables to the host.

    For details, see Importing Environment Variables.

  3. Enter the secure operation mode based on Command Execution Methods and locate the controller hosts in the AZ.

    After you enter the secure operation mode, the following information is displayed:

    Input command:

    Run the following command:

    cps host-list

    In the command output, a node whose roles value contains controller indicates a controller node. Take a note of the IDs and OM plane IP addresses of all controller nodes, as well as the OM plane IP address of the first host in the AZ.

  4. Log in to the host whose omip address of the first host obtained in 3 and run the su fsp command to switch to user fsp.
  5. In the home/fsp directory, create a directory used for storing log files.

    If the directory to be created is op_log, run the following command:

    mkdir -p /home/fsp/op_log

  6. In the directory created in 5, run the following command to create a subdirectory for each controller host and name them with the IDs of the controller hosts:

    mkdir -p /home/fsp/op_log/controller_host_id

    If the control host ID is controller-node-1, run the following command:

    mkdir -p /home/fsp/op_log/controller-node-1

  7. Perform the following operations on each controller host obtained in 3 to copy controller host operation logs to their corresponding subdirectories created in 6:

    • Run the following commands to log in to each controller host using the OM plane IP addresses obtained in 3 and switch to user root:

      ssh fsp@OM plane IP address

      su - root

    • Run the following command to create a temporary directory:

      mkdir /home/fsp/op_tmp_log

    • Run the following command to switch to the directory containing the operation logs:

      cd /var/log/fusionsphere/operate

    • Run the following commands to copy the operation logs to the temporary directory /home/fsp/op_tmp_log:

      cp -r ./nova-api /home/fsp/op_tmp_log

      cp -r ./cinder-api /home/fsp/op_tmp_log

      cp -r ./glance-api /home/fsp/op_tmp_log

    • Run the following command to modify the permission of the temporary directory /home/fsp/op_tmp_log:

      chmod 777 /home/fsp/op_tmp_log -R

    • Run the following commands to copy the temporary directory /home/fsp/op_tmp_log to the first host:

      su fsp

      scp -r /home/fsp/op_tmp_log/* fsp@first_node_ip:/home/fsp/op_log/controller_host_id/

      In the command, first_node_ip indicates the OM plane IP address of the first host, and controller_host_id indicates the subdirectory created in 6 for the host.

    • Run the following command to delete the temporary directory:

      rm /home/fsp/op_tmp_log -r

    After the operation logs of all controller hosts are copied to the subdirectories, log in to the first host.

  8. Run the following command to generate an operation report:

    su - root

    operate-replay analyse --dest nova,cinder,glance --path /home/fsp/op_log [--start <time>] [--end <time>]

    You can set the time range and analysis objects as required.

    A .csv report file is displayed in the specified directory.

  9. Run the following command to change the permission for directory /home/fsp/op_log:

    chmod 777 /home/fsp/op_log -R

  10. Use WinSCP or other tools to copy the audit report on the first host to the local PC.

    If you no longer needs files in the xxx directory after the report is successfully copied, delete this directory.

    rm /home/fsp/op_log -r

  11. Use Excel to open the report.

Analyzing Audit Results

Scenarios

Analyze the audit results when:

  • When receiving audit-related alarms, such as volume, VM, snapshot, and image audit alarms, log in to the system, obtain the audit reports, and rectify the faults accordingly.
  • After enabling the backup and restoration feature, log in to the system and perform a consistency audit. Then obtain the audit reports and rectify the fault accordingly.
  • To perform routine maintenance for the system, log in to the system and perform an audit. Then obtain the audit reports and rectify the fault accordingly.

Prerequisites

Procedure

  1. Determine the audit report name.

    If the alarm is an audit alarm, choose Additional Info > Details and select the required audit report from displayed audit reports.

  2. Check the audit report name.

    • VM Audit
      • If the report name is orphan_vm.csv, rectify the fault based on Orphan VMs. Otherwise, residual resources may exist.
      • If the report name is invalid_vm.csv, rectify the fault based on Invalid VMs. Otherwise, unavailable VMs may be visible to users.
      • If the report name is host_changed_vm.csv, rectify the fault based on VM Location Inconsistency. Otherwise, VMs may become unavailable.
      • If the report name is stucking_vm.csv, rectify the fault based on Stuck VMs. Otherwise, VMs may become unavailable.
      • If the report name is diff_state_vm.csv, rectify the fault based on VM Status Inconsistency. Otherwise, tenant operations on VMs may be restricted.
      • If the report name is diff_property_vm.csv, rectify the fault based on VM Attribute Inconsistency. Otherwise, system data may become inconsistent.
      • If the report name is cold_stuck.csv and the report is not empty, rectify the fault based on Intermediate State of the Cold Migration. Otherwise, the affected VMs may fail to be maintained.
      • If the report name is host_invalid_migration.csv and the report is not empty, rectify the fault based on Cold Migrated VMs That Are Adversely Affected by Abnormal Hosts. Otherwise, the affected VMs may fail to be maintained.
      • If the report name is nova_service_cleaned.csv and the report is not empty, rectify the fault based on Handling nova-compute Service Residuals. Otherwise, user experience is affected.
      • If the report name is nova_idle_transactions.csv and the report is not empty, rectify the fault based on Handling Transactions That Are Not Submitted in the Nova Database. Otherwise, the number of available Nova database connections may decrease.
    • Volume Audit
      • If the report name is wildVolumeAudit.csv, rectify the fault based on Orphan Volumes. Otherwise, volumes may be unavailable in the Cinder service but occupy the storage space.
      • If the report name is fakeVolumeAudit.csv, rectify the fault based on Invalid Volumes. Otherwise, unavailable volumes may be visible to users.
      • If the report name is VolumeStatusAudit.csv, rectify the fault based on Stuck Volumes. Otherwise, volumes may become unavailable.
      • If the report name is VolumeAttachmentAudit.csv, rectify the fault based on Inconsistent Volume Attachment Information. Otherwise, residual resources may exist.
      • If the report name is FrontEndQosAudit.csv and the report is not empty, rectify the fault based on Frontend Qos. Otherwise, residual resources may exist.
      • If the report name is VolumeQosAudit.csv and the report is not empty, rectify the fault based on Volume Qos. Otherwise, unavailable volumes may be visible to users.
    • Snapshot Audit
      • If the report name is wildSnapshotAudit.csv, rectify the fault based on Orphan Volume Snapshots. Otherwise, residual resources may exist.
      • If the report name is fakeSnapshotAudit.csv, rectify the fault based on Invalid Volume Snapshots. Otherwise, unavailable volume snapshots may be visible to users.
      • If the report name is SnapshotStatusAudit.csv and the report is not empty, rectify the fault based on Stuck Volume Snapshots. Otherwise, volume snapshots may be unavailable.
      • If the report name is wildInstanceSnapshotAudit.csv, rectify the fault based on Residual Orphan Child Snapshots.Otherwise, residual volumes snapshot resources exist, occupying system resources.
    • Image Audit
      • If the report name is stucking_images.csv, rectify the fault based on Stuck Images. Otherwise, the affected VMs may fail to be maintained.
    • Virtual Network Resource Audit
      • If the report name is redundant_namespaces.csv and the report is not empty, rectify the fault based on Handling Redundant Neutron Namespaces. Otherwise, residual namespace may exist and fail to be maintained.
    • Other Audit
      • If the report name is zombie_process_hosts.csv, zombie processes have been generated in the nova-novncproxy service and have been automatically processed. For details, see Nova novncproxy Zombie Process.
      • If the report name is cold_cleaned.csv and the report is not empty, residual cold migration records exist in the environment and have been automatically processed. For details, see Residual Cold Migration Data.
      • If the report name is wildReplicationAudit.csv and the report is not empty, rectify the fault based on Orphan Replication Pair. Otherwise, residual resources may exist.
      • If the report name is fakeReplicationAudit.csv and the report is not empty, rectify the fault based on Invalid Replication Pairs. Otherwise, unavailable replication pairs may be visible to users.
      • If the report name is ReplicationMidStatusAudit.csv and the report is not empty, rectify the fault based on Stuck Replication Pairs. Otherwise, replication pairs may be unavailable.
      • If the report name is statusReplicationAudit.csv and the report is not empty, rectify the fault based on Replication Pair with Inconsistent Statuses. Otherwise, replication pairs may be unavailable.
      • If the report name is wildReplicationcgAudit.csv and the report is not empty, rectify the fault based on Orphan Remote Replication Consistency Groups. Otherwise, residual resources may exist.
      • If the report name is fakeReplicationcgAudit.csv and the report is not empty, rectify the fault based on Invalid Remote Replication Consistency Group. Otherwise, unavailable consistency replication groups may be visible to users.
      • If the report name is ReplicationcgMidStatusAudit.csv and the report is not empty, rectify the fault based on Stuck Remote Replication Consistency Groups. Otherwise, consistency replication groups may be unavailable.
      • If the report name is statusReplicationcgAudit.csv and the report is not empty, rectify the fault based on Remote Replication Consistency Groups with Inconsistent States. Otherwise, consistency replication groups may be unavailable.
      • If the report name is contentReplicationcgAudit.csv and the report is not empty, rectify the fault based on Remote Replication Consistency Groups with Inconsistent Replication Pairs. Otherwise, consistency replication groups may be unavailable.
      • If the report name is wildHypermetroAudit.csv and the report is not empty, rectify the fault based on Orphan HyperMetro Pairs. Otherwise, residual resources may exist.
      • If the report name is fakeHypermetroAudit.csv and the report is not empty, rectify the fault based on Invalid HyperMetro Pairs. Otherwise, unavailable HyperMetro pairs may be visible to users.
      • If the report name is HypermetroMidStatusAudit.csv and the report is not empty, rectify the fault based on Stuck HyperMetro Pairs. Otherwise, HyperMetro pairs may be unavailable.
      • If the report name is wildHypermetrocgAudit.csv and the report is not empty, rectify the fault based on Orphan HyperMetro Consistency Groups. Otherwise, residual resources may exist.
      • If the report name is fakeHypermetrocgAudit.csv and the report is not empty, rectify the fault based on Invalid HyperMetro Consistency Groups. Otherwise, unavailable HyperMetro consistency groups may be visible to users.
      • If the report name is HypermetrocgMidStatusAudit.csv and the report is not empty, rectify the fault based on Stuck HyperMetro Consistency Groups. Otherwise, HyperMetro consistency groups may be unavailable.
      • If the report name is images_vm_snapshots.csv, rectify the fault based on Residual ECS Snapshots. Otherwise, residual ECS snapshot resources exist, occupying system resources.

    If multiple faults in the audit report are displayed, the faults must be rectified based on the sequence listed in 2.

Handling Audit Results

In this part, import environment variables before performing the Nova (in nova xxx format), Cinder (in cinder xxx format), Neutron (in neutron xxx format), Glance (in glance xxx format), CPS (in cps xxx format), or cpssafe command in OpenStack. For details, see Importing Environment Variables.

You can run commands in OpenStack in either secure mode or insecure mode. For details, see Command Execution Methods.

Orphan VMs

Context

A VM is orphaned in the following scenarios:

  • The VM is present on a host but does not exist in the system database or is in the deleted state in the database.
  • If an orphan VM is not created by a tenant, it is recommended that the tenant deletes it to release computing and network resources.

Parameter Description

The name of the audit report for an orphan VM is orphan_vm.csv. Table 18-86 describes parameters in the report.

Table 18-86 Parameter description

Parameter

Description

uuid

Specifies the VM universally unique identifier (UUID).

hyper_vm_name

Specifies the VM Uniform Resource Name (URN) registered in FusionCompute.

host_id

Specifies the ID of the host accommodating the VM.

If the compute node is deployed in active/standby mode, the value of host_id is the logical ID of the compute node.

Possible Causes

  • A database is backed up for future restoration. However, after the backup is created, one or more VMs are created. After the database is restored, records of these VMs are deleted from the database, but these VMs reside on their hosts and become orphan VMs.
  • A VM is created using FusionCompute.
  • A VM is deleted when the fc-nova-compute component is in the fault state.

Impact on the System

  • VMs orphaned by database restoration are invisible to tenants.
  • System resources are leaked.

Procedure

  1. Log in to any controller host in an AZ.

    For details, see Using SSH to Log In to a Host.

  2. Import environment variables. For details, see Importing Environment Variables.
  3. Log in to any info-collect-server-assigned host using the OM plane IP address.

    For details, see Logging In to a Host with a Role Deployed.

    NOTE:

    Run the following command to obtain the info-collect-server-assigned host name:

    cps template-instance-list --service collect info-collect-server

    If the host you have logged in to is the one for which the OM plane IP address is to be obtained, proceed to the next operation.

  4. Run the following script to check whether the VM is an orphan VM:

    python /usr/local/bin/info-collect-server/server/audit_script.py cascaded_vm_wild_confirm fc-nova-computeXXX uuid

    NOTE:

    fc-nova-computeXXX: The value of fc-nova-computeXXX is that of host_id obtained from the audit report.

    uuid: The value of uuid can be obtained from the audit report.

    If the command output displays "This VM is wild.", the VM is an orphan VM. If the command output displays "This VM is not wild.", the VM is not an orphan VM. If the command output contains ERROR, contact technical support for assistance.

    • If yes, go to 5.
    • If no, the VM is not an orphan VM, and the fault may be falsely reported due to time differences. No further action is required.

  5. Query details of the VM in FusionCompute using uuid based on Querying Information About a VM in FusionCompute and check whether the orphan VM is created by an administrator on the FusionCompute web client.

    NOTE:

    uuid: The value of uuid can be obtained from the audit report.

    • If yes, confirm with the tenant whether to reserve the orphan VM.
      • If the VM needs to be reserved, no further action is required.
      • If the VM is not reserved, delete it on the FusionCompute web client.
    NOTE:

    If the VM to be reserved is the one created when users create a template, convert the VM to a template.

    In this case, no further action is required.

    • If no, go to the next step.

  6. Switch to the VM details page and check whether the VM is in the Stopped status.

    • If yes, go to 8.
    • If no, go to 7.

  7. On the VM details page, click Operation and select Stop from the drop-down list.
  8. Choose Hardware > Disk. The VM disk list page is displayed.
  9. In the VM disk list, click More and select Detach from the drop-down list to detach all disks from the VM.
  10. Confirm with the tenant whether to delete the orphan VM.

    • If yes, go to 11.
    • If no, contact technical support for assistance.

  11. On the VM details page, click Operation and select Delete from the drop-down list to delete the VM.

Invalid VMs

Context

An invalid VM is the one that is recorded as normal in the system database but is not present in FusionCompute.

For an invalid VM, confirm with the tenant whether the VM is useful. If the VM is not useful, delete the VM records from the database.

Parameter Description

The name of the audit report is invalid_vm.csv. Table 18-87 describes parameters in the report.

Table 18-87 Parameter description

Parameter

Description

uuid

Specifies the VM UUID.

tenant_id

Specifies the tenant ID.

hyper_vm_name

Specifies the VM name on the host, for example, instance_xxx.

updated_at

Specifies the last time when the VM status was updated.

status

Specifies the current VM status.

task_status

Specifies the current VM task status.

host_id

Specifies the ID of the host accommodating the VM.

If the compute node is deployed in active/standby mode, the value of host_id is the logical ID of the compute node.

Impact on the System

Users can query the VM using the Nova APIs, but the VM does not exist on the host.

Possible Causes

  • A database is backed up for future restoration. However, after the creation, one or more VMs are deleted. When the database is restored using the backup, records of these VMs are present to the restored database, but these VMs, indeed, have been deleted.
  • The VM creation fails due to FusionCompute exceptions or network faults during the VM creation or rebuilding process, resulting in residual VM data in the database.

Procedure

  1. Log in to any controller host in an AZ.

    For details, see Using SSH to Log In to a Host.

  2. Import environment variables. For details, see Importing Environment Variables.
  3. Log in to any info-collect-server-assigned host using the OM plane IP address.

    For details, see Logging In to a Host with a Role Deployed.

    NOTE:

    Run the following command to obtain the info-collect-server-assigned host name:

    cps template-instance-list --service collect info-collect-server

    If the host you have logged in to is the one for which the OM plane IP address is to be obtained, proceed to the next operation.

  4. Run the following script to check whether the VM is an invalid VM:

    python /usr/local/bin/info-collect-server/server/audit_script.py cascaded_vm_fake_confirm fc-nova-computeXXX uuid

    NOTE:

    fc-nova-computeXXX: The value of fc-nova-computeXXX is that of host_id obtained from the audit report.

    uuid: The value of uuid can be obtained from the audit report.

    If the command output displays "This VM is fake.", the VM is an invalid VM. If the command output displays "This VM is not fake.", the VM is not an invalid VM. If the command output contains ERROR, contact technical support for assistance.

    • If yes, go to 5.
    • If no, the VM is not an invalid VM, and the fault may be falsely reported due to time differences. In this case, no action is required.

  5. On the controller node, run the following command to check whether the last operation performed on the VM is the rebuilding operation:

    nova instance-action-list uuid

    NOTE:

    uuid: The value of uuid can be obtained from the audit report.

    Check whether the value of Action in the last row of the command output is rebuild.

    • If yes, go to 6.
    • If no, go to 7.

  6. Confirm with the tenant whether to rebuild the VM.

    • If yes, perform the operations provided in Rebuilding a VM. In this case, no further action is required.
    • If no, go to 7.

  7. Confirm with the tenant whether to delete the invalid VM.

    • If yes, go to 8.
    • If no, contact technical support for assistance.

  8. Run the following command to delete the VM:

    python /usr/local/bin/info-collect-server/server/audit_script.py cascaded_vm_fake_clean fc-nova-computeXXX uuid

    NOTE:

    fc-nova-computeXXX: The value of fc-nova-computeXXX is that of host_id obtained from the audit report.

    uuid: The value of uuid can be obtained from the audit report.

    If the command output displays "SUCCESS: Clean vm information succeed.", the VM is successfully deleted.

    Check whether the VM is successfully deleted.

    • If yes, no further action is required.
    • If no, contact technical support for assistance.

VM Location Inconsistency

Context

The host and hypervisor accommodating a VM recorded in the database are inconsistent with the actual host and hypervisor.

If the fault is confirmed, correct the actual VM location information (host ID) in the database.

Parameter Description

The name of the audit report is host_changed_vm.csv. Table 18-88 describes parameters in the report.

Table 18-88 Parameter description

Parameter

Description

uuid

Specifies the VM UUID.

tenant_id

Specifies the tenant ID.

hyper_vm_name

Specifies the VM name registered in the hypervisor.

updated_at

Specifies the last time when the VM status was updated.

status

Specifies the VM status.

task_status

Specifies the VM task status.

host_id

Specifies the ID of the host accommodating a VM recorded in the database.

If the compute node is deployed in active/standby mode, the value of host_id is the logical ID of the compute node.

hyper_host_id

Specifies the ID of the actual host accommodating the VM.

If the compute node is deployed in active/standby mode, the value of hyper_host_id is the logical ID of the compute node.

hypervisor_hostname

Specifies the hypervisor name of the VM recorded in the database.

hyper_hypervisor_hostname

Specifies the name of the hypervisor in which the VM is running.

Possible Causes

  • A database is backed up for future restoration. However, after the creation, the VM specifications are adjusted or one or more VMs are cold migrated. After the database is restored, location records of these VMs in the database are inconsistent with the actual VM locations.
  • Users manually migrate a VM to another cluster in FusionCompute, resulting in inconsistent VM location

Impact on the System

The VM becomes unavailable if the VM location recorded in the database is inconsistent with the host accommodating the VM.

Procedure

  1. Log in to any controller host in an AZ. For details, see Using SSH to Log In to a Host.
  2. Import environment variables. For details, see Importing Environment Variables.
  3. Log in to any info-collect-server-assigned host using the OM plane IP address.

    For details, see Logging In to a Host with a Role Deployed.

    NOTE:

    Run the following command to obtain the info-collect-server-assigned host name:

    cps template-instance-list --service collect info-collect-server

    If the host you have logged in to is the one for which the OM plane IP address is to be obtained, proceed to the next operation.

  4. Run the following command to check whether the VM is in a cluster connected to fc-nova-computeXXX:

    python /usr/local/bin/info-collect-server/server/audit_script.py cascaded_vm_host_changed_confirm fc-nova-computeXXX uuid

    NOTE:

    uuid: The value of uuid can be obtained from the audit report.

    fc-nova-computeXXX: The value of fc-nova-computeXXX is that of hyper_host_id obtained from the audit report.

    If "The VM is in cluster of fc-nova-computeXXX." is displayed in the command output, the VM is in a cluster connected to fc-nova-computeXXX.

    • If yes, go to 5.
    • If no, contact technical support for assistance.

  5. Log in to the host accommodating the active GauseDB service and run the following command to modify the information about the host accommodating the VM recorded in the database:

    sh /usr/bin/info-collect-script/audit_resume/host_changed_handle_without_hyper_name.sh uuid hyper_host_id

    For details, see Logging In to the Active GaussDB Node.

    NOTE:

    The password of the gaussdba account is required during the command execution process. The default password of user gaussdba is FusionSphere123.

    uuid: The value of uuid can be obtained from the audit report.

    hyper_host_id: The value of hyper_host_id is that the audit report.

    Check whether the command is successfully executed based on the command output.

    • If yes, go to 6.
    • If no, contact technical support for assistance.

  6. Run the following command to modify the name of the hypervisor accommodating the VM recorded in the database:

    sh /usr/bin/info-collect-script/audit_resume/host_changed_handle_hypervisor_name.sh uuid hyper_hypervisor_hostname

    NOTE:

    The password of the gaussdba account is required during the command execution process. The default password of user gaussdba is FusionSphere123.

    uuid: The value of uuid can be obtained from the audit report.

    hyper_ hypervisor_hostname: The value of hyper_ hypervisor_hostname is that in the audit report.

    Check whether the command is successfully executed based on the command output.

    • If yes, no further action is required.
    • If no, contact technical support for assistance.

Stuck VMs

Context

A stuck VM is the one that is kept in a transition state for more than 24 hours and cannot automatically restore due to a system exception (for example, FusionCompute exception) during a VM service process (for example, starting a VM).

Manually restore the VM based on the VM status and the task status.

Parameter Description

The name of the audit report is stucking_vm.csv. Table 18-89 describes parameters in the report.

Table 18-89 Parameter description

Parameter

Description

uuid

Specifies the VM UUID.

tenant_id

Specifies the tenant ID.

hyper_vm_name

Specifies the VM name registered in the hypervisor.

updated_at

Specifies the last time when the VM status was updated.

status

Specifies the VM status.

task_status

Specifies the VM task status.

host_id

Specifies the ID of the host accommodating the VM.

If the compute node is deployed in active/standby mode, the value of host_id is the logical ID of the compute node.

Possible Causes

A system exception occurs when a VM service operation is in process.

Impact on the System

The VM becomes unavailable and occupies system resources.

Procedure

Restore the VM based on the VM statuses and task statuses listed in the following table. For other situations, contact technical support for assistance.

Table 18-90 VM restoration methods

VM Status

Task Status

Possible Scenario

Restoration Method

building

scheduling

Creating a VM

For details, see Method 1.

building

None

Creating a VM

For details, see Method 1.

building

block_device_mapping

Creating a VM

For details, see Method 1.

building

networking

Creating a VM

For details, see Method 1.

N/A

rebooting

Restarting a VM

For details, see Method 2.

N/A

rebooting_hard

Restarting a VM

For details, see Method 2.

N/A

pausing

Pausing a VM

For details, see Method 2.

N/A

unpausing

Unpausing a VM

For details, see Method 2.

N/A

suspending

Suspending a VM

For details, see Method 2.

N/A

resuming

Resuming a VM

For details, see Method 2.

N/A

powering_off

Stopping a VM

For details, see Method 2.

N/A

powering_on

Starting a VM

For details, see Method 2.

N/A

migrating

Live migrating a VM

For details, see Method 2.

N/A

deleting

Deleting a VM

For details, see Method 2.

N/A

resize_prep

Modifying the VM attributes

For details, see Method 3.

Method 1

  1. Confirm with the tenant whether to delete the VM.

    • If yes, go to the next step.
    • If no, contact technical support for assistance.

  2. Set the VM status in the FusionSphere OpenStack system to error based on Setting the VM Status.
  3. Run the following command to delete the VM:

    nova delete uuid

    NOTE:

    uuid: The value of uuid can be obtained from the audit report.

Method 2

  1. Log in to any controller host in an AZ. For details, see Using SSH to Log In to a Host.
  2. Import environment variables. For details, see Importing Environment Variables.
  3. Log in to any info-collect-server-assigned host using the OM plane IP address.

    For details, see Logging In to a Host with a Role Deployed.

    NOTE:

    Run the following command to obtain the info-collect-server-assigned host name:

    cps template-instance-list --service collect info-collect-server

    If the host you have logged in to is the one for which the OM plane IP address is to be obtained, proceed to the next operation.

  4. Run the following command to set the VM status in the FusionSphere OpenStack system to ensure that the VM status is consistent with that in the FusionCompute system:

    python /usr/local/bin/info-collect-server/server/audit_script.py cascaded_vm_status_reset fc-nova-computeXXX uuid

    NOTE:

    fc-nova-computeXXX: The value of fc-nova-computeXXX is that of host_id obtained from the audit report.

    uuid: The value of uuid can be obtained from the audit report.

    If "SUCCESS: This vm's status is successfully reset." is displayed in the command output, the VM status is successfully set.

    Check whether the VM status is set successfully.

    • If yes, no further action is required.
    • If no, contact technical support for assistance.

Method 3

  1. Log in to any controller host in an AZ.

    For details, see Using SSH to Log In to a Host.

  2. Import environment variables. For details, see Importing Environment Variables.
  3. Log in to any info-collect-server-assigned host using the OM plane IP address.

    For details, see Logging In to a Host with a Role Deployed.

    NOTE:

    Run the following command to obtain the info-collect-server-assigned host name:

    cps template-instance-list --service collect info-collect-server

    If the host you have logged in to is the one for which the OM plane IP address is to be obtained, proceed to the next operation.

  4. Run the following command to check whether VM specifications obtained from the FusionSphere OpenStack system and those obtained from the FusionCompute system are consistent:

    python /usr/local/bin/info-collect-server/server/audit_script.py cascaded_vm_flavor_confirm fc-nova-computeXXX uuid

    NOTE:

    fc-nova-computeXXX: The value of fc-nova-computeXXX is that of host_id obtained from the audit report.

    uuid: The value of uuid can be obtained from the audit report.

    If "The flavor of VM is same between Openstack and FusionCompute." is displayed in the command output, the VM specifications in the FusionSphere OpenStack system are consistent with those in the FusionCompute system.

    • If yes, go to 5.
    • If no, log in to the FusionCompute web client, click Hardware, and query details of the VM CPU, memory, and CPU QoS using the VM UUID based on Querying Information About a VM in FusionCompute. Then modify the VM specifications on the FusionCompute web client based on the command output to ensure that the specifications are consistent with those in the FuisonSphere OpenStack system. Then go to 5.

  5. Update the VM status in the FusionSphere OpenStack system based on Method 2.

VM Status Inconsistency

Context

The VM status recorded in the database is inconsistent with the VM status in FusionCompute.

Parameter Description

The name of the audit report for the VM status inconsistency is diff_state_vm.csv. Table 18-91 describes parameters in the report.

Table 18-91 Parameter description

Parameter

Description

uuid

Specifies the VM UUID.

tenant_id

Specifies the tenant ID.

hyper_vm_name

Specifies the VM name registered in the hypervisor.

updated_at

Specifies the last time when the VM status was updated.

status

Specifies the VM status.

task_status

Specifies the VM task status.

power_status

Specifies the VM power supply status.

host_id

Specifies the ID of the host accommodating a VM recorded in the database.

If the compute node is deployed in active/standby mode, the value of host_id is the logical ID of the compute node.

hyper_status

Specifies the VM status in FusionCompute.

Possible Causes

  • A database is backed up for future restoration. However, after the creation, the VM is started or stopped. When the database is restored using the backup, status record of the VM in the database may be inconsistent with the actual VM status.
  • If an exception occurs in the FusionCompute system or management network, services in the FusionSphere OpenStack system are interrupted or fail, resulting in inconsistent VM status.
  • Other unknown errors result in inconsistent VM status.

Impact on the System

  • System data is inconsistent.
  • Tenants' operation rights on the VM are restricted.

Procedure

Handle the fault based on the processing methods applied to different VM statuses and scenarios listed in the following table. For other situations, contact technical support for assistance.

Table 18-92 Processing methods

OpenStack VM Status

FusionCompute VM Status

Possible Scenario

Processing Method

error

running

FusionCompute error during VM creation

Restoration or deleting VMs when FusionCompute system fails

For details, see Method 1.

error

stopped

FusionCompute error during VM creation

FusionCompute error during VM adjustment

For details, see Method 1.

error

hibernated

Backing up the management data for further restoration

For details, see Method 2.

error

paused

Backing up the management data for further restoration

For details, see Method 2.

active

hibernated

FusionCompute error during the VM suspending (hibernating) process

See Method 2.

active

paused

Backing up the management data for further restoration

For details, see Method 2.

suspended

running

FusionCompute error when a suspended VM is restored

Restoring a suspended VM on the FusionCompute web client

For details, see Method 2.

suspended

stopped

Stopping a suspended VM on the FusionCompute web client

For details, see Method 2.

suspended

paused

Backing up the management data for further restoration

For details, see Method 2.

paused

running

FusionCompute error when a paused VM is restored

Restoring a paused VM on the FusionCompute web client

For details, see Method 2.

paused

stopped

Stopping a paused VM on the FusionCompute web client

See Method 2.

paused

hibernated

Stopping, starting, and hibernating a paused VM on the FusionCompute web client

See Method 2.

-

unknown

FusionCompute system error

Contact technical support for assistance.

NOTE:

If the system automatically restores VM service operations or statuses of FusionSphere OpenStack and FusionCompute systems during VM auditing, the audit report may be incorrect due to time difference. Therefore, check whether the VM status recorded in the database is consistent with the actual VM status. If the statuses are consistent, no further action is required.

Method 1

  1. Check whether the VM has been properly loaded or started based on the operations provided in section Querying Whether a VM Was Properly Started.

    NOTE:

    If OS-SRV-USG:launched_at is left blank, the VM became abnormal during the creation process.

    • If yes, go to 2.
    • If no, ask the tenant to delete the VM. If the tenant fails to delete the VM, contact technical support for assistance.

  2. Confirm with the tenant about whether to use the VM.

    • If yes, go to 3.
    • If no, ask the tenant to delete the VM. If the tenant fails to delete the VM, contact technical support for assistance.

  3. Log in to any controller host in an AZ.

    For details, see Using SSH to Log In to a Host.

  4. Import environment variables. For details, see Importing Environment Variables.
  5. Run the following command to query the name of the host accommodating the VM:

    nova show uuid

    NOTE:

    uuid: The value of uuid can be obtained from the audit report.

    The host name is the value of OS-EXT-SRV-ATTR:host in the command output.

  6. Log in to any info-collect-server-assigned host using the OM plane IP address.

    For details, see Logging In to a Host with a Role Deployed.

    NOTE:

    Run the following command to obtain the info-collect-server-assigned host name:

    cps template-instance-list --service collect info-collect-server

    If the host you have logged in to is the one for which the OM plane IP address is to be obtained, proceed to the next operation.

  7. Run the following command to check whether VM specifications obtained from the FusionSphere OpenStack system and those obtained from the FusionCompute system are consistent:

    python /usr/local/bin/info-collect-server/server/audit_script.py cascaded_vm_flavor_confirm fc-nova-computeXXX uuid

    NOTE:

    fc-nova-computeXXX: The value of fc-nova-computeXXX can be obtained from 5.

    uuid: The value of uuid can be obtained from the audit report.

    If "The flavor of VM is same between Openstack and FusionCompute." is displayed in the command output, the VM specifications in the FusionSphere OpenStack system are consistent with those in the FusionCompute system.

    • If yes, go to 8.
    • If no, click Hardware, query the information about the VM CPU, memory, and CPU QoS using the VM UUID based on Querying Information About a VM in FusionCompute, and modify the VM specifications on the FusionCompute web client based on the command output to ensure that the specifications are consistent with those in the FuisonSphere OpenStack system. Then go to 8.

  8. Update the VM status in the FusionSphere OpenStack system based on Method 2.

Method 2

  1. Log in to any controller host in an AZ.

    For details, see Using SSH to Log In to a Host.

  2. Import environment variables. For details, see Importing Environment Variables.
  3. Run the following command to query the name of the host accommodating the VM:

    nova show uuid

    NOTE:

    uuid: The value of uuid can be obtained from the audit report.

    The host name is the value of OS-EXT-SRV-ATTR:host in the command output.

  4. Log in to any info-collect-server-assigned host using the OM plane IP address.

    For details, see Logging In to a Host with a Role Deployed.

    NOTE:

    Run the following command to obtain the info-collect-server-assigned host name:

    cps template-instance-list --service collect info-collect-server

    If the host you have logged in to is the one for which the OM plane IP address is to be obtained, proceed to the next operation.

  5. Run the following command to set the VM status in the FusionSphere OpenStack system to ensure that the VM status is consistent with that in the FusionCompute system:

    python /usr/local/bin/info-collect-server/server/audit_script.py cascaded_vm_status_reset fc-nova-computeXXX uuid

    NOTE:

    fc-nova-computeXXX: The value of fc-nova-computeXXX can be obtained from 3.

    uuid: The value of uuid can be obtained from the audit report.

    If "SUCCESS: This vm's status is successfully reset." is displayed in the command output, the VM status is successfully set.

    Check whether the VM status is successfully set.

    • If yes, no further action is required.
    • If no, contact technical support for assistance.

VM Attribute Inconsistency

Context

The VM attributes recorded in the database are inconsistent with those recorded in FusionCompute.

The consistency auditing of the following attributes is supported by this version:

  • VM boot device
  • VM NIC
  • VM ID

Parameter Description

The name of the audit report for the VM attribute inconsistency is diff_property_vm.csv. Table 18-93 describes parameters in the report.

Table 18-93 Parameter description

Parameter

Description

uuid

Specifies the VM UUID.

tenant_id

Specifies the tenant ID.

hyper_vm_name

Specifies the VM name registered in the hypervisor.

updated_at

Specifies the last time when the VM status was updated.

status

Specifies the VM status.

task_status

Specifies the VM task status.

host_id

Specifies the ID of the host accommodating a VM recorded in the database.

If the compute node is deployed in active/standby mode, the value of host_id is the logical ID of the compute node.

property_name

Specifies the VM attributes, including the following:

bootDev: Boot devices of a VM, which include:

  • hd: indicates that the VM boots from hard disks.
  • network: indicates that the VM boots from the network.

nic: VM NICs (MAC addresses separated by slashes)

internal_id: the VM ID generated by FusionCompute when the VM is created

property

Specifies the VM attribute values in the FusionSphere OpenStack system.

hyper_property

Specifies the VM attributes in FusionCompute.

Possible Causes

  • A database is backed up for future restoration. However, after the creation, the VM boot device is changed or NICs are added or deleted. When the database is restored using the backup, attribute records of the VM in the database are rebuilt and therefore inconsistent with the actual VM attributes.
  • A FusionCompute data backup is created for future restoration. However, after the creation, the VM is rebuilt. When the database is restored using the backup, the VM ID in the database is inconsistent with the actual VM ID.
  • The VM boot device or NIC data recorded in the database is inconsistent with the actual one due to the system fault (FusionCompute exceptions) in the service process.

Impact on the System

  • System data is inconsistent.
  • The VM becomes unavailable.

Procedure

Handle the fault according to different attribute names recorded in the audit report.

  • If the attribute name is the boot device (bootDev) of a VM, handle the fault according to the operations provided in VM Boot Device Processing.
  • If the attribute name is VM NICs (nic), handle the fault according to the operations provided in VM NIC Processing.
  • If the attribute name is the VM ID (internal_id), handle the fault according to the operations provided in VM ID Processing.

VM Boot Device Processing

  1. Log in to any controller host in an AZ.

    For details, see Using SSH to Log In to a Host.

  2. Import environment variables. For details, see Importing Environment Variables.
  3. Run the following command to query the name of the host accommodating the VM:

    nova show uuid

    NOTE:

    uuid: The value of uuid can be obtained from the audit report.

    The host name is the value of OS-EXT-SRV-ATTR:host in the command output.

  4. Log in to any info-collect-server-assigned host using the OM plane IP address.

    For details, see Logging In to a Host with a Role Deployed.

    NOTE:

    Run the following command to obtain the info-collect-server-assigned host name:

    cps template-instance-list --service collect info-collect-server

    If the host you have logged in to is the one for which the OM plane IP address is to be obtained, proceed to the next operation.

  5. Run the following command to set the VM boot mode in the FusionCompute system to ensure that the VM boot mode is consistent with that in the FusionSphere OpenStack system:

    python /usr/local/bin/info-collect-server/server/audit_script.py cascaded_vm_boot_dev_reset fc-nova-computeXXXuuid

    NOTE:

    fc-nova-computeXXX: The value of fc-nova-computeXXX can be obtained from 3.

    uuid: The value of uuid can be obtained from the audit report.

    If "SUCCESS: Reset vm boot dev success." is displayed in the command output, the VM boot mode is successfully set.

    Check whether the VM boot mode is successfully set.

    • If yes, no further action is required.
    • If no, contact technical support for assistance.

VM NIC Processing

  1. Query the audit report to check whether a NIC exists in FusionSphere OpenStack or FusionCompute.

    • If the property attribute is configured, go to 2.
    • If the hyper_property attribute is configured, go to 6.
    NOTE:

    The audit report records different MAC addresses for a NIC of a VM in FusionSphere OpenStack and FusionCompute.

    If the property attribute is configured, the NIC exists only in the FusionSphere OpenStack system.

    If the hyper_property attribute is configured, the NIC exists only in the FusionCompute system.

    If different information about multiple NICs is recorded in FusionSphere OpenStack and FusionCompute, handle the fault based on the system in which the NICs are deployed.

  2. Log in to any controller host in an AZ.

    For details, see Using SSH to Log In to a Host.

  3. Import environment variables. For details, see Importing Environment Variables.
  4. Run the following command to query the VM port information and make a note of the port ID corresponding to the MAC address in the property attribute:

    neutron port-list --device_id uuid

    NOTE:

    uuid: The value of uuid can be obtained from the audit report.

  5. Run the following command to delete residual data:

    neutron port-delete port_id

    NOTE:

    port_id: If there are multiple port IDs corresponding to the MAC address obtained in the 4, repeat 5 to delete all port IDs.

  6. Confirm with the tenant whether the NIC that resides in the FusionCompute system is still in use in a VM.

    • If yes, no further action is required.
    • If no, go to 7.

  7. Log in to the FusionCompute web client to query the VM information using the VM UUID and switch to the VM details page. For details, see Querying Information About a VM in FusionCompute.

    NOTE:

    uuid: The value of uuid can be obtained from the audit report.

  8. On the VM details page, choose Hardware > NIC and locate the NIC to be delete.

    NOTE:

    You can determine the MAC address of the NIC based on hyper_property in the audit report and delete the NIC with the MAC address.

  9. Click More and select Delete from the drop-down list.

VM ID Processing

  1. Check whether the VM ID is left blank in the property attribute.

    • If yes, go to 2.
    • If no, go to 5.

  2. Log in to any controller host in an AZ. For details, see Using SSH to Log In to a Host.
  3. Import environment variables. For details, see Importing Environment Variables.
  4. Run the following command to check whether the VM is deleted:

    nova instance-action-list uuid

    NOTE:

    uuid: The value of uuid can be obtained from the audit report.

    Check whether the value of Action in the last row of the command output is delete.

    • If yes, go to 9.
    • If no, contact technical support for assistance.

  5. Log in to the alarm page of the FusionCompute web client based on Querying Information About a VM in FusionCompute. and check whether the alarm Uncontrolled VMs Detected is displayed.

    • If yes, go to 6.
    • If no, go to 8.

  6. Query the alarm object URN and obtain the ID of the uncontrolled VM.

    NOTE:

    For example, if the alarm object URN is urn:sites:3B5E0684:vms:i-00000014, the ID of the uncontrolled VM is i-00000014.

  7. Check whether the VM ID recorded in the property attribute is consistent with that obtained from 6.

    • If yes, clear the alarm according to FusionCompute alarm information, and then go to Step 8.
    • If no, go to Step 8.

  8. Rebuild the VM based on the operations provided in Rebuilding a VM. After this operation is performed, no further action is required.
  9. Run the following command to delete the VM:

    nova delete uuid

    NOTE:

    uuid: The value of uuid can be obtained from the audit report.

  10. Check whether the VM is successfully deleted.

    • If yes, no further action is required.
    • If no, contact technical support for assistance.

Stuck Images

Context

An image in the active state is available for use. If an image is stuck in the queued or saving state, the image is unavailable. If an image is kept stuck in a transition state for a long time (24 hours by default), delete the image.

Description

The name of the audit report is stucking_images.csv. Table 18-94 describes parameters in the report.

Table 18-94 Parameter description

Parameter

Description

id

Specifies the image ID.

status

Specifies the image status.

updated_at

Specifies the last time when the image was updated.

owner

Specifies the ID of the tenant who created the image.

Impact on the System

  • An image in the queued state does not occupy system resources, but the image is unavailable.
  • An image in the saving state has residual image files that occupy the storage space.

Possible Causes

  • The image creation process is not complete: The image was not uploaded to the image server within 24 hours after it was created. In this case, the image is kept in the queued state.
  • During the image creation process, an exception (for example, intermittent network disconnection) occurred when the image was being uploaded. In this case, the image is kept in the queued state.
  • When an image was being uploaded, the Glance service failed. In this case, the image is kept in the saving state.

Procedure

Delete the image that is kept stuck in the queued or saving state and create another one.

  1. Log in to any controller host in an AZ.

    For details, see Using SSH to Log In to a Host.

  2. Import environment variables. For details, see Importing Environment Variables.
  3. Run the following command to delete the image and check whether the command is successfully executed:

    glance image-delete Image ID

    NOTE:

    The Image ID value is the value of id in the audit report.

    You can also contact the tenant and ask the tenant to delete the image.

    • If yes, no further action is required.
    • If no, contact technical support for assistance.

Orphan Volumes

Context

An orphan volume is the one that is present to a VRM node or a storage device but is not recorded in the Cinder database, or the one whose status is error in the Cinder database but is not bound to a VM on a VRM node or a storage device.

If the management data is lost due to backup-based system restoration and tenants require to restore the volume data, contact technical support for assistance. Then delete the volume whose status is error in the Cinder database.

Parameter Description

The name of the audit report is wildVolumeAudit.csv. Table 18-95 describes parameters in the report.

Table 18-95 Parameter description

Parameter

Description

volume_name

Specifies the unique volume ID on the VRM node to which volume- is prefixed, or the volume name on the storage device.

volume_type

Specifies the volume type, such as san, dsware, vrm, or v3.

Impact on the System

An orphan volume is unavailable in the Cinder service but occupies the storage space.

Possible Causes

  • A database is backed up for future restoration. However, after the backup is created, one or more volumes are created. When the database is restored using the backup, record of the volume creation information is deleted from the database, but the information resides on the storage devices or the VRM node.
  • Volumes on the VRM node or the storage device are not created using the Cinder service.
  • The data status is faulty in the Cinder database but normal on the VRM node or the storage device.
NOTE:

When you design system deployment for a site, do not create volumes through the VRM interface. Otherwise, false audit reports may be generated.

Procedure

Determine the methods for handling the volume based on Table 18-96.

Table 18-96 Determining the method for handling the volume based on the volume type

volume_type

Restoration Method

vrm

For details, see Method 1.

san, dsware, or v3

For details, see Method 2.

Method 1

  1. Open the audit report and query the volume information.
  2. Log in to any controller host in an AZ.

    For details, see Using SSH to Log In to a Host.

  3. Import environment variables. For details, see Importing Environment Variables.
  4. Run the following command to check whether the volume exists in the Cinder service:

    cinder show Volume UUID

    NOTE:

    The Volume UUID value is the value of name in the parameter volume_name in the audit report. If "ERROR: No volume with a name or ID of 'XXX (Volume UUID)' exists" is displayed in the command output, the volume does not exist.

    • If yes, go to 5.
    • If no, go to step 9.

  5. Check whether the volume status is error.

    NOTE:

    The volume status can be obtained from the status attribute.

    • If yes, go to 6.
    • If no, contact technical support for assistance.

  6. Confirm with the tenant whether to delete the volume.

    • If yes, go to 7.
    • If no, contact technical support for assistance.

  7. Run the following command to delete the volume:

    cinder force-delete Volume UUID

    NOTE:

    The Volume UUID value is the value of name in the parameter volume_name in the audit report.

  8. Run the following command to check whether the volume exists:

    cinder show Volume UUID

    NOTE:

    The Volume UUID value is the value of name in the parameter volume_name in the audit report.

    • If yes, contact technical support for assistance.
    • If no, no further action is required.

  9. Confirm with tenant whether to delete the orphan volume.

    • If yes, go to 10.
    • If no, contact technical support for assistance.

  10. Run the following command to obtain the OM plane IP address of the blockstorage-driver-vrmXXX-assigned host:

    cps host-list

    NOTE:

    To obtain the OM plane IP address, locate the host whose roles value contains blockstorage-driver-vrmXXX in the command output and take a note of the OM plane IP address.

  11. Log in to the blockstorage-driver-vrmXXX-assigned host based on Using SSH to Log In to a Host. and run the following command to query the volume details:

    python /usr/bin/info-collect-script/audit_resume/get_vrm_volume.py -qi Volume UUID -cf `echo "blockstorage-driver-vrmXXX" | awk -F '-' '{print "cinder-volume-"$3}'`

    NOTE:

    The Volume UUID value is the value of name in the parameter volume_name in the audit report.

    The roles value of blockstorage-driver-vrmXXX can be obtained from 10.

  12. Obtain the data store and volume UUID information based on the command output in 11.

    NOTE:

    The data store is the value of the volume_datastore_name field returned in the command output.

    The volume UUID is the value of the volume_uuid field returned in the command output.

  13. Query the volume details in the FusionCompute system using the volume UUID based on Querying Information About a Volume in FusionCompute and locate the volume based on the volume data store information obtained in 12.
  14. Check whether the value of Attach VM is Bound.

  15. Click More, select Delete from the More drop-down list, and check whether the volume is deleted.

    • If yes, no further action is required.
    • If no, contact technical support for assistance.

Method 2

  1. Log in to the first controller node in an AZ.

    For details, see Using SSH to Log In to a Host.

  2. Import environment variables to the host.

    For details, see Importing Environment Variables.

    Please enter 1, enable Keystone V3 authentication with the built-in DC administrator.

  3. Obtain the volume operation report using the operation replay function.

    For details, see Obtaining the Operation Report.

  4. Run the following command to query the mapping between the volume ID and the volume name and make a note of the volume ID:

    python /usr/bin/info-collect-script/audit_resume/storage_name_relations.py -ia Directory storing the audit report of the orphan volume -io Directory storing the operation log report -o Directory storing the execution result file -vt volume

    NOTE:

    Ensure that the audit report and the operation log report have been copied to the current node.

    python /usr/bin/info-collect-script/audit_resume/storage_name_relations.py -ia /var/log/audit/2014-09-23_070554/audit/wildVolumeAudit.csv -io /tmp/op_log/cinder-2014\:09\:22-00\:00\:00_unlimit.csv -o /tmp/result.csv -vt volume

    The command is successfully executed if the following information is displayed:

    Successful!

    Check whether the command is successfully executed.

    • If yes, go to 5.
    • If no, contact technical support for assistance.

  5. Run the following command to view the execution result file:

    cat Execution result file name

    Command example:

    cat /tmp/result.csv

    NOTE:

    /tmp/result.csv is a file storing execution results. If the file content is empty, the volume does not exist.

    Information similar to the following is displayed:

    volume_name,volume_id,tenants_id 
    volume-044e14af-9d11-4ee9-9b5a-0dcbcd5033aa,044e14af-9d11-4ee9-9b5a-0dcbcd5033aa,5c5e1c868a184035a84b3aaa61e32993 
    volume-18ff2024-07d1-427c-924d-dd8207f9af99,18ff2024-07d1-427c-924d-dd8207f9af99,5c5e1c868a184035a84b3aaa61e32993 
    volume-bcda8a1b-cb15-4bb8-8b55-0cb7c763a85a,bcda8a1b-cb15-4bb8-8b55-0cb7c763a85a,5c5e1c868a184035a84b3aaa61e32993

    Check whether the command output contains the orphan volume.

    • If yes, go to 6.
    • If no, contact technical support for assistance.

  6. Obtain the volume attributes by referring to Querying Volume Attributes. For details, see Querying Volume Attributes.
  7. Ask the tenant whether to restore the orphan volume.

    • If yes, go to 8.
    • If no, go to 9.

  8. Use the orphan volume to create another volume and restore the original data to the new volume.

    For details, see Restoring Volume Data.

    Check whether any exception occurs during the volume data restoration process.

    • If yes, contact technical support for assistance.
    • If no, perform 9 to delete the orphan volume.

  9. Delete the orphan volume by referring to Deleting an Orphan Volume.

Invalid Volumes

Context

An invalid volume is the one that is recorded in the Cinder database but is not present to a VRM node or the storage device. It can also be the one whose status is error in the Cinder database but is attached to a VM on the VRM node.

Delete the invalid volume from the Cinder database.

NOTE:

Rectify the fault in the orphan VM audit report before deleting an invalid volume. For details, see Orphan VMs.

Parameter Description

The name of the audit report is fakeVolumeAudit.csv. Table 18-97 describes parameters in the report.

Table 18-97 Parameter description

Parameter

Description

volume_id

Specifies the volume ID.

volume_displayname

Specifies the name of the volume created by a tenant.

volume_name

Specifies the unique volume ID on the VRM node to which volume- is prefixed, or the volume name on the storage device.

volume_type

Specifies the volume type, such as san, dsware, vrm, or v3.

location

Specifies the volume location.

Impact on the System

The volume can be queried using the Cinder command but cannot be used.

Possible Causes

  • A database is backed up for future restoration. However, after the backup is created, one or more volumes are deleted. When the database is restored using the backup, records of these volumes reside in the database and become invalid volumes.
  • Volumes fail to create using images, resulting in residual volumes in the database.

Procedure

Determine the method for handling the volume based on Table 18-98.

Table 18-98 Determining the method for handling the volume based on the volume type

volume_type

Restoration Method

vrm

For details, see Method 1.

san, dsware, or v3

For details, see Method 2.

Method 1

  1. Open the audit report and query the volume information.
  2. Log in to the blockstorage-driver-vrmXXX-assigned host based on Logging In to the blockstorage-driver-vrmXXX-Assigned Host to Which a Volume Is Attached and run the following command to query the volume details:

    python /usr/bin/info-collect-script/audit_resume/get_vrm_volume.py -qi uuid -cf `echo "blockstorage-driver-vrmXXX" | awk -F '-' '{print "cinder-volume-"$3}'`

    NOTE:

    The value of uuid is that of volume_id obtained from the audit report.

    The value of blockstorage-driver-vrmXXX is obtained from the data of blockstorage-driver-vrmXXX in Logging In to the blockstorage-driver-vrmXXX-Assigned Host to Which a Volume Is Attached.

    Check whether the command output contains "can not find this volume".

    NOTE:

    The value of uuid is that of volume_id obtained from the audit report.

  3. Confirm with tenant whether to delete the invalid volume.

    • If yes, go to 4.
    • If no, contact technical support for assistance.

  4. Run the following command to check whether the volume has any snapshots:

    cinder snapshot-list –-all-t –-volume-id Volume UUID

    NOTE:

    The Volume UUID value is the value of volume_id obtained from the audit report.

    If the command output is left blank, the volume does not have any snapshots.

    • If the volume has snapshots, go to 5.
    • If the volume does not have any snapshots, go to 6.

  5. Run the following command to delete all the snapshots displayed in the command output in 4:

    cinder snapshot-delete Volume snapshot UUID

    NOTE:

    The Volume snapshot UUID value can be obtained from the ID column in the command output in 4.

  6. Log in to the active GaussDB node based on Logging In to the Active GaussDB Node and run the following command to delete the volume:

    python /usr/bin/info-collect-script/audit_resume/delete_specify_volume.py Volume UUID

    NOTE:

    The Volume UUID value is the value of volume_id obtained from the audit report.

Method 2

  1. Log in to the first controller node in an AZ.

    For details, see Using SSH to Log In to a Host.

  2. Import environment variables to the host.

    For details, see Importing Environment Variables.

  3. Perform the following operations to check whether the volume exists in the Cinder service:

    1. Run the following command to enter the secure mode:

      runsafe

      Information similar to the following is displayed:

      Input command:
    2. Run the following command to check whether the volume exists in the Cinder service:

      cinder show Volume ID

      The volume ID is obtained from the audit report. Command example:

      cinder show 044e14af-9d11-4ee9-9b5a-0dcbcd5033aa

      If ERROR is displayed in the command output, the volume does not exist in the Cinder service.

      Check whether the command output contains ERROR.

      • If yes, contact technical support for assistance.
      • If no, go to 4.

  4. Perform the following operations to query the node list:

    1. Enter the secure mode.

      For details, see Command Execution Methods.

    2. Run the following command to query the management IP address of a controller node:

      cps host-list

      Information similar to the following is displayed:

      +--------------------------------------+-----------+----------------------+--------+------------+------+ 
      | id                                   | boardtype | roles                | status | manageip | omip | 
      +--------------------------------------+-----------+----------------------+--------+------------+------+ 
      | 778F416E-C3BB-11A0-8567-000000821800 | BC11BTSA  | auth,                | normal | 172.29.6.1 |      | 
      |                                      |           | blockstorage-driver, |        |            |      | 
      |                                      |           | compute,             |        |            |      | 
      |                                      |           | controller,          |        |            |      | 
      |                                      |           | image                |        |            |      | 
      | AE0CCD20-C1CF-1179-8567-000000821800 | BC11BTSA  | auth,                | normal | 172.29.0.2 |      | 
      |                                      |           | blockstorage-driver, |        |            |      | 
      |                                      |           | compute,             |        |            |      | 
      |                                      |         | controller,          |        |            |      | 
      |                                      |           | image,               |        |            |      | 
      |                                      |           | loadbalancer,        |        |          |      | 
      |                                      |           | router,              |        |            |      | 
      |                                      |           | sys-server           |        |            |      | 
      | 404ECF92-DBCF-11E4-8567-000000821800 | BC11BTSA  | auth,                | normal | 172.29.6.3 |      | 
      |                                      |           | blockstorage-driver, |        |            |      | 
      |                                      |           | compute,             |        |            |      | 
      |                                      |           | controller,          |        |            |      | 
      |                                      |           | image                |        |            |      | 
      +--------------------------------------+-----------+----------------------+--------+------------+------+

      The value of manageip indicates the management IP address.

  5. Run the following commands to log in to blockstorage-driver-assigned hosts one by one:

    su fsp

    ssh fsp@Management IP address

    Command example:

    ssh fsp@172.29.6.3

    Ensure that user fsp is used to establish the connection. After the login is successful, run the su - root command to switch to user root.

    • The default password of user fsp is Huawei@CLOUD8.
    • The default password of user root is Huawei@CLOUD8!.

  6. Run the following command to query the storage type:

    python /usr/bin/info-collect-script/audit_resume/get_host_storage_info.py

    Check the storage type based on the command output.

    NOTE:

    If the storage type displayed in the command output is inconsistent with the volume type, go to 5 to log in to another blockstorage-driver-assigned host and perform the subsequent operations.

    • Information similar to the following is displayed:
      storage_type=dsware 
      addition info is : 
                manage_ip=172.28.0.231 
                vbs_url=172.28.6.1,172.28.6.0,172.28.0.2

      The storage type is dsware. Go to 7. The value of manage_ip indicates the FusionStorage Manager node IP address, and the value of vbs_url indicates the compute node IP management IP address.

    • Information similar to the following is displayed:
      storage_type=san 
      addition info is : 
                ControllerIP0 is 192.168.172.40 
                ControllerIP1 is 192.168.172.41

      The storage type is san. Go to 8. The value of ControllerIP indicates the SAN storage device management IP address.

    NOTE:

    If the values of both ControllerIP0 and ControllerIP1 are x.x.x.x or 127.0.0.1. The storage type is v3. Go to 8.

  7. Run the following command to query the volume information:

    fsc_cli --ip Compute node management IP address --manage_ip FusionStorage Manager node IP address --port 10519 --op query_vol --volume Volume name on the storage device

    Command example:

    fsc_cli --ip 172.29.6.6 --manage_ip 172.29.0.231 --port 10519 --op query_vol --volume volume-6f2282f1-22b3-41f1-8b3f-d15aa9790388

    The volume exists on the storage device if information similar to the following is displayed:

    result=0 
    vol_name=volume-6f2282f1-22b3-41f1-8b3f-d15aa9790388,father_name=,status=0,vol_size=1024,real_size=-1,pool_id=0,create_time=2014-09-18 07:23:21

    Check whether the volume exists on the storage device attached to the host whose roles value is blockstorage-driver in the AZ.

    • If yes, contact technical support for assistance.
    • If no, go to 9.

  8. Log in to the OceanStor DeviceManager page of the IP SAN device, choose Storage Resources > LUN, search for volumes (choose Provisioning > LUN if the storage device belongs to V3 series), and check whether the volume exists on the storage device attached to the host whose role is blockstorage-driver in the AZ.

    • If yes, contact technical support for assistance.
    • If no, go to 9.

  9. Log in to each host that has the controller role assigned and run the following command to check whether the volume has any snapshots:

    cinder snapshot-list --all-t --volume-id Volume ID

    Check whether the volume has any snapshots based on the command output.

    • If yes, take note of the snapshot IDs, run the following command to delete the snapshots, and go to the next step:

      cinder snapshot-delete Snapshot ID

    • If no, go to the next step.

    Run the following command to delete the volume:

    python /usr/bin/info-collect-script/audit_resume/delete_specify_volume.py Volume ID

    The volume is successfully deleted if information similar to the following is displayed:

    INFO: delete success.

  10. Enter the secure operation mode (for details, see 3) and run the following command to query the volume status:

    cinder show Volume ID

    Check whether the volume still exists.

    • If yes, contact technical support for assistance.
    • If no, no further action is required.

Orphan Volume Snapshots

Context

An orphan volume snapshot is the one that exists on the VRM node or the storage device but is not recorded in the Cinder database. It can also be the one that exists in the Cinder database and on the VRM node but its unavailable period exceeds 24 hours.

Delete the orphan volume snapshot from the storage device.

Parameter Description

The name of the audit report is wildSnapshotAudit.csv. Table 18-99 describes parameters in the report.

Table 18-99 Parameter description

Parameter

Description

snap_name

Specifies the volume snapshot UUID on the storage device.

snap_type

Specifies the snapshot type, such as san, dsware, vrm, or v3.

Impact on the System

An orphan volume snapshot occupies the storage space.

Possible Causes

A database is backed up for future restoration. However, after the backup is created, one or more volume snapshots are created. When the database is restored using the backup, records of these snapshots are deleted from the database, but these snapshots reside on their storage devices and become orphan volume snapshots. Alternatively, system errors occur during the service process.

Procedure

Table 18-100 Determining the method for handling the snapshot based on the volume type

volume_type

Restoration Method

vrm

For details, see Method 1.

san, dsware, or v3

For details, see Method 2.

Method 1

  1. Open the audit report and view the snapshot information.
  2. Log in to any controller host in an AZ.

    For details, see Using SSH to Log In to a Host.

  3. Import environment variables. For details, see Importing Environment Variables.
  4. Run the following command to check whether the snapshot exists in the Cinder service:

    cinder snapshot-show Snapshot UUID

    NOTE:

    The Snapshot UUID value is the value of snap_name obtained from the audit report. If the message "ERROR: No snapshot with a name or ID of 'XXX(snapshot UUID)' exists." is displayed, the snapshot does not exist.

    • If yes, go to 9.
    • If no, go to 5.

  5. Run the following command to obtain the OM plane IP address of the blockstorage-driver-vrmXXX-assigned host:

    cps host-list

    NOTE:

    To obtain the OM plane IP address, locate the host whose roles value contains blockstorage-driver-vrmXXX in the command output and take a note of the OM plane IP address.

  6. Log in to the blockstorage-driver-vrmXXX-assigned host based on Using SSH to Log In to a Host. and run the following command to query the snapshot details:

    python /usr/bin/info-collect-script/audit_resume/get_vrm_snapshot.py -qi Snapshot UUID -cf `echo "blockstorage-driver-vrmXXX" | awk -F '-' '{print "cinder-volume-"$3}'`

    NOTE:

    The Snapshot UUID value is the value of snap_name obtained from the audit report.

    The roles information about blockstorage-driver-vrmXXX can be obtained from 5.

    Check whether the command output contains "no this snapshot".

    • If yes, contact technical support for assistance.
    • If no, go to 7.

  7. Confirm with the tenant whether to delete the snapshot.

    • If yes, go to 8.
    • If no, contact technical support for assistance.

  8. Log in to the blockstorage-driver-vrmXXX-assigned host based on Using SSH to Log In to a Host and run the following command to delete the snapshot:

    python /usr/bin/info-collect-script/audit_resume/delete_vrm_snapshot.py -di Snapshot UUID -cf `echo "blockstorage-driver-vrmXXX" | awk -F '-' '{print "cinder-volume-"$3}'`

    NOTE:

    The Snapshot UUID value is the value of snap_name obtained from the audit report.

    The roles information about blockstorage-driver-vrmXXX can be obtained from 5.

    No further action is required.

  9. Confirm with the tenant whether to delete the snapshot.

    • If yes, go to 10.
    • If no, contact technical support for assistance.

  10. Check whether the snapshot status is error.

    • If yes, go to 13.
    • If no, go to 11.

  11. Log in to the host whose roles value is controller and run the following command to change the snapshot status to error:

    cinder snapshot-reset-state --state error Snapshot UUID

    NOTE:

    The Snapshot UUID value is the value of snap_name obtained from the audit report.

  12. Run the following command to query the snapshot status:

    cinder snapshot-show Snapshot UUID

    NOTE:

    The Snapshot UUID value is the value of snap_name obtained from the audit report.

    Check whether the snapshot status is error.

    • If yes, go to 13.
    • If no, contact technical support for assistance.

  13. Log in to the host whose roles value is controller and run the following command to delete the snapshot:

    cinder snapshot-delete Snapshot UUID

    NOTE:

    The Snapshot UUID value is the value of snap_name obtained from the audit report.

Method 2

  1. Obtain the operation report for the volume snapshot.

    For details, see Obtaining the Operation Report.

    Check whether the operation report contains the data records meeting all the following conditions:

    The value of res_id is that of uuid in the audit report. 
    The value of res_type is snapshots. 
    time: specifies the time when the operation is performed. This time is within the period after the management data was backed up and before the data was restored. 
    action:In the value of action, the value of the HTTP request is POST, and the HTTP request URL is /v2/tenant_id/snapshots. The value of tenant_id is tenant in the operation audit.

  2. Log in to the first controller node in an AZ.

    For details, see Using SSH to Log In to a Host.

  3. Import environment variables to the host.

    For details, see Importing Environment Variables.

  4. Run the following command to query the mapping between the snapshot ID and the snapshot name and make a note of the snapshot ID:

    python /usr/bin/info-collect-script/audit_resume/storage_name_relations.py -ia Directory storing the audit report for the orphan snapshot -io Directory storing the operation log report -o Directory storing the result file -vt snapshot

    NOTE:

    Ensure that the audit report and the operation log report have been copied to the current node.

    The following is an example:

    python /usr/bin/info-collect-script/audit_resume/storage_name_relations.py -ia /var/log/audit/2014-09-23_070554/audit/wildSnapshotAudit.csv -io /tmp/op_log/cinder-2014\:09\:22-00\:00\:00_unlimit.csv -o /tmp/result.csv -vt snapshot

    NOTE:

    /tmp/result.csv is a file storing execution results. If the file content is empty, the snapshot does not exist.

    The command is successfully executed if the following information is displayed:

    Successful!

    Check whether the command is successfully executed.

    • If yes, go to 5.
    • If no, contact technical support for assistance.

  5. Run the following command to view the execution result file:

    cat Execution result file name

    Command example:

    cat /tmp/result.csv

    Information similar to the following is displayed:

    snap_name,snap_id,tenants_id 
    _snapshot-d57ecea2-5408-4976-b944-3b6d948c398b,d57ecea2-5408-4976-b944-3b6d948c398b,5c5e1c868a184035a84b3aaa61e32993

    Check whether the command output contains the snapshot name.

    • If yes, go to 6.
    • If no, contact technical support for assistance.

  6. Run the following command to check whether the snapshot exists in the Cinder service:

    1. Run the following command to enter the secure mode:

      runsafe

      Information similar to the following is displayed:

      Input command:
    2. cinder snapshot-show Snapshot ID

      Command example:

      cinder snapshot-show 1cd5c6eb-e729-4773-b846-e9f1d3467c56

      If information similar to the following is displayed, the snapshot does not exist in the Cinder service:

      ERROR: No snapshot with a name or ID of '1cd5c6eb-e729-4773-b846-e9f1d3467c56' exists.

      Check whether the snapshot exists in the Cinder service.

      • If yes, contact technical support for assistance.
      • If no, go to 7.

  7. Perform the following operations to query the node list:

    1. Enter the secure mode.

      For details, see Command Execution Methods.

    2. Run the following command to query the management IP address of a controller node:

      cps host-list

      Information similar to the following is displayed:

      +--------------------------------------+-----------+----------------------+--------+------------+------+ 
      | id                                   | boardtype | roles                | status | manageip | omip | 
      +--------------------------------------+-----------+----------------------+--------+------------+------+ 
      | 778F416E-C3BB-11A0-8567-000000821800 | BC11BTSA  | auth,                | normal | 172.29.6.1 |      | 
      |                                      |           | blockstorage-driver, |        |            |      | 
      |                                      |           | compute,             |        |            |      | 
      |                                      |           | controller,          |        |            |      | 
      |                                      |           | image                |        |            |      | 
      | AE0CCD20-C1CF-1179-8567-000000821800 | BC11BTSA  | auth,                | normal | 172.29.0.2 |      | 
      |                                      |           | blockstorage-driver, |        |            |      | 
      |                                      |           | compute,             |        |            |      | 
      |                                      |           | controller,          |        |            |      | 
      |                                      |           | image,               |        |            |      | 
      |                                      |           | loadbalancer,        |        |            |      | 
      |                                      |           | router,              |        |            |      | 
      |                                      |           | sys-server           |        |            |      | 
      | 404ECF92-DBCF-11E4-8567-000000821800 | BC11BTSA  | auth,                | normal | 172.29.6.3 |      | 
      |                                      |           | blockstorage-driver, |        |            |      | 
      |                                      |           | compute,             |        |            |      | 
      |                                      |           | controller,          |        |            |      | 
      |                                      |           | image                |        |            |      | 
      +--------------------------------------+-----------+----------------------+--------+------------+------+

      The value of manageip indicates the management IP address.

  8. Run the following commands to log in to blockstorage-driver-assigned hosts one by one:

    su fsp

    ssh fsp@Management IP address

    Command example:

    ssh fsp@172.29.6.3

    Ensure that user fsp is used to establish the connection. After the login is successful, run the su - root command to switch to user root.

    • The default password of user fsp is Huawei@CLOUD8.
    • The default password of user root is Huawei@CLOUD8!.

  9. Run the following command to query the storage type:

    python /usr/bin/info-collect-script/audit_resume/get_host_storage_info.py

    Check the storage type based on the command output.

    NOTE:

    If the storage type displayed in the command output is inconsistent with the snapshot type, go to 8 to log in to another blockstorage-driver-assigned host and perform the subsequent operations.

    • Information similar to the following is displayed:
      storage_type=dsware 
      addition info is : 
                manage_ip=172.28.0.231 
                vbs_url=172.28.6.1,172.28.6.0,172.28.0.2

      The storage type is dsware. Go to 10. The value of manage_ip indicates the FusionStorage Manager node IP address, and the value of vbs_url indicates the compute node IP management IP address.

    • Information similar to the following is displayed:
      storage_type=san 
      addition info is : 
                ControllerIP0 is 192.168.172.40 
                ControllerIP1 is 192.168.172.41

      The storage type is san. Go to 11. The value of ControllerIP indicates the SAN storage device management IP address.

      NOTE:

      If the values of both ControllerIP0 and ControllerIP1 are 127.0.0.1 or x.x.x.x. The storage type is v3. Go to 11.

  10. Run the following command to query the snapshot information:

    fsc_cli --ip Compute node management IP address --manage_ip FusionStorage Manager node IP address --port 10519 --op query_snap --snap Snapshot name on the storage device

    Command example:

    fsc_cli --ip 172.29.6.6 --manage_ip 172.29.0.231 --port 10519 --op query_snap --snap snapshot-6f2282f1-22b3-41f1-8b3f-d15aa9790388

    The snapshot exists on the storage device if information similar to the following is displayed:

    result=0 
    snap_name=snapshot-6f2282f1-22b3-41f1-8b3f-d15aa9790388,father_name=,status=0,snap_size=1024,real_size=-1,pool_id=0,create_time=2014-09-18 07:23:21

    Check whether the snapshot exists on the storage device attached to the host whose role is blockstorage-driver.

    • If yes, go to 12.
    • If no, contact technical support for assistance.

  11. Log in to the OceanStor DeviceManager page of the IP SAN device, choose SAN Services > Snapshots (choose Data Protection > Snapshots if the storage device belongs to V3 series), search for snapshot names, and check whether the snapshot exists on the storage device attached to the host whose role is blockstorage-driver in the AZ.

    • If yes, go to 12.
    • If no, contact technical support for assistance.

  12. Obtain the operation report for the volume snapshot.

    For details, see Obtaining the Operation Report.

    Determine whether to delete the snapshot.

    • If yes, go to 13.
    • If no, no further action is required.

  13. Log in to the host to which the snapshot belongs.

    Ensure that user fsp is used to establish the connection. After the login is successful, run the su - root command to switch to user root.

    • The default password of user fsp is Huawei@CLOUD8.
    • The default password of user root is Huawei@CLOUD8!.

  14. Run the following command to delete the snapshot:

    • If the storage type is san, log in to the OceanStor DeviceManager system and delete the target snapshot.
      NOTE:

      Before you delete the snapshot, you need to confirm the run state of the snapshot. If it is activated, you need to click more first, and click Cancel in the drop-down menu, then you can delete it.

    • If the storage type is daware, run the following command to delete the snapshot:

      fsc_cli --ip Compute node management IP address--manage_ip FusionStorage Manager node IP address--port 10519 --op del_snap --snap Snapshot name on the storage device

      Command example:

      fsc_cli --ip 172.29.6.6 --manage_ip 172.29.0.231 --port 10519 --op del_snap --snap snapshot-6f2282f1-22b3-41f1-8b3f-d15aa9790388

    Check whether the snapshot is successfully deleted.

    • If yes, no further action is required.
    • If no, contact technical support for assistance.

Invalid Volume Snapshots

Context

An invalid volume snapshot is the one that is recorded in the Cinder database but does not exist on the VRM node or the storage device.

Delete the invalid volume snapshot from the Cinder database.

Parameter Description

The name of the audit report is fakeSnapshotAudit.csv. Table 18-101 describes parameters in the report.

Table 18-101 Parameter description

Parameter

Description

snap_id

Specifies the snapshot ID.

snap_name

Specifies the snapshot UUID to which snapshot- is prefixed on the VRM node, or the volume name on the storage device.

volume_id

Specifies the base volume ID.

snap_type

Specifies the snapshot type, such as san, dsware, vrm, or v3.

location

Specifies the snapshot location.

Impact on the System

The invalid snapshot can be queried using the Cinder command but unavailable for the system.

Possible Causes

A database is backed up for future restoration. However, after the backup is created, one or more volume snapshots are deleted. When the database and storage devices are restored using the backup, records of these volume snapshots reside in the database and become invalid volume snapshots. Alternatively, system errors occur during the service process.

Procedure

Table 18-102 Determining the method for handling the snapshot based on the volume type

volume_type

Restoration Method

vrm

For details, see Method 1.

san, dsware, or v3

For details, see Method 2.

Method 1

  1. Open the audit report and view the snapshot information.
  2. Log in to any controller host in an AZ.

    For details, see Using SSH to Log In to a Host.

  3. Import environment variables. For details, see Importing Environment Variables.
  4. Run the following command to check whether the snapshot exists in the Cinder service:

    cinder snapshot-show Snapshot UUID

    NOTE:

    The Snapshot UUID value is the value of snap_id obtained from the audit report. If the message "ERROR: No snapshot with a name or ID of 'XXX(snapshot UUID)' exists." is displayed, the snapshot does not exist.

    • If yes, go to 5.
    • If no, contact technical support for assistance.

  5. Run the following command to query the UUID of the volume to which the snapshot belongs:

    cinder snapshot-show Snapshot UUID | grep volume_id | awk -F '|' '{ print $3}'

    NOTE:

    The Snapshot UUID value is the value of snap_id obtained from the audit report.

  6. Log in to the blockstorage-driver-vrmXXX-assigned host to which the volume obtained in 5 is attached based on Logging In to the blockstorage-driver-vrmXXX-Assigned Host to Which a Volume Is Attached, run the following command to query the snapshot information, and check whether the snapshot is available in FusionCompute:

    python /usr/bin/info-collect-script/audit_resume/get_vrm_snapshot.py -qi Snapshot UUID -cf `echo "blockstorage-driver-vrmXXX" | awk -F '-' '{print "cinder-volume-"$3}'`

    NOTE:

    The Snapshot UUID value is the value of snap_id obtained from the audit report.

    The value of blockstorage-driver-vrmXXX is obtained from the data of blockstorage-driver-vrmXXX in Logging In to the blockstorage-driver-vrmXXX-Assigned Host to Which a Volume Is Attached.

    Check whether the command output contains "no this snapshot".

    • If yes, go to 7.
    • If no, contact technical support for assistance.

  7. Log in to the host whose roles value is controller and run the following command to query the snapshot status:

    cinder snapshot-show Snapshot UUID

    NOTE:

    The Snapshot UUID value is the value of snap_id obtained from the audit report.

    Check whether the snapshot status is available or error.

    • If yes, go to 10.
    • If no, go to 8.

  8. Run the following command to set the snapshot status to error:

    cinder snapshot-reset-state --state error Snapshot UUID

    NOTE:

    The Snapshot UUID value is the value of snap_id obtained from the audit report.

  9. Run the following command to query the snapshot status:

    cinder snapshot-show Snapshot UUID

    NOTE:

    The Snapshot UUID value is the value of snap_id obtained from the audit report.

    Check whether the snapshot status is error.

    • If yes, go to 10.
    • If no, contact technical support for assistance.

  10. Run the following command to delete the snapshot:

    cinder snapshot-delete Snapshot UUID

    NOTE:

    The Snapshot UUID value is the value of snap_id obtained from the audit report.

Method 2

  1. Log in to the first controller node in an AZ.

    For details, see Using SSH to Log In to a Host.

  2. Import environment variables to the host.

    For details, see Importing Environment Variables.

  3. Run the following command to check whether the snapshot exists in the Cinder service:

    1. Run the following command to enter the secure mode:

      runsafe

      Information similar to the following is displayed:

      Input command:
    2. cinder snapshot-show Snapshot ID

      Command example:

      cinder snapshot-show 1cd5c6eb-e729-4773-b846-e9f1d3467c56

      If information similar to the following is displayed, the snapshot does not exist in the Cinder service:

      ERROR: No snapshot with a name or ID of '1cd5c6eb-e729-4773-b846-e9f1d3467c56' exists.

      Check whether the snapshot exists in the Cinder service.

      • If yes, go to 4.
      • If no, contact technical support for assistance.

  4. Perform the following operations to query the node list:

    1. Enter the secure mode.

      For details, see Command Execution Methods.

    2. Run the following command to query the management IP address of a controller node:

      cps host-list

      Information similar to the following is displayed:

      +--------------------------------------+-----------+----------------------+--------+------------+------+ 
      | id                                   | boardtype | roles                | status | manageip | omip | 
      +--------------------------------------+-----------+----------------------+--------+------------+------+ 
      | 778F416E-C3BB-11A0-8567-000000821800 | BC11BTSA  | auth,                | normal | 172.29.6.1 |      | 
      |                                      |           | blockstorage-driver, |        |            |      | 
      |                                      |           | compute,             |        |            |      | 
      |                                      |           | controller,          |        |            |      | 
      |                                      |           | image                |        |            |      | 
      | AE0CCD20-C1CF-1179-8567-000000821800 | BC11BTSA  | auth,                | normal | 172.29.0.2 |      | 
      |                                      |           | blockstorage-driver, |        |            |      | 
      |                                      |           | compute,             |        |            |      | 
      |                                      |           | controller,          |        |            |      | 
      |                                      |           | image,               |        |            |      | 
      |                                      |           | loadbalancer,        |        |            |      | 
      |                                      |           | router,              |        |            |      | 
      |                                      |           | sys-server           |        |            |      | 
      | 404ECF92-DBCF-11E4-8567-000000821800 | BC11BTSA  | auth,                | normal | 172.29.6.3 |      | 
      |                                      |           | blockstorage-driver, |        |            |      | 
      |                                      |           | compute,             |        |            |      | 
      |                                      |           | controller,          |        |          |      | 
      |                                      |           | image                |        |            |      | 
      +--------------------------------------+-----------+----------------------+--------+------------+------+

      The value of manageip indicates the management IP address.

  5. Run the following commands to log in to blockstorage-driver-assigned hosts one by one:

    su fsp

    ssh fsp@Management IP address

    Command example:

    ssh fsp@172.29.6.3

    Ensure that user fsp is used to establish the connection. After the login is successful, run the su - root command to switch to user root.

    • The default password of user fsp is Huawei@CLOUD8.
    • The default password of user root is Huawei@CLOUD8!.

  6. Run the following command to query the storage type:

    python /usr/bin/info-collect-script/audit_resume/get_host_storage_info.py

    Check the storage type based on the command output.

    NOTE:

    If the storage type displayed in the command output is inconsistent with the snapshot type, go to 5 to log in to another blockstorage-driver-assigned host and perform the subsequent operations.

    • Information similar to the following is displayed:
      storage_type=dsware 
      addition info is : 
                manage_ip=172.28.0.231 
                vbs_url=172.28.6.1,172.28.6.0,172.28.0.2

      The storage type is dsware. Go to 7. The value of manage_ip indicates the FusionStorage Manager node IP address, and the value of vbs_url indicates the compute node IP management IP address.

    • Information similar to the following is displayed:
      storage_type=san 
      addition info is : 
                ControllerIP0 is 192.168.172.40 
                ControllerIP1 is 192.168.172.41

      The storage type is san. Go to 8. The value of ControllerIP indicates the SAN storage device management IP address.

      NOTE:

      If the values of both ControllerIP0 and ControllerIP1 are 127.0.0.1 or x.x.x.x. The storage type is v3. Go to 8.

  7. Run the following command to query the snapshot information:

    fsc_cli --ip Compute node management IP address --manage_ip FusionStorage Manager node IP address --port 10519 --op query_snap --snap Snapshot name on the storage device

    Command example:

    fsc_cli --ip 172.29.6.6 --manage_ip 172.29.0.231 --port 10519 --op query_snap --snap snapshot-6f2282f1-22b3-41f1-8b3f-d15aa9790388

    The snapshot exists on the storage device if information similar to the following is displayed:

    result=0 
    snap_name=snapshot-6f2282f1-22b3-41f1-8b3f-d15aa9790388,father_name=,status=0,snap_size=1024,real_size=-1,pool_id=0,create_time=2014-09-18 07:23:21

    Check whether the snapshot exists on the storage device attached to the host whose role is blockstorage-driver.

    • If yes, contact technical support for assistance.
    • If no, go to 9.

  8. Log in to the OceanStor DeviceManager page of the IP SAN device, choose SAN Services > Snapshots (choose Data Protection > Snapshots if the storage device belongs to V3 series), search for snap_name, and check whether the snapshot exists on the storage device attached to the host whose role is blockstorage-driver in the AZ.

    • If yes, contact technical support for assistance.
    • If no, go to 9.

  9. Obtain the operation report for the volume snapshot.

    For details, see Obtaining the Operation Report.

    Determine whether to delete the snapshot.

    • If yes, go to 10.
    • If no, no further action is required.

  10. Enter the secure mode (for details, see 3) and run the following command to delete the volume:

    cinder snapshot-delete Snapshot ID

    Command example:

    cinder snapshot-delete 1cd5c6eb-e729-4773-b846-e9f1d3467c56

    If ERROR is displayed in the command output, the snapshot is not deleted. Check whether the snapshot is deleted.

    • If yes, no further action is required.
    • If no, contact technical support for assistance.

Stuck Volumes

Context

A stuck volume is the one that is kept in a transition state (including creating, downloading, deleting, extending, error_extending, error_deleting, error_attaching, error_detaching, attaching, detaching, uploading, retyping, error_restoring, backing-up, restoring-backup) and is unavailable for use.

Volumes only in the available or in-use state can be used. If a volume is kept stuck in a transition state for more than 24 hours, restore the volume based on site conditions.

NOTE:

Audit the orphan VM before deleting a stuck volume.

Description

The name of the audit report is VolumeStatusAudit.csv. Table 18-103 describes parameters in the report.

Table 18-103 Parameter description

Parameter

Description

volume_id

Specifies the volume ID.

volume_displayname

Specifies the name of the volume created by a tenant.

volume_name

Specifies the unique volume identifier to which volume- is prefixed on the VRM node.

volume_type

Specifies the volume type, such as san, dsware, or vrm.

location

Specifies the volume location.

status

Specifies the status of the volume.

last_update_time

Specifies the last time when the volume was updated.

Possible Causes

  • A system exception occurs when a volume service operation is in process.
  • A database is backed up for future restoration. However, after the backup is created, the statuses of one or more volumes are changed. When the database is restored using the backup, records of these volume statuses are restored to their former statuses in the database.

Impact on the System

The stuck volume becomes unavailable and occupies system resources.

Procedure

Restore the volume based on the volume statuses listed in the following table. For other situations, contact technical support for assistance.

Table 18-104 Volume restoration methods

Volume Status

Status Description

Possible Scenario

Restoration Method

creating

The volume is being created.

A system exception occurs during the volume creation process.

For details, see Method 1.

error_restoring

The volume fails to restore.

The volume data fails to restore.

For details, see Method 2.

backing-up

The volume data is being backed up.

An exception occurs in the system when the volume data is being backed up.

For details, see Method 2.

restoring-backup

The volume data is being restored.

An exception occurs in the system when the volume data is being restored.

For details, see Method 2.

downloading

The image for creating the volume is being downloaded.

A system exception occurs when a volume is being created from an image.

For details, see Method 2.

deleting

The volume is being deleted.

A system exception occurs when the volume is being deleted.

Forcibly delete the volume. For details, see Method 3.

error_deleting

Deletion failed.

A system exception occurs when the volume is being deleted, resulting in the deletion failure.

Forcibly delete the volume. For details, see Method 3.

error_attaching

Attachment failed.

A system exception occurs when the volume fails to attach to a VM.

Set the volume status to available or in-use. For details, see Method 4.

error_detaching

Detachment failed.

A system exception occurs when the volume fails to detach from a VM.

Set the volume status to available or in-use. For details, see Method 4.

attaching

The volume is being attached to a VM.

A system exception occurs during the volume attachment process.

If the volume is a DR placeholder volume, no further action is required. Otherwise, set the volume status to available or in-use. For details, see Method 4.

detaching

The volume is being detached from a VM.

A system exception occurs during the volume detachment process.

Set the volume status to available or in-use. For details, see Method 4.

uploading

The image is being uploaded.

A system exception occurs when an image is being created using the volume.

For details, see Method 5.

retyping

The volume is being migrated.

A system exception occurs during the storage migration process.

For details, see Method 6.

extending

The volume is being expanded.

A system exception occurs during the volume expansion process.

For details, see Method 7.

error_extending

Expansion failed.

Volume expansion failed

For details, see Method 7.

Method 1

  1. Log in to any controller host in an AZ.

    For details, see Using SSH to Log In to a Host.

  2. Import environment variables. For details, see Importing Environment Variables.
  3. Run the following command to query the volume status:

    cinder show Volume UUID

    NOTE:

    The Volume UUID value is the value of volume_id obtained from the audit report.

    Check whether the value of status in the command output is consistent with the volume status in the audit report.

    • If yes, go to 4.
    • If no, no further action is required.

  4. Query the value of last_update_time in the audit report and check whether the time difference between the value and the current time exceeds 24 hours.

    • If yes, go to 5.
    • If no, contact technical support for assistance.

  5. Log in to the blockstorage-driver-vrmXXX-assigned host based on Logging In to the blockstorage-driver-vrmXXX-Assigned Host to Which a Volume Is Attached and run the following command to query the volume details:

    python /usr/bin/info-collect-script/audit_resume/get_vrm_volume.py -qi uuid -cf `echo "blockstorage-driver-vrmXXX" | awk -F '-' '{print "cinder-volume-"$3}'`

    NOTE:

    The value of blockstorage-driver-vrmXXX is obtained from the data of blockstorage-driver-vrmXXX in Logging In to the blockstorage-driver-vrmXXX-Assigned Host to Which a Volume Is Attached.

    Check whether the command output contains "can not find this volume".

    • If yes, go to 7.
    • If no, go to 6.

  6. Reset the volume status using the volume UUID based on Resetting the Volume Status.
  7. Run the following command to delete the volume:

    cinder force-delete Volume UUID

Method 2

  1. Log in to any controller host in an AZ.

    For details, see Using SSH to Log In to a Host.

  2. Import environment variables. For details, see Importing Environment Variables.
  3. Run the following command to query the volume status:

    cinder show Volume UUID

    NOTE:

    The Volume UUID value is the value of volume_id obtained from the audit report.

    Check whether the value of status in the command output is consistent with the volume status in the audit report.

    • If yes, go to 4.
    • If no, no further action is required.

  4. Query the value of last_update_time in the audit report and check whether the time difference between the value and the current time exceeds 24 hours.

    • If yes, go to 5.
    • If no, contact technical support for assistance.

  5. Reset the volume status using uuid based on Resetting the Volume Status.

Method 3

  1. Log in to any controller host in an AZ.

    For details, see Using SSH to Log In to a Host.

  2. Import environment variables. For details, see Importing Environment Variables.
  3. Run the following command to query the volume status:

    cinder show Volume UUID

    NOTE:

    The Volume UUID value is the value of volume_id obtained from the audit report.

    Check whether the value of status in the command output is consistent with the volume status in the audit report.

    • If yes, go to 4.
    • If no, no further action is required.

  4. Query the value of last_update_time in the audit report and check whether the time difference between the value and the current time exceeds 24 hours.

    • If yes, go to 5.
    • If no, contact technical support for assistance.

  5. Run the following command to delete the volume:

    cinder force-delete Volume UUID

Method 4

  1. Log in to any controller host in an AZ.

    For details, see Using SSH to Log In to a Host.

  2. Import environment variables. For details, see Importing Environment Variables.
  3. Run the following command to query the volume status:

    cinder show Volume UUID

    NOTE:

    The Volume UUID value is the value of volume_id obtained from the audit report.

    Check whether the value of status in the command output is consistent with the volume status in the audit report.

    • If yes, go to 4.
    • If no, no further action is required.

  4. Query the value of last_update_time in the audit report and check whether the time difference between the value and the current time exceeds 24 hours.

    • If yes, go to 5.
    • If no, contact technical support for assistance.

  5. Run the following command to query the volume status:

    cinder show Volume UUID

    In the command output, check whether the value of attachments is left blank.

    • If yes, go to 6.
    • If no, go to 8.

  6. Run the following command to set the volume status to available:

    cinder reset-state --state available Volume UUID

  7. Run the following command to query the volume status:

    cinder show Volume UUID

    In the command output, check whether the value of status is available.

    • If yes, go to 11.
    • If no, contact technical support for assistance.

  8. Run the following command to set the volume status to in-use:

    cinder reset-state --state in-use Volume UUID

  9. Run the following command to query the volume status:

    cinder show Volume UUID

    In the command output, check whether the value of status is in-use.

    • If yes, go to 10.
    • If no, contact technical support for assistance.

  10. Query details of the volume in FusionCompute using uuid and check whether Attach VM is Bound.

    For details, see Querying Information About a Volume in FusionCompute.

  11. Query details of the volume in FusionCompute using uuid and check whether Attach VM is Not bound.

    For details, see Querying Information About a Volume in FusionCompute.

Method 5

  1. Log in to any controller host in an AZ.

    For details, see Using SSH to Log In to a Host.

  2. Import environment variables. For details, see Importing Environment Variables.
  3. Run the following command to query the volume status:

    cinder show Volume UUID

    NOTE:

    The Volume UUID value is the value of volume_id obtained from the audit report.

    Check whether the value of status in the command output is consistent with the volume status in the audit report.

    • If yes, go to 4.
    • If no, contact technical support for assistance.

  4. Query the value of last_update_time in the audit report and check whether the time difference between the value and the current time exceeds 24 hours.

    • If yes, go to 5.
    • If no, contact technical support for assistance.

  5. Query details of the volume in FusionCompute using uuid and check whether the volume exists in the disk list of the storage pool.

    For details, see Querying Information About a Volume in FusionCompute.

    NOTE:

    The volume attachment status can be obtained from the Attach VM attribute.

    • If yes, take a note of the volume attachment status (Bound or Not bound) and go to 6.
    • If no, contact technical support for assistance.

  6. Determine whether to attach the volume to a VM.

    • If yes, go to 7.
    • If no, go to 8.

  7. Check whether a VM is being exported in Task Center based on the information of the VM to which the volume is to be attached.

    • If yes, no further action is required.
    • If no, go to 8.

  8. Switch to the controller host you have logged in to and run the following command to query the volume status:

    cinder show Volume UUID

    In the command output, check whether the value of attachments is left blank.

    • If yes, go to 9.
    • If no, go to 12.

  9. Run the following command to set the volume status to available:

    cinder reset-state --state available Volume UUID

  10. Run the following command to query the volume status:

    cinder show Volume UUID

    In the command output, check whether the value of status is available.

    • If yes, go to 11.
    • If no, contact technical support for assistance.

  11. Check whether the attachment status of the volume obtained in 5 is Not bound.

  12. Run the following command to set the volume status to in-use:

    cinder reset-state --state in-use Volume UUID

  13. Check whether the attachment status of the volume obtained in 5 is Bound.

Method 6

  1. Log in to any controller host in an AZ.

    For details, see Using SSH to Log In to a Host.

  2. Import environment variables. For details, see Importing Environment Variables.
  3. Run the following command to query the volume status:

    cinder show Volume UUID

    NOTE:

    The Volume UUID value is the value of volume_id obtained from the audit report.

    Check whether the value of status in the command output is consistent with the volume status in the audit report.

    • If yes, go to 4.
    • If no, contact technical support for assistance.

  4. Query the value of last_update_time in the audit report and check whether the time difference between the value and the current time exceeds 24 hours.

    • If yes, go to 5.
    • If no, contact technical support for assistance.

  5. Confirm with the tenant whether to change the volume status.

    • If yes, go to 6.
    • If no, no further action is required.

  6. Query details of the volume in FusionCompute using uuid and check whether the volume exists in the disk list of the storage pool.

    For details, see Querying Information About a Volume in FusionCompute.

    NOTE:

    The volume attachment status can be obtained from the Attach VM attribute.

    • If yes, take a note of the volume attachment status (Bound or Not bound) and go to 7.
    • If no, contact technical support for assistance.

  7. Run the following command to query the volume status:

    cinder show Volume UUID

    In the command output, check whether the value of attachments is left blank.

    • If yes, go to 8.
    • If no, go to 11.

  8. Run the following command to set the volume status to available:

    cinder reset-state --state available Volume UUID

  9. Run the following command to query the volume status:

    cinder show Volume UUID

    In the command output, check whether the value of status is available.

    • If yes, go to 10.
    • If no, contact technical support for assistance.

  10. Check whether the attachment status of the volume obtained in 6 is Not bound.

  11. Run the following command to set the volume status to in-use:

    cinder reset-state --state in-use Volume UUID

  12. Check whether the attachment status of the volume obtained in 6 is Bound.

Method 7

  1. Log in to any controller host in an AZ.

    For details, see Using SSH to Log In to a Host.

  2. Import environment variables. For details, see Importing Environment Variables.
  3. Run the following command to query the volume status:

    cinder show Volume UUID

    NOTE:

    The Volume UUID value is the value of volume_id obtained from the audit report.

    Check whether the value of status in the command output is consistent with the volume status in the audit report.

    • If yes, go to 4.
    • If no, no further action is required.

  4. Check whether the volume is in either of the following statues:

    • Extending

      In this case, check whether the time difference between the value of last_update_time and the current time exceeds 24 hours.

    • error_extending
    • If yes, go to 5.
    • If no, contact technical support for assistance.

  5. Query details of the volume in FusionCompute using uuid based on Querying Information About a Volume in FusionCompute and locate the volume.

    Check whether the volume status is available.

    • If yes, go to 6.
    • If no, contact technical support for assistance.

  6. Log in to the host with the blockstorage-driver-vrmXXX role assigned based on Logging In to the blockstorage-driver-vrmXXX-Assigned Host to Which a Volume Is Attached and run the following script to restore the volume that fails to expand:

    python /usr/bin/info-collect-script/audit_resume/resume_extend_volume.py -id uuid -cf `echo "blockstorage-driver-vrmXXX" | awk -F '-' '{print "cinder-volume-"$3}'`

    NOTE:

    The value of blockstorage-driver-vrmXXX is obtained from the data of blockstorage-driver-vrmXXX in Logging In to the blockstorage-driver-vrmXXX-Assigned Host to Which a Volume Is Attached.

    After the command is executed, check whether error information is not displayed.

    • If yes, go to 7.
    • If no, contact technical support for assistance.

  7. Reset the volume status using the value of uuid based on Resetting the Volume Status.

Inconsistent Volume Attachment Information

Context

Volume attachment information includes the following:

  • Attachment status of volumes recorded in Cinder management data
  • Attachment status of volumes recorded in Nova management data
  • Information about volumes recorded in VRM

The system audits the consistency between the preceding volume attachment information.

Parameter Description

The name of the audit report is VolumeAttachmentAudit.csv. Table 18-105 describes parameters in the report.

Table 18-105 Parameter description

Parameter

Description

volume_id

Specifies the volume ID.

volume_displayname

Specifies the name of the volume created by a tenant.

volume_type

Specifies the volume type, such as san, dsware, or vrm.

location

Specifies details about the volume.

attach_status

Specifies the volume attachment status.

Impact on the System

  • Residual volume attachment information may reside on hosts
  • Volume-related services may be affected.

Possible Causes

  • A database is backed up for future restoration. However, after the backup is created, one or more volumes are attached to VMs. When the database and storage devices are restored using the backup, record of the volume attachment information is deleted from the database, but the information resides on the storage devices.
  • If a service operation fails and is rolled back, volume information rollback fails.

Procedure

  1. Open the audit report and view the volume attachment information.

    The volume attachment information includes:

    • location: Specifies the detailed volume attachment information recorded in the Cinder service, Nova service, and storage devices. Values:
      • ATTACH_TO: Indicates the volume attachment information recorded in the Cinder management data.

        For example:

        'ATTACH_TO': [{'instance_id': u' e3fd74ba-389e-4b51-afe0-531e25978264'}]

        The value of instance_id indicates the VM UUID.

      • BELONG_TO: Indicates information about the host to which the volume belongs.
      • HYPER_USE: If the volume type is vrm, this field is left blank.
      • MAP_TO: [{'location': ' vrm'}]: Indicates that he volume is on the VRM.
      • NOVA_USE: Indicates information about the VM to which the volume is attached and recorded in the Nova management data.

        For example:

        'NOVA_USE': [{''instance_name':u'instance-00000002', 'instance_id': u'e3fd74ba-389e-4b51-afe0-531e25978264'}]

    • attach_status: Specifies the volume attachment status. Values:
      • management_status: Indicates the result of comparison between attachment information in the Cinder service and the Nova service. match indicates that the information is consistent, and not_match indicates that information is inconsistent.
      • cinder_status: Indicates the result of comparison between attachment information in the Cinder service and the storage device when the FusionCompute data store is used. match indicates that the attachment status recorded in the Cinder service is consistent with that recorded in FusionCompute, and not_match indicates that information is inconsistent.

  2. Restore the volume attachment information based on the volume statuses listed in the following table. For other situations, contact technical support for assistance.

    Table 18-106 Restoration methods of volume attachment information

    management_status

    cinder_status

    Possible Scenario

    Restoration Method

    not_match

    not_match

    The volume is recorded as attached in the Cinder service but is not recorded as attached in the Nova service or in FusionCompute.

    See Method 1.

    not_match

    match

    The volume is recorded as attached in the Nova service but is not recorded as attached in the Cinder service or in FusionCompute.

    See Method 2 .

    match

    not_match

    The volume is recorded as attached in the Cinder service and Nova service but is not recorded as attached in FusionCompute. Alternatively, the volume is recorded as attached in FusionCompute but is not recorded as attached in the Cinder service or the Nova service.

    See Method 3.

Method 1

  1. In the audit report, check whether the VM information is available in ATTACH_TO of location.

    • If yes, go to 2.
    • If no, contact technical support for assistance.

  2. Log in to any controller host in an AZ.

    For details, see Using SSH to Log In to a Host.

  3. Import environment variables. For details, see Importing Environment Variables.
  4. Run the following command to query the volume attachment information of the VM:

    nova show vm-uuid

    NOTE:

    The value of vm-uuid is the instance_id value in ATTACH_TO of location in the audit report.

    In the command output, os-extended-volumes:volumes_attached indicates the volume attachment information of the VM.

  5. Check whether the volume attachment information obtained in 4 contains the volume UUID in the audit report.

    NOTE:

    The volume UUID is the value of volume_id obtained from the audit report.

    • If yes, the fault is falsely reported due to time differences. In this case, no further action is required.
    • If no, go to 6.

  6. In the audit report, check whether the VM information is available in NOVA_USE of location.

    • If yes, contact technical support for assistance.
    • If no, go to 7.

  7. Query the details of the volume in FusionCompute using the value of uuid based on Querying Information About a Volume in FusionCompute. In the disk list of the storage pool, check whether the volume is attached to a VM.

    NOTE:

    Volume attachment status indicates the data displayed in the Attach VM column (Bound or Not bound) of the volume list.

    • If yes, contact technical support for assistance.
    • If no, go to 8.

  8. Perform the following operations to change the volume status and clear the volume attachment status:

    • Run the following command to set the volume status to available:

      cinder reset-state --state available Volume UUID

    • Run the following command on a controller node to clear the attachment status of the volume:

      python /usr/bin/info-collect-script/audit_resume/clear_attachment_info.py Volume UUID

  9. Run the following command to attach the volume to a VM:

    nova volume-attach vm-uuidVolume UUID auto

    NOTE:

    The Volume UUID value is the value of volume_id obtained from the audit report.

    The value of vm-uuid is the instance_id value in ATTACH_TO of location in the audit report.

Method 2

  1. Log in to any controller host in an AZ.

    For details, see Using SSH to Log In to a Host.

  2. Import environment variables. For details, see Importing Environment Variables.
  3. Run the following command to query the volume attachment information of the VM:

    nova show vm-uuid

    NOTE:

    The value of vm-uuid is the instance_id value in NOVA_USE of location in the audit report.

    In the command output, os-extended-volumes:volumes_attached indicates the volume attachment information of the VM.

  4. Check whether the volume attachment information obtained in 3 contains the volume UUID in the audit report.

    NOTE:

    The volume UUID is the value of volume_id obtained from the audit report.

    • If yes, go to the next step.
    • If no, the fault is falsely reported due to time differences. In this case, no further action is required.

  5. Run the following command to query the VM status:

    nova show VM UUID | grep OS-SRV-USG:launched_at

    NOTE:

    The value of VM UUID is the instance_id value in NOVA_USE of location in the audit report.

    If OS-SRV-USG:launched_at is left blank, the VM became abnormal during the creation process.

    Check whether time information is displayed in the command output.

    • If yes, go to 6.
    • If no, the VM is an invalid VM. In this case, ensure that the invalid VM has been properly deleted based on Invalid VMs.

  6. Log in to the active GaussDB node and the Nova database node based on Logging In to the Active GaussDB Node. Then run the following script to delete the residual volume attachment information from the VM:

    sh /usr/bin/info-collect-script/audit_resume/delete_bdm.sh VM UUIDVolume UUID

    NOTE:

    The value of VM UUID is the instance_id value in NOVA_USE of location in the audit report.

    The volume UUID is the value of volume_id obtained from the audit report.

Method 3

  1. In the audit report, check whether the VM information is available in ATTACH_TO of location.

    • If yes, go to 2.
    • If no, go to 5.

  2. Query details of the volume in FusionCompute using the value of uuid based on Querying Information About a Volume in FusionCompute and check whether the volume is attached to the VM.

    NOTE:

    The value of uuid is that of volume_id obtained from the audit report.

    • If yes, contact technical support for assistance.
    • If no, go to 3.

  3. Query details of the VM in FusionCompute using the value of uuid based on Querying Information About a VM in FusionCompute and locate the VM.

    NOTE:

    The value of uuid is the instance_id value in ATTACH_TO of location in the audit report.

  4. Attach the volume to the VM.

    NOTE:

    The value of Name to be entered on the page is the value of volume_id in the audit report.

    No further action is required.

  5. Query details of the volume in FusionCompute using the value of uuid based on Querying Information About a Volume in FusionCompute and check whether the volume is attached to the VM.

    NOTE:

    The value of uuid is that of volume_id obtained from the audit report.

    • If yes, go to 6.
    • If no, contact technical support for assistance.

  6. In the disk list, click Name of a disk to view details of the VM to which the disk is attached.
  7. Check whether the VM status is Stopped.

    • If yes, take a note of the VM UUID and go to 10.
    • If no, take a note of the VM UUID and go to 8.
    NOTE:

    The VM status can be obtained from the status attribute on the basic information page.

    The VM UUID can be obtained from the VM UUID attribute on the basic information page.

  8. Confirm with the tenant whether the VM can be paused.

    • If yes, go to 9.
    • If no, no further action is required.

  9. Stop the VM.
  10. On the VM details page, click Hardware, select Disks, locate the target volume, click More, and select Detach to detach the volume from the VM.
  11. Log in to any controller host in an AZ.

    For details, see Using SSH to Log In to a Host.

  12. Import environment variables. For details, see Importing Environment Variables.
  13. Run the following command to attach the volume to the VM:

    nova volume-attach vm-uuidVolume UUID auto

    NOTE:

    The Volume UUID value is the value of volume_id obtained from the audit report.

    The value of vm-uuid is the VM UUID recorded in 7.

Nova novncproxy Zombie Process

Context

The Nova novncproxy service may generate zombie processes due to the websockify module or the Python version. However, the probability for this issue is found to be very low. To improve system stability, the system also audits and automatically clears these zombie processes.

Parameter Description

The audit configuration item is max_zombie_process_num, which is stored in the /etc/info-collect.conf file on the novncproxy-deployed node. The configuration item specifies the threshold for automatically clearing zombie processes. The default value is 10. The value is explained as follows:

  • The system automatically clears the zombie processes on a compute node only when the number of zombie processes on the node exceeds the threshold.
  • If the threshold is set to -1, the system does not clear zombie processes.

The name of the audit report is zombie_process_hosts.csv. Table 18-107 describes parameters in the report.

Table 18-107 Parameter description

Parameter

Description

host

Specifies the compute node name.

zombieprocess

Specifies the number of zombie processes detected on the node.

is restart

Specifies whether any automatic zombie process deletion is conducted. The default value is True.

Impact on the System

  • Excessive zombie processes may deteriorate the system performance.
  • After a zombie process is deleted, the nova-novncproxy service restarts, which interrupts in-use novnc services.

Possible Causes

  • The websockify module used by the nova-novncproxy service is defective.
  • Python 2.6 is defective.

Procedure

No operation is required. The system automatically clears excessive zombie processes based on the specified threshold.

NOTE:

Before the system automatically clears a zombie process, this zombie process is attached to process 1. Therefore, this zombie process clearing does not immediately take effect.

Residual Cold Migration Data

Context

FusionSphere OpenStack stores VM cold migration information in the database and will automatically delete it after the migration confirmation or rollback. However, if an exception occurs, residual information is not deleted from the database.

Parameters

The name of the audit report is cold_cleaned.csv. Table 18-108 describes parameters in the report.

Table 18-108 Parameters in the audit report

Parameter

Description

instance_uuid

Specifies the UUID of the VM that is cold migrated.

Impact on the System

  • This issue incurs a higher quota usage than the actual usage.
  • This issue adversely affects the code implementation and resource usages of subsequent VM cold migrations.

Possible Causes

  • The fc-nova-compute service is restarted during the migration.
  • The VM status is reset after the migration.

Procedure

No operations are required.

Intermediate State of the Cold Migration

Context

FusionSphere OpenStack stores VM cold migration information in the database. If the source node is restarted during the migration confirmation, the cold migration may be stuck in the intermediate state.

Parameters

The name of the audit report is cold_stuck.csv. Table 18-109 describes parameters in the report.

Table 18-109 Parameters in the audit report

Parameter

Description

instance_uuid

Specifies the UUID of the VM that is cold migrated.

migration_id

Specifies the ID of the cold migration record.

migration_updated

Specifies the time when the migration is confirmed.

instance_updated

Specifies the time when the VM information is updated.

Impact on the System

Maintenance operations cannot be performed on the VM.

Possible Causes

  • The fc-nova-compute service on the source node is restarted during the cold migration.
  • Network exceptions cause packet loss.

Procedure

  1. Log in to any controller host in an AZ. For details, see Using SSH to Log In to a Host.
  2. Import environment variables. For details, see Importing Environment Variables.
  3. Log in to the active GaussDB node based on Logging In to the Active GaussDB Node and run the following command to clear the intermediate state of the VM:

    python /usr/bin/info-collect-script/audit_resume/clean_stuck_migration.py instance_uuid migration_id

    NOTE:

    The value of instance_uuid can be obtained from the audit report.

    The value of migration_id can be obtained from the audit report.

  4. Check whether the VM is running properly.

    • If yes, no further action is required.
    • If no, contact technical support for assistance.

Cold Migrated VMs That Are Adversely Affected by Abnormal Hosts

Context

If the source host becomes faulty during a VM cold migration, the cold migration will be adversely affected. Perform an audit to detect the cold migrated VMs that are adversely affected by faulty hosts in the system.

Parameters

The name of the audit report is host_invalid_migration.csv. Table 18-110 describes parameters in the report.

Table 18-110 Parameters in the audit report

Parameter

Description

id

Specifies the ID of the cold migration record.

instance_uuid

Specifies the UUID of the VM that is cold migrated.

source_compute

Specifies the source host in the cold migration.

source_host_state

Specifies the status of the source host.

Impact on the System

Maintenance operations cannot be performed on the VM.

Possible Causes

  • The source host is powered off.
  • The fc-nova-compute role on the source host is deleted.
  • The fc-nova-compute service on the source host runs improperly.

Before handling the audit result, ensure that no service exception alarm has been generated in the system. If any host becomes faulty, replace the host by performing operations provided in section Replacing Hosts and Accessories from HUAWEI CLOUD Stack 6.5.0 Parts Replacement. To delete a host, perform operations provided in section Deleting a Host from an AZ from HUAWEI CLOUD Stack 6.5.0 O&M Guide.

Procedure

  1. Log in to any controller host in an AZ.

    For details, see Using SSH to Log In to a Host.

  2. Import environment variables. For details, see Importing Environment Variables.
  3. Perform the following operations on the source host:

    1. Run the following command to check whether the source host exists:

      cps host-list

      Check whether the command output contains a host whose ID is the same as the source_compute value in the audit report.

      • If no, go to 4.
      • If yes, proceed to the next operation.
    2. Check whether the status value of the host is fault in the command output in the previous operation and whether the services on the host cannot be restored.
      • If yes, go to 4.
      • If no, restore the services and perform the audit again.
    3. Run the following command to check whether the host is running properly:

      cps host-list

      Locate the host whose ID is the same as the source_compute value in the audit report and check whether the host has the fc-nova-compute role assigned.

      • If no, go to 4.
      • If yes, run the following command to verify that the fc-nova-compute service is running properly:

        cps template-instance-list --service nova fc-nova-computeXXX

      • If the fc-nova-compute service is running properly, that is, the value of status is active or standby, but no operation can be performed on the VM, contact technical support for assistance.

  4. Run the following command to clear the residual cold migration record and reset the VM status:

    python /usr/bin/info-collect-script/audit_resume/clean_stuck_migration.py instance_uuid id

    NOTE:

    The value of instance_uuid can be obtained from the audit report.

    The value of id can be obtained from the audit report.

Handling Redundant Neutron Namespaces

Context

In centralized DHCP scenarios, a network has been deleted, but its DHCP namespace still exists. This namespace is a redundant one. In distributed DHCP scenarios, the namespace of this network on a node is redundant if the node does not contain a port for the network.

After the user confirms that a DHCP namespace is redundant, restart the neutron-dhcp-agent service to delete the namespace.

In centralized router scenarios, a router has been deleted, but the router namespace still exists. This namespace is known as a redundant one. In distributed router scenarios, the router namespace on a node is redundant if the router namespace exists on the node but VMs on all the subnets connected to the router do not exist.

Parameters

The name of the audit report is redundant_namespaces.csv. Table 18-111 describes parameters in the report.

Table 18-111 Parameters in the audit report

Parameter

Description

host_id

Specifies the universally unique identifier (UUID) of the node accommodating redundant namespaces.

namespace_list

Specifies the list the redundant namespaces.

Possible Causes

When networks are deleted in batches, the RPC messages consumed by dhcp-agent are transmitted in serial mode. Therefore, the RPC messages are prone to stack in the message queue. At this time, if dhcp-agent is disconnected from RabbitMQ, the RPC broadcast messages will be lost, causing the failure to delete the DHCP namespaces of some networks.

Impact on the System

The system contains residual DHCP namespaces.

Procedure

  1. Log in to any controller host in an AZ.

    For details, see Using SSH to Log In to a Host.

  2. Import environment variables to the host.

    For details, see Importing Environment Variables.

    Perform the following operations to query the DHCP or router deployment mode:
    1. Enter the secure operation mode. The following information is displayed:
      Input command:
    2. Query the DHCP or router deployment mode.
      1. To query the DHCP deployment mode

        Run the following command:

        cps template-params-show --service neutron neutron-server|grep dhcp_distributed

        If True is displayed in the command output, distributed DHCP is used. Otherwise, centralized DHCP is used.

      2. To query the router deployment mode

        Run the following command:

        cps template-params-show --service neutron neutron-openvswitch-agent|grep enable_distributed_routing

        If True is displayed in the command output, distributed router is used. Otherwise, centralized router is used.

        • In centralized DHCP scenarios, perform 3 to 6.
        • In distributed DHCP scenarios, perform 7 to 9.
        • If routers are deployed in a centralized manner, perform 10 to 12.
        • If routers are deployed in a distributed mode, perform 13 to 15.

  3. Locate the node containing a redundant DHCP namespace based on host_id in the audit report. Then log in to the node and import environment variables (for details, see 1 and 2).
  4. Enter the secure operation mode (for details, see Command Execution Methods) and check whether the redundant namespace exists on the host:

    1. After you enter the secure operation mode, the following information is displayed:
      Input command:
    2. Run the following command to check whether the redundant DHCP namespace exists on the host:

      ip netns | grep namespace_id

      NOTE:

      namespace_id specifies the ID of each namespace in the namespace_list field of the audit report.

      Check whether the redundant DHCP namespace exists on the host.

      • If yes, go to the next step.
      • If no, the namespace is not redundant. No further action is required.

  5. Run the following command in the secure operation mode and check whether the network of the redundant namespace exists in the system:

    neutron net-show network_id

    network_id specifies the network ID for namespace_id in 4.

    An example is provided as follows:

    If namespace_id is qdhcp-9c4c4872-af61-4fe0-9148-04324233a5e9,then network_id is 9c4c4872-af61-4fe0-9148-04324233a5e9.

    Check whether the network of the redundant namespace exists in the system.

    • If yes, the namespace is not redundant. No further action is required.
    • If no, go to the next step.

  6. Run the following command in secure operation mode to delete the redundant DHCP namespace from the node:

    ip netns del namespace_id

    For details about how to obtain the value of namespace_id, see 4.

  7. Enter the secure operation mode (for details, see Command Execution Methods)and perform the following operations to check whether the node accommodating the redundant namespace has a network port:

    1. After you enter the secure operation mode, the following information is displayed:
      Input command:
    2. Run the following command to check whether the node has a network port:

      neutron port-list --network_id network_id --binding:host_id host_id

      NOTE:

      network_id can be obtained in 5, and host_id is the host_id value in the audit report.

      In the command output:

      • If only one distributed_dhcp_port record is displayed, this node does not contain other network ports. Go to the next step.
      • If multiple distributed_dhcp_port records are displayed, the namespace is not redundant. No further action is required.

  8. Perform the operations provided in 3 and 4 to log in to the node containing the redundant DHCP namespace and check whether the redundant DHCP namespace exists, respectively.
  9. If the node has a redundant namespace, go to 6.
  10. Enter the secure operation mode based on Command Execution Methods and check whether the redundant router namespace exists on the host.

    1. After you enter the secure operation mode, the following information is displayed:
      Input command:
    2. Run the following command to check whether the redundant router namespace exists on the host:

      ip netns | grep namespace_id

      NOTE:

      namespace_id specifies the ID of each namespace in the namespace_list field of the audit report.

      Check whether the redundant router namespace exists on the host.

      • If yes, go to the next step.
      • If no, the namespace is not redundant. No further action is required.

  11. Run the following command in the secure operation mode and check whether the router of the redundant namespace exists in the system:

    neutron router-show router_id

    The value of router_id is the router ID corresponding to the value of namespace_id in 10.

    Check whether the router exists in the system.

    If namespace_id is qrouter-af15306f-2ccd-4f1e-932d-9007f31c7f6f,then network_id is af15306f-2ccd-4f1e-932d-9007f31c7f6f.

    Check whether the network of the redundant namespace exists in the system.

    • If yes, the namespace is not redundant. No further action is required.
    • If no, go to the next step.

  12. Run the following command in the secure operation mode to delete the redundant router namespace from the node:

    ip netns del namespace_id

    The namespace_id value can be obtained in 10.

  13. Enter the secure operation mode based on Command Execution Methods and check whether the node accommodating the redundant namespace has a network port.

    1. The following information is displayed in the secure operation mode:
      Input command:
    2. Run the following command to check whether the networks to which all subnets connected to the router belong:

      Check whether the router exists and obtain the IDs of the networks to which all subnets connected to the router belong.

      Obtain the value of port_id of the router:

      neutron router-port-list router_id

      Obtain the value of network_id of the router port:

      neutron port-show router_port_id -c network_id

      If the router exists, the namespace is not redundant. Otherwise, the namespace is redundant. If the namespace is redundant, go to the next step.

    3. Run the following command to check whether the node contains a network port:

      neutron port-list --subnet_id subnet_id --binding:host_id host_id

      NOTE:

      The value of host_id can be obtained from the audit report.

      In the command output, check whether the namespace is redundant.

      • If all the subnets on the router have no VM ports, the router namespace is redundant. Go to the next step.
      • If the subnets on the router have VM ports, the router namespace is not redundant. No further action is required.

  14. Perform the operations provided in 3 and 4 to log in to the node containing the redundant router namespace and check whether the redundant namespace exists.
  15. If the redundant router namespace exists, go to 12.

Orphan Replication Pair

Context

An orphan replication pair is the one that is present to a storage device but is not recorded in the drextend database.

If a replication pair is orphaned and the management data is lost due to backup-based system restoration, ask the administrator to delete the replication pair.

Parameter Description

The name of the audit report is wildReplicationAudit.csv. Table 18-112 describes parameters in the report.

Table 18-112 Parameter description

Parameter

Description

pair_id

Specifies the pair_id value of the replication pair in the array.

NOTE:

Only the Huawei OceanStor V3 series arrays are supported.

Impact on the System

The replication pair does not exist in the drextend database, and the volumes in the replication pair cannot be used to create replication pairs.

Possible Causes

  • A database is backed up for future restoration. However, after the backup is created, one or more replication pairs are created. After the database is restored, records of these replication pairs are deleted from the database, but these replication pairs reside on their storage devices and become orphan replication pairs.
  • The storage system is shared by multiple FusionSphere systems.
  • Replication pairs on the storage device are not created using the drextend component.
NOTE:

When you design system deployment for a site, do not make multiple hypervisors share one storage system, and do not use other components other than drextend to create replication pairs on a storage device. Otherwise, false audit reports may be generated.

Procedure

  1. Log in to any controller host in an AZ.

    For details, see Using SSH to Log In to a Host.

  2. Import environment variables. For details, see Importing Environment Variables.
  3. Run the following command to check whether a replication pair is orphaned:

    python /usr/bin/info-collect-script/audit_resume/confire_replication.py --action show --pair_id pair_id

    The value of pair_id can be obtained from the audit report.

    Go to 4 if information similar to the following is displayed, which indicates that the replication pair is an orphan one:

    [] 
    INFO:success.

    If the square bracket contains information, contact technical support for assistance.

  4. Log in to the disk array to query information about the orphan replication pair.

    1. Log in to the disk array and choose Data Protection > Remote Replication > Remote Replication Pair.
    2. Select pair_id.

    3. Check the role, health status, and running status of the orphan replication pair, and information about the remote replication consistency group.

      If the pair health status is normal, go to 5. Otherwise, check the data link status between the local disk array and remote disk array and contact technical support for assistance.

  5. Table 18-113 describes how to handle the fault based on the replication pair status.

    Table 18-113 Orphan replication pair restoration methods

    Role

    Consistency Group

    Pair Running Status

    Restoration Method

    Primary

    Belongs to a consistency group.

    Normal

    See Method 1.

    Primary

    Belongs to a consistency group.

    Split

    See Method 1.

    Primary

    Does not belong to a consistency group.

    Normal

    See Method 2

    Primary

    Does not belong to a consistency group.

    Split

    See Method 2.

    Secondary

    Belongs to a consistency group.

    Normal

    See Method 3.

    Secondary

    Belongs to a consistency group.

    Split

    See Method 3.

    Secondary

    Does not belong to a consistency group.

    Normal

    See Method 3.

    Secondary

    Does not belong to a consistency group.

    Split

    See Method 3.

Method 1

  1. Locate the remote replication consistency group to which the replication pair belongs. Locate the remote replication pair based on 4 in Procedure, as shown in the following figure.

    Make a note of the value of the consistency group and go to the next step.

  2. Determine the remote replication consistency group to which the replication pair belongs. Click the Consistency Group tab to view the consistency groups.

  3. Query the consistency group. Enter the consistency group name obtained in 1 in the search text box and click Search.

    If the record of the consistency group exists, go to 4. Otherwise, contact technical support for assistance.

  4. Remove the orphan replication pair from the consistency group.

    1. If the health status of the consistency group is Normal, make a note of the running status of the consistency group. Then perform splitting operations on the consistency group as follows: Select the consistency group, right-click the record, and select Split from the shortcut menu to set the consistency group running status to Split.
    2. Select the replication pair which belongs to the consistency group in the lower part of the page and click Remove.

    3. If the original running status of the consistency group is Normal, right-click the consistency group and select Synchronize to restore the running status to Normal (if the running status has become Normal, skip this operation).

      If any exception occurs during the operation, contact technical support for assistance.

  5. Delete the replication pair.

    Switch to the Remote Replication Pair tab, select the orphan replication pair to be processed, and click Delete.

Method 2

  1. Locate the orphan replication pair that has been audited (the remote replication pair obtained in 4 in Procedure), as shown in the following figure.

  2. Delete the replication pair.

    • If the running status of the replication pair is Split, delete it.
    • If the running status of the replication pair is Normal, select the orphan replication pair, right-click it, select Split from the shortcut menu, and delete the replication pair.

Method 3

  1. Check whether the primary replication pair of the orphan replication pair exists on the remote disk array.

    Make a note of the local resource name and remote resource name of the replication pair on the local disk array. Log in to the remote disk array, choose Data Protection > Remote Replication > Remote Replication Pair, and check whether the primary replication pair exists based on the recorded local resource name and remote resource name.

    • If the remote disk array contains the primary replication pair, no further action is required.
    • If the remote array does not contain the primary replication pair, go to 2.

  2. Check the status of the orphan replication pair on the local disk array.

    • If the replication pair belongs to a consistency group, contact technical support for assistance.
    • If the replication pair does not belong to a consistency group, go to 3.

  3. Delete the replication pair.

    • If the running status of the replication pair is Split, delete it.
    • If the running status of the replication pair is Normal, select the orphan replication pair, right-click it, select Split from the shortcut menu, and delete the replication pair.

Invalid Replication Pairs

Context

An invalid replication pair is the one that is recorded in the drextend database but does not exist on a storage device.

Parameter Description

The name of the audit report is fakeReplicationAudit.csv. Table 18-114 describes parameters in the report.

Table 18-114 Parameter description

Parameter

Description

replication_id

Specifies the universally unique identifier (UUID) of the replication pair in the drextend database.

pair_id

Specifies the pair_id value of the replication pair in the array.

Impact on the System

Residual replication pair records reside in the drextend database.

Possible Causes

A database is backed up for future restoration. However, after the backup is created, one or more replication pairs are deleted. When the database is restored using the backup, records of these replication pairs reside in the database and become invalid replication pairs.

Procedure

  1. Log in to any controller host in an AZ.

    For details, see Using SSH to Log In to a Host.

  2. Import environment variables. For details, see Importing Environment Variables.
  3. Run the following command to check whether the replication pair exists in the drextend database:

    python /usr/bin/info-collect-script/audit_resume/confire_replication.py --action show --replication_id replication_id

    In the command, the value of replication_id can be obtained from the fakeReplicationAudit.csv report.

    If information similar to the following is displayed, and the replication pair status is available, the replication pair exists in the drextend database :

    [(u'9b5db50c-43dc-4516-b701-adbec0c4b74a', u'available', u'{"pair_id": "4fe8deac7c40005"}')] 
    INFO:success.
    • If the value of pair_id in the audit report is empty, go to 5.
    • If the value of pair_id in the audit report is not empty, go to 4.

  4. Log in to the IP SAN V3 disk array, enter the address of the IP SAN V3 (for example, https://172.20.0.1:8088) in the address box, and check whether the replication pair exists in the disk array.

    1. Log in to the disk array and choose Data Protection > Remote Replication > Remote Replication Pair.
    2. Click the down arrow below the Search button and select Pair ID.

    3. Select Pair ID as the keyword, enter the value of Pair ID obtained in the audit report in the search box, and click Search.

      If no search result is displayed, the replication pair is an invalid one. Go to 5. If a search result is displayed, contact technical support for assistance.

  5. Run the following script to set the invalid replication status to error.

    python /usr/bin/info-collect-script/audit_resume/confire_replication.py --action set --replication_id replication_id --status error

    The status is successfully changed if the following information is displayed:

    INFO:success.

Stuck Replication Pairs

Context

A stuck replication pair is the one that is kept in a transition state (including creating, deleting, and error_deleting) and is unavailable for use. If a replication pair is kept stuck in a transition state for a long period of time (24 hours by default), restore the replication pair based on site requirements.

Parameter Description

The name of the audit report is ReplicationMidStatusAudit.csv. Table 18-115 describes parameters in the report.

Table 18-115 Parameter description

Parameter

Description

replication_id

Specifies the universally unique identifier (UUID) of the replication pair in the drextend database.

name

Specifies the name of the replication pair.

status

Specifies the status of the replication pair.

last_update_time

Specifies the last time when the replication pair was updated.

pair_id

Specifies the pair_id value of the replication pair in the array.

Possible Causes

  • A system exception occurs when a service operation on the replication pair is in progress, resulting in that the replication pair status is not updated.
  • A database is backed up for future restoration. However, after the backup is created, operations are performed on one or more replication pairs and the statuses of the replication pairs are changed. After the database is restored, records of the replication pair statuses are restored to their former statuses in the database.

Impact on the System

The replication pair becomes unavailable.

Procedure

Determine the method for rectifying the stuck replication pair based on Table 18-116. Contact technical support for assistance for other scenarios.

Table 18-116 Methods for restoring the stuck replication pair

Volume Status

In Transition Mode

Description

Possible Scenario

Restoration Method

creating

Yes

A replication pair is being created.

Creating a replication pair

See Method 1.

deleting

Yes

A replication pair is being deleted.

Deleting a replication pair

See Method 2.

error_deleting

No

Deletion failed.

Failing to delete a replication pair

See Method 2.

Method 1

  1. Log in to any controller host in an AZ.

    For details, see Using SSH to Log In to a Host.

  2. Import environment variables. For details, see Importing Environment Variables.
  3. Check the transition state of the replication pair in the audit report.

    If the replication pair status is creating and the time difference between the value of last_update_time and the current time exceeds 24 hours (including 24 hours), go to 4. Otherwise, contact technical support for assistance.

  4. Check the status of the replication pair in the drextend database.

    Log in to a controller node and run the python /usr/bin/info-collect-script/audit_resume/confire_replication.py --action show --replication_id replication_id command. Information similar to the following is displayed, which includes the values of replication_id, status, and replication_driver_data:

    [(u'67472457-64b9-4c2c-bf9c-37553f9faa23', u'creating', u'{"pair_id": "4fe8deac7c40005"}')] 
    INFO:success.
    • If the status value is not creating, contact technical support for assistance.
    • If the value of pair_id in replication_driver_data is not left blank, go to the next step. Otherwise, go to 7.

  5. Check whether the replication pair exists in the disk array. Log in to the IP SAN V3 disk array, enter the address of the IP SAN V3 (for example, https://172.20.0.1:8088) in the address box, and check whether the replication pair exists in the disk array.

    1. Log in to the disk array and choose Data Protection > Remote Replication > Remote Replication Pair.
    2. Click the down arrow below the Search button and select Pair ID.

    3. Select Pair ID as the keyword, enter the value of pair_id in replication_driver_data obtained from 3 in the search box, and click Search.

      If no search result is displayed, the replication pair does not exist in the disk array. In this case, go to 7. Otherwise, go to the next step.

  6. Restore the replication pair status to available.

    Run the python /usr/bin/info-collect-script/audit_resume/confire_replication.py --action set --replication_id replication_id --status available command.

    If the following information is displayed, the status is successfully changed. No further action is required.

    INFO:success.

  7. Change the replication pair status to error.

    Run the python /usr/bin/info-collect-script/audit_resume/confire_replication.py --action set --replication_id replication_id --status error command.

    If the following information is displayed, the status is successfully changed. No further action is required.

    INFO:success.

Method 2

  1. Log in to any controller host in an AZ.

    For details, see Using SSH to Log In to a Host.

  2. Import environment variables. For details, see Importing Environment Variables.
  3. Check the transition state of the replication pair in the audit report.

    • If the replication pair status is deleting and the time difference between the value of last_update_time and the current time exceeds 24 hours (including 24 hours), go to 4. Otherwise, contact technical support for assistance.
    • If the replication pair status is error_deleting, go to 4.

  4. Check the status of the replication pair in the drextend database.

    Log in to a controller node and run the python /usr/bin/info-collect-script/audit_resume/confire_replication.py --action show --replication_id replication_id command. Information similar to the following is displayed, which includes the values of replication_id, status, and replication_driver_data:

    [(u'67472457-64b9-4c2c-bf9c-37553f9faa23', u'deleting', u'{"pair_id": "4fe8deac7c40005"}')] 
    INFO:success.

    If the value of status is not deleting or error_deleting, go to 6.

  5. To change the replication pair status to error,

    log in to a controller node and run the python /usr/bin/info-collect-script/audit_resume/confire_replication.py --action set --replication_id replication_id --status error command.

    If the following information is displayed, the status is successfully changed. No further action is required.

    INFO:success.

  6. Contact technical support for assistance.

Replication Pair with Inconsistent Statuses

Context

The status of the replication pair recorded in the drextend database is inconsistent with that in the disk array.

Parameter Description

The name of the audit report is statusReplicationAudit.csv.