No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>Search

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

FusionStorage OBS 7.0 Parts Replacement 05

Rate and give feedback:
Huawei uses machine translation combined with human proofreading to translate this document to different languages in order to help you better understand the content of this document. Note: Even the most advanced machine translation cannot match the quality of professional translators. Huawei shall not bear any responsibility for translation accuracy and it is recommended that you refer to the English document (a link for which has been provided).
Replacing an NVMe SSD

Replacing an NVMe SSD

NOTE:

This chapter applies to 2288H V5 12-slot nodes and 5288 V5 36-slot nodes.

SSDs can be used as the cache of the storage system. NVMe SSDs support orderly hot swap.

Impact on the System

When an SSD is being replaced, system performance is compromised. Therefore, replace an SSD during off-peak hours.

Prerequisites

  • A spare SSD is ready.
  • The faulty SSD has been located.
  • The storage pool to which the faulty SSD belongs is in the normal state, and no data reconstruction task is running.
NOTE:

For details about the slot numbers of NVMe SSDs, see Slot Numbers.

Precautions

  • Wait until the removal of an SSD is complete before you remove another one.
  • Wait until the insertion of an SSD is complete before you install another one.
  • NVMe SSDs support only orderly hot swap. Before removing an NVMe SSD, stop all services accessing the NVMe SSD.
  • When replacing an SSD, wait for 30 seconds after it is removed before you install a new one.

Tools and Materials

  • ESD gloves
  • ESD wrist straps
  • ESD bags
  • Labels

Replacement Process

Replace an NVMe SSD by following the process shown in Figure 9-1.

Figure 9-1 NVMe SSD replacement process

Procedure

  1. Log in to the CLI of the primary management node as user dfvmanager, and run the sh /opt/dfv/oam/oam-p/client/bin/dswareTool.sh --op setServerStorageMode -ip Storage plane floating IP address of the faulty node -mode 1 command to switch to the maintenance mode. To run this command, you need to enter the user name and password of CLI super administrator account admin.
  2. Obtain the drive letter and ESN of the faulty NVMe SSD.

    Log in to the node where the faulty NVMe SSD resides as user dfvmanager and run the cat /proc/smio_host command to obtain the drive letter and ESN based on the slot ID of the faulty NVMe SSD. For example, the drive letter of the NVMe SSD whose slot ID is 124 is nvme0n1, and the ESN is 031YSVFSJ7000600.
    [dfvmanager@node0101 ~]$ cat /proc/smio_host
    |  ScsiId|    Name| DevId|Capacity|Location|              Esn|                          VMR|    Type|Identify|   State|IoCnt| Event|HW_status|ALM_TYPE| ALM_LEV|       FALT_INFO|SECTOR_SIZE|
    |0:0:-1:0| nvme0n1| 259:0| 1490(G)|   0:124| 031YSVFSJ7000600| HUAWEI,HWE32P43016M000N,3.10|NVME_SSD|   0x0:0|  online|    0|   0x0|        0|       0|       0|            NULL|        512|
    | 0:0:3:0|     sde|  8:64| 3726(G)|     0:3|         Z1ZBC6FD|    ATA,ST4000NM0033-9ZM,SNC6|SATA_HDD|   0x0:0|  online|   11|   0x0|        0|       0|       0|            NULL|        512|
    | 0:0:2:0|     sdd|  8:48| 3726(G)|     0:2|         Z1Z0C0BM|    ATA,ST4000NM0033-9ZM,SN06|SATA_HDD|   0x0:0|  online|    4|   0x0|        0|       0|       0|            NULL|        512|
    | 0:0:1:0|     sdc|  8:32| 3726(G)|     0:1|         Z1Z0CE2R|    ATA,ST4000NM0033-9ZM,SN06|SATA_HDD|   0x0:0|  online|    9|   0x0|        0|       0|       0|            NULL|        512|
    | 0:0:0:0|     sdb|  8:16| 3726(G)|     0:0|         Z1Z1LWNJ|    ATA,ST4000NM0033-9ZM,SN06|SATA_HDD|   0x0:0|  online|    0|   0x0|        0|       0|       0|            NULL|        512|
    NOTE:

    A slot ID is a slot number plus 80. For example, if the slot number of an NVMe SSD is 44, its slot ID is 124.

  3. Set a kernel parameter.

    1. Run the su - root command and enter the password of user root to switch to user root.
    2. Run the vim /etc/default/grub command and press i to enter the editing mode.
    3. Locate GRUB_CMDLINE_LINUX="crashkernel=256M rd.lvm.lv=dfvos/root" and enter pciehp.pciehp_force=1 pci=pcie_bus_perf after it.

      Between the added content and its preceding content, a space is required and no line feed is allowed.

      [root@node0101 ~]# vim /etc/default/grub
      GRUB_TIMEOUT=5
      GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)"
      GRUB_DEFAULT=saved
      GRUB_DISABLE_SUBMENU=true
      GRUB_TERMINAL_OUTPUT="console"
      GRUB_CMDLINE_LINUX="crashkernel=256M rd.lvm.lv=dfvos/root pciehp.pciehp_force=1 pci=pcie_bus_perf"
      GRUB_DISABLE_RECOVERY="true"
    4. Press Esc to exit the editing mode, enter :wq, and press Enter.
    5. Run the following command for the modification to take effect. By default, the command in UEFI mode is used.
      • Legacy mode: grub2-mkconfig -o /boot/grub2/grub.cfg
      • UEFI mode: grub2-mkconfig -o /boot/efi/EFI/euleros/grub.cfg
    6. Restart the faulty node.

      Log in to DeviceManager and choose Cluster > Hardware. In the faulty node area, click and select Restart.

  4. Change the value of the a8 register. Otherwise, the NVMe SSD does not support the orderly hot swap function.

    1. Use SSH to log in to the storage node as user dfvmanager.
    2. Run the su - root command and enter the password of user root to switch to user root.
    3. Run the lspci -s bdf -xxx command to check the value of the a8 register (PCI data in row a0 and column 9) and record the value.
      [root@node0101 ~]# lspci -s d7:00.0 -xxx
      d7:00.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port A (rev 04)
      00: 86 80 30 20 47 05 10 00 04 00 04 06 10 00 01 00
      10: 00 00 00 00 00 00 00 00 d7 d8 d8 00 f0 00 00 20
      20: 80 ee 80 ee 01 00 11 00 f0 03 00 00 f0 03 00 00
      30: 00 00 00 00 40 00 00 00 00 00 00 00 ff 01 03 00
      40: 0d 60 00 00 86 80 00 00 00 00 00 00 00 00 00 00
      50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      60: 05 90 03 01 38 00 e0 fe 00 00 00 00 02 00 00 00
      70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      90: 10 e0 42 01 21 80 00 00 27 01 00 00 43 30 7a 09
      a0: 00 00 43 70 5b 00 e0 03 f1 11 40 00 1f 00 01 00
      b0: 00 00 00 00 be 13 00 00 29 00 00 00 0e 00 00 00
      c0: 03 00 1e 00 00 00 00 00 00 00 00 00 00 00 00 00
      d0: 00 00 00 00 10 00 00 00 00 00 00 00 00 00 00 00
      e0: 01 00 03 c8 08 00 00 00 00 00 00 00 00 00 00 00
      f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

      bdf indicates the value of ROOT PORT (B/D/F) for the NVMe SSD. This step uses the NVMe SSD in slot 44 as an example. Table 9-1 lists the values of ROOT PORT (B/D/F) for NVMe SSDs.

      Table 9-1 B/D/F mapping

      Configuration

      Slot Number

      ROOT PORT (B/D/F)

      Device (B/D/F)

      4 x 2.5-inch rear disks

      44

      d7:00.0

      d8:00.0

      45

      d7:01.0

      d9:00.0

      46

      d7:02.0

      da:00.0

      47

      d7:03.0

      db:00.0

      NOTE:

      If the default value of the a8 register is not f1, contact Huawei technical support.

    4. Run the setpci -s bdf a8.B=e1 command to change the value of the a8 register to e1.
    5. Run the lspci -s bdf -xxx command to verify the modification.
      [root@node0101 ~]# lspci -s d7:00.0 -xxx
      d7:00.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port A (rev 04)
      00: 86 80 30 20 47 05 10 00 04 00 04 06 10 00 01 00
      10: 00 00 00 00 00 00 00 00 d7 d8 d8 00 f0 00 00 20
      20: 80 ee 80 ee 01 00 11 00 f0 03 00 00 f0 03 00 00
      30: 00 00 00 00 40 00 00 00 00 00 00 00 ff 01 03 00
      40: 0d 60 00 00 86 80 00 00 00 00 00 00 00 00 00 00
      50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      60: 05 90 03 01 38 00 e0 fe 00 00 00 00 02 00 00 00
      70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      90: 10 e0 42 01 21 80 00 00 27 01 00 00 43 30 7a 09
      a0: 00 00 43 70 5b 00 e0 03 e1 11 40 00 1f 00 01 00
      b0: 00 00 00 00 be 13 00 00 29 00 00 00 0e 00 00 00
      c0: 03 00 1e 00 00 00 00 00 00 00 00 00 00 00 00 00
      d0: 00 00 00 00 10 00 00 00 00 00 00 00 00 00 00 00
      e0: 01 00 03 c8 08 00 00 00 00 00 00 00 00 00 00 00
      f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

  5. Remove the faulty NVMe SSD.

    1. If a file system is mounted to the faulty NVMe SSD, run the following command to unmount the file system from it:

      umount /dev/NVMe drive letter

    2. Find the slot ID of the NVMe SSD in the operating system according to Table 9-2.
      Table 9-2 Mapping between server slot numbers and slot IDs in the operating system

      Configuration

      Slot Number

      Slot ID

      4 x 2.5-inch rear disks

      44 to 47

      124 to 127

    3. Run the following command to remove the disk securely:

      echo 0 > /sys/bus/pci/slots/Slot ID/power

      For example, to hot remove the NVMe SSD in slot 44, run the following command:

      echo 0 > /sys/bus/pci/slots/124/power

    4. Observe the NVMe SSD indicators. When the green indicator is off and the yellow indicator blinks at 0.5 Hz, you can remove the NVMe SSD.
      Perform the following steps to remove the faulty NVMe SSD:
      1. Press the button that secures the disk module ejector lever, as shown in step 1 in Figure 9-2.

        The ejector lever automatically ejects.

        Figure 9-2 Removing a disk module
      2. Hold the ejector lever, and pull out the disk module for approximately 3 cm, as shown in step 2 in Figure 9-2.
      3. Wait at least 30 seconds until the disk stops spinning, and slowly pull out the disk module, as shown in step 3 in Figure 9-2.
    5. Place the removed NVMe SSD in an ESD bag.

  6. Install the spare NVMe SSD.

    1. After the orderly hot removal, run the setpci -s bdf a8.B=f1 command to restore the value of the a8 register to f1.
      NOTE:

      If the value of the register is not restored, the hot insertion of the NVMe SSD may be abnormal.

      Run the lspci -s bdf -xxx command to verify the modification.
      [root@node0101 ~]# lspci -s d7:00.0 -xxx
      d7:00.0 PCI bridge: Intel Corporation Sky Lake-E PCI Express Root Port A (rev 04)
      00: 86 80 30 20 47 05 10 00 04 00 04 06 10 00 01 00
      10: 00 00 00 00 00 00 00 00 d7 d8 d8 00 f0 00 00 20
      20: 80 ee 80 ee 01 00 11 00 f0 03 00 00 f0 03 00 00
      30: 00 00 00 00 40 00 00 00 00 00 00 00 ff 01 03 00
      40: 0d 60 00 00 86 80 00 00 00 00 00 00 00 00 00 00
      50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      60: 05 90 03 01 38 00 e0 fe 00 00 00 00 02 00 00 00
      70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      90: 10 e0 42 01 21 80 00 00 27 01 00 00 43 30 7a 09
      a0: 00 00 43 70 5b 00 e0 03 f1 11 40 00 1f 00 01 00
      b0: 00 00 00 00 be 13 00 00 29 00 00 00 0e 00 00 00
      c0: 03 00 1e 00 00 00 00 00 00 00 00 00 00 00 00 00
      d0: 00 00 00 00 10 00 00 00 00 00 00 00 00 00 00 00
      e0: 01 00 03 c8 08 00 00 00 00 00 00 00 00 00 00 00
      f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    2. Take the spare NVMe SSD out of its ESD bag.
    3. Perform the following steps to install the spare NVMe SSD:
      1. Raise the ejector lever and push the disk module in along the guide rails until it does not move, as shown in step 1 in Figure 9-3.
        Figure 9-3 Installing a disk module
      2. Ensure that the ejector lever is fastened to the beam, and lower the ejector lever to completely insert the disk module into the slot, as shown in step 2 in Figure 9-3.
    4. The green indicator of the NVMe SSD will be off, and the yellow indicator will blink at 2 Hz. Then, both the indicators will turn off for about 30 seconds. After the green indicator becomes steady on, log in to the iBMC of the storage node. Choose Information > System Info > Storage, and check whether the new NVMe SSD runs properly.
      NOTE:

      The NVMe SSD power-on duration varies by vendor.

    5. Run the service irqbalance status command. If the command output contains active (running), the irqbalance service is used to balance the CPU interruption.
      [root@node0101 ~]# service irqbalance status
      Redirecting to /bin/systemctl status irqbalance.service
      ● irqbalance.service - irqbalance daemonLoaded: loaded (/usr/lib/systemd/system/irqbalance.service; enabled; vendor preset: enabled)
      Active: active (running) since Mon 2018-12-10 12:06:59 CST; 1 weeks 1 days ago
      Main PID: 1347 (irqbalance)
      Memory: 32.0K
      CGroup: /system.slice/system-hostos.slice/irqbalance.service
      └─1347 /usr/sbin/irqbalance --foreground --policyscript=/etc/sysconfig/irqbalance.rules --hintpolicy=subset

      In this case, run the following command to restart the irqbalance service after the hot insert:

      systemctl restart irqbalance.service

  7. Add the spare NVMe SSD into the storage pool.

    1. Obtain the ESN of the spare NVMe SSD. Log in to the node where the faulty NVMe SSD resided as user dfvmanager and run the cat /proc/smio_host command to obtain the ESN based on the slot ID of the faulty NVMe SSD. For example, the ESN of the NVMe SSD whose slot ID is 124 is 032JLF10H7000227.
      [dfvmanager@node0101 ~]$ cat /proc/smio_host
      |  ScsiId|    Name| DevId|Capacity|Location|              Esn|                          VMR|    Type|Identify|   State|IoCnt| Event|HW_status|ALM_TYPE| ALM_LEV|       FALT_INFO|SECTOR_SIZE|
      |0:0:-1:0| nvme0n1| 259:0| 1490(G)|   0:124| 032JLF10H7000227| HUAWEI,HWE32P43016M000N,3.10|NVME_SSD|   0x0:0|  online|    0|   0x0|        0|       0|       0|            NULL|        512|
      | 0:0:3:0|     sde|  8:64| 3726(G)|     0:3|         Z1ZBC6FD|    ATA,ST4000NM0033-9ZM,SNC6|SATA_HDD|   0x0:0|  online|   11|   0x0|        0|       0|       0|            NULL|        512|
      | 0:0:2:0|     sdd|  8:48| 3726(G)|     0:2|         Z1Z0C0BM|    ATA,ST4000NM0033-9ZM,SN06|SATA_HDD|   0x0:0|  online|    4|   0x0|        0|       0|       0|            NULL|        512|
      | 0:0:1:0|     sdc|  8:32| 3726(G)|     0:1|         Z1Z0CE2R|    ATA,ST4000NM0033-9ZM,SN06|SATA_HDD|   0x0:0|  online|    9|   0x0|        0|       0|       0|            NULL|        512|
      | 0:0:0:0|     sdb|  8:16| 3726(G)|     0:0|         Z1Z1LWNJ|    ATA,ST4000NM0033-9ZM,SN06|SATA_HDD|   0x0:0|  online|    0|   0x0|        0|       0|       0|            NULL|        512|
    2. Log in to the CLI of the primary management node as user dfvmanager, and run the following command to obtain the storage pool ID. To run this command, you need to enter the user name and password of CLI super administrator account admin.

      sh /opt/dfv/oam/oam-p/client/bin/dswareTool.sh --op queryStoragePool

      Information about all storage pools is displayed in the command output. poolId in the leftmost column displays IDs of all storage pools.

      [dfvmanager@node0101 ~]$ sh /opt/dfv/oam/oam-p/client/bin/dswareTool.sh --op queryStoragePool
      [Thu Dec 20 09:44:35 CST 2018] DswareTool operation start.
      Enter User Name:admin
      Enter Password :
      Operation finish successfully. Result Code:0
      Dec 20, 2018 9:44:44 AM com.huawei.dfv.persistent.oam.client.cmd.QueryStoragePool handleSuccessResult
      INFO:
      poolId poolName totalCapacity phy(MB) usedCapacity phy(MB) freeCapacity logic(MB) thinRate thinThreshold azProperty protectMode routingMode slowIoSwitch replicationFactor poolGroupName poolGroupId
      --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      0 pool_sata_ec 130351795 449696 86601399 0 70 inner ec online null 0 default 0
      
      When there is no dsware client or failed to query storage pool capacity, the totalCapacity/usedCapacity will be set as 0.
      [Thu Dec 20 09:44:45 CST 2018] DswareTool operation end.
    3. Run the sh /opt/dfv/oam/oam-p/client/bin/dswareTool.sh --op queryStorageNodeInfo -id poolId command to query the ID of the storage pool to which the faulty NVMe SSD belongs. To run this command, you need to enter the user name and password of CLI super administrator account admin.
      If cacheInfo contains the ESN of the faulty NVMe SSD, poolId in the command is the ID of the storage pool to which the faulty NVMe SSD belongs.
      [dfvmanager@node0101 ~]$ sh /opt/dfv/oam/oam-p/client/bin/dswareTool.sh --op queryStorageNodeInfo -id 0
      [Wed Dec 19 11:36:38 CST  2018] DswareTool operation start.
      Enter User Name:admin
      Enter Password  :
      Operation finish successfully. Result Code:0
      The result as  follow:
      nodeMgrIp:8.44.124.6 poolId:0 nodeType:0 rack:2  subRack:null
      diskInfo:
      diskSn:K4KDYJRB diskType:5 diskSize:3726 diskUse:1  diskStatus:0 diskSlot:0
      diskSn:K4KDL7JB diskType:5 diskSize:3726 diskUse:1  diskStatus:0 diskSlot:1
      diskSn:K3GEU03B diskType:5 diskSize:3726 diskUse:1  diskStatus:0 diskSlot:2
      diskSn:K4J7KZRB diskType:5 diskSize:3726 diskUse:1  diskStatus:0 diskSlot:6
      diskSn:K4KEBS2B diskType:5 diskSize:3726 diskUse:1  diskStatus:0 diskSlot:7
      diskSn:K4KDL46B diskType:5 diskSize:3726 diskUse:1  diskStatus:0 diskSlot:8
      diskSn:K4JK3K1B diskType:5 diskSize:3726 diskUse:1  diskStatus:0 diskSlot:12
      diskSn:K4J7TX2B diskType:5 diskSize:3726 diskUse:1  diskStatus:0 diskSlot:13
      diskSn:K4J7X04B diskType:5 diskSize:3726 diskUse:1  diskStatus:0 diskSlot:14
      diskSn:K4JK3JJB diskType:5 diskSize:3726 diskUse:1  diskStatus:0 diskSlot:18
      diskSn:K4J0DZBB diskType:5 diskSize:3726 diskUse:1  diskStatus:0 diskSlot:19
      diskSn:K4J7NBNB diskType:5 diskSize:3726 diskUse:1  diskStatus:0 diskSlot:20
      cacheInfo:
      cacheEsn:031YSVFSJ7000600 cacheType:3  cacheSize:1600 usedSize:1596 cacheStatus:3
      cacheEsn:031YSVFSJ6000940 cacheType:3  cacheSize:1600 usedSize:1596 cacheStatus:3
      ......
    4. Run the following command to add the spare NVMe SSD into the storage pool. To run this command, you need to enter the user name and password of CLI super administrator account admin.

      sh /opt/dfv/oam/oam-p/client/bin/dswareTool.sh --op forceReplaceSSD -id ID of the storage pool to which the faulty NVMe SSD belongs -oldEsn ESN of the faulty NVMe SSD -newEsn ESN of the spare NVMe SSD -nodeMgrIp Storage plane floating IP address of the faulty node -type cache -ignoreMediaFault true

  8. Log in to the CLI of the primary management node as user dfvmanager, and run the sh /opt/dfv/oam/oam-p/client/bin/dswareTool.sh --op setServerStorageMode -ip Storage plane floating IP address of the faulty node -mode 0 command to switch to the normal mode. To run this command, you need to enter the user name and password of CLI super administrator account admin.
  9. Check the firmware version.

    Log in to the node where the faulty NVMe SSD resided as user dfvmanager. Run the su - root command to switch to user root, and run the following command to check the firmware version (nvme0 is used as an example):

    hioadm updatefw -d nvme0

    [root@node0101 ~]# hioadm updatefw -d nvme0
    slot  version   activation
    1     3.10       
    2     3.10      current
    3     3.10 

    The version on the left of current is the current firmware version. If the current firmware version is not 3.10 or later, contact Huawei technical support.

  10. Check the system status.

    On SmartKit, choose Home > Storage > Routine Maintenance > More > Inspection and check the system status.
    • If all inspection items pass the inspection, the inspection is successful.
    • If some inspection items fail, the inspection fails. Rectify the faults by taking recommended actions in the inspection reports. Perform inspection again after fault rectification. If the inspection still fails, contact Huawei technical support.

    For details, see the FusionStorage OBS Administrator Guide.

Follow-up Procedure

Label the replaced SSD to facilitate subsequent operations.

Translation
Download
Updated: 2019-07-05

Document ID: EDOC1100051325

Views: 5027

Downloads: 2

Average rating:
This Document Applies to these Products
Related Documents
Related Version
Share
Previous Next