No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>Search

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

Huawei Server Maintenance Manual 09

Rate and give feedback :
Huawei uses machine translation combined with human proofreading to translate this document to different languages in order to help you better understand the content of this document. Note: Even the most advanced machine translation cannot match the quality of professional translators. Huawei shall not bear any responsibility for translation accuracy and it is recommended that you refer to the English document (a link for which has been provided).
Oracle

Oracle

RAC Breaks Down Due to Accidental Erase of the ASM Disk

Problem Description
Table 5-267 Basic information

Item

Information

Source of the Problem

FusionCube troubleshooting

Intended Product

All Oracle versions

Release Date

2018-05-28

Keyword

Accidental erase

Symptom

The ASM disk is erased accidentally.

Key Process and Cause Analysis

Problem analysis

After the ASM disk group is created, the ASM disk is erased accidentally by using FIO or DD. After the database is installed, the database tablespace cannot be created and therefore the database breaks down unexpectedly.

Log analysis

In the endian_kfbh area of the 86 blocks in file 2147483666, the code in the 29297 line of kfc.c is incorrect. The correct storage value should be 1 while the actual value is 0.

Conclusion and Solution

Conclusion

Oracle RAC crashes due to I/O timeout.

Solution

  1. Use the backup to rectify the fault.

    The ASM automatically backs up the au=0 area. If the damaged area is in the au=0 metadata area, it can be automatically restored with the backup.

    kfed repair asmdisk*

  2. Back up and restore services.

    If the damaged area is in the database area, use the rman tool to restore the data, delete the ASM disk group first and rebuild a new one.

    If the ASM disk group is damaged, data consistency in the database may be damaged, and it will be difficult to restore the data. To prevent such incidents from happening again, do not erase the ASM disk and format the file system on the OS layer.

Experience

None

Collecting Logs in the Oracle Database Scenario of FusionCube

Problem Description
Table 5-268 Basic information

Item

Information

Source of the Problem

FusionCube troubleshooting

Intended Product

FusionCube Oracle database scenario

Release Date

2018-05-28

Keyword

Log collection

Symptom

Log collection involves logs of the OS, database, FusionStorage, and IB.

Key Process and Cause Analysis

Key process

  1. Collect FusionStorage logs.
    1. Collect logs of versions earlier than FusionCube C60.

      The directory for storing FusionStorage logs is /var/log/dsware. Use WinSCP to download logs generated within 12 hours before and after the fault occurs.

      The root user cannot directly download the logs because the security of the storage node is hardened. Therefore, you need to copy the logs to the /tmp directory and run the chown command to change the owner of the log files to user fc2 or dsware. Then, download the logs as user fc2 or dsware.

    2. Collect logs of FusionCube C60 and later versions.

      Log in to the FusionCube Center management portal, choose System > System Maintenance > Log Collection, select the node whose logs need to be collected, adjust the collection range, and click Collect Log.

      It takes a long time to collect logs if the collection time range is longer than one day. After the collection is complete, click Download to save the logs to the local PC.

  2. Collect OS logs.

    For the SUSE OS:

    Run the supportconfig command to collect system logs.

    For the Red Hat and Oracle OSs:

    Run the sosreport command and press Enter for multiple times.

  3. Collect database logs.
    1. One-click collection.

      The tfactl tool is used to collect Oracle logs by one click.

      [root@db01 db01]# find / -name tfactl

      /u01/app/11.2.0/grid/tfa/bin/tfactl

      /u01/app/11.2.0/grid/tfa/db01/tfa_home/bin/tfactl

      [root@db01 trace]# /u01/app/11.2.0/grid/tfa/bin/tfactl diagcollect -from "Mar/5/2013 09:00 " -to "Mar/5/2013 21:00:00" //Collect logs of 12 hours on all nodes of the database on March 5, 2013.

      If the message "TFA is not running" is displayed when you run the log collection command, you need to manually start the tfactl log collection process.

      [root@db01 db01]# /u01/app/11.2.0/grid/tfa/bin/tfactl diagcollect -from "Mar/5/2013 09:00 " -to "Mar/5/2013 21:00:00"

      TFA-00002: TFA is not running

      [root@db01 db01]# /u01/app/11.2.0/grid/tfa/bin/tfactl start //Manually start the tfactl log collection process.

    2. Manual collection.

      For versions earlier than Oracle 12.2:

      [root@db01 app]#su – grid

      [grid@db01 ~]#ls -l $ORACLE_HOME/log/dbn01 //View the alterSID.log file first.

      Download related logs, such as cssd and crsd logs based on the information displayed in the alter log.

      For Oracle ASM:

      [root@db01 trace]# cd/u01/app/grid/diag/asm/+asm/+ASM1/trace //View the alert_+ASM1.log file first.

      For Oracle Database:

      [root@db01 trace]# cd/u01/app/oracle/diag/rdbms/db0/db01/trace/ //View the alert_db01.log file first (db01 is an instance number of Oracle).

      For versions later than Oracle 12.2:

      Grid logs of Oracle 12c and later versions are stored in the /u01/app/grid/diag directory.

      [root@db01 app]#su – grid

      [grid@db01 ~]# ls/u01/app/grid/diag/crs/rac01/crs/trace/alert.log//View the alter.log file first.

      Download related logs, such as cssd and crsd logs based on the information displayed in the alter log.

      For Oracle ASM:

      [root@db01 trace]# ls -l/u01/app/grid/diag/asm/+asm/+ASM1/trace/alert_+ASM1.log //View the alert_+ASM1.log file of ASM logs first.

      For Oracle Database:

      [root@db01 trace]# ls -l/u01/app/oracle/diag/rdbms/rac/rac01/trace/alert_rac01.log //View the alert_rac01.log file of Oracle database logs first (rac01 is an instance number of Oracle).

    3. Log collection of versions earlier than Oracle 12.2.

      Switch to the oracle user and run the sqlplus command to collect the AWR report of the database.

      [oracle@db01 ~]$ cd $ORACLE_HOME

      [oracle@db01 ~]$ cd rdbms/admin/

      [oracle@db01 admin]$ sqlplus / as sysdba

      As shown in the following figure, collect the logs of the 01 node to obtain the database snapshot data generated within one day.

      The database generates a snapshot every hour by default.

      Manually create a snapshot of the current database.

      Log collection of versions later than Oracle 12.2.

      Switch to the oracle user and run the sqlplus command to collect the AWR report of the database.

      [oracle@db01 ~]$ cd $ORACLE_HOME

      [oracle@db01 ~]$ cd rdbms/admin/

      [oracle@db01 admin]$ sqlplus / as sysdba

      As shown in the following figure, collect the logs of the 01 node to obtain the database snapshot data generated within one day.

      The database generates a snapshot every hour by default.

      Manually create a snapshot of the current time node.

      SQL> exec dbms_workload_repository.create_snapshot;

      PL/SQL procedure successfully completed.

  4. Collect IB logs.

    Log in to any compute node and run a command to obtain the following information:

    1. Network topology information

      Run the iblinkinfo > iblinkinfo.info command to collect the iblinkinfo.info file.

    2. Network diagnosis information

      Run the ibdiagnet -r -pc --pm_pause_time 1200 -P all=1 command to collect all files in the /var/tmp/ibdiagnet2/ directory.

Conclusion and Solution

None

Experience

None

Adjusting the Volume Attachment Mode of the FusionCube OS

Problem Description
Table 5-269 Basic information

Item

Information

Source of the Problem

FusionCube troubleshooting

Intended Product

FusionStorage Block V100R003C30 or later

Release Date

2018-05-28

Keyword

Max CQs reached

Symptom

After the host of the compute node is restarted, the host cannot attach the file system in the fstab file due to the delayed mounting of the storage volume. As a result, the host enters the maintenance mode.

Key Process and Cause Analysis

After the kernel is started, the Linux system starts the user-mode service of the OS in sequence by using the init process. The priority of dsware_agent and dsware_attach_volume of FusionStorage is 39 by default, and the priority of applications is about 90. The system starts the FusionStorage service first during startup. After the dsware_agent service of FusionStorage is started, use the dsware_agent service to start the VBS server. After the VBS service is started, use the dsware_attach_volume service to automatically attach volumes.

Conclusion and Solution

Solution

  1. Adjust the boot sequence of the dsware.

    vi /etc/init.d/dswareStopVBS

    vi /etc/init.d/dswareAgent

    vi /etc/init.d/dsware_attach_volume

    Change the value in the third column of the second row chkconfig in the dswareStopVBS file from 38 to 12. Ensure that the value is greater than that of the network service startup priority shown in the preceding figure.

    Change the value in the third column of dswareAgent and dsware_attach_volume from 39 to 13 and ensure that the value is greater than that in the dswareStopVBS file.

    Reconfigure the dswareStopVBS and dswareAgent services. The dsware_attach_volume service adjusts automatically after the dswareStopVBS file is adjusted. No manual operation is required.

    chkconfig --del dswareStopVBS

    chkconfig --add dswareStopVBS

    chkconfig --level 2345 dswareStopVBS on

    chkconfig --del dswareAgent

    chkconfig --add dswareAgent

    chkconfig --level 2345 dswareAgent on

    Check whether the dsware-related services have been adjusted.

  2. Adjust the volume mounting parameters of the fstab file.

    Change the volume mounting parameter in the fstab file to _netdev. For example:

  3. Restart the server to check whether the volume mounting is normal.

    Run the df -h command to check whether the volume is mounted properly.

Experience

None

RDS Driver Cannot Be Loaded Due to IB Parameter File Modification

Problem Description
Table 5-270 Basic information

Item

Information

Source of the Problem

FusionCube troubleshooting

Intended Product

Oracle 11g or later

Release Date

2018-05-28

Keyword

RDS

Symptom

After the private network IPOIB of the database node is changed to RDS, the database ASM instance fails to be started. The message "Invalid protocol requested (2) or protocol not loaded" is displayed, indicating that the RDS driver cannot be loaded.

Key Process and Cause Analysis

None

Conclusion and Solution

Solution

This problem is caused by the security hardening of the user-defined IB network of modprobe.conf and CIS.conf. As a result, the RDS access is abnormal. You need to manually clear the preceding parameter files.

  1. The /etc/modprobe.conf file is a user-defined file. Move it to another directory.
  2. Check the ls –l/etc/modprobe.d/ directory file, compare it with the directory file of a normal node, and move newly-added configuration files of the node to another directory.

    NOTE:

    Do not manually edit the IB network adapter parameter file. Any modification to the IB network parameters must be confirmed with Huawei R&D engineers.

Experience

None

Oracle RAC Breakdown Caused by I/O Timeout

Problem Description
Table 5-271 Basic information

Item

Information

Source of the Problem

FusionCube troubleshooting

Intended Product

All FusionStorage versions

Release Date

2018-05-28

Keyword

Oracle, I/O

Symptom

During the stable running of the database, the VBS process is faulty. As a result, the Agent restarts the VBS. After the restart, the database breaks down.

Key Process and Cause Analysis

Key process

  1. Analyze database alter logs.
    Figure 5-332 Analysis of database alter logs

  2. The Oracle database has the following requirements on the OCR disk I/O latency:

    Delayed ASM PST heart beats on ASM disks in normal or high redundancy diskgroup, thus the ASM instance dismount the diskgroup. By default, it is 15 seconds.

    The OCR disk group is created in normal or high redundancy mode. Therefore, the OCR disk latency must be less than 15 seconds.

    https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=272356784595958&id=1581684.1&_adf.ctrl-state=kuranodbd_847

Cause analysis

The Oracle database has specific requirements on the OCR disk I/O latency.

Conclusion and Solution

Conclusion

Oracle RAC crashes due to I/O timeout.

Solution

  1. Check whether the OS and storage parameters can be modified to reduce the storage I/O latency and ensure that the OCR disk I/O latency is less than 15 seconds.
  2. Change the default value.

    Change the value of the hidden parameter _asm_hbeatiowait.

    alter system set "_asm_hbeatiowait"=120 scope=spfile sid='*';

    For versions later than 12.1.0.2, the default value of this parameter is changed to 120s.

Experience

This problem also occurs in a SAN storage scenario. Because the SAN storage controller is abnormal after being configured with multiple fibre channels, I/O timeout is very likely to occur. If storage I/O latency cannot be reduced, you can change the hidden parameter.

Note

This problem also occurs in a SAN storage scenario. Because the SAN storage controller is abnormal after being configured with multiple fibre channels, I/O timeout is very likely to occur. If storage I/O latency cannot be reduced, you can change the hidden parameter.

OCR Disk Group Failure and Replacement

Problem Description
Table 5-272 Basic information

Item

Information

Source of the Problem

FusionCube troubleshooting

Intended Product

Oracle 11g or later

Release Date

2018-05-28

Keyword

OCR restoration

Symptom

When the OCR is stored in the ASM and is damaged, the OCR cannot be restored directly. The ocrconfig -restore command can be executed only when the ASM is running. However, since the OCR is lost, the ASM cannot be started.

Key Process and Cause Analysis

None

Conclusion and Solution

Solution

  1. If the OCR is restored to the new disk group, modify the /etc/oracle/ocr.loc file.
  2. Check the OCR backup information.

    The automatic OCR backup set may be located on any node in the cluster. You need to search for the backup set one by one.

    #ocrconfig –showbackup (The backup is automatically performed by default.)

  3. Disable all GI services.

    # crsctl stop crs -f

  4. Start the CRS service.

    # crsctl start crs -excl -nocrs

  5. Log in to the ASM to create a disk group and import the latest OCR for backup.

    SQL> create diskgroup CRS normal redundancy

    disk '/dev/oracleasm/disks/asm-disk01', '/dev/oracleasm/disks/asm-disk02', '/dev/oracleasm/disks/asm-disk03'

    attribute 'COMPATIBLE.ASM' = '11.2.0.0.0'

    # ocrconfig -restore backup00.ocr

  6. Re-create the voting disk.

    # crsctl replace votedisk +CRS

  7. Stop and restart the CRS.

    # crsctl stop crs –f

  8. Start the CRS.

    # crsctl start crs

    Note

    If the disk path needs to be changed, the path of the spfile file also needs to be changed.

    Perform the following steps:

    # su – grid

    #sqlplus / as sysasm

    SQL> alter system set asm_diskstring='/dev/oracleasm/disks/*' scope=memory;

    SQL> create spfile='+CRS' from memory;

    SQL> startup force;

    SQL> show parameter spfile

    SQL> exit

    Restart the cluster and replace or repair the OCR disk.

Experience

None

Database Startup Failure Due to RAC Cluster Parameter Modification

Problem Description
Table 5-273 Basic information

Item

Information

Source of the Problem

FusionCube troubleshooting

Intended Product

Oracle RAC 11g R2 or later

Release Date

2018-05-28

Keyword

Cluster parameter

Symptom

After the ASM disk path parameters are modified, the RAC cluster cannot be started.

Key Process and Cause Analysis

The following describes the startup parameter file for the Oracle RAC cluster:

The XML file gpnp profile is used to store the bootstrap information, which is the basic information on cluster building, including the cluster name, cluster GUID, ASM discovery string, public network information, and private network information. Therefore, when start a node in the cluster, read this file (the default file name is <gi_home>/gpnp/<node_name>/profiles/peer/profile.xml) to get the basic information on cluster building. In addition, because the file stores the basic information about the entire cluster, the file must be the same between all nodes. A daemon process is also needed, that is, gpnpd.bin (resource name: ora.gpnpd) to maintain the gpnp profile.

For example, in a three-node cluster, node 3 is not started due to some problems. During this period, the private network configuration of the cluster changes, and then node 3 starts. During the startup process, the gpnpd process of node 3 needs to communicate with the gpnpd process of other nodes to obtain the latest gpnp profile.

The fault occurs as follows:

  1. The disk_string parameter in the asm database is modified.

    su – grid

    sqlplus /as sysasm

    create pfile= '/tmp/spfile.bak' from spfile;

    alter system set asm_diskstring= '/dev/asmdisk/*' scope=spfile

  2. After the parameter is modified, the database cluster server is powered off unexpectedly. As a result, the disk path parameter cannot be synchronized to the gpnp file.

    [root@host01 peer]# gpnptool get (Some parameter files are as follows.)

    ProfileSequence= "7" ClusterUId= "14cddaccc0464f92bfc703ec1004a386" ClusterName= "cluster01" PALocation= "><gpnp:Network-Profile><gpnp:HostNetwork id=" gen "HostName=" * "><gpnp:Network id=" net1 "IP=" 192.9.201.0 "Adapter=" eth0 "Use=" public "/><gpnp:Network id=" net2 "IP=" 10.0.0.0 "Adapter=" eth1 "Use=" cluster_interconnect "/></gpnp:HostNetwork></gpnp:Network-Profile><orcl:CSS-Profile id=" css "DiscoveryString=" +asm "LeaseDuration=" 400 "/><orcl:ASM-Profile id=" asm "DiscoveryString=" /dev/asm* "SPFile=" +DATA/cluster01/asmparameterfile/registry.253.783619911 "/>

Conclusion and Solution

Solution

Use the gpnptool tool to manually modify the cluster parameters and manually edit the profile.xml file. The modification method is as follows:

#crsctl start crs            //Start the gpnp process.
#gpnptool get  -o=/u01/app/11.2.0/grid/gpnp/fcdb1/profile/profile.xml    
#cp /u01/app/11.2.0/grid/gpnp/fcdb1/profile/profile.xml  /u01/app/11.2.0/grid/gpnp/fcdb1/profile/profile.xml.bak                 //Back up the configuration file.
  #gpnptool getpval -p=/u01/app/11.2.0/grid/gpnp/fcdb1/profile/profile.xml -prf_sq -o-   //Obtain the current serial number (Each time when the serial number is changed and written back to CRS, a serial number is used as the identifier).
 # gpnptool edit -p=/u01/app/11.2.0/grid/gpnp/fcdb1/profile/profile.xml -o=/u01/app/11.2.0/grid/gpnp/fcdb1/profile/profile.xml -ovr -prf_sq=2                                   //Change the sequence number in the configuration file (add 1 to the original sequence number).
# gpnptool getpval -p=/u01/app/11.2.0/grid/gpnp/fcdb1/profile/profile.xml  -prf_sq -o- 
 //Verify that the serial number has been changed.
 # gpnptool sign -p=/u01/app/11.2.0/grid/gpnp/fcdb1/profile/profile.xml -o=/u01/app/11.2.0/grid/gpnp/fcdb1/profile/profile.xml -ovr -w=cw-fs:peer    //Use a private key to re-identify the configuration file.
 # gpnptool put -p=/u01/app/11.2.0/grid/gpnp/fcdb1/profile/profile.xml   //Write the configuration to the parameter file.
Modify the parameter file of each node in the cluster in sequence.
#crscl stop crs –f   //Shut down the cluster.
#crsctl start crs    //Start the cluster for the new parameters to take effect.
Experience

None

ACFS File of the Oracle Database Cannot Be Disabled Due to SMIO

Problem Description
Table 5-274 Basic information

Item

Information

Source of the Problem

FusionCube troubleshooting

Intended Product

Oracle RAC 11g R2 or later

Release Date

2018-05-28

Keyword

ACFS

Symptom

After the cluster is stopped, the ACFS volume cannot be disabled.

Key Process and Cause Analysis

Two sets of oracle RAC clusters are deployed in the production environment and the ACFS file system is used in both sets to store archived data. One set uses the SAN storage and runs properly. However, the ACFS volume cannot be disabled in the other set that uses FusionStorage. The ACFS volume can be suspended only after the crsctl stop crs command is executed and the host is restarted.

Error information on disabling the ACFS volume is as follows.

Analyze the alter logs generated when the cluster is shut down.

Analyze the ASM logs generated when the cluster is started.

Analyze the OS logs.

Conclusion and Solution

Conclusion

Oracle RAC crashes due to I/O timeout.

Solution

Based on the foregoing analysis, the smio process is enabled on the compute node by default. The smio process is connected to the /dev/ directory disk. As a result, the ASM directory disk cannot be shut down.

To avoid this problem, perform the following steps:

  1. Stop the smio process.

    Log in to the compute node and run the following commands:

    cd /opt/dsware/osd/ko/$(uname -r)/smio/

    ./smio_stop

  2. Rename the smio directory.

    mv /opt/dsware/osd/ko/$(uname -r)/smio/ /opt/dsware/osd/ko/$(uname -r)/smio_bak/

  3. Run the lsmod | grep smio command.

Experience

None

VBS Volume Fails to Be Attached Due to the udev Policy Configuration on the Deletion of the Original Device and Generation of a New Device

Problem Description
Table 5-275 Basic information

Item

Information

Source of the Problem

FusionCube troubleshooting

Intended Product

FusionStorage R3C00, R3C02, R3C30

Release Date

2018-05-28

Keyword

Volume attachment, udev

Symptom

The VBS volume attachment process is as follows:

  1. The VSC is instructed to add a disk. A kernel hot swap event is triggered. SCSI disk device file /dev/sd* is generated by udev.
  2. Check whether a corresponding device exists in the /sys/bus/scsi directory according to the SCSI quadruplet.

    Edit the processing policy for hot swap event in the /etc/udev/rules.d/ directory, for example,

    KERNEL== 'sdb', NAME= 'asmdisk1_ocr1', OWNER= 'grid', GROUP= 'asmadmin', MODE= '0660'. This means that an sdb device is first deleted when it is connected. Then, a new device asmdisk1_ocr1 is created, which is the sdb.

    The VBS volume fails to be attached because the VBS cannot find the corresponding sdb device as udev has deleted it after the kernel hot swap event is triggered.

Key Process and Cause Analysis

Perform the following steps:

  1. Analyze database alter logs.
    Figure 5-333 Analysis of database alter logs

  2. The Oracle database has the following requirements on the OCR disk I/O latency:

    Delayed ASM PST heart beats on ASM disks in normal or high redundancy diskgroup, thus the ASM instance dismount the diskgroup. By default, it is 15 seconds.

    The OCR disk group is created in normal or high redundancy mode. Therefore, the OCR disk latency must be less than 15 seconds.

    https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=272356784595958&id=1581684.1&_adf.ctrl-state=kuranodbd_847

Conclusion and Solution

None

Experience

Solution

If the database cannot be started after the server is restarted, perform the following steps to restore the database:

Rename the udev rules of the Oracle as the root user.

mv /etc/udev/rules.d/99-oralce*.rule /etc/udev/rules.d/99-oracle.rules.bak

Run the following commands as the root user:

[root@fcdb1]#udevadm control --reload-rules
[root@fcdb1]#udevadm trigger

After the disk is identified, rename the udev rule to which the Oracle belongs with its original name.

After the renaming is complete, run the following commands again:

[root@fcdb1]#udevadm control --reload-rules
[root@fcdb1]#udevadm trigger

After confirming that all disks are identified, start the Oracle database.

After the database is started, perform the following steps to modify the udev rules so that the server can identify the disk when the database is restarted.

As the operators += are not supported in the NAME parameter in the udev rules, change NAME= to SYMLINK+=.

This problem can be solved. Reserve the original device. A soft link file is generated.

When NAME=asmdisk1_ocr1 is changed to SYMLINK+= "asmdisk/asmdisk1_ocr1, the directory must be changed to asmdisk.

KERNEL=="sdb", SYMLINK+="asmdisk/asmdisk1_ocr1", OWNER="grid", GROUP="asmadmin", MODE="0660"

Modify the spfile parameter of the ASM database and change asm_diskstring to the new path of the udev disk. The procedure is as follows:

[root@fcdb1 ~]#su – grid                          //Switch to the grid user.
[grid@fcdb1 ~]$sqlplus  / as sysasm                  //Log in to the ASM database.
SQL> show parameter asm_                      //View the current ASM disk path information.
SQL> create pfile='/tmp/asm_spfile.ora' from spfile;    //Back up spfile configuration information.
SQL> alter system set asm_diskstring= '/dev/asmdisk/*' scope=spfile;  //Modify ASM disk path parameters.
SQL> exit

To shut down all clusters, run the following commands:

[root@fcdb1 ~]#su – grid                          //Switch to the grid user.
[grid@fcdb1 ~]$echo $ORACLE_HOME            //Obtain the cluster path.
[grid@fcdb1 ~]$exit 
[root@dbn05 install]# XXXX (enter the cluster path) /bin/crsctl stop cluster -all //Stop the RAC cluster (stop all RAC nodes).

If modification is to be made on each node one by one, run the following commands:

[root@fcdb1 ~]#su – grid                          //Switch to the grid user.
[grid@fcdb1 ~]$echo $ORACLE_HOME            //Obtain the cluster path.
[grid@fcdb1 ~]$exit 
[root@dbn05 install]# XXXX (enter the cluster path) /bin/crsctl stop crs // Stop the RAC single-node cluster.

Edit the udev file and import the new udev rules.

[root@fcdb1 ~]#vi /etc/udev/rules.d/99-oralce*.rule
[root@fcdb1 ~]# udevadm control --reload-rules
[root@fcdb1 ~]#udevadm trigger

Check whether the udev rules take effect.

[root@fcdb1 ~]#ls –l /dev/asmdisk/*

Run the following commands to restart the OS for the modification to take effect:

[grid@fcdb1 ~]$sqlplus / as sysasm                  //Log in to the ASM database.
SQL> show parameter asm_                 //Check whether the ASM disk path information is modified.
SQL> exit

Oracle Database Breakdown Due to VBS Shutdown Exception

Problem Description
Table 5-276 Basic information

Item

Information

Source of the Problem

FusionCube troubleshooting

Intended Product

Oracle 11.2.0.3 to 12.1.0.2

Release Date

2018-05-28

Keyword

I/O latency

Symptom

After the VBS process is restarted, the database breaks down (occasionally).

Key Process and Cause Analysis
  1. Analyze database alter logs.
    Figure 5-334 Analysis of database alter logs (1)
    Figure 5-335 Analysis of database alter logs (2)
  2. The Oracle database has the following requirements on the OCR disk I/O latency:

    Delayed ASM PST heart beats on ASM disks in normal or high redundancy diskgroup, thus the ASM instance dismount the diskgroup. By default, it is 15 seconds.

    The OCR disk group is created in normal or high redundancy mode. Therefore, the OCR disk latency must be less than 15 seconds.

    https://support.oracle.com/epmos/faces/DocumentDisplay?_afrLoop=272356784595958&id=1581684.1&_adf.ctrl-state=kuranodbd_847

Conclusion and Solution

Solution

  1. Check whether the OS and storage parameters can be modified to reduce the storage I/O latency and ensure that the OCR disk I/O latency is less than 15 seconds.
  2. Change the default value.

    Change the value of the hidden parameter _asm_hbeatiowait.

    alter system set "_asm_hbeatiowait"=120 scope=spfile sid='*';

    For versions later than 12.1.0.2, the default value of this parameter is changed to 120s.

Experience

None

Handling High CPU Usage Caused by udevd Processes on the Server

Problem Description
Table 5-277 Basic information

Item

Information

Source of the Problem

FusionCube troubleshooting

Intended Product

FusionCube database scenario

Release Date

2018-05-28

Keyword

udevd, CPU, Oracle

Symptom

On May 11, 2018, Huawei field engineers reported that the CPU usage of udevd process was high during 15:00 to 18:00, and then the problem was recovered.

Key Process and Cause Analysis

Key process

  1. Analyze the database server logs.
    1. Analyze OS top logs.

      Check the top logs of the jecn31 node at about 15:26 on May 11, 2018. It is found that the udevd process usage is close to 100% and that there are multiple udevd processes.

      The udevd CPU usage does not decrease until 16:09.

    2. Analyze database logs.

      Check the database instance logs at 14:00 to 18:00 on May 11, 2018. No exception log is found.

    3. Analyze storage VBS logs.

      Check the VBS logs at about 15:26 on May 11, 2018. No exception log is found.

  2. Replicate the problem in a test environment.
    1. Create 200 volumes. Each of them is 1 GB and is attached to the compute node.
    2. Create the ASM disk group +TEST in udev mode. (OPTIONS:= 'nowatch' is not added.)
    3. Use the HammerDB tool to create data on +TEST, perform a pressure test on the database, and run the top command to check the CPU usage of udevd processes. After a period of time, the CPU usage of udevd processes is close to 100%. The problem recurs.
    4. Add OPTIONS:= "nowatch" to the /etc/udev/rules.d/99-oracle-asmdevices.rule file. Restart the system to make the udev rules take effect. Then, use the HammerDB tool to perform a pressure test and run the top command to check the CPU usage of udevd processes.

Cause analysis

The cause of the high CPU usage of udevd processes is described as follows.

According to the Red Hat case, when the number of disks is large and the database pressure is high, the CPU usage of udevd processes may be occasionally overhigh. The udev rules created for ASM disks are loaded repeatedly. After tuned is installed in the system, the system CPU usage becomes high because the tuned-mpath-iosched rule is triggered from /lib/udev at specific time points. When the Oracle process opens devices for writing and then closes volumes, a change event is generated, and the udev rule ACTION== "add|change" will be reloaded. As a result, the CPU usage is high. The problem is more likely to occur when the number of disks is large and the database pressure is high.

The udev rules have the following control parameters:

watch:

Watch a device node with inotify. When the node is closed after being opened for writing, a change event is generated.

nowatch:

Do not watch a device node with inotify.

High CPU usage cases caused by udev listed on the Linux website are as follows.

https://access.redhat.com/solutions/1465913

Conclusion and Solution

Solution

  1. Add the following line to the end of /etc/udev/rules.d/99-oracle-asmdevices.rules:

    ACTION=="add|change", KERNEL=="sd*", OPTIONS:="nowatch"

  2. After the preceding line is added, edit and save the configuration, and then run the following commands for the modification to take effect:

    [root@dbn01 u01]# /sbin/udevadm control --reload-rules

    [root@dbn01 u01]# /sbin/udevadm trigger --type=devices --action=change

Do not run the /sbin/start_udev command for the modification to take effect when services are running.

Experience

None

Note

It is found through continuous observation that the database and VBS processes are normal, and the CPU usage of udevd processes is no longer high.

Oracle Database Optimization for Sinosafe Insurance

Problem Description
Table 5-278 Basic information

Item

Information

Source of the Problem

FusionCube troubleshooting

Intended Product

Non-NA customers and non-VIP customers

Release Date

2018-05-28

Keyword

Oracle, SQL

Symptom

In the database scenario, a large number of insert operations are performed when customer services are stored. The execution takes 37 minutes, and the optimization effect is not obvious.

Key Process and Cause Analysis

Key process

The optimization procedure is as follows:

Run SQL statements concurrently. The execution efficiency is greatly improved. The execution time of customer service scripts is reduced from 37 minutes to 12 minutes.

Create a trigger as user System or Sys to run SQL statements concurrently.

CREATE OR REPLACE TRIGGER TRG_DW

AFTER LOGON ON DATABASE

declare

v_username varchar2(100);

begin

select username into v_username

from v$session where AUDSID = SYS_CONTEXT('USERENV', 'SESSIONID') and rownum <= 1;

if upper(v_username)='DW' then

execute immediate('alter session force parallel ddl parallel 4');

execute immediate('alter session force parallel dml parallel 4');

execute immediate('alter session force parallel query parallel 4');

end if;

Cause analysis

The database parameter parallel_force_local is set to true, but the SQL statements in the script are not invoked concurrently. The execution efficiency is low.

Conclusion and Solution

Conclusion

Oracle RAC crashes due to I/O timeout.

Solution

  1. Check whether the OS and storage parameters can be modified to reduce the storage I/O latency and ensure that the OCR disk I/O latency is less than 15 seconds.
  2. Change the default value.

    Change the value of the hidden parameter _asm_hbeatiowait.

    alter system set "_asm_hbeatiowait"=120 scope=spfile sid='*';

    For versions later than 12.1.0.2, the default value of this parameter is changed to 120s.

Experience

None

Note

DW is a service test user. Change the value parallel 4 according to the actual situation. A larger value is not necessarily better.

Memory Exhaustion Caused by the udisks-daemon Process of the Oracle Linux OS

Problem Description
Table 5-279 Basic information

Item

Information

Source of the Problem

FusionCube troubleshooting

Intended Product

FusionCube Oracle database scenario

Release Date

2018-05-28

Keyword

Low performance

Symptom

The udisks-daemon process of the Oracle Linux OS causes memory exhaustion.

Key Process and Cause Analysis

Key process

After the Oracle Linux OS runs for a period of time, it is found that the available memory of the OS decreases gradually. The top logs show that the udisks-daemon process consumes a large amount of memory.

Conclusion and Solution

Solution

This problem is caused by a bug in the udsik of Oracle Linux 6.9. To avoid this problem, run the kill command to stop the udsik-daemon process. To resolve this problem permanently, upgrade udisk to 1.0.1-11.el6 according to the suggestions on Oracle's website.

Official link: https://support.oracle.com/epmos/faces/SearchDocDisplay?_adf.ctrl-state=18pgju7erv_9&_afrLoop=338637318225942

According to the analysis of the OS help logs, when a user logs in to the OS in gnome GUI mode, the org.freedesktop.UDisks service is invoked and the dbus-daemon process starts the udsiks-daemon process.

The udisk provides interfaces to enumerate and execute operations on disks and storage devices. Any application (including non-privileged applications) can access the udisksd through the name org.freedesktop.UDisks2 on the system message bus. The udisksd only indirectly relates to devices and objects displayed on the user interface, because they are effectively managed by graphical applications such as gnome-disks (1).

Experience

None

A Node Is Abnormal Due to the Default Setting of ipfrag and Is Removed from the Cluster in Oracle RAC on RHEL 6.6 or Later Versions

Problem Description
Table 5-280 Basic information

Item

Information

Source of the Problem

FusionCube troubleshooting

Intended Product

Oracle RAC scenario

Release Date

2018-05-28

Keyword

Abnormal, removed

Symptom

When the Oracle database runs on Red Hat 6.6 or a later version, or the Oracle database OS is upgraded to Red Hat 6.6 or a later version, the RAC node is abnormal and is removed from the cluster when the error message "IPC Send timeout" is displayed in the database logs.

Key Process and Cause Analysis

Key process

The following alarm information is displayed in database alarm logs:

IPC Send timeout detected.Receiver ospid xxxxx

According to the database cluster logs, the Oracle database is restarted.

Check whether the OS version is Red Hat 6.6.

Cause analysis

The Oracle database has specific requirements on the OCR disk I/O latency.

Conclusion and Solution

Conclusion

Oracle RAC crashes due to I/O timeout.

Solution

According to the Oracle case (ID 2008933.1), this problem is caused by the default setting of ipfrag in the RHEL 6.6 or later versions.

  1. Modify the ipfrag parameter in the sysctl.conf file of the OS.

    net.ipv4.ipfrag_high_thresh = 16777216

    net.ipv4.ipfrag_low_thresh = 15728640

  2. Modify database parameters.

    Check whether the value of the parallel_force_local parameter in the database is false. If the value is false, change it to true.

  3. Restart all database nodes for the parameters to take effect so that abnormal nodes can be added to the cluster.
Experience

None

Configuring a PBR Route on the Database Node for Ports Through Which Network Packets Enter and Exit

Problem Description
Table 5-281 Basic information

Item

Information

Source of the Problem

FusionCube troubleshooting

Intended Product

FusionCube Oracle database scenario

Release Date

2018-05-28

Keyword

Policy routing

Symptom

Configure a PBR route on the database node for ports through which network packets enter and exit.

Key Process and Cause Analysis

Key process

A compute node has multiple NICs and at least a management network and a service network. Customers can access the compute node from either one of the networks. The static route of the destination address segment is usually used. However, if the management network segment and service network segment cannot be determined or there are too many network segments, it may be troublesome to use the static route. To solve the problem, configure a PBR route on the interface of a non-default gateway for ports through which network packets enter and exit.

Conclusion and Solution

Solution

  1. Configure a routing policy. The following policy indicates that network packets enter and exit from eth0. This policy uses routing table 1.

    cat /etc/sysconfig/network-scripts/rule-eth0

    iif eth0 table 1

    from <ip of eth0> table 1

  2. Run the following commands to configure routes in routing table 1.

    # cat /etc/sysconfig/network-scripts/route-eth0

    <network/prefix> dev eth0 table 1

    default via <gateway address> dev eth0 table 1

    #to add additional static routes

    #<network address> via <gateway address> dev eth0 table 1

Reference:

https://access.redhat.com/solutions/288823

Experience

None

Manually Adding a Single NFS Disk as the OCR Shared Disk

Problem Description
Table 5-282 Basic information

Item

Information

Source of the Problem

FusionCube troubleshooting

Intended Product

Oracle 11g or later

Release Date

2018-05-28

Keyword

NFS disk

Symptom

A storage image is used as the voting disk, which cannot ensure service continuity.

Key Process and Cause Analysis

Key process

The Oracle 11g OCR disk group requires three failure groups. However, when the storage image is used to construct the active-active data center, the third failure group is not suitable for any storage device. If the storage device on which two failure groups are placed is faulty, the entire cluster cannot run properly because the number of voting disks is less than half.

Conclusion and Solution

Solution

Use the NFS technology to attach a new disk, and then load the disk as a quorum failure group. In this way, three failure groups are created.

A special type of failure group has been introduced in ASM 11.2, and it is called a quorum failure group. This type of failure group is used in the context of extended distance clusters, when the voting files are deployed in ASM. A quorum failure group does not contain user data, and it does not count for disk group redundancy requirements.

With stretched clusters, we need at least three voting files, one on each site plus a third that is normally deployed via NFS. Unfortunately, the OUI does not allow such a complex setup, so when configuring the extended distance cluster, you should start with a normal configuration - two voting disks on site A and one for site B - and add the third voting disk after the installation finishes.

Begin by exporting a file system on the NFS appliance, containing a zero-padded file to serve as the voting disk later. The file should be the same size as the ASM disks already used to store the voting files.

This NFS export should then be concurrently mounted by the two sites. Continue by adding the new zero-padded file as an ASM disk, marked as a quorum failure group—for this you have to change the disk discovery string. Once the new disk is added to the disk group, ASM will automatically readjust the voting files and include the NFS disk!

The following figure shows the architecture.

To attach an NFS disk, perform the following steps:

  1. Install Grid based on the best practice document.
  2. Create an OCR shared directory on the NFS server.
  3. Attach the NFS directory to all database RAC nodes.
    # mkdir -p /u01/ocrdata
    # vi /etc/fstab
    nfs_nas_server:/vol/DATA/oradata  /u01/oradata     nfs  rw,bg,hard,nointr,tcp,nfsvers=3,timeo=600,rsize=32768,wsize=32768,actimeo=0
  4. Run the dd command on one node to generate the OCR disk file.
    # dd if=/dev/zero of=/u01/ocrdata/disk1 bs=1024k count=10000 oflag=direct
    # chown -R grid:asmadmin  /u01/ocrdata/
    # chmod 660 /u02/oradata/datadg/*
  5. In an ASM instance, change the ASM disk search path.
    # sqlplus / as sysasm
    SQL>alter system set asm_diskstring='/dev/asmdisk/*','/u01/ocrdisk/disk1';
  6. Check whether an NFS disk can be identified by the ASM.
    SQL>col path format A40
    SQL>select group_number,name,path,mount_status,header_status,state,REDUNDANCY,FAILGROUP,voting_file from v$asm_disk;
  7. Add an NFS disk to the OCR disk group.
    1. Add an NFS disk to the original OCR disk group.
      SQL>alter diskgroup OCR add quorum failgroup FGQ DISK '/u01/ocrdisk/disk1';
    2. If the NFS disk cannot be added to the original OCR disk group, create a disk group.

      Switch to the voting disk directory as the grid user.

      [grid@dbn1 ~]$ crsctl replace votedisk +CRS
       Successful addition of voting disk 58c1ac72dff94f25bffc8e649a36c883.
       Successful addition of voting disk 076f0b3e9b0a4f5cbf26841c540211a7.
       Successful addition of voting disk 84cf735c784e4f74bf5d55fc99e98422.
       Successful deletion of voting disk 73fb4a797e624fa9bf382f841340dfa8.
       Successfully replaced voting disk group with +CRS.
  8. Check the voting disk.
    [grid@dbn1 ~]$ crsctl query css votedisk
     ##  STATE    File Universal Id                File Name Disk group
     --  -----    -----------------                --------- ---------
      1. ONLINE   58c1ac72dff94f25bffc8e649a36c883 (/dev/asmdisk/asm-diske) [CRS]
      2. ONLINE   076f0b3e9b0a4f5cbf26841c540211a7 (/dev/asmdisk/asm-diskf) [CRS]
      3. ONLINE   84cf735c784e4f74bf5d55fc99e98422 (/u01/ocrdisk/disk1) [CRS]
     Located 3 voting disk(s).
Experience

None

Disks on Compute Nodes Are Mounted Repeatedly in the Database Scenario

Problem Description
Table 5-283 Basic information

Item

Information

Source of the Problem

FusionCube troubleshooting

Intended Product

Non-NA customers and non-VIP customers

Release Date

2018-05-28

Keyword

Oracle, mount

Symptom

In the database scenario, a large number of insert operations are performed when customer services are stored. The execution takes 37 minutes, and the optimization effect is not obvious.

Key Process and Cause Analysis

Key process

The optimization procedure is as follows:

Run SQL statements concurrently. The execution efficiency is greatly improved. The execution time of customer service scripts is reduced from 37 minutes to 12 minutes.

Create a trigger as user System or Sys to run SQL statements concurrently.

CREATE OR REPLACE TRIGGER TRG_DW

AFTER LOGON ON DATABASE

declare

v_username varchar2(100);

begin

select username into v_username

from v$session where AUDSID = SYS_CONTEXT('USERENV', 'SESSIONID') and rownum <= 1;

if upper(v_username)='DW' then

execute immediate('alter session force parallel ddl parallel 4');

execute immediate('alter session force parallel dml parallel 4');

execute immediate('alter session force parallel query parallel 4');

end if;

Cause analysis

The database parameter parallel_force_local is set to true, but the SQL statements in the script are not invoked concurrently. The execution efficiency is low.

Conclusion and Solution

Solution

  1. Check whether the OS and storage parameters can be modified to reduce the storage I/O latency and ensure that the OCR disk I/O latency is less than 15 seconds.
  2. Change the default value.

    Change the value of the hidden parameter _asm_hbeatiowait.

    alter system set "_asm_hbeatiowait"=120 scope=spfile sid='*';

    For versions later than 12.1.0.2, the default value of this parameter is changed to 120s.

Experience

None

Note

DW is a service test user. Change the value parallel 4 according to the actual situation. A larger value is not necessarily better.

Configuring udev Rules When Disks Are Uninstalled in the Database Scenario

Problem Description
Table 5-284 Basic information

Item

Information

Source of the Problem

FusionCube troubleshooting

Intended Product

All FusionStorage versions

Release Date

2018-05-28

Keyword

udev, Oracle

Symptom

After the udev command is used to set a rule for a block device, you need to run the udev command to release the reference count before uninstalling the block device.

Key Process and Cause Analysis

Key process

Standard actions are required when a block device needs to be uninstalled in a running Oracle database environment.

Conclusion and Solution

Conclusion

Oracle RAC crashes due to I/O timeout.

Solution

  1. Ensure that the Oracle database and other applications, file systems, and LVM do not use this device.
  2. Delete information on the to-be-uninstalled block device from the udev rules.
  3. Run the udevadm control --reload-rules command to reload the udev rule configuration.
  4. Run the udevadm trigger command to trigger a hot swap event. In this case, the device that should be generated in the udev rule disappears.
  5. Uninstall the block device from the storage device.
Experience

None

Translation
Download
Updated: 2019-02-25

Document ID: EDOC1000041338

Views: 71748

Downloads: 3788

Average rating:
This Document Applies to these Products
Related Documents
Related Version
Share
Previous Next