No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>Search

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

Huawei Server Maintenance Manual 09

Rate and give feedback :
Huawei uses machine translation combined with human proofreading to translate this document to different languages in order to help you better understand the content of this document. Note: Even the most advanced machine translation cannot match the quality of professional translators. Huawei shall not bear any responsibility for translation accuracy and it is recommended that you refer to the English document (a link for which has been provided).
Faults

Faults

The Virtual CD-ROM Drive Cannot Be Used in Windows Server 2008

Problem Description
Table 5-300 Basic information

Item

Information

Source of the Problem

RH2285

Intended Product

RH2285 and E6000

Release Date

2011-06-11

Keyword

Windows Server 2008 SP2, Internet Explorer 8, BMC, virtual CD-ROM drive

Author

Li Weijia (employee ID: 00176591)

Symptom

Software configuration

  • Operating system (OS): Windows Server Enterprise 2008 SP2 (64-bit)
  • Browser version: Internet Explorer 8
  • Baseboard management controller (BMC) version: 2.06

Symptom

An RH2285 enables the remote keyboard, video, and mouse (KVM) function in the BMC Web UI, and uses the virtual CD-ROM drive to load image files to install an OS. To load an image file, perform the following steps:

  1. Select an image file, and click Open, as shown in Figure 5-347.
    Figure 5-347 Selecting an image file

  2. Click Connect in the virtual CD-ROM drive interface.
  3. The Message dialog box is displayed, prompting CD-ROM State: The path of the image file does not exist or the image file is used by other program. Therefore, the image file cannot be opened, as shown in Figure 5-348.
    Figure 5-348 Failure in loading the image file
Key Process and Cause Analysis

Key process

  1. UVP C01 and Windows Server 2003 image files are available on site, and the files are in good condition. After mounting the UVP C01 and Windows Server 2003 image files, the system prompts that the virtual CD-ROM drive cannot be opened.
  2. Cut UVP C01 and Windows Server 2003 image files to other folders, and remount these image files. The system prompts that the virtual CD-ROM drive cannot be opened, indicating that image files are not used by other users or programs.
  3. Restart the BMC, and remount UVP C01 and Windows Server 2003 image files. The system prompts that the virtual CD-ROM drive cannot be opened, indicating that the virtual CD-ROM drive is not used by other users.
  4. Close Internet Explorer, and reopen Internet Explorer as an administrator. Right-click Internet Explorer and choose Run as administrator from the shortcut menu, and mount image files by using the virtual CD-ROM drive for the KVM. The system displays a message indicating that the mounting succeeds.
  5. After multiple tests, the virtual CD-ROM drive can be connected and disconnected properly.
Conclusion and Solution

Conclusion

Windows Server 2008 limits the permission of running Internet Explorer. In this case, the virtual CD-ROM drive for the BMC cannot be used.

Solution

Run Internet Explorer as an administrator in Windows Server 2008, log in to the BMC Web UI, and use the virtual CD-ROM drive for the KVM to mount image files.

Experience

None

Note

None

Windows Server 2003 Cannot Use the Remote Control KVM Function of an SMM

Problem Description
Table 5-301 Basic information

Item

Information

Source of the Problem

E6000

Intended Product

E6000

Release Date

2011-12-23

Keyword

Windows Server 2003, SMM, remote control

Author

Li Weijia (employee ID: 00176591)

Symptom

Software configuration

Client operating system (OS): Windows Server 2003 server-x86.

Symptom

  • Symptom 1: Enable the keyboard, video, and mouse (KVM) function in Direct KVM and VMM mode of Remote Control; however, the KVM window cannot be opened, and only a blank page is displayed.
  • Symptom 2: Enable the KVM function in KVM via MM mode of Remote Control. The system displays "The window of KVM via MM has been opened, please close it firstly."

Key Process and Cause Analysis

Cause analysis

Windows Server 2003 is a multi-user OS. Two users cannot simultaneously use shelf management modules (SMMs) to remotely control the KVM in a computer. Therefore, when a user (for example, the administrator) logs in to the system and enables the remote control KVM of an SMM, another user (for example, the pxe) cannot access Windows Server 2003 by logging in to the system using the same computer and enabling the remote control KVM.

After the administrator exits from the Internet Explorer process, the pxe can use the remote control KVM of the SMM.

Conclusion and Solution

Conclusion

Users cannot simultaneously enable the remote control KVM of an SMM in a computer.

Solution

Close the opened remote control KVM window (or end the Internet Explorer process in the task manager), and restart the remote control KVM of the SMM.

Experience

None

Note

None

An Unactivated Windows Server 2008 R2 Frequently Shuts Down

Problem Description
Table 5-302 Basic information

Item

Information

Source of the Problem

E6000

Intended Product

RH1280, RH2280, RH1285, RH2285, E6000, X6000, and RH5485

Release Date

2011-12-23

Keyword

Windows Server 2008 R2, WLMS

Author

Li Weijia (employee ID: 00176591)

Symptom

A board in use automatically shuts down. You need to manually power on the board to use it.

Key Process and Cause Analysis

Key process

  1. Check the baseboard management controller (BMC) system status of the board (focusing on power alarms). No hardware alarm is found.
  2. View the system event log. No hardware alarm is found; however, it is found that the board is powered off regularly, that is, the interval for restarting the board is about one hour.
  3. The Oracle database service runs on the board. Check whether the timing task of the service is triggered. Exit the Oracle software. No service runs on the board. An hour later, the board automatically shuts down. Therefore, the Oracle service has no impact on the board.
  4. Check whether the board shuts down due to abnormal basic input output system (BIOS) energy saving parameters and power options of the operating system (OS). In the BIOS, set Advanced > CPU Configuration > Intel(R) TurboMode tech and Advanced > Intel(R) SpeedStep(tm) tech to Disabled, and set Advanced > ACPI Configuration > ACPI Version Features to ACPI v2.0. After running for an hour, the board shuts down. Therefore, the energy saving function has no impact on the board.
  5. View the time difference between the BMC and the board OS, and obtain the board OS time based on the time that the system reboot log is generated.
  6. Based on the obtained board OS time, view Windows event logs including the System and Application logs in the level sequence of Error, Warning, and Information.
  7. In the System log, no log that triggers system shutdown is found, as shown in Figure 5-349.
    Figure 5-349 System logs

  8. In the Application log, valid information is captured, as shown in Figure 5-350.
    Figure 5-350 Application logs

  9. Query the Windows Licensing Monitoring Service (WLMS) in the OS service. The WLMS monitors the status (valid or invalid) of the serial number of the current system. If the serial number is invalid, the system shuts down every an hour. See Figure 5-351.
    Figure 5-351 WLMS service

  10. For a genuine Windows Server 2008, the WLMS service cannot be manually ended. If the service is ended, the system immediately shuts down. Therefore, buy the formal serial number to reactivate the system.
Conclusion and Solution

Conclusion

After the serial number of a genuine Windows Server 2008 expires, the WLMS service makes the system shut down every an hour.

Solution

Buy a genuine Windows Server 2008 and the genuine serial number of the corresponding version.

Experience

To diagnose the Windows system fault, you are advised to check whether alarms occur on the hardware based on the BMC log. If no hardware alarm occurs, check the event logs (including the System and Application logs) of the Windows system in the level sequence of Error, Warning, and Information, which facilitates fault diagnosis.

Note

To view the BMC time, do as follows:

  1. Enter telnet BMCIP on the CMD CLI at the client, and press Enter. BMCIP indicates the IP address of the BMC.
  2. Enter the BMC user name and password based on the prompt to log in to the system (the default user name and password are root).
  3. Run the ipmcget -d ipmctime command to view the BMC time.

To view the Windows Server 2008 system log, do as follows:

  1. Choose Start > Computer > Manage. The Server Manager UI is displayed.
  2. Choose Diagnostics > Event Viewer > Windows Log in the Server Manager UI. In the displayed system log interface, the system logs include the Application, Security, Setup, System, and Forwarded Events logs.

An RH2285 OS Cannot Be Started

Problem Description
Table 5-303 Basic information

Item

Information

Source of the Problem

RH2285

Intended Product

RH2285

Release Date

2011-12-12

Keyword

RH2285, sda6

Author

Han Hui (employee ID: 179477)

Symptom

Hardware configuration

RH2285 server, 12 hard drives, and 1068E controller card

Software configuration

SUSE11 sp1 (64-bit)

Symptom

After the server OS enters kdump, the system displays Waiting for device /dev/sda6 to appear shown in the red box in Figure 5-352.

Figure 5-352 Failure in finding the sda6 root partition

Key Process and Cause Analysis
  1. Run the echo c > /proc/sysrq-trigger command to crash the system. Then errors in Figure 5-352 may occur.
  2. Based on the dmesg information, the problem is caused because the mptsas driver is loaded for about 10 minutes (30s by default).
  3. The system is crashed and enters the kdump process. The process finds storage devices. The input/output (I/O) Advanced Programmable Interrupt Controller (APIC) receives the interruption from a device related to the I/O APIC, generates interruption information, and sends the interruption information to the local APIC of the processor. After processing the information, the processor drives the device. Then use the read partition information to identify storage devices in partitions and mount the root partition. See Figure 5-353.
    Figure 5-353 Process for handling system crash

  4. In kdump, the memory in the core serves as the crash storage kernel, and only one processor is reserved. When the I/O APIC distributes the interruption information, an incorrect operation occurs, that is, the information is sent to the local APIC of a nonexistent processor, which causes failures in distributing the interruption information. The driver accesses the sda6 root partition in cycled mode. When the accessing times out, the driver stops accessing the partition, and a message showing that the information is sent to a nonexistent processor is displayed.
Conclusion and Solution

Conclusion

The root cause is the SUSE11 bug. The I/O APIC is not used for interruption information distribution in the OS kdump process. The information is sent to the active processor to shield exceptions during information distribution, find storage devices, and store the driver and faults.

Solution

To resolve the problem, do as follows:

  1. Run vim /etc/sysconfig/kdump.
  2. Add noapic to KDUMP_COMMANDLINE_APPEND.
  3. Recreate the initrd file of the kdump process.
Experience

None

Note

None

RHEL 6.5 Port Bonding Bugs Result in Network Performance Deterioration

Problem Description
Table 5-304 Basic information

Item

Information

Source of the Problem

Problem on the live network

Intended Product

Rack servers and E9000 blade servers

Release Date

2015-07-16

Keyword

RHEL, bonding, TSO, NIC performance

Symptom
  • After the 10GE NIC ports on a server running Red Hat Enterprise Linux (RHEL) 6.5 are bonded in active/standby mode, the NIC throughput tested by iPerf is only 2.5 Gbit/s, much lower than 10 Gbit/s. The throughput on the server running SUSE Linux Enterprise Server (SLES) 11 SP3 at the peer end is normal, as shown in Figure 5-354.
    Figure 5-354 NIC performance tested by iPerf
  • Server configuration: All software is selected during the installation of RHEL 6.5.

The NIC uses the Intel® 82599 chip and the following driver:

driver: ixgbe

version: 3.15.1-k

firmware-version: 0x800003e2

  • RHEL 6.5 is installed on an E9000 CH121 in the lab, with the same port configuration on a different 10GE NIC (such as be3). The problem is reproduced, as shown in Figure 5-355.
Figure 5-355 Problem reproduction

If physical ports (eth0 and eth1) are not bonded, the throughput tested by iPerf is greater than 9 Gbit/s.

Figure 5-356 Problem reproduction

Key Process and Cause Analysis
  • RHEL optimization:
    • Delete intel_iommu=on; from /boot/grub/grub.conf.
    • Disable VT-D on the BIOS.

      The problem persists.

  • NIC interrupt distribution analysis

    Measure the network interrupt queue distribution using one iPerf process respectively in port bonding and non-bonding scenarios, and little difference is recorded. In both scenarios, data receiving and transmitting are performed in two queues. The CPU usage is higher in the port bonding scenario.

  • Tool analysis

    Run multiple iPerf processes at the same time. The total throughput increases, but the maximum throughput is still about 6 Gbit/s, lower than the expected 8 Gbit/s. Use netperf to perform a test, and the throughput over the TCP in the port bonding scenario is much lower than expected. Tools are not causes of the problem.

  • TSO and GSO analysis
    1. Based on the preceding verification, the poor performance of the bonded port has little to do with the NIC and much to do with RHEL 6.5.
    2. TCP Segmentation Offload (TSO) and Generic Segmentation Offload (GSO) are disabled for the bonded port in RHEL 6.5, and are enabled for the physical ports and the bonded port in SLES 11 SP3. These parameters are intended to improve data receiving and transmitting performance. TSO, UDP Fragmentation Offload (UFO), and GSO are intended to improve receiving and transmitting performance, and Large Receive Offload (LRO) and Generic Receive Offload (GRO) are intended to improve receiving performance.

      TSO is a technique that reduces CPU workloads by using the NIC to split TCP packets. This technique is also called Large Segment Offload (LSO). To support TSO, a network device must support checksum and Scatter-Gather.

      GSO is used more frequently than TSO. GSO delays packet segmenting until the packets are sent to the NIC driver. If the NIC supports packet segmenting (TSO or UFO), the packets are sent to the NIC driver. If the NIC driver does not support packet segmenting, the packets are segmented before being sent to the NIC driver. In this way, a packet passes through the protocol stack only once, improving efficiency.

    3. Figure 5-357 shows the default settings in RHEL 6.5.
      Figure 5-357 Default settings in RHEL 6.5

      Figure 5-358 shows the default settings in SLES 11 SP3.

      Figure 5-358 Default settings in SLES 11 SP3

    4. Run the following command to enable TSO and GSO for the bonded port and observe the NIC performance.

      ethtool –K bond0 tso/gso on

      After GSO is enabled, the throughput of the bonded port increases to about 6 Gbit/s using one iPerf process.

      After TSO is enabled for bonded port and the physical ports, errors are reported, as shown in dmesg logs in Figure 5-359.

      Figure 5-359 dmesg logs

      lo: Dropping TSO features since no CSUM feature.

      (null): Dropping TSO features since no CSUM feature.

      (null): Dropping TSO6 features since no CSUM feature.

      rhevm: Dropping TSO features since no CSUM feature.

    5. According to the following two cases in Figure 5-360, if the NIC supports TSO and GSO and has them enabled, the RHEL bonding module also supports TSO and GSO. However, TSO cannot be enabled for bonded ports due to bugs in RHEL 6.5. The OS kernel needs to be upgraded.

      https://access.redhat.com/solutions/781503

      https://access.redhat.com/solutions/631123

      Figure 5-360 Two cases

    6. After the kernel is upgraded, TSO is enabled by default for the bonded port and the throughput reaches 9.6 Gbit/s, as shown in Figure 5-361.
      Figure 5-361 TSO setting and NIC performance
Conclusion and Solution

Conclusion:

TSO cannot be enabled for the bonded ports due to a bug in RHEL 6.5, resulting in poor network performance. The OS kernel needs to be upgraded.

Solution:

Upgrade the kernel to 2.6.32-431.11.2.el6.

Experience

The problem may occur in other Linux OSs. If the problem occurs, check TSO and GSO settings and resolve it by referring to this case.

Note

None

"Out of SW-IOMMU space" Is Displayed for Linux

Problem Description
Table 5-305 Basic information

Item

Information

Source of the Problem

RH2285

Intended Product

RH2285, SR100

Release Date

2011-12-14

Keyword

1068E, DMA, out of sw-iommu space

Symptom

Hardware configuration

An RH2285 server configured with eight or twelve hard drives and a LSI SAS1068 controller card

Software configuration

SUSE10 SP2 x86_64

Symptom

The messages for Linux are as follows:

  • DMA: Out of SW-IOMMU space for 16384 bytes at device 08:00.0
  • DMA: Out of SW-IOMMU space for 16384 bytes at device 08:00.0
  • DMA: Out of SW-IOMMU space for 16384 bytes at device 08:00.0
Key Process and Cause Analysis

Key process

LSI SAS1068E PCI-E device is displayed after 08:00.0 by using the lspci command, as shown in Figure 5-362.

Figure 5-362 lspci command output

Cause analysis

An input/output memory management unit (IOMMU) is a hardware unit that provides the I/O function. The IOMMU transfers device memory addresses from the I/O space to the machine space, which enables a device to connect to physical memory addresses beyond its device range. This function is implemented by mapping the address space within the device range to physical memory addresses beyond the device range. The swiotlb parameter implements the IOMMU function from the software layer, which is called "bounce buffers". If the IOMMU space is insufficient but the system has extra memory space, you can set the swiotlb parameter to a large value to enlarge the IOMMU space size.

The default value of swiotlb is 64 MB for Red Hat systems and 16 MB for SUSE systems. If a Peripheral Component Interconnect Express (PCIe) network interface card (NIC) is under a great deal of pressure, and swiotlb is set to a small value, the PCIe NIC cannot be used properly.

Conclusion and Solution

Conclusion

If a PCIe NIC is under a great deal of pressure, and swiotlb is set to a small value, the PCIe NIC cannot be used properly.

Solution

Set swiotlb to a large value and add swiotlb=128 or swiotlb=256 to /boot/grub/menu.lst, as shown in Figure 5-363. Restart the system and run the #cat /proc/cmdline command on the command-line interface (CLI) as the root user to check whether the modification (swiotlb=128) takes effect, as shown in Figure 5-364.

Figure 5-363 Setting the swiotlb parameter
Figure 5-364 Setting results

Experience

According to the SUSE engineers, SUSE OSs of the 2.6.16.60-0.60 version or later can detect and solve this problem automatically.

Note

None

The BIOS Screen Cannot Be Properly Displayed After the Login over SOL

Problem Description
Table 5-306 Basic information

Item

Information

Source of the Problem

RH5485

Intended Product

RH5485, E6000, RH2285, and X6000

Release Date

2012-03-21

Keyword

RH5485, SOL, BIOS

Symptom

Hardware configuration

An RH5485

Software configuration

Red Hat Enterprise Linux Server release 6.1

Basic input/output system (BIOS) V009

Symptom

If the BIOS and local terminal types are set to VT_100+, some operation options on the BIOS screen cannot be displayed when the user logs in to the BIOS over Serial over LAN (SOL) after the OS at the service side restarts, as shown in Figure 5-365. LAN refers to local area network.

Figure 5-365 Missing operation options

Figure 5-366 shows the correct displaying of operation options on the BIOS screen.

Figure 5-366 Correct displaying of operation options
Key Process and Cause Analysis

Cause analysis

The standard DOS display mode is 80 x 25. If you use the 80 x 24 HyperTerminal to log in, the second-last line displayed on the HyperTerminal will be overwritten by the last line.

Conclusion and Solution

Conclusion

The second-last line of operation options on the BIOS screen can be properly displayed after the display mode is adjusted to 80 x 25. For the HyperTerminal provided by Windows, the display mode of 80 x 24 cannot be adjusted. As a result, the last line of operation options cannot be displayed. However, the second-last line provides description of shortcut keys, which does not affect normal operations.

Experience

None

Note

None

An Incorrect ping Command Is Used

Problem Description
Table 5-307 Basic information

Item

Information

Source of the Problem

Problem in Online Devices

Intended Product

Linux

Release Date

2012-03-22

Keyword

ping

Symptom

Hardware configuration

Twelve 1 TB hard drives, one LSI SAS1068E controller card, and hard drives in slots 0 and 1 configured with RAID 1

Software configuration

SUSE11 SP1 64-bit

Symptom

After the ping 192.168.3.100 –i 192.168.3.10 command is executed on the server operating system (OS), the response packet is received after 3 minutes, as shown in Figure 5-367.

Figure 5-367 Ping response packet cannot be received shortly

Key Process and Cause Analysis

Key process

Syntax: ping [-dfnqrRv][-c<number of completion times>][-i<interval in seconds>][-I<network UI>][-l<preset loading>][-p<example format>][-s<packet size>][-t<TTL value>][host name or IP address]

NOTE:

The ping uses the Internet Control Message Protocol (ICMP) to send a message that will be sent back as a response. If a remote host sends back the message as the response, the host runs normally.

Parameters in the command are described as follows:

-d: uses the socket SO_DEBUG function.

-c<number of completion times>: sets the number of required responses.

-f: detects the limit.

-i<interval in seconds>: specifies the interval between sending and receiving packets.

-I<network UI>: uses the specified network user interface (UI) to send packets.

-l<preset loading>: specifies the packets sent before the required packets are sent.

-n: outputs values only.

-p<example format>: sets the example format to fill a packet.

-q: displays only summary lines at startup time and when finished.

-r: disregards normal routing tables and sends packets directly to a remote host.

-R: records the routing process.

-s<packet size>: sets the packet size.

-t<TTL value>: sets the time to live (TTL) value.

-v displays the command execution process in detail.

In conclusion, the customer can use the ping 192.168.3.100 –I 192.168.3.10 command for testing. Figure 5-368 shows the command output.

Figure 5-368 ping packet testing

Cause analysis

A parameter in the ping command is incorrect. The ping 192.168.3.100 –I 192.168.3.10 command instead of the ping 192.168.3.100 –i 192.168.3.10 command needs to be used.

Conclusion and Solution

Conclusion

A parameter in the ping command is incorrect.

Solution

Run the ping 192.168.3.100 –I 192.168.3.10 command.

Experience

None

Note

None

The Cluster Service Fails to Start on Windows Server 2003

Problem Description
Table 5-308 Basic information

Item

Information

Source of the Problem

X6000

Intended Product

RH2285, E6000, and X6000

Release Date

2012-05-04

Keyword

Windows Server 2003, cluster service, NIC, switch

Symptom

Hardware configuration

An X6000

Software configuration

Windows Server 2003

Symptom

After the server is restarted, the cluster service cannot restart, and a failover cannot occur for the database clusters that rely on the cluster service when a fault occurs. The event logs show that the cluster service fails to connect to domain China, as shown in Figure 5-369 and Figure 5-370.

Figure 5-369 Event log for the cluster service startup failure

Figure 5-370 Event log for the failure in connecting to domain China

Key Process and Cause Analysis

Key process

  1. Create the database clusters again and check configuration parameters. The problem persists.
  2. According to the event log shown in Figure 5-371, the NETLOGON service cannot find a domain controller in domain China, and an event log with the ID 5719 is generated.
    Figure 5-371 Event log for the failure in finding a domain controller in domain China

  3. Manually start the cluster service. The service is started successfully.
  4. On the Recovery tab, set the cluster service properties to Restart the Service, as shown in Figure 5-372. Restart the server. The cluster service successfully restarts.
    Figure 5-372 Setting cluster service properties

  5. In conclusion, initializing network components takes a long period of time. As a result, services that rely on the domain account cannot properly start during the server system startup. Network component initialization includes NIC and network parameters and external network settings, which are closely associated with hardware and the network environment.
  6. Uninstall the NIC driver and install a new driver. The problem persists.
  7. Set carrier down-hold-time to 2000 ms and the port down latency to 2s for the S5328 switch that is connected to the X6000. The switch cannot quickly respond to port up and down changes. The problem persists. By default, the port up latency is 2000 ms and down latency is 0 ms. Set carrier down-hold-time to 3000 ms, as shown in Figure 5-373. The problem is solved. The cluster service properly starts.
    Figure 5-373 Setting carrier down-hold-time to 3000 ms

  8. Microsoft confirms that the delay start function can be set for a service on Windows Server 2008 and later products. Therefore, the cluster service can be set to start after all services are prepared to solve the startup failure, as shown in Figure 5-374.
    Figure 5-374 Setting the cluster service to start later

Cause analysis

During OS startup, the port status is down-up-down-up. The OS is started before the NIC is started. When the states of ports on the NIC frequently changes between the up and down states, if the down event is triggered and not recovered in Down Hold Time, the network status is not ready for startup, and the OS services cannot connect to the active directory (AD) for authentication. As a result, the NETLOGON service fails to start, and the cluster service cannot start.

Conclusion and Solution

Conclusion

If physical signals for a NIC are not stable, ports on the NIC frequently change between the up and down state (The latency for a switch to report a down event is 0 ms by default). As a result, he OS services cannot connect to the AD for authentication, and the cluster service fails to start.

Solution

  • (Recommended) Refer to The Cluster Service Fails to Start on Windows Server 2003.
  • Start the startup script, set the cluster service to start after all services are prepared. This prevents an error event log from being generated for a cluster service start failure.
  • Set the cluster service to start later on Windows Server 2008 and later.
  • Set carrier down-hold-time to 3000 ms for the switch that is connected to the server service planes.
Experience

Select the first or second solution for Windows Server 2003 and the third solution for Windows Server 2008.

Note

None

The RH2488 V2 with the Kylin 3.0 Installed Breaks Down

Problem Description
Table 5-309 Basic information

Item

Information

Source of the Problem

RH2488 V2

Intended Product

RH2488 V2

Release Date

2013-04-16

Keyword

Kylin 3.0, C-STATE

Symptom

Hardware configuration

RH2488 V2

Software configuration

Kylin 3.0

Symptom

Faults frequently occur in an RH2488 V2 during startup. For example, a blank screen is displayed, the keyboard does not respond, and the service network port cannot be accessed. Similar faults sometimes occur on other servers when the servers are running.

Key Process and Cause Analysis

Key process

  1. Log in to the operating system (OS) of the server and query the system version, as shown in Figure 5-375.
    Figure 5-375 Querying the system version
  2. Exclude application factors because the fault still occurs when the system is in the idle state. Install the RHEL5.7 and RHEL5.8 on two servers respectively onsite and test the servers for one day. Servers with the RHEL5.7 and RHEL5.8 installed are running properly, while servers with the Kylin 3.0 installed are malfunctioning. Determine that the fault is caused by the OS.
  3. Modify basic input/output system (BIOS) settings (P/C/T option) for servers with the Kylin 3.0 installed. No exception is found. This indicates that BIOS settings (P/C/T option) are related to the fault.
  4. Use the RHEL5.7 kernel to replace the Kylin 3.0 kernel. The fault does not occur. This indicates that the fault is caused by the Kylin 3.0 kernel. After the Kylin 3.0 kernel is upgraded, the fault is rectified.

Cause analysis

The Kylin 3.0 kernel is of an early version and does not fully support Intel CPUs of the RH2488 V2. As a result, a fault occurs. The server hardware is normal.

Conclusion and Solution

Conclusion

The Kylin 3.0 kernel is of an early version and does not fully support Intel CPUs of the RH2488 V2. As a result, a fault occurs. The server hardware is normal.

Solution

Method 1:

  1. Restart the RH2488 V2 and press Delete during the power on self-test (POST) to enter the BIOS.
  2. On the Advanced tab, choose Processor and Clock options.
  3. Set Intel(R) SpeedStep(tm) tech to Disable.
  4. Set Intel(R) C-STATE tech to Disable.
  5. Set ACPI T State to Disable.

    Figure 5-376 shows the new settings.

    Figure 5-376 New settings

  6. Save the settings and exit the BIOS.

Method 1 is a workaround for hardware.

Method 2:

Seek help from Kylin and upgrade the Kylin 3.0 kernel to support Westmere EX systems. Method 2 is the solution to this issue.

Experience

Check OS compatibility first for similar cases.

Note

None

The System Breaks Down After the SUSE11SP1 OS Is Continuously Running for More than 208 Days

Problem Description
Table 5-310 Basic information

Item

Information

Source of the Problem

RH2285

Intended Product

R1 series servers

Release Date

2013-04-08

Keyword

SUSE11SP1, DIVIDED_BY_ZERO bug

Symptom

Hardware configuration

RH2285

Software configuration

SUSE11SP1 64-bit

Symptom

Servers with the SUSE11SP1 operating system (OS) installed break down or restart after the OS is continuously running for more than 208 days. Information similar to the following is displayed in the dmesg or /var/log/messages file.

------------[ cut here ]------------ 
WARNING: at /usr/src/packages/BUILD/kernel-default-2.6.32.29/linux-2.6.32/kernel/sched.c:3847 update_cpu_power+0x151/0x160() 
[...] 
Call Trace:  
 [<ffffffff810061dc>] dump_trace+0x6c/0x2d0 
 [<ffffffff813974e8>] dump_stack+0x69/0x71 
 [<ffffffff8104d754>] warn_slowpath_common+0x74/0xd0 
 [<ffffffff8103d6e1>] update_cpu_power+0x151/0x160 
 [<ffffffff8103e323>] find_busiest_group+0xa83/0xce0 
 [<ffffffff8104604d>] load_balance_newidle+0xcd/0x380 
 [<ffffffff813982db>] thread_return+0x2a7/0x34c 
 [<ffffffff813992fd>] do_nanosleep+0x8d/0xc0 
 [<ffffffff81068628>] hrtimer_nanosleep+0xa8/0x140 
 [<ffffffff81068730>] sys_nanosleep+0x70/0x80 
 [<ffffffff81002f7b>] system_call_fastpath+0x16/0x1b 
 [<00007f77d8469da0>] 0x7f77d8469da0 
---[ end trace 63f382152a7c7034 ]---

Alternatively, information similar to the following is displayed.

PID: 24290  TASK: ffff880064340140  CPU: 0 COMMAND: "blkback.5.hda" 
#0 [ffff880064b19910] crash_kexec at ffffffff80071e20 
#1 [ffff880064b199e0] oops_end at ffffffff80353958 
#2 [ffff880064b19a00] do_divide_error at ffffffff8000886e 
#3 [ffff880064b19aa0] divide_error at ffffffff80007c05 
#4 [ffff880064b19b28] find_busiest_group at ffffffff800300f4 
#5 [ffff880064b19cb8] load_balance_newidle at ffffffff80036cda 
#6 [ffff880064b19d38] thread_return at ffffffff803500c1 
#7 [ffff880064b19dc8] dm_table_unplug_all at ffffffffa0424fec 
#8 [ffff880064b19e48] blkif_schedule at ffffffffa0537734 
#9 [ffff880064b19ee8] kthread at ffffffff80056816 
#10 [ffff880064b19f48] kernel_thread at ffffffff80007f0a
Key Process and Cause Analysis

Key process

The DIVIDED_BY_ZERO bug is randomly triggered in the kernel after the SUSE11SP1 OS is continuously running for more than 208 days. The following provides the link of bugs at the SUSE official website:

http://www.novell.com/support/kb/doc.php?id=7009834

The server host must meet the following requirements:

  1. CPUs are provided by Intel.
  2. The CPU flags in /proc/cpuinfo contain constant_tsc and nonstop_tsc fields.
  3. The dmesg and /var/log/boot.msg do not contain Marking TSC unstable.
Conclusion and Solution

Conclusion

The DIVIDED_BY_ZERO bug is randomly triggered in the kernel after the SUSE11SP1 OS is continuously running for more than 208 days. As a result, the system breaks down or restarts.

Solution

Workaround:

Manually reset the OS before the OS is continuously running for 208 days.

Run the uptime command to query the continuous running time of the OS. In the command output, pay attention to the value before days.

# uptime 
23:48:44 up 3 days, 23:48,  1 user,  load average: 0.02, 0.05, 0.00

Solution:

Upgrade the SUSE11SP1 kernel to the latest version 2.6.32.59-0.7.1 (determine the default or Xen kernel according to actual situations).

The following uses 2.6.32.59-0.7.1-default as an example to describe how to upgrade the kernel:

  1. Dial the SUSE hotline 4008106500 to obtain the .rpm package of the kernel and upload the package to the server for which the kernel is to be upgraded.
  2. Run the following command to check whether the upgrade package can be installed:
    # rpm -ivh --test --force kernel-default-2.6.32.59-0.7.1.x86_64.rpm kernel-default-base-2.6.32.59-0.7.1.x86_64.rpm
  3. If no error message is displayed in step 2, run the following command to install the package:
    # rpm -ivh --force kernel-default-2.6.32.59-0.7.1.x86_64.rpm kernel-default-base-2.6.32.59-0.7.1.x86_64.rpm
  4. Check that the startup kernel in the /boot/grub/menu.lst is the new kernel.
    # cat /boot/grub/menu.lst 
    # Modified by YaST2. Last modification on Tue Dec 11 13:44:59 EST 2012 
    default 0 (0 indicates the default startup kernel, which is specified in the first title in the following.) 
    timeout 8 
    ##YaST - generic_mbr 
    gfxmenu (hd0,0)/boot/message 
    ##YaST - activate 
     
    ###Don't change this comment - YaST2 identifier: Original name: linux### 
    title SUSE Linux Enterprise Server 11 SP1 - 2.6.32.59-0.7 (default) 
    root (hd0,0) 
    kernel /boot/vmlinuz-2.6.32.59-0.7-default root=/dev/disk/by-id/scsi-36286ed494c1a7000184757f207d309cc-part1 resume=/dev/disk/by-id/scsi-36286ed494c1a700003f742e20b1b0ea1-part2 splash=silent crashkernel=256M-:128M showopts vga=0x317 
    initrd /boot/initrd-2.6.32.59-0.7-default 
     
    ###Don't change this comment - YaST2 identifier: Original name: failsafe### 
    title Failsafe -- SUSE Linux Enterprise Server 11 SP1 - 2.6.32.59-0.7 
    root (hd0,0) 
    kernel /boot/vmlinuz-2.6.32.59-0.7-default root=/dev/disk/by-id/scsi-36286ed494c1a7000184757f207d309cc-part1 showopts ide=nodma apm=off noresume edd=off powersaved=off nohz=off highres=off processor.max_cstate=1 nomodeset x11failsafe vga=0x317 
    initrd /boot/initrd-2.6.32.59-0.7-default     
  5. Restart the system to make the new kernel take effect.
  6. Run the following command to check that the kernel version is the target version:
    # uname -a
Experience

None

Note

This case applies only to Huawei R1 series servers (with the SUSE11SP1 standard OS installed).

Artifacts Appear on the GUI of the RHEL 6U2 and RHEL 6U3 OSs

Problem Description
Table 5-311 Basic information

Item

Information

Source of the Problem

BH620

Intended Product

R1 series servers

Release Date

2013-04-11

Keyword

RHEL 6U2 6U3, artifacts

Symptom

Hardware configuration

BH620

Software configuration

RHEL 6U2 64-bit

Symptom

At a site, when the customer installs the RHEL 6U2 64-bit operating system (OS) on the BH620, artifacts appear on the screen. After the installation is complete, artifacts still appear on the desktop, as shown in Figure 5-377.

Figure 5-377 Desktop artifacts

Key Process and Cause Analysis

Cause analysis

The XGI driver provided by the RHEL 6U2 and RHEL 6U3 OSs does not properly support Z9s video cards, so artifacts appear on the graphical user interface (GUI).

Conclusion and Solution

Conclusion

The XGI driver provided by the RHEL 6U2 and RHEL 6U3 OSs does not properly support Z9s video cards, so artifacts appear on the GUI.

Solution

Solution 1 (the OS is being installed):

Select Install system with basic video driver during OS installation, as shown in Figure 5-378. No artifacts appear during OS installation and after the installation.

Figure 5-378 Selecting an Installation mode

Solution 2 (the OS is installed):

Use the VESA video card driver to replace the XGI video card driver as follows:

  1. Run vi /etc/X11/xorg.conf on the terminal to create the xorg.conf file, as shown in Figure 5-379.
    Figure 5-379 Creating a file

  2. In the text editor, press Insert and add the following information.
    Section "Device" 
                    Identifier "Videocard 0" 
                    Driver "vesa" 
    EndSection     

    Press Esc, enter :wq, and press Enter to save the configurations and exit, as shown in Figure 5-380.

    Figure 5-380 Text editor

  3. Run init 3 on the terminal to switch to the text UI, as shown in Figure 5-381 and Figure 5-382.
    Figure 5-381 Switching to the text UI

    Figure 5-382 Text UI

  4. Log in to the system from the text UI and run the init 5 command to restart the server. The GUI is recovered, as shown in Figure 5-383 and Figure 5-384.
    Figure 5-383 Restarting the GUI
    Figure 5-384 GUI
Experience

None

Note

None

The Virtual DVD-ROM Drive Cannot Be Connected After Internet Explorer Does Not Respond

Problem Description
Table 5-312 Basic information

Item

Information

Source of the Problem

E9000

Intended Product

E9000 servers

Release Date

2014-01-13

Keyword

Internet Explorer, iMana, virtual DVD-ROM drive

Symptom

Hardware configuration

E9000

Software configuration

Windows

Symptom

  1. Use Internet Explorer 8 to log in to iMana.
  2. Connect compute nodes and the virtual DVD-ROM drive by using remote login on iMana.
  3. Internet Explorer 8 does not respond unexpectedly.
  4. Forcibly close Internet Explorer 8.
  5. Use Internet Explorer 8 to log in to iMana again and connect compute nodes by using remote login on iMana. The virtual DVD-ROM drive cannot be connected, and the system displays a message, indicating the virtual media is being used.
Key Process and Cause Analysis

Key process

After the virtual DVD-ROM drive is connected, Internet Explorer 8 does not respond unexpectedly, which is a problem on Windows. After Internet Explorer 8 is forcibly closed, restart Internet Explorer 8 and connect the virtual DVD-ROM drive again.

Cause analysis

After stop-responding Internet Explorer 8 is forcibly closed, the virtual DVD-ROM drive is still connected, and Java virtual machines (VMs) are still running. Therefore, the system displays the message, indicating the virtual DVD-ROM drive is being used.

Conclusion and Solution

Solution

  • Method 1: Open the Task Manager on Windows and forcibly stop the java.exe process.
  • Method 2: Restart the local PC.
  • Method 3: Restart iMana.
Experience

None

Note

None

V2 Server Intel 82576 NIC Port Intermittence in Windows Server 2008

Problem Description
Table 5-313 Basic information

Item

Information

Source of the Problem

RH2288 V2

Intended Product

RH2288 V2

Release Date

2013-12-15

Keyword

Windows Server 2008, abnormal network port status

Symptom

Hardware configuration

RH2288 V2 with an LSI SAS2208 controller card

Symptom

In Windows Server 2008 R2 SP1, a network port is continuously switching between the Network cable unplugged and Enable states, and the network connection is abnormal (intermittence occurs). See Figure 5-385.

Figure 5-385 Network port intermittence

Key Process and Cause Analysis

Cause analysis

In BIOS, the value of PCI-E Port Max.Payload Size is 128B. In Windows, if the function 0 network port of the NIC is disabled, the MPS value (256B) of Root Port is different from the MPS value (128B) on the device side, leading to a PCIe link exception.

Conclusion and Solution

Conclusion

In BIOS, the value of PCI-E Port Max.Payload Size is 128B. In Windows, if the function 0 network port of the NIC is disabled, the MPS value (256B) of Root Port is different from the MPS value (128B) on the device side, leading to a PCIe link exception.

Solution

There are two solutions:

Solution 1: Enable the function 0 network port for the NIC.

NOTE:

For details about how to distinguish the function 0 and function 1 network ports, see Note.

Solution 2: Enter the BIOS hidden mode and change the value of PCI-E Port Max.Payload Size Request to 256B. The procedure is as follows:

  1. During server startup, press Ctrl+Alt+1. A few minutes later, the screen shown in Figure 5-386 is displayed.
    Figure 5-386 Startup options

  2. Press Esc and choose Setup Utility to enter the BIOS hidden mode.
    Figure 5-387 BIOS hidden mode

  3. Select Advanced and select CPU IIO.
    Figure 5-388 Selecting CPU IIO

  4. Change the value of PCI-E Port Max. Payload Request to 256B under CPU IIO0 Configuration and CPU IIO1 Configuration for the PICe slots where the corresponding NICs reside. If the PICe slots where the expansion cards reside are uncertain, change the values of PCI-E Port Max. Payload Request for all five PCIe slots.
    Figure 5-389 PCIe slots

    Figure 5-390 Selecting "CPU IIO0 Configuration" and "CPU IIO1 Configuration"

  5. Select the corresponding option PCI Express Port XX.
    Figure 5-391 Selecting "PCI Express Port XX"

  6. Press Enter and change the value of PCI-E Port Max. Payload Request to 256B.
    Figure 5-392 Changing the value of "PCI-E Port Max. Payload Request"

  7. Repeat steps 4-6 to change the value of PCI-E Port Max. Payload Request under PCI Express Port XX for the other slots. Press F10 to save the settings. Then restart the server for the changes to take effect.
Experience

None

Note

Taking Windows Server 2008 as an example, the method of distinguishing the function 0, function 1, and function X network ports on the same NIC is as follows:

  1. Go to the Network and Sharing Center page and click Change adapter settings in the navigation area.
    Figure 5-393 Clicking "Change adapter settings"

  2. Click Device Name. The system automatically sorts data by NIC model.
    Figure 5-394 Clicking "Device Name"

  3. Select the device with the minimum number (Device Name) (blank, #2, #3, and #4 in sequence) to check its properties and select Configure.
    Figure 5-395 Properties item

  4. Check the location information (Location: PCI bus 135. Device 0. Function 0 in this example) on the General tab to determine whether the current network port is the one that needs to be enabled.
    Figure 5-396 General tab

Unable to Delete a Partition in Windows Server 2008 R2

Symptom

In Windows Server 2008 R2 disk management, a partition fails to be deleted. Take drive D as an example, in the disk management, right-click drive D. "Delete volume" is gray, and the status of drive D is displayed normally.

Key Process and Cause Analysis

Drive D was set as a drive used by the virtual memory. The confirmation method is as follows:

Right-click on My Computer (or Computer), select Properties, go to the Advance tab- Performance Options - Advanced - Virtual Memory, and view the virtual memory on drive D. You can also see a hidden file pagefile.sys in Drive D.

Conclusion and Solution

Solution:

Select either of the following two options.

  • If the physical memory is sufficient to run the services, cancel the virtual memory function as follows.

    On the Virtual Memory tab, select No Paging File and click SettingsOK. Then delete the D drive partition.

  • Move the virtual memory to other partitions, and delete the partition after canceling the virtual memory from drive D.
Experience

None.

Note

None.

Built-in Intel 82576 Driver of Windows Server 2008 R2 SP1 Leads to CAT ERROR

Problem Description
Table 5-314 Basic information

Item

Information

Source of the Problem

BH620/BH621 V2

Intended Product

BH620/BH621 V2

Release Date

2014-05-09

Keyword

Windows server 2008 R2, LSI SAS2308, blue screen of death (BSOD), automatic restart

Symptom

Symptom:

The server iMana management system reported the Machine Check Exception (MCE) event, triggering a CPU error, as shown in Figure 5-397.

Figure 5-397 iMana alarm

There were no other logs after the Windows Server 2008 R2 SP1 operating system (OS) log reported the WHEA-Logger event. The WHEA-Logger event in the OS log points to Intel 82576 (Bus No.s 4: 0: 0 and 3: 0: 0 respectively, device ID 8086: 10c9).

After the OS was restarted, the bugcheck event was STOP: 0x00000124, and the MEMORY.DMP file was generated (Windows Crash log).

Key Process and Cause Analysis

Key process:

  1. Microsoft interpreted that STOP: 0x00000124 is an unrecoverable hardware failure, which is generally related to PCIe device hardware and its drivers.

    http://social.technet.microsoft.com/wiki/contents/articles/6302.windows-bugcheck-analysis.aspx#STOP_124

    Stop 0x00000124 (WHEA_UNCORRECTABLE_ERROR)

    The Stop 0x00000124 message occurs when Windows has a problem handling a PCIe device. Most often, this occurs when adding or removing a hot-pluggable PCIe card; however, it can occur with driver- or hardware-related problems for PCI-Express cards.

    Figure 5-398 STOP-124 interpretation segment

  2. Parse the MEMORY.DMP file generated by the OS, find that the file is incomplete and fail to locate the specific PCIe position (Slot4 or Slot5), however, you can parse that the system crash is caused by a PCIe type error.
    Figure 5-399 Parsing result of MEMORY.DMP

    Figure 5-400 Parsing result of WHEA-Logger

    Figure 5-401 Parsing result of PCI-E device address space

    Figure 5-402 Parsing result of normal DUMP log

  3. Combined with the PCIe event frequently printed by the OS, confirm with Microsoft that the expansion of the Intel 82576 network card caused the system crash.
    Figure 5-403 Reply email from Microsoft

  4. The Intel NIC driver comes with the OS. Upgrade the driver to eliminate the failure.
Conclusion and Solution

Conclusion:

The exception of Intel 82576 NIC driver that comes with Windows Server 2008 R2 SP1 led to a Machine Check Exception (MCE) event, resulting in the BSOD crash of the OS.

Solution:

Install the NIC driver released by Intel.

Experience

None.

Note

The Windows Hardware Error Architecture (WHEA) provides a common basis for handling hardware failures on the Windows platform. The WHEA is intended to provide richer bug reports to reduce the average time to recover deadly hardware bugs. The WHEA allows Windows OSs to take advantage of existing and future hardware error criteria such as Processor's Machine Check Architecture (MCA) and PCI Express Advanced Error Reporting (AER).

https://msdn.microsoft.com/en-us/library/windows/hardware/ff560515(v=vs.85).aspx

Secondary Pool Tag AfdP Causing High Nonpaged Pool Usage in Windows Server 2008 R2

Problem Description
Table 5-315 Basic information

Item

Information

Source of the Problem

E6000

Intended Product

All servers

Release Date

2015-01-08

Keyword

Nonpaged pool, high memory usage

Symptom

Hardware configuration

BH622 V2 with four 8 GB DIMMs

Software configuration

Windows Server 2008 R2 64-bit

Symptom

After Windows Server 2008 R2 64-bit runs on a BH622 V2 server blade for a period of time, the memory usage reaches 94%. See Windows task manager.

Key Process and Cause Analysis

Cause analysis

Use the Microsoft RamMap.exe tool to check the memory usage. The nonpaged pool uses 22 GB memory space, the Windows system memory space reserved for users use is only about 10 GB, and the current client's processes use about 9.5 GB memory space. As a result, the memory usage is up to 94%.

Use the PoolMon.exe tool to check the nonpaged pool Poolmon. The AfdP tag uses a large space of the nonpaged pool Poolmon.

If you search this problem in the Microsoft Knowledge Base (MKB), you can find that this problem is similar to the MKB article "System shows a high nonpaged pool utilization from pool tag AfdP." For details, see http://support.microsoft.com/kb/2935389/zh-cn.

The collected SDP logs indicate that the patch KB2935389 is not installed. To resolved this problem, you are advised to back up the server and then install the following patch to upgrade the related drivers to the latest versions:

Package: Afd.sys 

----------------------------------------------------------- 

KB Article Number (s) : 2935389    

Language: All (Global)    

Platform: x64    

Location: ( http://hotfixv4.microsoft.com/Windows%207/Windows%20Server2008%20R2%20SP1/sp2/Fix512781/7600/free/477198_intl_x64_zip.exe  ) 

----------------------------------------------------------- 

KB Article Number (s) : 2935389    

Language: All (Global)    

Platform: i386    

Location: ( http://hotfixv4.microsoft.com/Windows%207/Windows%20Server2008%20R2%20SP1/sp2/Fix512781/7600/free/477199_intl_i386_zip.exe  )

The problem is resolved after the patch KB2935389 is installed.

Conclusion and Solution

Conclusion

This problem occurs because of a nonpaged pool leak in the afd.sys driver. This leak exhausts the nonpaged pool and causes the server to stop responding. The problem can occur if one application thread posts a Winsock Select function on a set of sockets while a second application thread is deleting one or more of the sockets.

Solution

Install the patch KB2935389.

Package: Afd.sys 

----------------------------------------------------------- 

KB Article Number (s) : 2935389    

Language: All (Global)    

Platform: x64    

Location: ( http://hotfixv4.microsoft.com/Windows%207/Windows%20Server2008%20R2%20SP1/sp2/Fix512781/7600/free/477198_intl_x64_zip.exe  ) 

----------------------------------------------------------- 

KB Article Number (s) : 2935389    

Language: All (Global)    

Platform: i386    

Location: ( http://hotfixv4.microsoft.com/Windows%207/Windows%20Server2008%20R2%20SP1/sp2/Fix512781/7600/free/477199_intl_i386_zip.exe  )
Experience

None

Note

Use the Poolmon tool to check the usage of the nonpaged pool and paged pool. Related syntax parameters are as follows:

Poolmon Syntax:

http://technet.microsoft.com/en-us/library/cc775774(v=WS.10).aspx

Blue Screen and Error Message Being Displayed When Windows Server 2012 R2 Is Restarted After Installation

Problem Description
Table 5-316 Basic information

Item

Information

Source of the Problem

V2 servers

Intended Product

V2 servers

Release Date

2015-01-08

Keyword

Windows Server 2012 R2, computer problem

Symptom

Symptom:

After a user installs Update Rollup 2919355 (April 2014) for Windows RT 8.1, Windows 8.1, and Windows Server 2012 R2, or after a user installs any of these systems from media that include this update rollup, the user cannot restart the computer because the computer experiences a restart loop.

Update Rollup 2919355 may have been installed automatically through Windows Update (WU) or through Windows Server Update Services (WSUS) in the environment. Computers that start from certain Serial Attached SCSI (SAS) storage controllers are affected by this problem. This includes, but is not limited to, the following controller drivers:

  • Dell H200 PERC controller
  • IBM x240 with on-board LSI SAS2004 ROC controllers
  • LSI SAS2308 on-board controllers
  • LSI 9211-4i controllers
  • LSI 9211-8i controllers
  • LSI 9211 SAS
  • Supermicro X10SL7-F mainboard

If the "Automatically restart" option to set the computer behavior after a failure is disabled, you receive the following "Stop" error message during startup:

Stop 0x7B INACCESSIBLE_BOOT_DEVICE

Figure 5-404 Error message

Key Process and Cause Analysis

Cause analysis:

This problem occurs if the storage controller receives a memory allocation that starts on a 4 GB boundary. In this situation, the storage driver does not load. Therefore, the system does not detect the boot drive and returns the "Stop" error message that is mentioned in the "Symptom" section.

Conclusion and Solution

Solution:

There are two solutions:

  1. Try to restart the computer several times. You may occasionally be able to boot to the OS desktop by trying the process multiple times.

    Upgrade the Microsoft patch (KB2966870) and LSI SAS2308 RAID controller card driver (driver version: 2.00.72.02) on the OS desktop.

  2. If the OS desktop cannot be displayed after you restart the computer several times, perform the following operations:

    Based on Microsoft's corresponding recommendation and method, customize a Windows Server 2012 R2 64-bit system. Integrate the Microsoft patch KB2966870 and LSI SAS2308 RAID controller card driver (driver version: 2.00.72.02) with the Windows Server 2012 R2 64-bit system. For details, see Figure 5-405 at https://support.microsoft.com/kb/2966870.

Figure 5-405 Microsoft recommendation and method

For any questions about the Windows Server 2012 R2 64-bit system, contact Microsoft technical support over 800-820-3800.

Experience

None

Abnormal Restart with Bugcheck: 0x000000ca Occurred On a BH640 V2 that Runs Windows Server 2012 R2

Problem Description
Table 5-317 Basic information

Item

Information

Source of the Problem

BH640 V2

Intended Product

Huawei servers

Release Date

2015-08-18

Keyword

Windows Server 2012 R2, bugcheck 0x000000ca

Symptom

Hardware configuration:

BH640 V2 + MU220 + NX220; OS: Windows Server 2012 R2; BIOS version: V060; BMC: 5.88

Symptom:

A customer reported that on August 6, the server restarted abnormally.

Key Process and Cause Analysis

Key process:

Windows OS log analysis shows that an abnormal bugcheck (0x000000ca) occurred in the system leads to restart.

Contact Microsoft technical support, analyze the memory.dmp log, and confirm that mpio.sys caused abnormal system. Upgrade mpio to solve the problem.

Figure 5-407 Analysis on memory.dmp

Root cause analysis:

Windows OS mpio driver problem led to abnormal restart.

Conclusion and Solution

Conclusion:

Windows OS mpio driver problem led to abnormal restart.

Solution:

  1. Upgrade the mpio driver to the latest version.

    -----------------------------

    KB Article Number (s): 3036614

    Language: All (Global)

    Platform: x64

    Location: ( http://hotfixv4.microsoft.com/Windows%208.1/Windows%20Server%202012%20R2/sp1/Fix528809/9600/free/482050_intl_x64_zip.exe)

  2. If any third-party DSM is used, please contact the vendor to upgrade the DSM driver to the latest version.
Experience

None.

Note

None.

CPU and DIMM Alarms on Device Manager of Windows Server 2008 R2 (Standard Edition)

Problem Description
Table 5-318 Basic information

Item

Information

Source of the Problem

RH5885 V3

Intended Product

All servers

Release Date

2015-01-29

Keyword

Windows Server 2008 R2 (standard edition), device manager, alarm

Symptom

The RH5885 V3 indicators were normal, and there was no alarm information on the BMC. After the Windows Server 2008 (standard edition) operating system (OS) was installed, in the device manager of the OS, the following error messages were displayed for the CPUs and DIMMs.

  • CPU: All CPU icons have exclamation marks, indicating the device is waiting for the start of other devices or device groups (code 51).
  • DIMM: only 4 DIMM icons, all of which were accompanied with exclamation marks. Attribute message indicated that the device cannot start (code 10).
Key Process and Cause Analysis

Windows Server 2008 R2 (standard edition) can support a maximum memory capacity of 32 GB, and if the memory of more than 32 GB was configured, the excessive physical memory cannot be used, and the compatibility problem occasionally occurred, as shown in this case.

Conclusion and Solution

Conclusion:

Memory compatibility problem of Windows Server 2008 R2 (standard edition)

Solution:

Install Windows Server 2008 R2 Enterprise edition or later versions, including Enterprise and Datacenter editions.

Experience

The customer is advised to install the OS in strict accordance with the Compatibility List.

Note

None.

High Memory Usage When Opening NUMA Under Windows Server 2008 R2 & SQL Server Environment

Problem Description
Table 5-319 Basic information

Item

Information

Source of the Problem

RH5885 V3

Intended Product

All servers

Release Date

2015-04-02

Keyword

SQL Server 2012, memory usage, NUMA

Symptom

Hardware configuration:

RH5885 V3

OS:

Windows Server 2008 R2

Software version:

SQL Server 2012

Symptom:

The RH5885 V3 was installed with Windows Server 2008 R2 and SQL Server 2012 to run database services. In the case that the server enabled NUMA by default, the memory usage continues to increase after a period of time, which can reach up to 40% or more.

Key Process and Cause Analysis
  1. The customer disabled NUMA by referring to the Windows Forum. The memory usage reduced to below 10%.
  2. After the investigation, confirm that it is an SQL Server bug, and it has been fixed in the patch package of the latest SQL Server 2008/2012 version, as shown in Figure 5-408.

    http://support.microsoft.com/en-us/kb/2819662

    Figure 5-408 Description of the SQL patch
Conclusion and Solution

Conclusion:

SQL Server software bug.

Solution:

Download and install the latest patch for the corresponding version of SQL Server: http://support.microsoft.com/en-us/kb/2819662, as shown in Figure 5-409.

Figure 5-409 SQL patch package

Workaround:

Disable the server NUMA function in the server BIOS menu.

Experience

None.

Note

None.

Incorrect Processor Count Being Displayed in the Windows Server 2012 Task Manager

Problem Description
Table 5-320 Basic information

Item

Information

Source of the Problem

RH5885 V3

Intended Product

All servers

Release Date

2014-12-18

Keyword

Windows Server 2012, task manager, host logical processor

Symptom

Hardware configuration

RH5885 V3 with four 15-core E7-4890 v2 processors

Software configuration

Windows Server 2012

Symptom

A customer reported that, in the task manager in Windows Server 2012, the processor information column displayed 60 cores, 120 logical processors, and 64 host logical processors (the number of host logical processors conflicts with the actual configuration and the preceding two numbers).

The system information page in the OS shows that each processor is divided into 16 logical processors.

This is why the customer has doubts about the RH5885 V3 server.

Key Process and Cause Analysis

Key process

  1. This problem does not recur on the lab server RH5885 V3 where Windows Server 2012 is installed.
  2. According to a Microsoft FAE, it is found the following information in Microsoft Knowledge Base at Microsoft official website.

    http://blogs.technet.com/b/askcore/archive/2013/03/28/logical-processor-count-changes-after-enabling-hyper-v-role-on-windows-server-2012.aspx

    Figure 5-410 Related information in Microsoft Knowledge Base

    The related information found in Microsoft Knowledge Base indicates that this symptom is normal after the Hyper-V function is enabled in Windows Server 2012. Host logical processors indicates the maximum number of host logical processors that each Hyper-V-based virtual machine can use. Each physical processor contains 15 cores. After the hyper-threading function is enabled, the number of physical kernels increases to 30. However, a maximum of 16 virtual kernels can be provided for each virtual machine due to the restraint caused by the Hyper-V function.

  3. The problem can recur on the lab server running Windows Server 2012 where the Hyper-V function is enabled.

Cause analysis

This is a normal symptom after the Hyper-V function is enabled for Windows Server 2012.

Conclusion and Solution

Conclusion

This is a normal symptom after the Hyper-V function is enabled for Windows Server 2012.

Solution

This is not a problem, and no workaround is required.

Experience

For Windows Server 2012 or later versions, if the Hyper-V function is disabled, Host logical processors will not be displayed. Before you use this case, ensure that the Hyper-V function is enabled at the customer site.

Note

None

Server BSOD Caused by BDSafeBrowser.sys

Problem Description
Table 5-321 Basic information

Item

Information

Source of the Problem

V2 servers

Intended Product

All servers

Release Date

2014-12-18

Keyword

BDSafeBrowser.sys, blue screen of death (BSOD)

Symptom

Hardware configuration

RH5885 V3 with four 15-core E7-4890 v2 processors

Software configuration

Windows Server 2012

Symptom

After the server where Windows Server 2008 is installed runs for a period of time, a BSOD reading "BDSafeBrowser.sys DRIVER_UNLOAD_WITHOUT_CANCELLING_PENDING_OPERATIONS" is displayed. See Figure 5-411.

Figure 5-411 BSOD message

Key Process and Cause Analysis

Cause analysis

BDSafeBrowser.sys is the Baidu browser security component driver. It contains a bug that causes BSOD.

Conclusion and Solution

Solution

  1. Restart the server and press F8 to enter the safe mode.
  2. Uninstall all Baidu related application components: Find and run the uninstallation program uninst.exe in C:\Program Files\Common Files\Baidu\BaiduProtect\Version directory.
  3. In C:\WINDOWS\system32\driver\, delete the files BDSafeBrowser.sys, BDDefense.sys, BDMWrench.sys, bd0001.sys, bd0002.sys, and bd0004.sys. (If related registry tables are involved, manually delete them.)
  4. Restart the server.
Experience

If a BSOD is displayed, search for cases on the Internet or at http://3ms.huawei.com/hi/group/1004825 based on the error message displayed on the BSOD.

Note

None

VMware vCenter Reporting DIMM Alarms

Problem Description
Table 5-322 Basic information

Item

Information

Source of the Problem

CH121

Intended Product

All servers

Release Date

2013-11-28

Keyword

DIMM, vCenter, alarm

Symptom

Hardware configuration

Two E5-2603 CPUs, sixteen 8 GB DIMMs, and LSI SAS2308 RAID controller card

Symptom

After installing the VMware ESXi 5.1 on the CH121 server, the vCenter reports alarms for DIMMs.

Key Process and Cause Analysis

Key process

  1. Use the VMware vSphere Client to log in to the ESXi host, and check that DIMMs are in the healthy state.
  2. Analyse BMC logs of the server, and no alarm is generated.

Cause analysis

According to the logic of vCenter reporting alarms, the operation that BMC reports about Assert and Deassert events of DIMMs is considered as an alarm. After vCenter hardware status updating, the vCenter reports alarms for DIMMs when there are Assert and Deassert events.

Conclusion and Solution

Conclusion

The false DIMM alarm is reported by the VMware vCenter, and you can ignore it.

This problem has been resolved in VMware ESXi 5.1 patch. For details, see:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2070667

Figure 5-412 VMware ESXi 5.1 patch description

Experience

Strictly follow the method provided in this case to determine whether a DIMM fault occurs.

Note

This problem exists in systems including VMware ESXi 5.0/5.1/5.5. Currently, VMware has confirmed that this problem is caused by VMware and has been resolved in the VMware ESXi 5.1 patch released. The VMware ESXi 5.5 u2 about to be released will resolve this problem for VMware ESXi 5.5. For VMware ESXi 5.0, there is no version plan currently, and you can consult VMware technical support if a problem occurs.

BH620/BH621 V2 Windows Server 2008 R2 Crash

Problem Description
Table 5-323 Basic information

Item

Information

Source of the Problem

BH620/BH621 V2

Intended Product

BH620/BH621 V2

Release Date

2014-05-09

Keyword

Windows Server 2008 R2, LSI SAS2308, blue screen, automatic restart

Symptom

Symptom

A BH620/BH621 V2 configured with an LSI SAS2308 RAID controller card and running Windows Server 2008 R2 frequently encounters blue screen of death (BSOD) or automatic restart. The BMC reports the CAT ERROR alarm.

PCIe recoverable errors are frequently displayed in the OS logs. The device ID is 0x8086:0x3c08.

Key Process and Cause Analysis

Cause

After the ASPM function is enabled in Windows Server 2008 R2, the PCIe device may fail occasionally. As a result, the system occasionally restarts or BSOD occurs.

Conclusion and Solution

Solution:

Back up service data and perform the following steps to upgrade Windows Server 2008 R2 to Windows Server 2008 R2 SP1:

  1. Click the following link in the browser and click Download.

    https://www.microsoft.com/en-us/download/details.aspx?id=5842

  2. Select the windows6.1-KB976932-X64.exe file.

  3. Install the downloaded patch.
  4. After the upgrade is complete, choose Start > Run, enter "winver", and press Enter to view the OS version. If the following information is displayed, the upgrade is successful.

Experience

None

Note

None

Error Message "BIOS needs update for CPU frequency support" Is Displayed During RHEL/CentOS 6 Series Startup

Problem Description
Table 5-324 Basic information

Item

Information

Source of the Problem

V2 servers

Intended Product

All servers

Release Date

2013-11-28

Keyword

RHEL/CentOS 6, display

Symptom

Symptom

During the startup of RHEL/CentOS 6 series, the message "Firmware Bug BIOS needs update for CPU frequency support" is displayed, as shown in Figure 5-413.

Figure 5-413 Startup error

Key Process and Cause Analysis

Cause analysis

The CPU frequency scaling service is enabled for RHEL/CentOS 6 series, as shown in Figure 5-414. If the CPU SpeedStep state is disabled in BIOS, the message "BIOS needs update for CPU frequency support" is displayed during startup.

Figure 5-414 CPU frequency scaling service being enabled

Conclusion and Solution

Solution

  1. If the CPU frequency scaling function is required, enable the CPU SpeedStep state in the server BIOS.
    1. R1 servers: In BIOS, choose Advanced > CPU Configuration and set Intel(R) SpeedStep(tm) tech to enable.
    2. V2 servers: In BIOS, choose Advanced > Advanced Processor and set EIST Support to enable. If CPU Turbo Boost is required, enable Turbo Mode.

      For BH620/621 V2, choose Advanced > Socket 1/2 information and set EIST to enable.

  2. If the CPU frequency scaling function is not required, ignore the message displayed.
Experience

None

Note

Supplement related Windows OS logs and disable the parameter SpeedStep. Windows OS logs are similar to the following:

"Due to firmware problem, the performance power management function on processor 31 in group 0 is disabled. Contact the computer manufacturer to obtain the latest firmware."

The SpeedStep parameter is enabled, and the OS log is shown in Figure 5-415.

Figure 5-415 Log 2

Failure to Select High Resolution on the RHEL NVS315 Configuration Screen

Problem Description
Table 5-325 Basic information

Item

Information

Source of the Problem

V2 servers

Intended Product

All servers

Release Date

2013-11-28

Keyword

RHEL, NVS315

Symptom

Symptom

The server runs RHEL 6.2 and is configured with one NVS315 video card. After the video card driver is configured, run the nvidia-settings command to invoke the NVS315 video card configuration screen. The screen offers only two resolution options: 840 x 480 and 640 x 480. No higher resolution is available. See Figure 5-416.

Figure 5-416 No higher resolution available

Key Process and Cause Analysis

Key process

  1. The video card configuration screen indicates that the server monitor is CRT-1. In normal cases, the monitor model is displayed instead.
    Figure 5-417 Display information in the Xorg.log file

  2. Update the video card driver. However, the problem persists.
  3. Replace the monitor with one of another model. However, the problem persists.
  4. The server onsite uses physical KVM cables instead of cables come with the video card. After the physical KVM cables are replaced by cables come with the video card, the physical monitor model can be correctly identified, and a higher resolution is available.
Conclusion and Solution

Solution

Use the cables come with the video card.

Experience

The failure to identify the physical monitor model is usually caused by the following and you need to locate the fault accordingly:

  1. The video card driver is not installed, or the driver version is earlier than the expected one.
  2. The video card cable, including the connector, is of poor quality.
  3. The display is of poor quality.
Note

None

Tboot Error Is Reported During RHEL 6 Startup

Problem Description
Table 5-326 Basic information

Item

Information

Source of the Problem

V2 servers

Intended Product

All servers

Release Date

2013-12-15

Keyword

RHEL 6, tboot

Symptom

Symptom

RHEL6.3 or later is installed on the server, the tboot error message is displayed during the startup, but the server can be properly started.

The tboot error information for RHEL 6.3 is as follows:

Invalid magic number: 0

Error 13: Invalid or unsupported executable format

Press any key to continue...

  • The tboot error information for RHEL 6.5 is shown in Figure 5-418.
    Figure 5-418 RHEL 6.5 error message
Key Process and Cause Analysis

Cause analysis

During OS installation, all software is installed, or the tboot software package is installed, the system starts the tboot.gz kernel by default, but the built-in tboot module of RHEL has defects. As a result, tboot error information is displayed during the startup. For details, go to the following link:

https://access.redhat.com/articles/186583

Figure 5-419 Kernel tboot.gz

Conclusion and Solution

Solution

  • Scenario 1

    The OS has been installed. In the OS, uninstall the tboot software package that is installed.

    Figure 5-420 Uninstalling the tboot software package

Modify the menu.lst file in /boot/grub.

Figure 5-421 shows the original menu.lst file.

Figure 5-421 Original menu.lst file

Figure 5-422 shows the menu.lst file after modification. Restart the server after the modification.

Figure 5-422 Modified menu.lst file

  • Scenario 2

    The OS is not installed. Choose System Base > Base, deselect the tboot package, and install the OS.

    Figure 5-423 Deselecting the tboot package
Experience

None

Note

None

Windows Server 2012 BSOD with a 0x0000009E Error Code

Problem Description
Table 5-327 Basic information

Item

Information

Source of the Problem

CH121

Intended Product

Huawei servers

Release Date

2015-04-15

Keyword

0x0000009E, BSOD

Symptom

Hardware configuration:

CH121 with 16 x 8 GB DIMMs and two E5-2620 CPUs

Software version:

MM910: 2.20; blade server BIOS: V378; BMC: 5.11

OS:

Windows Server 2012

Symptom:

A customer reported that BSOD as shown in Figure 5-424 occurred on multiple servers, and were accompanied by a 0x0000009E error code.

Figure 5-424 BSOD accompanied by a 0x0000009E error code.

Key Process and Cause Analysis

Key process:

Microsoft engineers analysis: Microsoft analyzed the C:\Windows\MEMORY.DMP log provided by the customer, and combined with the 0x0000009E error code, confirming that it was similar to a known problem in Microsoft. Visit the Microsoft knowledge base at:

http://support.microsoft.com/en-us/kb/2876391

Figure 5-425 Symptom of 0x0000009E error code

Root cause analysis:

This problem occurs because the logical unit number (LUN) is deleted and locked twice but published only once. Therefore, Plug and Play (PnP) Manager cannot remove the device and the node crashes.

Conclusion and Solution

Conclusion:

This problem occurs because the logical unit number (LUN) is deleted and locked twice but published only once. Therefore, Plug and Play (PnP) Manager cannot remove the device and the node crashes.

Solution:

Contact Microsoft to obtain the latest patch for Windows Server 2012, and the following link is KB description on the official Microsoft website:

https://support.microsoft.com/en-us/kb/2876391/zh-cn

Experience

The bug exists in Windows Server 2012 and Windows Server 2008. If the 0x0000009E error code is displayed, directly contact Microsoft to obtain a solution.

Note

None.

OS Startup Failure on a Server with Dual RAID Controller Cards If Enable controller BIOS Is Disabled for the Huawei-developed LSI SAS3108 RAID Controller Card

Problem Description
Table 5-328 Basic information

Item

Information

Source of the Problem

RH2288 V3

Intended Product

V3 servers

Release Date

2016-03-11

Keyword

LSI SAS3108, Enable controller BIOS, disable, dual RAID controller cards, OS startup failure

Symptom

A server is equipped with dual RAID controller cards. One is a Huawei-developed LSI SAS3108 RAID controller card, and the other is a standard LSI SAS3108 RAID controller card. A RAID 1 array is created using two hard drives and the Huawei-developed LSI SAS3108 RAID controller card. A RAID 5 array is created using four hard drives and the standard LSI SAS3108 RAID controller card. After SUSE Linux Enterprise Server (SLES) is installed on the RAID 1 array, the OS restarts. However, the OS cannot be accessed and the error message "GRUB loading, please wait …" and error code 17 are displayed, as shown in Figure 5-426.

Figure 5-426 OS access failure
Key Process and Cause Analysis
  1. If a server is equipped with two RAID controller cards, the OS will start from a hard drive connected to a specified RAID controller card. Choose Boot > Legacy > Hard Drives, and view the RAID controller card under Hard Disk Drive. It was found that only the standard RAID controller card existed under Hard Disk Drive.
    Figure 5-427 RAID controller card failed to be managed by the BIOS
  2. Restart the server. In the power-on self-test (POST) process, the RAID controller card self-check screen is displayed. The screen shows that two RAID arrays are found under the RAID controller cards but neither of them is managed by the BIOS. The message "Press <Ctrl><R> to Enable BIOS" is also displayed.
    Figure 5-428 RAID controller card failed to be managed by the BIOS
  3. Press Ctrl+R. The RAID controller card configuration screen is displayed, showing two RAID controller cards.
    Figure 5-429 RAID configuration screen
  4. On the screen, select the Huawei-developed RAID controller card (Controller 0:SAS3108) and press Enter. On the displayed screen, check the status of the RAID array. Then use the same method to check the status of the RAID array created using the standard RAID controller card (Controller 1:LSI MegaRAID SAS).
    Figure 5-430 Status of the RAID array under the Huawei-developed RAID controller card
    Figure 5-431 Status of the RAID array under the standard RAID controller card
  5. Compare information about the Huawei-developed RAID controller card on the Ctrl Mgmt screen with information about the standard RAID controller card on the Ctrl Mgmt screen. The comparison result shows that Enable controller BIOS is disabled. After the controller BIOS is enabled, the OS can be started properly.
    Figure 5-432 Enable controller BIOS disabled
  6. An LSI SAS3108 RAID controller card consumes many OPROM resources. If a server is equipped with a Huawei-developed LSI SAS3108 RAID controller card and a standard LSI SAS3108 RAID controller card, only a hard drive under the Huawei-developed LSI SAS3108 RAID controller card can be used as the boot drive. That is, an OS can be installed only on a hard drive connected to the Huawei-developed LSI SAS3108 RAID controller card.
Conclusion and Solution

Conclusion:

Enable controller BIOS is disabled on the Huawei-developed LSI SAS3108 RAID controller card. As a result, the BIOS cannot find the hard drives under the Huawei-developed LSI SAS3108 RAID controller card and cannot start the OS.

Solution:

Enable Enable controller BIOS (this parameter is enabled by default on delivery) for the Huawei-developed LSI SAS3108 RAID controller card as follows: Select Enable controller BIOS on the Ctrl Mgmt screen, select APPLY, and press Enter.

Figure 5-433 Enable controller BIOS enabled
Experience

None

Note

None

PSOD Occurs Because the Built-in MZ910 Driver of VMware ESXi 5.5 Has Bugs

Problem Description
Table 5-329 Basic information

Item

Information

Source of the Problem

E9000

Intended Product

E9000

Release Date

2014-07-25

Keyword

VMware ESXi 5.5 MZ910 drivers, Purple Screen of Death (PSOD)

Symptom

Hardware configuration: E9000 equipped with the MZ910 driver.

Software configuration: VMware ESXi 5.5

Symptom: PSOD occurs on VMware ESXi 5.5, as shown in Figure 5-434.

Figure 5-434 PSOD

Key Process and Cause Analysis

Key process:

Contact VMware for analyzing VMware and coredump files.

PSOD occurs because VMware ESXi 5.5 comes with the built-in MZ910 driver whose version is 4.6.100.0v.


                  vmnic0  be2net       10df:e220  10df:e264  4.6.100.0v      1.1.43.24 
                  vmnic1  be2net       10df:e220  10df:e264  4.6.100.0v      1.1.43.24 
                  vmnic2  be2net       10df:e220  10df:e264  4.6.100.0v      1.1.43.24 
                  vmnic3  be2net       10df:e220  10df:e264  4.6.100.0v      1.1.43.24 

#0  0x0000418015c8e218 in be_cq_create_v2 (pfob=0x410ae8948218, rd=, length=, solicited_eventable=, no_delay=, cqe_dma_coalescing=, eq_object=0x120, cq_object=0x410ae8948778) at vmkdrivers/src_9/private_drivers/ServerEngines/be2net/hwlib/cq.c:269  #1  0x0000418015c796de in be_mcc_create (adapter=0x410ae89481c0) at vmkdrivers/src_9/private_drivers/ServerEngines/be2net/be_init.c:2523  #2  mcc_setup (adapter=0x410ae89481c0) at vmkdrivers/src_9/private_drivers/ServerEngines/be2net/be_init.c:2908  #3  0x0000418015c7d41c in pf_reset (adapter=0x410ae89481c0, tx_timeo_ctxt=8 'b') at vmkdrivers/src_9/private_drivers/ServerEngines/be2net/be_init.c:3989  #4  0x0000418015ae5516 in vmklnx_workqueue_callout (data=) at vmkdrivers/src_92/vmklinux_92/vmware/linux_workqueue.c:696  #5  0x000041801546165a in helpFunc (data=) at bora/vmkernel/main/helper.c:3251  #6  0x0000418015653532 in CpuSched_StartWorld (destWorld=, previous=) at bora/vmkernel/sched/cpusched.c:10052  #7  0x0000000000000000 in ?? ()

Before breakdown, information about vmnic0 is shown as follows:

2014-07-02T15:29:12.477Z cpu12:33361)VMK_PCI: 720: device 0000:02:00.0 allocated 8 interrupts (intrType 3) 2014-07-02T15:29:12.477Z cpu12:33361)MSIX enabled for dev 0000:02:00.0 2014-07-02T15:29:12.480Z cpu12:33361)TX queue creation failed 2014-07-02T15:29:12.480Z cpu12:33361)Rings creation of ring set 2 failed 2014-07-02T15:29:12.483Z cpu12:33361)pf_reset: ring_sets_setup Failed 2014-07-02T15:29:12.483Z cpu12:33361)World: 8773: PRDA 0x418043000000 ss 0x0 ds 0x10b es 0x10b fs 0x10b gs 0x0 2014-07-02T15:29:12.483Z cpu12:33361)World: 8775: TR 0x4020 GDT 0x4123c9461000 (0x402f) IDT 0x4180154f3000 (0xfff) 2014-07-02T15:29:12.483Z cpu12:33361)World: 8776: CR0 0x80010031 CR3 0x1686a54000 CR4 0x42768  

Cause analysis:

PSOD occurs because VMware ESXi 5.5 comes with the built-in MZ910 driver whose version is 4.6.100.0v.

Conclusion and Solution

Conclusion:

PSOD occurs because VMware ESXi 5.5 comes with the built-in MZ910 driver whose version is 4.6.100.0v.

Solution:

  1. Download the MZ910 driver and MZ910 firmware upgrade package. For details, see Huawei Server OS Installation Guide.
  2. Mount the MZ910 firmware using a virtual CD-ROM drive.
  3. Upgrade the MZ910 driver of VMware ESXi 5.5.
    1. Upload the MZ910 driver to a directory in VMware ESXi 5.5.
    2. In the directory, run the sh install.sh command.
    3. Enter 1 for automatic installation.
  4. Restart VMware ESXi 5.5, press F11, and select the virtual CD-ROM drive as the boot device. For details about how to upgrade the MZ910 firmware, see the preceding MZ910 firmware upgrade guide.
  5. After the MZ910 firmware is successfully upgraded, restart to enter VMware ESXi 5.5, and run the following three commands to confirm version information:
    1. esxcli software vib list |grep lpfc
    2. esxcli software vib list |grep elxnet
    3. esxcli network nic get -n vmnicx (x indicates the NIC number)
Experience

Install a specified MZ910 driver based on the corresponding MZ910 driver version mapping list released at the Huawei technical support website.

Note

None

Black-and-White Screen Is Displayed on VMware OS

Problem Description
Table 5-330 Basic information

Item

Information

Source of the Problem

Huawei servers

Intended Product

Huawei servers

Release Date

2015-03-04

Keyword

VMware, black-and-white screen

Symptom

The black-and-white screen is displayed, when the VMware OS runs normally, as shown in Figure 5-435.

Figure 5-435 Black-and-white screen

Figure 5-436 shows the VMware installation welcome screen.

Figure 5-436 VMware installation welcome screen

Key Process and Cause Analysis

Cause analysis:

The black-and-white screen is displayed because the user has pressed the F4 button.

Conclusion and Solution

Solution:

Press F4 again to recover it.

For more operations, see http://www.vmware.com/help.html.

Experience

None

Note

None

Insufficient Heap Memory Causes VMware OS PSOD

Problem Description
Table 5-331 Basic information

Item

Information

Source of the Problem

CH121

Intended Product

Huawei servers

Release Date

2015-04-15

Keyword

Heap, VMware ESXi 5.1, DLM_free, PSOD

Symptom

Hardware configuration: CH121 + 16 x 8 GB DIMMs + 2 x E5-2620

Software configuration:

  • MM910: 2.20
  • Blade server BIOS: V378
  • BMC: 5.11

OS: VMware ESXi 5.1

Symptom: At a customer site, after third-party security software (dvfilter-dsa module) provided by Trend Micro is installed on servers, PSOD shown in Figure 5-437 occurs.

Figure 5-437 PSOD caused by improper invoking of the dvfilter-dsa module

Key Process and Cause Analysis

Key process:

  1. Cause analysis from VMware engineers:

    Similar error messages, which belong to the same type of faults, are displayed when PSOD occurs.

    "dlmalloc.c" indicates a code segment related to memory allocation.

    The PSOD alarm information is displayed as follows:

    ---------------- 
     
    PSOD (HZ504K0601) 
     
    ---------------- 
     
    2015-03-29T13:48:07.558Z cpu8:4711920)@BlueScreen: PANIC bora/vmkernel/main/dlmalloc.c:4827 - Usage error in dlmalloc 2015-03-29T13:48:07.558Z cpu8:4711920)Code start: 0x41802d800000 VMK uptime: 110:08:30:31.303 2015-03-29T13:48:07.559Z cpu8:4711920)0x4122d7c1b588:[0x41802d87b31a]PanicvPanicInt@vmkernel#nover+0x61 stack: 0x3000000008 2015-03-29T13:48:07.560Z cpu8:4711920)0x4122d7c1b668:[0x41802d87bb1b]Panic@vmkernel#nover+0xae stack: 0x4100344b04a0 
     
    ---------------- 
     
    PSOD(HZ504K0603) 
     
    ---------------- 
     
    2015-03-31T02:10:17.680Z cpu45:34771)@BlueScreen: PANIC bora/vmkernel/main/dlmalloc.c:4827 - Usage error in dlmalloc 2015-03-31T02:10:17.680Z cpu45:34771)Code start: 0x418005c00000 VMK uptime: 0:18:46:14.697 2015-03-31T02:10:17.681Z cpu45:34771)0x41225f4db348:[0x418005c7b31a]PanicvPanicInt@vmkernel#nover+0x61 stack: 0x3000000008 2015-03-31T02:10:17.682Z cpu45:34771)0x41225f4db428:[0x418005c7bb1b]Panic@vmkernel#nover+0xae stack: 0x410058699b20 
         

    Check system kernel logs.

    The following alarm information is displayed when PSOD occurs:

    ---------------- 
     
    (HZ504K0601) 
     
    ---------------- 
     
    2015-03-29T13:46:18.395Z cpu26:103997)WARNING: Heap: 3058: Heap_Align(dvfilter-dsa, 80/80 bytes, 64 align) failed.  caller: 0x41802d819aa7 2015-03-29T13:46:18.734Z cpu31:6538979)WARNING: Heap: 2677: Heap dvfilter-dsa already at its maximum size. Cannot expand. 
     
    2015-03-29T13:46:19.120Z cpu29:16413)WARNING: Heap: 2677: Heap dvfilter-dsa already at its maximum size. Cannot expand. 
     
    ----------------- 
     
    (HZ504K0603) 
     
    ---------------- 
     
    2015-03-31T02:10:17.021Z cpu37:34772)WARNING: Heap: 2677: Heap dvfilter-dsa already at its maximum size. Cannot expand. 
     
    2015-03-31T02:10:17.295Z cpu39:34771)WARNING: Heap: 3058: Heap_Align(dvfilter-dsa, 880/880 bytes, 8 align) failed.  caller: 0x4180064742f2 2015-03-31T02:10:17.512Z cpu45:16429)WARNING: Heap: 3058: Heap_Align(dvfilter-dsa, 392/392 bytes, 8 align) failed.  caller: 0x418006488890 
         
  2. Cause analysis from Trend Micro:

    Improper invoking of the dvfilter-dsa module and insufficient heap memory may cause VMware OS PSOD. For details, see http://esupport.trendmicro.com/solution/en-us/1095995.aspx.

    Figure 5-438 Trend Micro official statements

Cause analysis:

Insufficient heap memory may cause VMware OS PSOD.

Conclusion and Solution

Conclusion:

Insufficient heap memory may cause VMware OS PSOD.

Solution:

Optimize heap memory allocated to filter drivers. Refer to official suggestions of Trend Micro or contact Trend Micro for assistance.

http://esupport.trendmicro.com/solution/en-us/1095995.aspx

Experience

Insufficient heap memory causes Deep Security Virtual Appliance (DSVA) out of service. PSOD can be solved by optimizing heap memory allocated to filter drivers.

Note

None

Alarms Related to LSI_SAS Are Displayed on VMs Running Windows Server 2008 R2 After VMware ESXi 5.0 Is Installed

Problem Description
Table 5-332 Basic information

Item

Information

Source of the Problem

RH5885 V2

Intended Product

All servers

Release Date

2015-02-04

Keyword

VMware ESXi 5.0, Windows Server 2008 R2, VM, LSI_SAS

Symptom

Hardware configuration: RH5885 V2

Software configuration: VMware ESXi 5.0 and Windows Server 2008 R2

Symptom: At a customer site, after VMs running Windows Server 2008 R2 are created on an RH5885 V2 server equipped with VMware ESXi 5.0 and run for a period, alarms are displayed.

Key Process and Cause Analysis

Key process:

  1. Check the BMC to find that no alarms occur. RAID controller cards may fail because LSI and RAID port information is displayed. Obtain logs about RAID controller cards using a detection tool and analyze the logs to find that no alarms occur.
  2. Contact VMware after-sales engineers for the following reference information:

    https://kb.vmware.com/s/article/2063346#q=2096638

    Figure 5-439 Replies from VMware after-sales engineers

LSI_SAS prompted in the preceding alarms indicates a usage mode of VM drive (or storage) resources. In addition, other modes such as LSI Logic Parallel and VMware Paravirtual SCSI (PVSCSI) are provided.

Cause analysis:

VMware ESXi 5.0 fails.

Conclusion and Solution

Conclusion:

VMware ESXi 5.0 fails.

Solution:

  1. Upgrade the VF driver at the upper layer (LSI_SAS) of the Windows Server 2008 R2 VM to 1.32.01 or later.
  2. If the preceding solution cannot solve the problem, perform the following procedure to change LSI_SAS of VM SCSI controllers to VMware PVSCSI:
    1. Power off the VM.
    2. Log in to a vSphere Client, right-click the VM, choose Edit Settings > Add. In the displayed list, select SCSI Controller, and then click Next.
    3. Unfold the SCSI controller added, and change its default type from LSI Logic SAS to VMware PVSCSI.
    4. Power on the VM, and log in to the OS device manager. On the device manager page, check whether the storage controller is set to VMware PVSCSI Controller and runs normally.
    5. Power off the VM, and change the original SCSI controller type from LSI SAS to VMware PVSCSI.
    6. Power on the VM.

If you only directly change the original SCSI controller type from LSI SAS to VMware PVSCSI, a blue screen occurs on a VM running the Windows guest OS, because the boot loader for the guest OS cannot identify loaded device types.

Experience

None

Note

None

Migrating VMware VMs Across CPU Platforms

Problem Description
Table 5-333 Basic information

Item

Information

Source of the Problem

RH5885 V3

Intended Product

All servers

Release Date

2015-04-25

Keyword

Enhanced vMotion Compatibility (EVC)

Symptom

Hardware configuration: HP ProLiant DL580 G7 (Sandy Bridge) and RH5885 V3 servers

Symptom: When VMs are migrated from HP ProLiant DL580 G7 servers running VMware ESXi 5.0 to RH5885 V3 servers running VMware ESXi 5.5, the following errors are reported on the vCenter management page

Figure 5-440 vCenter an error message
Key Process and Cause Analysis

Cause analysis:

By default, VMs cannot be migrated between hosts of different architectures, unless EVC functions are enabled.

Conclusion and Solution

Conclusion:

By default, VMs cannot be migrated across CPU platforms because VMware functions are restricted. To migrate VMs in such manners, enable EVC functions in the cluster.

Solution:

Enable EVC functions in the cluster.

Experience

None

Note

None

VMware ESXi Identifies SSDs as Non-SSDs

Problem Description
Table 5-334 Basic information

Item

Information

Source of the Problem

RH5885 V3

Intended Product

All servers

Release Date

2015-04-30

Keyword

VMware, SSD

Symptom

Hardware configuration: RH5885 V3 servers, LSI SAS2208 RAID controller cards, and SSDs

Symptom: VMware ESXi identifies SSDs that are configured as RAID 0 as non-SSDs by default, as shown in Figure 5-441.

Figure 5-441 SSDs are identified as non-SSDs

Key Process and Cause Analysis

Cause analysis:

VMware ESXi only accurately identifies raw drives. If these drives are configured as a RAID array, VMware ESXi identifies the RAID array as a non-SSD by default.

VMware provides change methods for this case. You can use PSA SATP statement rules to tag devices that are not automatically detected.

The verification method can be changed, as shown in Figure 5-442.

Figure 5-442 Change methods

Conclusion and Solution

Conclusion:

The SSD identification mechanism provided by VMware can be manually changed.

Experience

VMware ESXi 6.0, not VMware ESXi 5.5, supports All-Flash. Each drive must be configured as single-drive RAID 0.

  1. Only SSDs and HDDs that are listed in the vSAN Compatibility List apply to vSAN.
  2. SSDs can be used as caches or capacity. SSDs that are used as capacity cannot be used as caches.
Note

None

MAC Addresses Conflict Between Blade Servers Running the VMware OS in the E9000 Chassis

Problem Description
Table 5-335 Basic information

Item

Information

Source of the Problem

RH5885 V3

Intended Product

All servers

Release Date

2015-04-30

Keyword

VMware, SSD

Symptom

Hardware configuration: CH242 V3, 8 HDDs, MZ510 NICs, and E9000 with the NIC stateless computing function enabled

OS: VMware ESXi 5.5 Update 2

Symptom: After blade servers in slots 1 and 2 of the HW06 chassis are moved to slots 3 and 4 of the HW04 chassis and new blade servers are inserted in slots 1 and 2, packets are dropped on the network. By confirmation, it is because MAC addresses of blade servers in the HW04 chassis conflict with those in the HW06 chassis.

Figure 5-443 MAC address conflicts

Key Process and Cause Analysis

Cause analysis:

After NICs are replaced or when there are duplicate MAC addresses between VMkernel interfaces and NICs, the MAC addresses of vmk0 management network are not updated. This is related to VMware OS mechanism. For details, see http://www.vmware.com/help.html.

Conclusion and Solution

Conclusion:

After NICs are replaced or when there are duplicate MAC addresses between VMkernel interfaces and NICs, the MAC addresses of vmk0 management network are not updated.

Solution:

  1. Log in to the VMware CLI.
  2. Run the esxcfg-advcfg -s 1 /Net/FollowHardwareMac command.
  3. Restart the OS.
Experience

None

Note

None

Server Hardware Information Cannot Be Displayed on vSphere Web Client 6.0

Problem Description
Table 5-336 Basic information

Item

Information

Source of the Problem

RH5885 V3

Intended Product

All servers

Release Date

2015-04-30

Keyword

VMware, SSD

Symptom

Hardware configuration: CH242 V3 with 8 HDDs

OS: VMware ESXi 6.0.0

Symptom:

The hardware information of a server cannot be displayed on vSphere Web Client, as shown in Figure 5-444.

Figure 5-444 Server hardware information cannot be displayed

Key Process and Cause Analysis

Check the hardware logs of the server and find that no error exists.

This problem is a known issue on vSphere Web Client 6.0. For details, refer to the official knowledge base of VMware: "No host data available" reported in Hardware status tab (2112847)

Conclusion and Solution

Conclusion:

The hardware of the server is normal. The problem is a known issue on vSphere Web Client 6.0.

Solution:

Allocate global permission for user management accounts and groups on vSphere Web Client according to the official knowledge base of VMware.

Experience

None

Note

None

VMs Cannot Be Opened by vCenter on an RH8100 with ESXi 5.5

Problem Description
Table 5-337 Basic information

Item

Information

Source of the Problem

RH8100 V3

Intended Product

All servers

Release Date

2015-10-21

Keyword

Unable to connect to the MKS: Could not connect to pipe \\.\pipe\vmware-authdpipe within retry period

Symptom

Hardware configuration: RH8100 V3 server

OS: VMware ESXi 5.5

Symptom:

Four RH8100 V3 servers are deployed on a site and the same version of ESXi 5.5 is installed on the four nodes. The VMs and services are running properly. However, on one of the nodes, the consoles of all VMs under the node host cannot be opened on the vCenter server and the following information is displayed in the console window: Unable to connect to the MKS: Could not connect to pipe \\.\pipe\vmware-authdpipe within retry period.

Key Process and Cause Analysis
  • Migrate a VM to another host. The console of the VM can be connected by using the vCenter server, indicating that the VM itself is not faulty.
  • Search the error information in the knowledge base on the official VMware website. A similar error and its solution are found. The root cause is that the customer reconfigured a vCenter server, and had migrated the host to the new vCenter server. Further more, the customer did not set the host to the maintenance mode during the migration. You can resolve the problem by shutting down the VM, setting the VMware host to the maintenance mode, and restarting the VMware host.

Refer to the solution in the official knowledge base of VMware: "Unable to connect to MKS" error in vSphere Web Client (2115126)

Conclusion and Solution

Conclusion:

This problem is a known issue. You can resolve this problem by setting the host to the maintenance mode and restarting the host.

Solution:

Rectify the fault by using the following solution:

  1. Power off all VMs under VMware.
  2. Set the ESXi host to the maintenance mode.
  3. Restart the ESXi host.
  4. Connect to the VMs on the vCenter server to open the VM console.
Experience

None

Note

None

High Temperature Alarm Generated on a SATA Drive on VMware

Problem Description
Table 5-338 Basic information

Item

Information

Source of the Problem

RH2288H V2

Intended Product

All servers

Release Date

2015-11-25

Keyword

VMware, SATA drive, high temperature alarm, LSI SAS2308

Symptom

Hardware configuration: Universal server equipped with the LSI SAS2308 RAID controller card and SATA drives

OS: VMware

Symptom:

  1. On a customer site, VMware 5.5u2 was installed on an RH2288H V2 server. In OS logs, a drive high temperature alarm was generated, as shown in Figure 5-445.
  2. Use a command to check the hard drive problem. The query result showed that the value of drive temperature was 171 (the value varied with the actual temperature), as shown in Figure 5-446.
  3. The IBM VIX VM installed in VMware determined that the hard drive was overheating, and removed it from the shared file system.
    Figure 5-445 Overheating alarm of a drive

    Figure 5-446 Checking the drive problem
Key Process and Cause Analysis

1. Analysis of the reading value of the drive temperature as 171:

SMART in the SATA protocol specification is shown in Figure 5-447, wherein, ATA8-ACS specifies the SMART, but the protocol only provides the SMART architecture, and the specific content is defined by various hard drive manufacturers. Therefore, the content varies.

In other words, the SMART protocol only specifies the architecture, and gives the range of the address space. Each hard drive manufacturer customizes the content written in this range.

Figure 5-447 SMART

SMART information reading: The read content and print format may differ according to tools or OSs.

  • Figure 5-448 shows the SMART information read on VMware. Only three items of the information are included.
  • Figure 5-449 shows the SMART information read on SUSE. More detailed information is provided.

    Wherein, Raw-value is the actual temperature of the drive, and Value is the calculated value of the temperature. The value varies with the definitions of different manufacturers. Hitachi's product displays the calculated value, but Seagate's product displays the raw value.

    Figure 5-448 SMART information read on VMware

    Figure 5-449 SMART information read on SUSE

2. Server hardware detection:

  1. Check the hardware of the RH2288H V2 server. No hardware fault was found.

  2. Use the Huawei Toolkit detection system (SUSE 11.3 kernel) to read the drive temperature. The actual drive temperature is 28 degrees, the lowest historical temperature is 20 degrees, and the highest historical temperature is 41 degrees, which are all within the normal range.

    Therefore, no actual overheating problem exists on the hard drive, but the OS reads temperature information abnormally.

3. Analysis of temperature alarm of VMware drive:

  1. Description of SMART specification for SATA drives

    There are detailed explanations for Value, Worst and Threshold for SATA drives in hard drive SMART specifications. In the item of The problems with S.M.A.R.T., there is an explanation for threshold value is 0:

    In SMART information, Threshold=0 is a threshold. The drive is determined to be faulty only when Value and Worst is less than 0. That is, hard drive manufacturers do not want to set this threshold, and Threshold = 0 in any case will not be reached.

    It is also mentioned that the general hard drive temperature is monitored by external hardware manufacturers. On Huawei servers, there are sensors that monitor the temperature.

    Threshold (byte): the (failure) limit value for the attribute.

    Value (byte): the current relative "health" of the attribute. This number is calculated by the algorythm, using the raw data (see above). On a new hard drive, this number is high (a theoretical maximum, for example 100, 200 or 253) and it is decreasing during the lifetime of the drive.

    Worst (byte): the worst (smallest) value ever found in the previous lifetime of the hard drive.

    Moreover, the threshold value is 0 for many critical attributes. Because the Value cannot be decreased below 0, these attributes will never indicate any sign of failure - even if they "want" to do this. So S.M.A.R.T. will never alert.

  2. Cause of VMware error:

    In the knowledge base on the official VMware website, there is an explanation for this problem. When Value is greater than Threshold, the VMware tool determines that a fault exists. However, the actual SMART specification determines that a fault exists when Value is smaller than Threshold.

    That is, the determination method of SMART specifications for SATA hard drives is different from that of VMware tool, resulting in the current alarm.

  3. Cause of no error reported by SAS drives:

    The SMART specification for SAS drives is different from that for SATA drives. For SAS drives, Threshold = NA indicates that no threshold is set. Therefore, there is no drive error regardless of determination.

    Link for detailed explanation:

Conclusion and Solution

Conclusion:

The determination method of SMART specifications for SATA hard drives is different from that of VMware tool, resulting in the current alarm.

Solution:

This alarm does not affect service applications. At present, VMware is developing a patch to resolve this problem. Afterwards, the official VMware website will provide a corresponding explanation in the knowledge base.

This problem appears when an LSI SAS2308 RAID controller card is used and the hard drive is used as a passthrough one. If a RAID array is configured or the LSI SAS2208 RAID controller card is used, VMware is unable to directly obtain the SMART information of the hard drive, and therefore this alarm is not generated.

Experience

None

Note

None

A VMware Purple Screen Problem Occurs Due to the LSI SAS3108 RAID Controller Card Firmware and Driver Problems

Problem Description
Table 5-339 Basic information

Item

Information

Source of the Problem

RH5885 V3

Intended Product

Servers using the LSI SAS3108 RAID controller card

Release Date

2015-12-28

Keyword

LSI SAS3108 RAID controller card, firmware, driver, VMware purple screen

Symptom

Hardware configuration: RH5885 V3 server equipped with the LSI SAS3108 RAID controller card

OS: VMware 6.0

Symptom:

  1. The OS breaks down and a VMware purple screen problem occurs, as shown in Figure 5-450.
    Figure 5-450 VMware purple screen information
  2. A RAID controller card error message is displayed in BMC logs, as shown in Figure 5-451.
    Figure 5-451 BMC logs
  3. The alarm persists after the RAID controller card is replaced.
  4. The problem is resolved after the RAID controller card firmware and VMware driver are upgraded.
Key Process and Cause Analysis

Cause analysis: The driver provided by the OS and the LSI SAS3108 RAID controller card firmware do not match. You need to run the driver package obtained from the official Huawei website and upgrade the firmware according to the script contained in the driver package. A detailed upgrade scheme is provided in the solution.

Conclusion and Solution

Download the software before the upgrade.

Firmware upgrade guidance

Upgrade the firmware.

  1. Mount the downloaded firmware image to the KVM virtual DVD-ROM drive page, and power on the server.

  2. Press F11 to go to the boot option screen, select the virtual DVD-ROM drive from which you want to boot, and press Enter.

  3. Select Toolkit-V101.

  4. Press C to go to the CLI mode.

  5. Enter the user name and password.

    The user name is root, and the default password is Huawei12#$.

  6. Run the cd /home/Project/FTK/upgrade/raid/tool/ command to go to the folder.

    Run the ./FwUpgrade.py FwUpgrade.XML command and press Enter to start the upgrade. After the upgrade, restart the server for the upgrade to take effect.

  7. View the LSI SAS3108 RAID controller card firmware version. The version is 4.270.00.4382.

    Upgrade the driver.

  8. Upload the files corresponding to the VMware version to the tmp folder on the OS using tools such as SSH.
  9. Run the cd tmp/vmware5.5 command to go to the tmp folder.
  10. Run the sh install_driver.sh command.
  11. Select 1 to perform a full upgrade.
  12. Upgrade the driver version to 6.606.06.00.
  13. After the upgrade, restart the OS for the upgrade to take effect.

Experience

None

Note

None

VMware EVC Feature Enabling Error

Problem Description
Table 5-340 Basic information

Item

Information

Source of the Problem

RH1288 V3

Intended Product

V3 servers

Release Date

2015-08-27

Keyword

EVC, ESXi

Symptom

Hardware configuration: RH1288 V3

OS: VMware

Symptom:

When the EVC feature is enabled on the ESXi host, error "The CPU hardware of the host should support the current enhanced vMotion compatibility mode for the cluster, but the host is now lacking some of the necessary CPU functions. Check the BIOS configuration of the host to ensure that required functions (such as Intel's XD, VT, AES or PCLMULQDQ, or AMD's NX) are not disabled. For more information, see knowledge base articles 1003212 and 1034926.

Key Process and Cause Analysis

When the EVC feature is enabled on VMware, the Monitor/Mwait feature must be enabled on the BIOS; otherwise, an error is reported.

Conclusion and Solution

Enable the Monitor/Mwait feature on the BIOS. The path is: Advanced > Intel RC Group > Processor Configuration, as shown in Figure 5-452.

Figure 5-452 Enabling the Monitor/Mwait feature

Experience

None

Note

None

SLES 10 SP4 32-Bit Failed to Start After Installation Due to Excessive Memory Capacity

Problem Description
Table 5-341 Basic information

Item

Information

Source of the Problem

RH2288H V2

Intended Product

Server

Release Date

2014-05-16

Keyword

SuSE Linux Enterprise Server (SLES) 10 SP4 32-bit, 48 GB memory capacity

Symptom

Figure 5-453 shows the hardware configuration of the RH2288H V2. During the OS installation, a black screen is displayed before automatic restart, and the subsequent configuration operations cannot be performed to complete the OS installation, as shown in Figure 5-454.

Figure 5-453 Hardware configuration of the RH2288H V2
Figure 5-454 Black screen
Key Process and Cause Analysis

SLES 10 SP4 32-bit theoretically supports a maximum of 64 GB memory, but it can support only a maximum of 48 GB in the actual use. For details, visit:

http://www.novell.com/coolsolutions/tip/16262.html

Conclusion and Solution

Ultimate solution: Remove some DIMMs.

Temporary solution: Add mem=XXG (less than 48 GB) after the boot option during the boot process. After entering the OS, add mem=XXG (less than 48 GB) to the end of the kernel line in the /boot/grub/menu.lst file. The modification will take effect permanently.

Experience

None.

Note

None.

Garbled Characters Exist on the RHEL CLI

Problem Description
Table 5-342 Basic information

Item

Information

Source of the Problem

RH2288H V2

Intended Product

RH22XXH V2 and RH1288 V2

Release Date

2014-09-01

Keyword

Garbled characters

Symptom

Hardware configuration: RH2288H V2

  • Symptom 1:

    On an RH2288H V2 that run RHEL 6.5, when a text file is opened on the command line interface (CLI), garbled characters are displayed during paging up and down.

  • Symptom 2:

    When a user presses Ctrl+Alt+F2 and Ctrl+Alt+F3 to switch between different CLIs, the active CLI overlaps with the information of the previous CLI.

Key Process and Cause Analysis

Root cause analysis: The resolution of the text screen is low. Configure a higher resolution to resolve the problem. Add vga = 0x317 to the end of the kernel line in the /boot/grub/menu.lst (or grub.conf) file, and then restart the system, as shown in Figure 5-455.

Figure 5-455 Modifying grub.conf

Conclusion and Solution

Conclusion: The resolution of the text screen is low. Configure a higher resolution to resolve the problem.

Solution: Add vga = 0x317 to the end of the kernel line in the /boot/grub/menu.lst (or grub.conf) file, and then restart the system.

Experience

None.

Note

None.

Memory Allocation Failure (Page Allocation Failure) in Heavy Load

Problem Description
Table 5-343 Basic information

Item

Information

Source of the Problem

CH121

Intended Product

Huawei servers

Release Date

2015-04-15

Keyword

Page allocation failure

Symptom

Hardware configuration:

  • Hardware configuration: CH121 + 16 x 8 GB memory + 2 x E5-2620
  • Software version:
    • MM910: 2.20
    • Compute node:
  • BIOS: V378
  • BMC: 5.11
  • Operating system (OS): SUSE11 sp1 (64-bit)

Symptom:

Two servers running database services were down, and the service IP address cannot be pinged.

Key Process and Cause Analysis

Key process:

The OS log file /var/log/message contains a large number of page allocation failure errors, as shown in Figure 5-456.

Figure 5-456 Page allocation failure

The preceding logs indicate that a memory request (order: 0) fails in the system. For the system, this means that a 4K page request fails, so it can be considered that the system memory cannot be requested at the time. There must be no available memory. When this happens, it may lead to the suspended system.

Root cause analysis:

The page allocation failure leads to OS breakdown.

Conclusion and Solution

Conclusion:

Memory allocation failure in heavy database service load will lead to OS breakdown or process suspension.

Solution:

Increase the value of /proc/sys/vm/min_free_kbytes. For the specific value, you are advised to contact SUSE support for help.

Experience

None.

Note

According to SUSE engineers,

the /proc/sys/vm/min_free_kbytes parameter specifies the minimum memory reserved for the system. If it is too small, the system may break down because no memory is allocated to key operations. If it is too large, the system will reclaim memory too early, resulting in a large amount of memory being idle and unable to be used. In general, this parameter needs to be configured according to the memory capacity (it must be more than 128 KB, usually between 1 MB to 128 MB), at a maximum of 2% of the system physical memory.

Linux GUI Reports "Could not update ICEauthority file /var/lib/gdm/.ICEauthority"

Problem Description
Table 5-344 Basic information

Item

Information

Source of the Problem

RH2288H V2

Intended Product

Tecal servers

Release Date

2015-04-22

Keyword

gdm ICEauthority

Symptom

Hardware configuration:

RH2288H V2 server

Symptom:

During the startup, the RHEL 6.4 GUI reports "Could not update ICEauthority file /var/lib/gdm/.ICEauthority", and the login fails.

Key Process and Cause Analysis

Root cause analysis:

The permission of the /var/lib/gdm/.ICEauthority file is incorrectly modified, as shown in Figure 5-457.

Figure 5-457 After modification

Figure 5-458 shows the permission of /var/lib/gdm/ before being incorrectly modified:

Figure 5-458 Before modification

Conclusion and Solution

Solution:

To log in to the OS from the command-line interface (CLI), run the chown gdm:gdm /var/lib/gdm command.

Experience

None.

Note
  • Figure 5-459 shows the permissions of all files in the /var/lib/gdm directory.
    Figure 5-459 Permissions of all files in the /var/lib/gdm directory

Linux OS Prints "kernel:ERST:Error Record Serialization Table (ERST) support is initialized" During Startup

Problem Description
Table 5-345 Basic information

Item

Information

Source of the Problem

RH2288H V2

Intended Product

Tecal servers

Release Date

2015-04-22

Keyword

gdm ICEauthority

Symptom

During the startup of Linux operating system, the log prints "kernel: ERST: Error Record Serialization Table (ERST) support is initialized."

Key Process and Cause Analysis

Root cause analysis:

The log information only indicates that the Error Record Serialization Table (ERST) module is initialized, and it is not fault information. The module is provided by the ACPI Platform Error Interface (APEI) and is used to detect and save hardware error records.

Conclusion and Solution

Solution:

Ignore the information.

Experience

None

Note

For details about the description on the official RHEL website, see ERST: Error Record Serialization Table (ERST) support is initialized.

Figure 5-461 Description on the official RHEL website

Failed to Start the HAL Service on an RH8100 Running RHEL 6

Problem Description
Table 5-346 Basic information

Item

Information

Source of the Problem

RH8100 V3

Intended Product

All servers

Release Date

2015-10-21

Keyword

HAL daemon FAILED

Symptom

Hardware configuration:

An RH8100 V3 server running RHEL 6.4

Symptom:

The server restarts after a period of running. The "Starting HAL daemon: FAILED" error is reported during OS loading, indicating that the HAL service process fails to start, as shown in the following figure.

Figure 5-462 Error message of HAL service start failure

Key Process and Cause Analysis

Root cause analysis:

After an online survey, it is confirmed that this is a known problem in RHEL 5 and RHEL 6. You can update the HAL service program installation package and set an appropriate value for the timeout parameter of the haldaemon process to resolve this problem. You can refer to haldaemon fails to start on system with a large number of disks in RHEL 5 and RHEL 6.

The symptom is described on the official Red Hat website, as shown in Figure 5-463:

Figure 5-463 Symptom description on the Red Hat website

On the official Red Hat website, the fault locating procedure and the root cause are shown in Figure 5-464:

Figure 5-464 Root cause and diagnosis procedure provided on the official Red Hat website

Hald is a hardware abstraction layer daemon provided by Linux that implements auto-mount functionality for hard drive partitions and other peripherals. By default, it sets a 250-second timeout to wait for all child processes to mount all the devices. However, a timeout error occurs when all devices fail to be mounted within 250 seconds. This problem often occurs in systems configured with a large number of hard drives.

After manually starting the haldaemon service, calculate the difference between the time the service starts and the time stamp printed it exits to measure the time when the hald service mount all LUNs for all child processes in the OS log, and determine whether the root cause is as above mentioned.

Conclusion and Solution

Conclusion:

This is a known issue and you can resolve it by updating the hal service process installation package and changing the value of the timeout parameter.

Solution:

The Red Hat official resolution is shown in Figure 5-465:

Figure 5-465 Red Hat official resolution
Experience

None.

Note

For HAL in Wikipedia, please see HAL (software).

Traditionally, the operating system kernel was responsible for providing an abstract interface to the hardware the system ran on. Applications used the system call interface, or performed file I/O on device nodes in order to communicate with hardware through these abstractions. This sufficed for the simple hardware of early desktop computing.

Computer hardware, however, has increased in complexity and the abstractions provided by Unix kernels have not kept pace with the proliferating number of device and peripheral types now common on both server and desktop computers. Most modern buses also become hotplug-capable and can have non-trivial topologies. As a result, devices are discovered or change state in ways which can be difficult to track through the system call interface or Unix IPC. The complexity of doing so forces application authors to re-implement hardware support logic.

In other words, HAL is a technology to solve the problem that the original OS kernel/driver system call cannot manage the peripherals as the number and types of peripherals on the servers and computers increase, and the hot swap function is added.

Implementations and obsolescence

On Linux, HAL uses /sys (a virtual file system for Linux systems) to discover hardware and listen for kernel hotplug events. Some Linux distributions also provide an udev rule to allow the udev daemon to notify HAL whenever new device nodes appear.

In some of the latest Linux distributions, such as Ubuntu, Debian, and Fedora, HAL functionality is incorporated into the udev daemon and related kernel modules.

Deprecated

As of 2011, Linux distributions such as Ubuntu, Debian, and Fedora, and projects such as KDE, GNOME and X.org are in the process of deprecating HAL as it has "become a large monolithic unmaintainable mess". The process is largely complete, but some use of HAL remains – Debian squeeze (Feb 2011) and Ubuntu version 10.04 remove HAL from the basic system and boot process.

In Linux, it is in the process of being merged into udev (main udev, libudev, and udev-extras) and existing udev and kernel functionality. No specific replacement for non-Linux systems has been identified.

Initially a new daemon DeviceKit was planned to replace certain aspects of HAL, but in March 2009, DeviceKit was deprecated in favor of adding the same code to udev as a package: udev-extras, and some functions have now moved to udev proper.

Abnormal Restart of RHEL 6.5

Problem Description
Table 5-347 Basic information

Item

Information

Source of the Problem

RH2288H V2

Intended Product

All servers

Release Date

2015-10-22

Keyword

Qdisk cycle took more than X seconds to complete, abnormal restart

Symptom

Hardware configuration:

RH2288H V2 server running Red Hat Enterprise Linux (RHEL) 6.5

Symptom:

Two RH2288H V2 servers at a customer site are occasionally powered off after running for a period of time. On one server, hardware alarms of the CD-ROM drive, memory and motherboard are generated and then the alarms are cleared after hardware replacement; no software and hardware abnormalities can be found from the BMC logs and OS logs of the other server. The BMC log shows that the power-off is on the business side, and the specific reasons are unknown. You are advised to contact Red Hat for analysis:

More than a month later, the problems occurred again. The customer said that they had not bought Red Hat maintenance services, and the cluster was built by a software integrator. The software integrator did not give any effective recommendations after analyzing the logs. On the OS side, configure the serial port redirection and Kdump and adjust the OS print level to the highest level. After running a month, the system is abnormally powered off again. This time, the BMC sel log of one of the RH2288H V2 servers recorded a power-off operation triggered by the IPMI protocol through the network channel. Then the Fence cluster management software isolates the server from the cluster:

The oplog operation log contains the information about the power-off command issued by the IP address of the corresponding cluster control server:

May 15 04:57:36 iMana bmcipmi.out[358]: Operation,root(::ffff:10.2.11.105),Host,Set chassis control(power down) (sessionid=00, sessiontime=05-15 04:57:36) success,EvtCode:0x20100020
May 15 04:57:38 iMana bmcipmi.out[358]: Operation,root(::ffff:10.2.11.105),Host,Set chassis control(power down) (sessionid=00, sessiontime=05-15 04:57:38) success,EvtCode:0x20100020
May 15 04:57:40 iMana bmcipmi.out[358]: Operation,root(::ffff:10.2.11.105),Host,Set chassis control(power down) (sessionid=00, sessiontime=05-15 04:57:40) success,EvtCode:0x20100020

However, there is still no valid information in the OS log. In the OS log, the following print information is displayed for several times of abnormal restart including this time:

Aug  8 04:40:02 S65 rgmanager[205923]: [lvm] Getting status
Aug  8 04:40:02 S65 rgmanager[205961]: [lvm] Getting status
Aug  8 04:40:03 S65 rgmanager[206049]: [lvm] Getting status
Aug  8 05:06:03 S65 kernel: imklog 5.8.10, log source = /proc/kmsg started.
Aug  8 05:06:03 S65 rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="3307" x-info="http://www.rsyslog.com"] start

The print information suggested that the logical volume was in the getting state. Therefore, the abnormal restart may be related to the hard drive. However, after analysis of a full set of RAID controller card logs, no link error or reset event was found.

No problem occurred within a few months after the upgrade of BMC and BIOS, but abnormal restart recurs on the two servers. In the operating system, the print information was as follows on the two servers:

  1. Before the restart of S65 at 1:00 on September 14, the following alarm information related to the qdiskd (Cluster Quorum Disk Daemon. For details, see http://linux.die.net/man/8/qdiskd) process was displayed in the log:
    Sep 14 00:02:25 S65 qdiskd[3478]: qdisk cycle took more than 1 second to complete (1.190000)Sep 14 00:12:56 S65 qdiskd[3478]: qdisk cycle took more than 1 second to complete (1.080000)
    Sep 14 01:00:49 S65 rgmanager[179305]: [lvm] Getting status Sep 14 01:00:50 S65 rgmanager[179417]: [lvm] Getting status Sep 14 01:00:50 S65 rgmanager[179457]: [lvm] Getting status Sep 14 01:00:50 S65 rgmanager[179629]: [lvm] Getting status Sep 14 01:00:51 S65 rgmanager[179717]: [lvm] Getting status Sep 14 01:00:59 S65 rgmanager[180144]: [lvm] Getting status Sep 14 01:00:59 S65 rgmanager[180309]: [lvm] Getting status Sep 14 01:01:00 S65 rgmanager[180528]: [lvm] Getting status Sep 14 01:01:00 S65 rgmanager[180611]: [lvm] Getting status Sep 14 01:01:00 S65 rgmanager[180693]: [lvm] Getting status Sep 14 01:01:00 S65 rgmanager[180731]: [lvm] Getting status Sep 14 01:26:19 S65 kernel: imklog 5.8.10, log source = /proc/kmsg started. //The first log of normal OS Loading after restart, same hereinafter
  2. Similar information was displayed in the log for the abnormal restart of S66 at 15:45 on September 10:
    Sep  9 23:04:55 S66 qdiskd[3583]: qdisk cycle took more than 1 second to complete (1.040000)Sep  9 23:28:16 S66 qdiskd[3583]: qdisk cycle took more than 1 second to complete (1.010000)Sep 10 00:32:41 S66 qdiskd[3583]: qdisk cycle took more than 1 second to complete (1.010000)
    Sep 10 03:07:01 S66 rhsmd: In order for Subscription Manager to provide your system with updates, your system must be registered with the Customer Portal. Please enter your Red Hat login to ensure your system is up-to-date.
    Sep 10 06:40:36 S66 qdiskd[3583]: qdisk cycle took more than 1 second to complete (1.010000)Sep 10 11:56:32 S66 qdiskd[3583]: qdisk cycle took more than 1 second to complete (1.040000)
    Sep 10 15:45:45 S66 kernel: imklog 5.8.10, log source = /proc/kmsg started.
  3. Similar information was displayed in the log for the abnormal restart of S66 at around 01:33 on September 14:
    Sep 13 23:24:48 S66 qdiskd[3575]: qdisk cycle took more than 1 second to complete (1.500000) Sep 13 23:25:43 S66 qdiskd[3575]: qdisk cycle took more than 1 second to complete (1.010000) Sep 13 23:29:02 S66 qdiskd[3575]: qdisk cycle took more than 1 second to complete (1.450000) Sep 13 23:29:56 S66 qdiskd[3575]: qdisk cycle took more than 1 second to complete (1.500000) Sep 13 23:30:20 S66 qdiskd[3575]: qdisk cycle took more than 1 second to complete (1.500000) Sep 13 23:31:53 S66 qdiskd[3575]: qdisk cycle took more than 1 second to complete (1.020000) Sep 13 23:32:24 S66 qdiskd[3575]: qdisk cycle took more than 1 second to complete (1.000000) Sep 14 00:11:04 S66 qdiskd[3575]: qdisk cycle took more than 1 second to complete (1.250000) Sep 14 00:21:34 S66 qdiskd[3575]: qdisk cycle took more than 1 second to complete (1.010000)
    Sep 14 01:33:48 S66 kernel: imklog 5.8.10, log source = /proc/kmsg started.
Key Process and Cause Analysis

Key process:

Huawei engineers found a case on the official Red Hat website: https://access.redhat.com/solutions/19495#. Because a Red Hat account is required to read the full content, the screenshot is shown in Figure 5-466.

Figure 5-466 Problem

Problem diagnosis procedure is shown in Figure 5-467.

Figure 5-467 Diagnosis procedure

Root cause analysis:

Figure 5-468 Root Cause Analysis

In the case, the qdiskd cluster hard drive management service process measures the duration of each I/O cycle. If the duration exceeds the value of the interval parameter, it reports the above information. If in the kernel parameters, paranoid = 1 or io_timeout = 1 is configured, then in the case that the I/O cycle exceeds the tko* interval, the cluster will receive the command from qdiskd to power off the host. Two solutions are available:

  1. Check whether the length of duration (for example, 1.040000) reported in the log exceeds the tko * interval, and if it exceeds, increase the value of the interval parameter. Then the alarm and power-off flow will not be triggered.
  2. Check whether there are other faults that lead to the I/O cycle timeout and rectify the faults. Collect the logs of the RAID controller card on site. Once again confirm that there is no hardware failure on the RAID or hard drive on the customer site. At present, the customer has contacted the software integrator to perform software configuration.
Conclusion and Solution

Conclusion:

A known problem of the Red Hat operating system. It can be solved through the cluster parameter.

Solution:

The Red Hat official resolution is shown in Figure 5-469:

Figure 5-469 Resolution

1. Set the values of tko and interval according to the methods recorded in the resolution, as shown in Figure 5-470.

For the settings of the tko and interval,refer to the following link:

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/5/html/Cluster_Administration/s1-start-clustertool-CA.html

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/5/html/Cluster_Administration/s1-start-luci-ricci-conga-CA.html

Figure 5-470 Creating a new cluster

2. Collect a full set of the logs of the RAID controller card to determine whether there are potential hardware faults that lead to slow I/O.

Experience
  1. If the BMC SEL logs of the server shows that the cause of the restart is as described in Figure 5-471.
    Figure 5-471 Restart cause

    and at the same time, the oplog operation log contains the information about the power-off command issued by the IP address of the corresponding cluster control server:

    May 15 04:57:36 iMana bmcipmi.out[358]: Operation,root(::ffff:10.2.11.105),Host,Set chassis control(power down) (sessionid=00, sessiontime=05-15 04:57:36) success,EvtCode:0x20100020

    it is preliminarily determined that the power-off failure is triggered by the cluster software.

  2. Promote the customer to deploy the serial port redirection, Kdump and print level adjustment for the OS logs before the operation of the business, which plays a decisive role on hardware and software fault identifying and locating.
Note

None.

SUSE 11 SP2&SP3 Running on a Server

Problem Description
Table 5-348 Basic information

Item

Information

Source of the Problem

RH5885 V3

Intended Product

Servers on which SUSE 11 SP2&SP3 are running

Release Date

2015-10-26

Keyword

SUSE11SP2&SP3, rcu_bh_state

Symptom

On an RH5885 V3 server where SUSE 11 SP2 and SUSE 11 SP3 are installed, black screen and breakdown appear during its operation. No hardware alarm is reported, and the server recovers after restart.

Key Process and Cause Analysis

The rcu_bh_state detected stool on CPU error is found in the message log of the OS. The screenshot is shown in Figure 5-472.

Figure 5-472 Message log

According to the SUSE kernel website, this problem is mainly caused by the remaining "Jiffies" positive tag information in one of the cores during the service processing procedure of multi-core CPUs, which leads to the crash of the processing. This is not a common symptom. It occasionally occurs in the occasional CPU scheduling and CPU core switching in the same process. To resolve this problem, the effective time of the tag is set and the time of the multi-core switching is compared in the new patch to confirm the correctness of the data and inform the kernel that the new data is running properly to prevent the kernel from activating the crash mechanism.

http://kernel.suse.com/cgit/kernel-source/commit/?id=98873d8d7315851d03f9e6424744c1fcb259c069

This is the kernel bug. NOVELL official has given the appropriate repair patch. The problem number is bnc # 834204. Both SLES 11 SP2 and SP3 have this problem. For details, visit:

  • SP2: https://download.suse.com/Download?buildid=KSc16YTnEso~
  • SP3: https://download.suse.com/Download?buildid=5mdF-RmZMJ4~
Conclusion and Solution

Conclusion:

A bug in the OS kernel leads to system breakdown.

Solution:

Upgrade the OS kernel. The kernel download address is as follows:

  • SP2: https://download.suse.com/Download?buildid=KSc16YTnEso~
  • SP3: https://download.suse.com/Download?buildid=5mdF-RmZMJ4~
Experience

Confirm that the key information of this kernel bug is rcu_bh_state detected stall on CPU.

To resolve OS breakdown problems, you are advised to deploy kdump and serial port redirection in advance to collect information for analysis at the time of breakdown because the system cannot record exception logs after a kernel hang error occurs.

Note

Configure kdump for SUSE 11 (user root).

  • Configure the parameters.
    vi /etc/sysconfig/kdump

    Modify the parameter configuration as follows:

    KDUMP_IMMEDIATE_REBOOT="yes"    //Whether to restart immediately 
    KDUMP_SAVEDIR="file:///var/crash"          //Directory for storing dump files 
    KDUMP_COPY_KERNEL="yes"             //Whether to copy the kernel when dump files are generated 
    KDUMP_KEEP_OLD_DUMPS="2"      //Maximum number of dump files to be saved  
    KDUMP_DUMPFORMAT="compressed"    //Dump format  
    KDUMP_DUMPLEVEL="31"     //Log level     
  • Configure the GRUB.
    • Change Description

      Configure the GRUB by adding the following content to a configuration file. During the boot process, grub passes the parameter to the standard kernel. The parameter indicates the memory space to be reserved for the crash kernel. Modify the following options:

      crashkernel=512M
    • File to be Modified
      /boot/grub/menu.lst
    • Example
      root (hd0,0) 
      kernel /boot/vmlinuz-2.6.32.45-0.3-default 
      root=/dev/disk/by-id/scsi-3600508e0000000008fdb0976c18e7c01-part1  
      resume=/dev/disk/by-id/scsi-3600508e0000000008fdb0976c18e7c01-part2 splash=silent  
      crashkernel=512M showopts vga=0x317 console=tty0 console=ttyS0,115200 
      initrd /boot/initrd-2.6.32.45-0.3-default     

Start the kdump service.

  • Restarting the kdump service can generate an initrd-(sys)-kdump file.
    linux:~ # rm /boot/initrd-2.6.32.12-0.7-default-kdump  
    linux:~ # rckdump restart 
    Unloading kdump                                                     done 
    Loading kdump 
    Regenerating kdump initrd ...                                       done 
    linux:~ # ll /boot/initrd-2.6.32.12-0.7-default-kdump  
    -rw------- 1 root root 16556311 Nov 18 11:52 /boot/initrd-2.6.32.12-0.7-default-kdump 
    linux:~ # reboot     
  • Restart the OS for the final configurations to take effect

    Run the reboot command to restart the OS.

  • Verify whether the configuration is successful (the operation will lead to an OS restart).

    Run the echo c > /proc/sysrq-trigger command to force the system to crash. The system will restart the kdump kernel and enter the kdump process. Check whether a vmcore file in /var/crash is generated.

This verification will result in a system restart. Ensure that no service is running before the operation.

The default directory for storing the vmcore file is /var/crash/%HOST-%DATE/, which stores two files: vmcore and vmcore-dmesg.txt. You need to copy both files.

Configure the serial port redirection.

The SUSE Linux serial port redirection involves the modification of the following files:

  • /boot/grub/menu.lst
  • /etc/inittab
  • /etc/securetty
  • Configure the GRUB.

    Change Description

    Configure the GRUB to use the serial port. Comment out the configuration item splashimage, and add the configuration items serial and terminal.

    serial --unit=0 --speed=115200 
    terminal --timeout=15 serial console

    File to be Modified

    /boot/grub/menu.lst

    Example

    # Modified by YaST2. Last modification on Wed Aug 29 02:37:33 2007 
    #color white/blue black/light-gray
    default 0 
    timeout 8 
    #gfxmenu (hd0,1)/boot/message
    serial --unit=0 --speed=115200 
    terminal --timeout=15 serial console     
NOTE:

The content in italics indicates the content that is commented out; the content in bold indicates the content that is added.

  • Configure the kernel.

    Change Description

    Configure the kernel in the GRUB boot menu, and add the following items:

    console=ttyS0,115200  console=tty0

    File to be Modified

    /boot/grub/menu.lst

    Example

    ###Don't change this comment - YaST2 identifier: Original name: linux### 
    title Linux 
             kernel (hd0,1)/boot/vmlinuz root=/dev/sda2 selinux=0 resume=/dev/sda1 splash=silent  
    elevator=cfq showopts console=ttyS0,115200 console=tty0
             initrd (hd0,1)/boot/initrd     
NOTE:

The content in bold indicates the content that is added.

  • Configure the inittab.

    Change Description

    Configure the inittab so that you can log in through the serial port, and add the following line:

    S0:12345:respawn:/sbin/agetty -L115200ttyS0ansi

    File to be Modified

    /etc/inittab

    Example

    # getty-programs for the normal runlevels 
    #<id>:<runlevels>:<action>:<process> 
    # The "id" field  MUST be the same as the last 
    # characters of the device (after "tty"). 
    1:2345:respawn:/sbin/mingetty --noclear tty1 
    2:2345:respawn:/sbin/mingetty tty2 
    3:2345:respawn:/sbin/mingetty tty3 
    4:2345:respawn:/sbin/mingetty tty4 
    5:2345:respawn:/sbin/mingetty tty5 
    6:2345:respawn:/sbin/mingetty tty6 
    # 
    #S0:12345:respawn:/sbin/agetty -L 9600 ttyS0 vt102 
    S0:12345:respawn:/sbin/agetty -L115200ttyS0ansi
NOTE:

The content in bold indicates the content that is added, and the content underscored indicates the content that is modified.

  • Configure the securetty.

    Change Description

    Configure the port as a secure port so that you can log in as user root. Add the following line to the /etc/securetty file:

    ttyS0

    File to be Modified

    /etc/securetty

    Example

    tty6 
    ttyS0
    # for devfs: 
    vc/1 
    
    
NOTE:

The content in bold indicates the content that is added.

CentOS 7.0 or RHEL 7.0 Restarts Automatically Due to an OS Kernel Bug

Problem Description
Table 5-349 Basic information

Item

Information

Source of the Problem

RH5885 V3

Intended Product

Universal servers

Release Date

2015-12-26

Keyword

CentOS 7.0, Red Hat Enterprise Linux (RHEL) 7.0, kernel bug

Symptom

Hardware configuration: RH5885 V3 server

Software configuration: CentOS 7.0 or RHEL 7.0 with kernel 3.10.0-123.el7.x86_64

Symptom: The server restarts automatically but no fault information is recorded in BMC logs.

Key Process and Cause Analysis
  1. According to BMC logs, the OS on the RH5885 V3 server restarted automatically on October 22 and 29, 2015. However, the BMC did not report hardware fault alarms.

  2. Vmcore analysis of OS dump logs showed that kernel bug information was recorded in logs on October 22 and 29, 2015. This problem has been explained on the Red Hat website. It is caused by an OS kernel bug and the kernel version is 3.10.0-123.el7.x86_64.

Solution: Upgrade the version to kernel-3.10.0-123.20.1.el7.x86_64.

A case was found on the official Red Hat website, https://access.redhat.com/solutions/1364873. You can view all the content only after you log in to the website by using a Red Hat account. The screenshots are shown as follows:

Conclusion and Solution

Upgrade the version to kernel-3.10.0-123.20.1.el7.x86_64.

Experience

None

Note

None

Java Processes DataNode and NodeManager Did Not Respond on RHEL 6.4

Problem Description
Table 5-350 Basic information

Item

Information

Source of the Problem

Red Hat 6.4, CentOS

Intended Product

Haswell CPUs and PowerPC CPUs

Release Date

2016-01-05

Keyword

Hadoop JVM Eden Space

Symptom

Java processes such as DataNode and NodeManager did not respond.

# jmap -heap 41017

Attaching to process ID 41017, please wait...

Concurrent Mark-Sweep GC

Eden Space:

capacity = 429522944 (409.625MB)

used = 429522944 (409.625MB)

free = 0 (0.0MB)

100.0% used

# dmseg -c

# echo l > /proc/sysrq-trigger

# dmseg

The following updates are displayed:

/proc/1234/task/12345=[<ffffffff810b226a>] futex_wait_queue_me+0xba/0xf0 
[<ffffffff810b33a0>] futex_wait+0x1c0/0x310 
[<ffffffff810b4c91>] do_futex+0x121/0xae0 
[<ffffffff810b56cb>] sys_futex+0x7b/0x170 
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b 
[<ffffffffffffffff>] 0xffffffffffffffff     
Key Process and Cause Analysis

Method I: Optimize the JVM mode.

No hardware error was found from the customer feedback. The Hadoop JVM Eden Space was full, resulting in longer time needed by the parnew collector.

The JVM memory application process is as follows:

  1. The JVM attempts to initialize a memory area for the associated Java object in Eden.
  2. When the Eden space is enough, the memory application ends. Otherwise, the process enters the next step.
  3. The JVM tries to release all of the inactive objects in Eden (this is garbage collection at level 1 or a higher level); if the Eden space is still insufficient for new objects after release, the JVM tries to put some of the active objects in Eden into the Survivor area or OLD area.
  4. The Survivor area is used as the intermediate exchange area for Eden and OLD. When the OLD area is sufficient, the objects in the Survivor area will be moved to the OLD area; otherwise they will be retained in the Survivor area.
  5. When the OLD area is insufficient, the JVM will perform a full garbage collection (level 0) in the OLD area.
  6. After the full garbage collection, if the Survivor and OLD areas still cannot store some of the objects copied from Eden, and the JVM cannot create a memory area for the new object in the Eden area, an "out of memory" error occurs.

The log information fed back by the customer proves that the above process exists.

Eden Space:

capacity = 429522944 (409.625MB)

used = 429522944 (409.625MB)

free = 0 (0.0MB)

100.0% used

2015-12-29T18:34:52.972+0800: 21956.396: [GC2015-12-29T18:34:52.972+0800: 21956.396: [ParNew: 446031K->25181K(471872K), 8799.4720130 secs] 482319K->61493K(1995584K), 8799.4722900 secs] [Times: user=315058.59 sys=1155.15, real=8798.13 secs]

2015-12-29T21:01:33.370+0800: 30756.794: [GC2015-12-29T21:01:33.370+0800: 30756.794: [ParNew: 444637K->14777K(471872K), 0.0064570 secs] 480949K->51225K(1995584K), 0.0066160 secs] [Times: user=0.14 sys=0.01, real=0.01 secs]

Hadoop JVM optimization. A number of similar ways can optimize this problem. You can suggest the customer to verify non-production links and then implement the optimization. You also need to coordinate java experts from the big data development department to provide the customer with optimization advice.

Current customer's heap configuration: The configuration is as follows:

MinHeapFreeRatio = 40

MaxHeapFreeRatio = 70

MaxHeapSize = 4194304000 (4000.0MB)

NewSize = 536870912 (512.0MB)

MaxNewSize = 536870912 (512.0MB)

OldSize = 5439488 (5.1875MB)

NewRatio = 2

SurvivorRatio = 8 //SurvivorRatio: Set the proportion of Survivor space and Eden space in YOUNG generation to 1.

PermSize = 134217728 (128.0MB)

MaxPermSize = 268435456 (256.0MB)

G1HeapRegionSize = 0 (0.0MB)

Method 2: For a new kernel pair, upgrade the kernel patch.

Softlockup with pThreads, Mutexes on Haswell CPUs and PowerPC CPUs (but may not be limited to just these)

Conclusion and Solution

None

Experience

None

Note

URL: https://access.redhat.com/solutions/1386323

Abstract: Upgrading to RHEL 6.6, 7.0 or 7.1 may result in an application, using futexes, appearing to stall in futex_wait(). After upgrading to Red Hat Enterprise Linux 6.6 (specifically 2.6.32-504 up to and including 2.6.32-504.12.2) may result in an application hang. URL: https://bugs.centos.org/view.php?id=8371

Abstract: 0008371: futex waiter counter causing hangs in 2.6.32-504.8.1.el6.x86_64 on Haswell. kernel-2.6.32-504.16.2.el6 just released should have the fix.

Keyboard and Mouse Do Not Respond During Server OS Running

Problem Description
Table 5-351 Basic information

Item

Information

Source of the Problem

RH2485 V2/E9000

Intended Product

All servers

Release Date

2016-01-07

Keyword

Red Hat, haldaemon

Symptom

Black screen occurred when the server installed with Red Hat Enterprise Linux 6.4 was running, the local keyboard and mouse cannot be used, and the BMC KVM cannot be used. The OS can be accessed through SSH, and the services are running properly.

Key Process and Cause Analysis
  1. Log in to the system through SSH. The system is running properly, and the occupancy rates of CPU, memory, hard drive, NIC and other resources are normal. There is no abnormal service.
  2. Restart the server. In the POST phase and OS grub phase, the keyboard and mouse can be used. The hardware problem is excluded.
  3. The OS loading process was very slow. It took more than ten minutes to enter the OS, during which the haldaemon service failed to be loaded. The service is described as follows:

    The previous description shows that the haldaemon service exception will affect the use of USB removable devices.

  4. After the OS is accessed, the haldaemon service cannot be started manually.

  5. Handle the haldaemon startup failure based on the knowledge base on the official Red Hat website. However, the haldaemon service still fails to be started.
    Log in to graphical environment fails due to haldaemon is not running on RHEL 6.

    https://access.redhat.com/solutions/1585843

  6. Based on OS logs, it is found that many storage drives are mounted to the server, and hundreds of drive letters are assigned. A similar case exists at the official Red Hat website: haldaemon fails to start on system with a large number of drives in RHEL 5 and RHEL 6.

    https://access.redhat.com/solutions/27571

  7. A test is performed and it is found that the problem is resolved after the haldaemon service timeout interval is changed to 4800s.
Conclusion and Solution

Conclusion:

The number of storage drives that are mounted is too large. As a result, a timeout error occurs when the haldaemon service scans devices (default timeout interval: 250s). Therefore, the haldaemon service fails to be started, affecting the automatic mounting of mobile devices.

Solution:

Modify the timeout interval of the haldaemon service.

Experience

None

Note

None

Eight-Hour Time Difference During Linux Startup

Problem Description
Table 5-352 Basic information

Item

Information

Source of the Problem

RH2485 V2/E9000

Intended Product

All servers

Release Date

2016-01-07

Keyword

Red Hat, haldaemon

Symptom

Eight-hour time difference occurred during Linux startup; however, the iBMC time was consistent with the local time.

Figure 5-473 OS log segments
Key Process and Cause Analysis

Root Cause:

There is NTP time synchronization during OS startup. After the NTP server synchronizes the time, the time is shifted forward by 28800.668190 seconds (8 hours).

Figure 5-474 Time offset of OS logs

Conclusion and Solution

Solution:

Check the NTP server time (especially the time zone) to ensure time synchronization in the network environment.

Experience

The eight-hour time difference that occurs on Linux is generally a problem introduced by different time zone configurations. You can use the hwclock, hwclock-manager, and hwclock -localtime commands to adjust the system time and hardware clock to confirm that the time difference is related to the OS or hardware clock.

Note

None

Translation
Download
Updated: 2019-02-25

Document ID: EDOC1000041338

Views: 69976

Downloads: 3773

Average rating:
This Document Applies to these Products
Related Documents
Related Version
Share
Previous Next