No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

The Server Does Not Reset When the Watchdog Times Out

Publication Date:  2015-06-19 Views:  45 Downloads:  0
Issue Description
Hardware configuration:
An RH2285

Software configuration:
SUSE11.1

Symptom:

After the RH2285 operating system (OS) is started, run the ipmitool raw 0x06 0x24 0x44 0x01 0x00 0x10 0x70 0x17 command to initialize the watchdog and run the ipmitool raw 0x06 0x22 command to reset the watchdog timer every 2 minutes. If the command for resetting the watchdog timer is not executed when the OS does not respond, the hardware watchdog triggers a server reset. In the actual situation, the server does not reset when the watchdog times out.
Handling Process
1.  Check that the BMC has generated the log for recording watchdog timeout, as shown in Figure 1.

Figure 1 Watchdog timeout log



timer use:SMS/OS indicates that the watchdog timer times out on the OS. According to the log severity and Timer expired in the description, no action is executed for the timeout, and the BMC does not generate any alarm. As a result, the server does not reset.

2.  The server can properly start after being manually reset. Check the watchdog status and find that the watchdog timer runs properly on the OS, and the timeout action is Hard Reset, as shown in Figure 2.

Figure 2 Watchdog status



3.  To analyze the cause that triggers no action from the watch dog, run the history | grep "ipmitool" command on Linux to check whether any watchdog command is manually operated. A command for stopping the watchdog is queried, as shown in Figure 3.

Figure 3 Watchdog operation history



4.  Figure 4 shows the watchdog status after it is stopped. The watchdog timer runs on the OS but is stopped. The timeout action is No action. Initial Countdown is 300 seconds. During this period, the system executes the ipmitool raw 0x06 0x22 command every 2 minutes. However, after the command is executed, the value of Watchdog Timer Actions is still No action. and the value of Initial Countdown is still 300 seconds, as shown in Figure 5. As a result, the hardware watchdog performs no action when the OS does not respond, and the server does not reset. The watchdog timeout log is generated, as shown in Figure 1.

Figure 4 Stopped watchdog



Figure 5 Resetting the watchdog timer after the watchdog is reset

Root Cause
The watchdog is manually stopped when the OS is running. Though the watchdog timer is periodically reset, the watchdog is enabled, Present Countdown is reset, and watchdog timeout action is still No action. As a result, the server does not reset when the watchdog times out.

Solution
After the customer disables the watchdog when the OS is running, run the ipmitool raw 0x06 0x24 0x44 0x01 0x00 0x10 0x70 0x17 command again to initialize the watchdog and run the ipmitool raw 0x06 0x22 command to activate the watchdog and reset the watchdog timer.

Suggestions
1.  The watchdog timeout logs are as follows:
  • Log shown in Figure 1
  • Log shown in Figure 6
Figure 6 Watchdog timeout alarms



The log shown in Figure 6 indicates that the watchdog that runs on the OS times out and the timeout action is hard reset. The server resets, and a major alarm log is generated. The alarm was generated at 18:21:56 and cleared at 18:21:57.

2.  The ipmitool raw 0x06 0x24 0x44 0x01 0x00 0x10 0x70 0x17 command is described as follows:
  • 0x06 indicates the network function code.
  • 0x24: indicates a field in the command.
  • 0x44: indicates SMS/OS. 0x42 indicates BIOS/POST. 0x43 indicates OS Load.
  • 0x01: indicates Hard Reset. 0x00 indicates No action. 0x02 indicates Power Down. 0x03 indicates Power Cycle.
  • 0x70 0x17 indicates the initial countdown value. The hexadecimal number 1770 is converted to the decimal number 6000. The unit is 100 ms. The initial countdown value is calculated in the following equation: 6000 x 100 ms = 600s
3.  The ipmitool raw 0x06 0x22 command is described as follows:

Providing the same function as the Reset watchdog and ipmitool mc watchdog reset commands, the ipmitool raw 0x06 0x22 command enables the watchdog and resets the present countdown timer without changing the value of Watchdog Timer Actions. If the countdown timer is initialized to 600 seconds, the current countdown timer is reset to 600 seconds.

Preventive Measure:

Check whether Watchdog Timer Actions is set to Hard Reset. If no, software automatically runs the ipmitool raw 0x06 0x24 0x44 0x01 0x00 0x10 0x70 0x17 command and then the ipmitool raw 0x06 0x22 command every 2 minutes.

END