No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

The System Breaks Down After the SUSE11SP1 OS Is Continuously Running for More than 208 Days

Publication Date:  2015-06-19 Views:  80 Downloads:  0
Issue Description
Hardware configuration:
RH2285

Software configuration:
SUSE11SP1 64-bit

Symptom:

Servers with the SUSE11SP1 operating system (OS) installed break down or restart after the OS is continuously running for more than 208 days. Information similar to the following is displayed in the dmesg or /var/log/messages file.

------------[ cut here ]------------
WARNING: at /usr/src/packages/BUILD/kernel-default-2.6.32.29/linux-2.6.32/kernel/sched.c:3847 update_cpu_power+0x151/0x160()
[...]
Call Trace:
[<ffffffff810061dc>] dump_trace+0x6c/0x2d0
[<ffffffff813974e8>] dump_stack+0x69/0x71
[<ffffffff8104d754>] warn_slowpath_common+0x74/0xd0
[<ffffffff8103d6e1>] update_cpu_power+0x151/0x160
[<ffffffff8103e323>] find_busiest_group+0xa83/0xce0
[<ffffffff8104604d>] load_balance_newidle+0xcd/0x380
[<ffffffff813982db>] thread_return+0x2a7/0x34c
[<ffffffff813992fd>] do_nanosleep+0x8d/0xc0
[<ffffffff81068628>] hrtimer_nanosleep+0xa8/0x140
[<ffffffff81068730>] sys_nanosleep+0x70/0x80
[<ffffffff81002f7b>] system_call_fastpath+0x16/0x1b
[<00007f77d8469da0>] 0x7f77d8469da0
---[ end trace 63f382152a7c7034 ]---


Alternatively, information similar to the following is displayed.

PID: 24290  TASK: ffff880064340140  CPU: 0   COMMAND: "blkback.5.hda"
#0 [ffff880064b19910] crash_kexec at ffffffff80071e20
#1 [ffff880064b199e0] oops_end at ffffffff80353958
#2 [ffff880064b19a00] do_divide_error at ffffffff8000886e
#3 [ffff880064b19aa0] divide_error at ffffffff80007c05
#4 [ffff880064b19b28] find_busiest_group at ffffffff800300f4
#5 [ffff880064b19cb8] load_balance_newidle at ffffffff80036cda
#6 [ffff880064b19d38] thread_return at ffffffff803500c1
#7 [ffff880064b19dc8] dm_table_unplug_all at ffffffffa0424fec
#8 [ffff880064b19e48] blkif_schedule at ffffffffa0537734
#9 [ffff880064b19ee8] kthread at ffffffff80056816
#10 [ffff880064b19f48] kernel_thread at ffffffff80007f0a
Handling Process
The DIVIDED_BY_ZERO bug is randomly triggered in the kernel after the SUSE11SP1 OS is continuously running for more than 208 days. The following provides the link of bugs at the SUSE official website:

http://www.novell.com/support/kb/doc.php?id=7009834

The server host must meet the following requirements:
  1. CPUs are provided by Intel.
  2. The CPU flags in /proc/cpuinfo contain constant_tsc and nonstop_tsc fields.
  3. The dmesg and /var/log/boot.msg do not contain Marking TSC unstable.
Root Cause
The DIVIDED_BY_ZERO bug is randomly triggered in the kernel after the SUSE11SP1 OS is continuously running for more than 208 days. As a result, the system breaks down or restarts.

Solution
Workaround:

Manually reset the OS before the OS is continuously running for 208 days.

Run the uptime command to query the continuous running time of the OS. In the command output, pay attention to the value before days.

# uptime
23:48:44 up 3 days, 23:48,  1 user,  load average: 0.02, 0.05, 0.00


Solution:

Upgrade the SUSE11SP1 kernel to the latest version 2.6.32.59-0.7.1 (determine the default or Xen kernel according to actual situations).

The following uses 2.6.32.59-0.7.1-default as an example to describe how to upgrade the kernel:

1.  Dial the SUSE hotline 4008106500 to obtain the .rpm package of the kernel and upload the package to the server for which the kernel is to be upgraded.

2.  Run the following command to check whether the upgrade package can be installed:

# rpm -ivh --test --force kernel-default-2.6.32.59-0.7.1.x86_64.rpm kernel-default-base-2.6.32.59-0.7.1.x86_64.rpm

3.  If no error message is displayed in step 2, run the following command to install the package:

# rpm -ivh --force kernel-default-2.6.32.59-0.7.1.x86_64.rpm kernel-default-base-2.6.32.59-0.7.1.x86_64.rpm

4.  Check that the startup kernel in the /boot/grub/menu.lst is the new kernel.

# cat /boot/grub/menu.lst
# Modified by YaST2. Last modification on Tue Dec 11 13:44:59 EST 2012
default 0   (0 indicates the default startup kernel, which is specified in the first title in the following.)
timeout 8
##YaST - generic_mbr
gfxmenu (hd0,0)/boot/message
##YaST - activate

###Don't change this comment - YaST2 identifier: Original name: linux###
title SUSE Linux Enterprise Server 11 SP1 - 2.6.32.59-0.7 (default)
root (hd0,0)
kernel /boot/vmlinuz-2.6.32.59-0.7-default root=/dev/disk/by-id/scsi-36286ed494c1a7000184757f207d309cc-part1 resume=/dev/disk/by-id/scsi-36286ed494c1a700003f742e20b1b0ea1-part2 splash=silent crashkernel=256M-:128M showopts vga=0x317
initrd /boot/initrd-2.6.32.59-0.7-default

###Don't change this comment - YaST2 identifier: Original name: failsafe###
title Failsafe -- SUSE Linux Enterprise Server 11 SP1 - 2.6.32.59-0.7
root (hd0,0)
kernel /boot/vmlinuz-2.6.32.59-0.7-default root=/dev/disk/by-id/scsi-36286ed494c1a7000184757f207d309cc-part1 showopts ide=nodma apm=off noresume edd=off powersaved=off nohz=off highres=off processor.max_cstate=1 nomodeset x11failsafe vga=0x317
initrd /boot/initrd-2.6.32.59-0.7-default


5.  Restart the system to make the new kernel take effect.

6.  Run the following command to check that the kernel version is the target version:

# uname -a
Suggestions
Note:
This case applies only to Huawei R1 series servers (with the SUSE11SP1 standard OS installed).

END