Common Troubleshooting
Illegal Stack Access
- Compile code to simulate illegal stack access.
#include <linux/kernel.h> #include <linux/module.h> static int __init a_init(void) { char msg[10] = {0}; printk("a init\n"); strcpy(msg, "this modules is testing module, aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"); return 0; } static void __exit a_exit(void) { printk("a exit\n"); } module_init(a_init) module_exit(a_exit) MODULE_LICENSE("GPL");
- Perform the following diagnostic steps:
Run the bt command to display stack traces of the current context. The stack traces show that the stack of a particular process cannot be parsed.
Determine the stack memory based on RSP: ffff8801fddabf20. Identify the location of error code in the stack memory.
Illegal Access to Page Tables
- Compile code to simulate illegal access to page tables.
#include <linux/kernel.h> #include <linux/module.h> #include <linux/kprobes.h> #include <linux/kallsyms.h> #include <linux/syscalls.h> #include <linux/slab.h> #include <linux/kdebug.h> #include <asm/apic.h> #include <asm/pgalloc.h> static int __init jprobe_init(void) { unsigned long address; pgd_t *pgd; pud_t *pud; pmd_t *pmd; printk(KERN_INFO "zk--- in \n"); address = (unsigned long)kmalloc(512 ,GFP_KERNEL); pgd = pgd_offset(current->active_mm, address); pud = pud_offset(pgd, address); pmd = pmd_offset(pud, address); memset(pmd, 0, 16); printk("test: %x \n", *(int*)address); return 0; } static void __exit jprobe_exit(void) { printk(KERN_INFO "zk--- out \n"); } module_init(jprobe_init) module_exit(jprobe_exit) MODULE_LICENSE("GPL");
- Perform the following diagnostic steps:
Analyze the erroneous stack and finds that it is due to an address error.
Analyze the instruction returned in response to RIP: ffffffff8108af68 and find that the instruction uses the (RAX+0x10) address.
Analyze the page table that contains the address and find that the PMD of the page table incorrectly points to 0.
Determine the suspected module based on the content in the memory that has been illegally accessed or based on the logs generated around the time of crash.
Overwriting of Static Variables
Analysis Method
Run the sym -l command to list kernel symbols. Determine the addresses of the overwritten static variables based on the kernel symbols. Find the static variables near the addresses of the overwritten static variables. It is very likely that a particular variable overflows, causing neighboring variables to be overwritten.
Illegal Memory Access in DMA
- Symptom
There are no regular stack traces. All code areas are identical and their code is fe 0b ad ca.
[ 3051.054204] Call Trace: [ 3051.057035] [<ffffffff8115faa9>] path_openat+0xd9/0x420 [ 3051.063054] [<ffffffff8115ff2c>] do_filp_open+0x4c/0xc0 [ 3051.069103] [<ffffffff81150b71>] do_sys_open+0x171/0x1f0 [ 3051.075223] [<ffffffff8144fc53>] ia32_do_call+0x13/0x13 [ 3051.081261] Code: fe 0b ad ca fe 0b ad ca fe 0b ad ca fe 0b ad ca fe 0b ad ca fe 0b ad ca fe 0b ad ca fe 0b ad ca fe 0b ad ca fe 0b ad ca fe 0b ad <ca> fe 0b ad ca fe 0b ad ca fe 0b ad ca fe 0b ad ca fe 0b ad ca
- Diagnosis
The kernel code area is read-only and cannot be modified using a linear address. A physical address must have been used to modify the kernel code area. Both the direct memory access (DMA) controller and the basic input/output system (BIOS) are capable of operating on physical memory. Considering that the content stored in the target address is irrelevant to the DMA controller, then BIOS is suspected of causing the problem.
BIOS code analysis discovers a driver code error. Due to the error, BIOS is overwritten during DMA and then illegally accesses kernel memory while BIOS is running.
Deadlock
- Diagnosis
The key point is the stack of each CPU when a deadlock occurs. You can run the bt -a command to print stacks of all CPUs run in the system. By parsing the data structure of the lock, users can find out which thread holds the lock.
- Common types of deadlock
- A non-atomic procedure, such as sleep schedule, is executed in the spinlock protection area.
- It takes long to execute logic in the spinlock protection area.
- AB-BA deadlock
- AA deadlock
- Ring lock in which modules in complex architecture wait for each other in a loop.
- Mismatch between lock and unlock, resulting in changes to lock pointer