linux驱动调试--修改系统时钟终端来定位僵死问题-土豆和地瓜-ChinaUnix博客

曾经遇到过一个bug是这样的，通过串口终端和开发板交互时，执行一个程序后，整个系统就挂了，也不接受输入了，只能重启，
后来发现是死在某段代码里了，当时可是费了一番功夫，今天来说一下怎么调试这种系统僵死的程序.

首先说一下linux的时钟中断。
linux的时钟中断也是一种硬件中断，通过计数器产生输出脉冲，送到CPU，触发中断。这个中断比较特殊，它是来记录系统时间的，
每隔固定的一段时间就会触发一次。类似于现实中的钟表，每隔1秒就滴答一次，记录时间，我们的时间概念都是以这个为基准的。
同样，内核当中的时间都是以时钟中断为基准的，一次中断就可以认为是一个时间单位。它是内核的心脏，它不跳了，内核肯定就挂了。
系统利用时钟中断来维持系统时间、促使环境切换和进程调度。

linux用HZ来表示1s产生的时钟中断次数，用jiffies来表示自从系统启动后产生了多少次时钟中断

下面进入正题

首先写一个能够引起系统僵死的测试程序。

system_dead.c

点击(此处)折叠或打开

#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/fs.h>
#include <linux/init.h>
#include <linux/delay.h>
#include <asm/uaccess.h>
#include <asm/irq.h>
#include <asm/io.h>
#include <linux/device.h>
static struct class *sysdead_class;
static struct device *sysdead_class_dev;
int major;
static int sysdead_test_open(struct inode *inode, struct file *file)
{
int i = 0;
int j = 0;
int k = 0;
while(1){
i = i + 1;
j = i + 1;
k = j + 1;
if(i > 100)
i = 0;
if(j > 100)
j = 0;
if(k > 100)
k = 0;
}
//printk("sysdead_test_open success!\n");
return 0;
}
static struct file_operations sysdead_test_fops = {
.owner = THIS_MODULE,
.open = sysdead_test_open,
};
static int sysdead_drv_init(void)
{
major = register_chrdev(0, "sysdead_test", &sysdead_test_fops);
sysdead_class = class_create(THIS_MODULE, "sysdead_test");
sysdead_class_dev = device_create(sysdead_class, NULL, MKDEV(major, 0), NULL, "sysdead");
printk("sysdead_drv_init success!\n");
return 0;
}
static void sysdead_drv_exit(void)
{
device_destroy(sysdead_class, MKDEV(major,0));
class_destroy(sysdead_class);
unregister_chrdev(major, "sysdead_test");
}
module_init(sysdead_drv_init);
module_exit(sysdead_drv_exit);
MODULE_LICENSE("GPL")

test.c

点击(此处)折叠或打开

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
int main(int argc, char **argv)
{
int fd;
int val = 1;
fd = open("/dev/sysdead", O_RDWR);
if (fd < 0)
{
printf("can't open!\n");
return -1;
}
close(fd);
return 0;
}

# insmod system_dead.ko
sysdead_drv_init success!
# ./test

程序会卡死，输入会没有任何反应，因为我们在 sysdead_test_open 函数里引入了一个死循环，当应用程序调用open的
时候，程序就陷入死循环出不来了。

这种调试方法的思想是，如果内核僵死了，我们认为可能是卡在某个程序里出不来了，这个时候内核只有这一个程序在运行。
我们调试方法就是，如果我们判断一个进程连续执行超过10s，我们就认为这个进程陷入了死循环，把进程号（可能是没用的，因为
卡死了要重新启动，在运行时进程号就变了）和PC值打印出来，根据PC值来定位当前执行的代码。

那么怎么去修改内核来实现我们的这种思想呢？

由之前的说明我们知道，无论什么时候内核的脉搏总是有的，就是系统时钟中断。就算卡死的时候也还是有的，因为它是一个
硬件中断，卡死的时候也会响应中断。所以我们在系统中断响应函数里添加一段代码实现我们上面的调试思想。

在linux-2.6.30.4\arch\arm\kernel\irq.c 文件中，找到asm_do_IRQ函数，修改如下：

点击(此处)折叠或打开

asmlinkage void __exception asm_do_IRQ(unsigned int irq, struct pt_regs *regs)
{
struct pt_regs *old_regs = set_irq_regs(regs);
//add by llz in 2015.4.1
static pid_t pre_pid = 0;
static int cnt = 0;
if(30 == irq)
{
if(pre_pid == current->pid)
cnt++;
else
{
cnt = 0;
pre_pid = current->pid;
}
if(cnt == 10*HZ)
{
cnt = 0;
printk("asm_do_IRQ -> s3c2410_timer_irq : pid = %d , task_name = %s , ", current->pid, current->comm);
printk("pc = %08x\n", (unsigned int )regs->ARM_pc);
}
}
//addition ends here
irq_enter();
/*
* Some hardware gives randomly wrong interrupts. Rather
* than crashing, do something sensible.
*/
if (irq >= NR_IRQS)
handle_bad_irq(irq, &bad_irq_desc);
else
generic_handle_irq(irq);
/* AT91 specific workaround */
irq_finish(irq);
irq_exit();
set_irq_regs(old_regs);
}

添加的部分为红色的字体，注意第6、7行的pre_pid和cnt变量定义要用static修饰，这样初值只会赋一次，不然会判断错误。
具体原因请百度static的作用。30号中断是代表时钟中断，见linux-2.6.30.4\arch\arm\mach-s3c2410\include\mach\irqs.h。
current是结构体struct task_struct，表示当前进程。还有struct pt_regs，请百度。
通过current->pid, current->comm，regs->ARM_pc分别打印当前进程号、进程名、PC值。

接下来编译内核：make uImage
然后启动开发板，进入uboot模式，通过tftp下载uImage镜像，并启动：
> tftp 0x30007fc0 uImage
> bootm 0x30007fc0

（这个内核是用来调试的，每一次编译烧到nand flash太麻烦。关于tftp下载内核启动请参见另一篇博文：
http://blog.chinaunix.net/uid-29401328-id-4930747.html）

下面再来测试一次：
# insmod system_dead.ko
sysdead_drv_init success!
# ./test // 卡死了，等10s会打印如下信息
asm_do_IRQ -> s3c2410_timer_irq : pid = 635 , task_name = test , pc = bf0d700c
asm_do_IRQ -> s3c2410_timer_irq : pid = 635 , task_name = test , pc = bf0d700c

我们一眼就看出来了，问题出现在test这个程序上，但具体出在哪不清楚，就要根据PC值去分析了，分析方法和之前博文讲的一样。
下面再唠叨一遍：

1. 找到bf0d700c所在的函数。
现在的系统僵死了，我们没办法继续下去，只有重启系统。这里注意一个问题，重启系统使用的内核要和僵死时使用的内核是同一个。
因为如果内核变了，我们就很难还原僵死前的状态了，新内核pc = bf0d700c可能代表不同的代码。

用tftp启动刚才的内核，插入模块system_dead.ko。
先去查看内核源码下的 System.map 文件，看PC地址是否属于其中，这里不属于（那里面是内核函数，地址都是以C开头）。
然后查看开发板的模块地址：cat cat /proc/kallsyms > kall.txt
打开kall.txt，在里面查找PC值相近的地址（有可能直接查到，也有可能PC位于某段地址之间），这里查到：

00000000 a system_dead.c [system_dead]
bf0d7000 t $a [system_dead]
bf0d7000 t sysdead_test_open [system_dead]
bf0d7010 t sysdead_drv_exit [system_dead]

可知pc = bf0d700c位于sysdead_test_open函数中。接下来去分析这个函数

2. 分析发生错误的函数
因为我们这里的代码很短，所以可以很快的定位出问题，但当代码很长时，可能就需要看汇编了。这里给出方法

反汇编sysdead_test_open函数位于的模块system_dead.ko：
arm-none-linux-gnueabi-objdump system_dead.ko -D > system_dead.dis

打开system_dead.dis：（贴出对我们有用的那段）
00000000 :
0: e1a0c00d mov ip, sp
4: e92dd800 push {fp, ip, lr, pc}
8: e24cb004 sub fp, ip, #4 ; 0x4
c: eafffffe b c

到这里就需要用汇编去分析了，我们这里的错误比较明显，就是一直跳转到自己这个函数里，调不出去了。

注：有可能两次发生僵死时PC值不一样，就算僵死在同一段代码，PC值也可能不一样。因为如果死循环是一段代码，
那么僵死时，程序可能正在执行这段代码当中的任意一句。