一个panic bug的分析过程(一)

12639阅读 0评论2012-05-15 android_bsp
分类:LINUX

一个工作中遇到的bug的问题,分析一下流程,顺便把panic这种类似的bug流程做一些分析:

环境: linux 3.0, arm 芯片平台
首先看出问题时的backtrace 如下:
<6>[ 10.298767] regulator_init_complete: disabling vrfref 
<6>[ 10.299163] regulator_init_complete: disabling vrf2 
<1>[ 10.300750] Unable to handle kernel paging request at virtual address fc4ba1c0 
<1>[ 10.300781] pgd = c0004000 
<1>[ 10.300781] [fc4ba1c0] *pgd=00000000 
<0>[ 10.300811] Internal error: Oops: 5 [#1] PREEMPT SMP 
<0>[ 10.300842] last sysfs file: 
<4>[ 10.300842] Modules linked in: 
<4>[ 10.300872] CPU: 0 Tainted: G W (2.6.35.7-eng-gddc8274 #1) 
<4>[ 10.300933] PC is at strcmp+0x4/0x34 
<4>[ 10.300964] LR is at platform_match+0x5c/0x68 
<4>[ 10.300994] pc : [] lr : [] psr: 60000013 
<4>[ 10.301025] sp : dc455f30 ip : 00000070 fp : 00000000 
<4>[ 10.301055] r10: 000000a0 r9 : 00000000 r8 : c032ac98 
<4>[ 10.301055] r7 : 00000000 r6 : c093a020 r5 : dc4a62a8 r4 : 00000000 
<4>[ 10.301086] r3 : 00000000 r2 : 00000001 r1 : c07c29fa r0 : fc4ba1c0 
<4>[ 10.301116] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment kernel 
<4>[ 10.301147] Control: 10c53c7f Table: 8000404a DAC: 00000015 
<4>[ 10.301177] 
<4>[ 10.301177] PC: 0xc0284d28: 
<4>[ 10.301208] 4d28 e2822001 e5d23000 e3530000 1afffffb e7d1c003 e35c0000 e7c2c003 e2833001 
<4>[ 10.301239] 4d48 1afffffa e12fff1e e3520000 e92d4010 11a0c000 1a000001 e8bd8010 e28cc001 
<4>[ 10.301300] 4d68 e5dc3000 e3530000 1afffffb ea000004 e1520003 1a000002 e3a03000 e7cc3002 
<4>[ 10.301361] 4d88 e8bd8010 e7d14003 e3540000 e7cc4003 e2833001 1afffff5 e8bd8010 e3a03000 
<4>[ 10.301391] 4da8 e7d02003 e7d1c003 e2833001 e152000c 0a000002 23a00001 33e00000 e12fff1e 
<4>[ 10.301452] 4dc8 e3520000 1afffff5 e1a00002 e12fff1e e92d4010 e3a03000 ea000009 e7d0c003 
<4>[ 10.301483] 4de8 e7d14003 e2833001 e15c0004 0a000002 23a00001 33e00000 e8bd8010 e35c0000 
<4>[ 10.301544] 4e08 0a000002 e3520000 e2422001 1afffff2 e3a00000 e8bd8010 e6ef2072 e3a03000 
<4>[ 10.301605] 
<4>[ 10.301605] LR: 0xc032d660: 
<4>[ 10.301635] d660 e3530000 05103008 e1a00001 e59f100c ebfd4f2b e3a00000 e8bd8010 c082324e 
<4>[ 10.301666] d680 c0823240 e92d4070 e5914038 e2405008 e3540000 1a000006 ea00000d e5950000 
<4>[ 10.301727] d6a0 ebfd5dbf e3500000 05854168 0a000005 e2844018 e5d43000 e1a01004 e3530000 
<4>[ 10.301757] d6c0 1afffff5 e1a04003 e2540000 13a00001 e8bd8070 e5100008 e5911000 ebfd5db0 
<4>[ 10.301818] d6e0 e2700001 33a00000 e8bd8070 e92d47f0 e3a04000 e1a05000 e1a06001 e1a0a002 
<4>[ 10.301879] d700 e1a08004 ea00000a e5957164 e0877004 e284401c e597300c e2033c1f e1560003 
<4>[ 10.301910] d720 1a000003 e5970008 ebfd5d9d e3500000 0a000005 e5953160 e1a0100a e1580003 
<4>[ 10.301971] d740 e2888001 3affffef e3a07000 e1a00007 e8bd87f0 e1a02001 e3a01b01 e92d4010 
<4>[ 10.302032] 
<4>[ 10.302032] SP: 0xdc455eb0: 
<4>[ 10.302032] 5eb0 00000000 c01a5e98 dc455ee8 dbbfb608 dc455ee0 dbbfc7e0 00000001 dc4209b0 
<4>[ 10.302093] 5ed0 ffffffff dc455f1c c093a020 00000000 c032ac98 c0671eec fc4ba1c0 c07c29fa 
<4>[ 10.302124] 5ef0 00000001 00000000 00000000 dc4a62a8 c093a020 00000000 c032ac98 00000000 
<4>[ 10.302185] 5f10 000000a0 00000000 00000070 dc455f30 c032d6e0 c0284da8 60000013 ffffffff 
<4>[ 10.302246] 5f30 dc4a62b0 dc455f50 c093a020 c032c280 c093a020 dc455f50 c032c260 c032b5a8 
<4>[ 10.302276] 5f50 dc447940 dc4a4378 c093a020 c093a020 dbbfc7e0 c093a6c8 00000000 c032adbc 
<4>[ 10.302337] 5f70 c07c29fa c07c29fa c093a020 c004be14 00000001 00000002 00000000 00000000 
<4>[ 10.302368] 5f90 00000000 c032c5ac c0031b64 c004be14 00000001 00000002 00000000 c0057598 
<4>[ 10.302429] 
<4>[ 10.302429] R0: 0xfc4ba140: 
<4>[ 10.302459] a140 ******** ******** ******** ******** ******** ******** ******** ******** 
<4>[ 10.302520] a160 ******** ******** ******** ******** ******** ******** ******** ******** 
<4>[ 10.302551] a180 ******** ******** ******** ******** ******** ******** ******** ******** 
<4>[ 10.302612] a1a0 ******** ******** ******** ******** ******** ******** ******** ******** 
<4>[ 10.302673] a1c0 ******** ******** ******** ******** ******** ******** ******** ******** 
<4>[ 10.302734] a1e0 ******** ******** ******** ******** ******** ******** ******** ******** 
<4>[ 10.302764] a200 ******** ******** ******** ******** ******** ******** ******** ******** 
<4>[ 10.302825] a220 ******** ******** ******** ******** ******** ******** ******** ******** 
<4>[ 10.302886] 
<4>[ 10.302886] R1: 0xc07c297a: 
<4>[ 10.302917] 2978 32332d70 79732d6b 742d636e 72656d69 696d6400 6f632d63 00636564 70616d6f 
<4>[ 10.302947] 2998 6562612d 6961642d 616d6f00 62612d70 78762d65 2d636572 00696164 70616d6f 
<4>[ 10.303009] 29b8 62636d2d 642d7073 6f006961 2d70616d 2d6d6370 69647561 6d6f006f 2d327061 
<4>[ 10.303039] 29d8 6c69616d 00786f62 70616d6f 756f765f 6d6f0074 775f7061 6d6f0062 666c7061 
<4>[ 10.303100] 29f8 76700062 76727372 6f006d6b 6470616d 68007373 00696d64 0064636c 3c007525 
<4>[ 10.303161] 2a18 73253e33 3a732520 656c7320 745f7065 6f656d69 735f7475 65726f74 6e49203a 
<4>[ 10.303192] 2a38 696c6176 61762064 0a65756c 0a752500 6f6c2f00 726c6163 2f6f7065 7162636d 
<4>[ 10.303253] 2a58 772f3338 736b726f 65636170 616c702f 726f6674 656b2f6d 6c656e72 616d6f2f 
<4>[ 10.303314] 2a78 612f3470 2f686372 2f6d7261 6863616d 616d6f2d 732f3270 61697265 00632e6c 
<4>[ 10.303344] 
<4>[ 10.303344] R5: 0xdc4a6228: 
<4>[ 10.303375] 6228 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 
<4>[ 10.303405] 6248 00000000 00000000 00000000 00000000 00000000 00000000 165f8171 57515555 
<4>[ 10.303466] 6268 5d555755 06055d55 635688c0 d84156c5 c755d51d c008eacc 564d4157 577de515 
<4>[ 10.303527] 6288 d415d35d 4d254555 75861555 5755d977 635688c0 d84156c5 f00dcafe 00000000 
<4>[ 10.303558] 62a8 fc4ba1c0 00000002 c093a570 dc4a4340 dc4ba160 1c4a607c dc4a64fc c093a578 
<4>[ 10.303619] 62c8 dc437e40 c093a3f0 dc4aabf0 00000005 00000007 00000000 00000000 00000001 
<4>[ 10.303649] 62e8 00000000 00000000 dc4a62f0 dc4a62f0 00000000 00000000 dc4a62e4 c093a6c8 
<4>[ 10.303710] 6308 c097ebec dc4a9f80 00000000 00000000 00000001 dc4a60dc dc4a655c 7fffffff 
<4>[ 10.303771] 
<4>[ 10.303771] R6: 0xc0939fa0: 
<4>[ 10.303771] 9fa0 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 
<4>[ 10.303833] 9fc0 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 
<4>[ 10.303863] 9fe0 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 
<4>[ 10.303924] a000 00000000 00000000 00000000 c02f92a0 c02f9264 c02f9050 c02f9024 c02f8ff8 
<4>[ 10.303985] a020 c07c29fa c093a6c8 00000000 00000000 00000000 00000000 c032d450 c032d474 
<4>[ 10.304016] a040 c032d490 00000000 00000000 00000000 00000000 dbbfc7e0 00000000 c093a05c 
<4>[ 10.304077] a060 c093a05c c02fc340 c02f95ec 00000001 c02fc368 c02fc348 c02fcb68 00000000 
<4>[ 10.304107] a080 c02fc544 c02fce68 c029fdc8 c02fc744 c02e3418 c02e4348 00000000 c02e2fc4 
<4>[ 10.304168] 
<4>[ 10.304168] R8: 0xc032ac18: 
<4>[ 10.304199] ac18 e5940030 e3500000 0a000001 e2800010 ebfd56fa e1a00004 e8bd8010 e92d47f3 
<4>[ 10.304229] ac38 e1a04000 e5900004 ebfffff1 e2506000 03e07015 0a0000ad e59f32cc e3a01401 
<4>[ 10.304290] ac58 e1c380d0 e0093001 e3530000 0a00000b e59f32b8 e3a01000 e3a00008 e1c380d0 
<4>[ 10.304321] ac78 e0082000 e0093001 e1921003 0a000003 e59f329c e5d33047 e3530000 1a00009d 
<4>[ 10.304382] ac98 e59f3290 e30810d0 e5937014 e1a00007 ebf88563 e59f8280 e1a05000 e1a00007 
<4>[ 10.304443] acb8 ebf87aee e59f2274 e5d23004 e3530000 e1a0a000 0a00001e e1a0100d e3c13d7f 
<4>[ 10.304473] acd8 e3c3303f e5931004 e2811001 e5831004 e5927010 e3570000 0a00000b e30890d0 
<4>[ 10.304534] acf8 e58da000 e3a03054 e58d9004 e1a01008 e5970004 e1a02005 e1a0e00f e597f000 
<0>[ 10.304595] Process swapper (pid: 1, stack limit = 0xdc4542f8) 
<0>[ 10.304626] Stack: (0xdc455f30 to 0xdc456000) 
<0>[ 10.304656] 5f20: dc4a62b0 dc455f50 c093a020 c032c280 
<0>[ 10.304687] 5f40: c093a020 dc455f50 c032c260 c032b5a8 dc447940 dc4a4378 c093a020 c093a020 
<0>[ 10.304718] 5f60: dbbfc7e0 c093a6c8 00000000 c032adbc c07c29fa c07c29fa c093a020 c004be14 
<0>[ 10.304779] 5f80: 00000001 00000002 00000000 00000000 00000000 c032c5ac c0031b64 c004be14 
<0>[ 10.304809] 5fa0: 00000001 00000002 00000000 c0057598 c07dd98d 00000170 c09220d8 c09f553c 
<0>[ 10.304840] 5fc0: 00000002 c004bdd0 c004be14 c09f553c 00000002 00000000 00000000 c000859c 
<0>[ 10.304901] 5fe0: 00000000 c000843c c005873c 00000013 00000000 c005873c 74545575 55575d54 
<4>[ 10.304962] [] (strcmp+0x4/0x34) from [] (platform_match+0x5c/0x68) 
<4>[ 10.305023] [] (platform_match+0x5c/0x68) from [] (__driver_attach+0x20/0x84) 
<4>[ 10.305084] [] (__driver_attach+0x20/0x84) from [] (bus_for_each_dev+0x48/0x84) 
<4>[ 10.305114] [] (bus_for_each_dev+0x48/0x84) from [] (bus_add_driver+0x188/0x330) 
<4>[ 10.305175] [] (bus_add_driver+0x188/0x330) from [] (driver_register+0xa8/0x134) 
<4>[ 10.305236] [] (driver_register+0xa8/0x134) from [] (do_one_initcall+0x5c/0x1b8) 
<4>[ 10.305297] [] (do_one_initcall+0x5c/0x1b8) from [] (kernel_init+0x160/0x22c) 
<4>[ 10.305328] unwind: Unknown symbol address c000859c 
<4>[ 10.305328] unwind: Index not found c000859c 
<0>[ 10.305358] Code: e2833001 1afffff5 e8bd8010 e3a03000 (e7d02003) 
<4>[ 10.305450] ---[ end trace 1b75b31a2719ed25 ]--- 
<0>[ 10.305511] Kernel panic - not syncing: Attempted to kill init! 
<0>[ 10.305541] Die id:59ae000600000001 
<3>[ 10.305572] mbm_version=0x00000a6c 
<3>[ 10.305572] mbm_loader_version=0x00000a6c 
<0>[ 10.477203] Timestamp = 10.796 
<0>[ 10.477233] Current Time = 01-01 00:00:10.796, Uptime = 10.971 seconds

首先看panic时的现场如下: 
<1>[ 10.300750] Unable to handle kernel paging request at virtual address fc4ba1c0 
<1>[ 10.300781] pgd = c0004000 
<1>[ 10.300781] [fc4ba1c0] *pgd=00000000 
<0>[ 10.300811] Internal error: Oops: 5 [#1] PREEMPT SMP 
<0>[ 10.300842] last sysfs file: 
<4>[ 10.300842] Modules linked in: 
<4>[ 10.300872] CPU: 0 Tainted: G W (2.6.35.7-eng-gddc8274 #1) 
<4>[ 10.300933] PC is at strcmp+0x4/0x34 
<4>[ 10.300964] LR is at platform_match+0x5c/0x68 
<4>[ 10.300994] pc : [] lr : [] psr: 60000013 

<4>[ 10.301025] sp : dc455f30 ip : 00000070 fp : 00000000 
<4>[ 10.301055] r10: 000000a0 r9 : 00000000 r8 : c032ac98 
<4>[ 10.301055] r7 : 00000000 r6 : c093a020 r5 : dc4a62a8 r4 : 00000000 
<4>[ 10.301086] r3 : 00000000 r2 : 00000001 r1 : c07c29fa r0 : fc4ba1c0 
<4>[ 10.301116] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment kernel 
注意看PC 和 LR, 也就是说kernel 在调用 platform_match函数中的strcmp出现了panic,这时通过objdump(至于怎么dump,首先根据你panic时的commit,找到回应的source code,然后build相应的vmlinx,然后用objdump反汇编)将C和汇编的code 
dump出来,我截取了片段如下:
c0284da4 :
int strcmp(const char *cs, const char *ct)
{
        unsigned char c1, c2;

        while (1) {
                c1 = *cs++;
c0284da4:       e3a03000        mov     r3, #0  ; 0x0
c0284da8:       e7d02003        ldrb    r2, [r0, r3]       <<== panic时pc为 c0284da8, 此时 寄存器r3的值为0, r0为fc4ba1c0,访问fc4ba1c0这个虚拟地址出错,这也跟上面的提示出错的信息相吻合(Unable to handle kernel paging request at virtual address fc4ba1c0 )。
                c2 = *ct++;
c0284dac:       e7d1c003        ldrb    ip, [r1, r3]   
                if (c1 != c2)

因为我们知道lr是platform_match,所以可以找到对应的C code 如下(driver/base/platform.c):
static int platform_match(struct device *dev, struct device_driver *drv) 
{
struct platform_device *pdev = to_platform_device(dev);
struct platform_driver *pdrv = to_platform_driver(drv);

/* Attempt an OF style match first */
if (of_driver_match_device(dev, drv))
return 1;

/* Then try to match against the id table */
if (pdrv->id_table)
return platform_match_id(pdrv->id_table, pdev) != NULL;

/* fall-back to driver name match */
return (strcmp(pdev->name, drv->name) == 0);  <<===panic 在这里,通过上面的汇编看得出来r0是入口参数 const char *cs,  对应这里的实参是pdev->name,   说明pdev->name是一个无效的地址,所以trigger panic
}

依然如此,就有这样的疑问,因为pdev是平台数据的结构体,是pdev这个结构体本身就错了呢?还是pdev这个结构里面的一些成员变量的内存值变了呢?看下面的分析:
列出调用platform_match到strcmp之间的汇编和C的对照代码如下:
static int platform_match(struct device *dev, struct device_driver *drv)
{
c032d684:       e92d4070        push    {r4, r5, r6, lr}
        struct platform_device *pdev = to_platform_device(dev);
        struct platform_driver *pdrv = to_platform_driver(drv);

        /* match against the id table first */
        if (pdrv->id_table)
c032d688:       e5914038        ldr     r4, [r1, #56]
 * and compare it against the name of the driver. Return whether they match
 * or not.
 */
static int platform_match(struct device *dev, struct device_driver *drv)
{
        struct platform_device *pdev = to_platform_device(dev);
c032d68c:       e2405008        sub     r5, r0, #8      ; 0x8   <==通过上面的的汇编看到r0实际上是platform_match的入口参数,这里通过,to-platform_device这个函数将dev下面存的平台数据取出来,然后送给r5,此时r5就是pdev的机构体,因为panic的前一刻r5的值没有被变过,所以我们可以在stack里面找出其值来。
        struct platform_driver *pdrv = to_platform_driver(drv);

        /* match against the id table first */
        if (pdrv->id_table)
c032d690:       e3540000        cmp     r4, #0  ; 0x0
c032d694:       1a000006        bne     c032d6b4
c032d698:       ea00000d        b       c032d6d4
static const struct platform_device_id *platform_match_id(
                        const struct platform_device_id *id,
                        struct platform_device *pdev)
{
        while (id->name[0]) {
                if (strcmp(pdev->name, id->name) == 0) {
c032d69c:       e5950000        ldr     r0, [r5] <==注意看这里将r5的值取出来给r0,实际上也是pdev->name,所以通过这里看来第一次这里用到了这个内存单元的值是正确的,因为假设整个结构出问题的话,那么code运行到这里就应该出错了,所以排除第一种可能。
c032d6a0:       ebfd5dbf        bl      c0284da4
c032d6a4:       e3500000        cmp     r0, #0  ; 0x0
                        pdev->id_entry = id;
c032d6a8:       05854168        streq   r4, [r5, #360]
c032d6ac:       0a000005        beq     c032d6c8
                        return id;
                }
                id++;
c032d6b0:       e2844018        add     r4, r4, #24     ; 0x18

static const struct platform_device_id *platform_match_id(
                        const struct platform_device_id *id,
                        struct platform_device *pdev)
{
        while (id->name[0]) {
c032d6b4:       e5d43000        ldrb    r3, [r4]
                if (strcmp(pdev->name, id->name) == 0) {
c032d6b8:       e1a01004        mov     r1, r4

static const struct platform_device_id *platform_match_id(
                        const struct platform_device_id *id,
                        struct platform_device *pdev)
{
        while (id->name[0]) {
c032d6bc:       e3530000        cmp     r3, #0  ; 0x0
c032d6c0:       1afffff5        bne     c032d69c
                        pdev->id_entry = id;
                        return id;
                }
                id++;
        }
        return NULL;
c032d6c4:       e1a04003        mov     r4, r3
        struct platform_device *pdev = to_platform_device(dev);
        struct platform_driver *pdrv = to_platform_driver(drv);

        /* match against the id table first */
        if (pdrv->id_table)
                return platform_match_id(pdrv->id_table, pdev) != NULL;
c032d6c8:       e2540000        subs    r0, r4, #0      ; 0x0
c032d6cc:       13a00001        movne   r0, #1  ; 0x1
c032d6d0:       e8bd8070        pop     {r4, r5, r6, pc}

        /* fall-back to driver name match */
        return (strcmp(pdev->name, drv->name) == 0);
c032d6d4:       e5100008        ldr     r0, [r0, #-8] <==这里是将 r0-8位置的内存的值取出来(一个pdev结构的指针)放到r0寄存器中,这里实际是在给strcmp 准备实参, 我们可以看下pdev这个结构体如下,name是结构体的第一个成员变量,那么pdev的指针实际上就是指向pdev->name, 那么通过这里看出pdev结构体来自r0-8的位置,因为我们上面有两种假设:一是pdev这个结构本身就出问题了,另外一个就是pdev里面的一个成员错了,继续往上推。
struct platform_device {
const char * name;
int id;
struct device dev;
u32 num_resources;
struct resource * resource;
..........
};
c032d6d8:       e5911000        ldr     r1, [r1]
c032d6dc:       ebfd5db0        bl      c0284da4 <==这里将跳转到strcmp执行相应的code
c032d6e0:       e2700001        rsbs    r0, r0, #1      ; 0x1  <==实际上c032d6e0就是panic时的LR
c032d6e4:       33a00000        movcc   r0, #0  ; 0x0


我们根据r5的值找到panic时内存中r5寄存器对应的内存单元的值如下:
R5: 0xdc4a6228: 
6228 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 
6248 00000000 00000000 00000000 00000000 00000000 00000000 165f8171 57515555 
6268 5d555755 06055d55 635688c0 d84156c5 c755d51d c008eacc 564d4157 577de515 
6288 d415d35d 4d254555 75861555 5755d977 635688c0 d84156c5 f00dcafe 00000000 
62a8 fc4ba1c0 00000002 c093a570 dc4a4340 dc4ba160 1c4a607c dc4a64fc c093a578 
62c8 dc437e40 c093a3f0 dc4aabf0 00000005 00000007 00000000 00000000 00000001 
62e8 00000000 00000000 dc4a62f0 dc4a62f0 00000000 00000000 dc4a62e4 c093a6c8 
6308 c097ebec dc4a9f80 00000000 00000000 00000001 dc4a60dc dc4a655c 7fffffff 
panic时r5的值是0xdc4a62a8,那么panic时对应的pdev的结构体就是
fc4ba1c0 00000002 c093a570 dc4a4340 dc4ba160 1c4a607c dc4a64fc c093a578 
根据上面的结构体层次关系,所以第一个fc4ba1c0就是pdev->name, 第二个是 pdev->id = 00000002,  第三个是结构体 device
struct device{ struct device *parent; struct device_private *p; struct kobject kobj; ... } struct kobject { const char *name; .... }
第三个是pdev->dev->parent = c093a570, 第四个是pdev->dev->p = dc4a4340, 第五个是 pdev->dev->kobj->name = dc4ba160,  实际上这个name原则上说应该跟第一个是相等的, 注意如果细心的观察下fc4ba1c0与dc4ba160转换为二进制,仅仅差2位,
通常这种情况最大的可能是硬件出现问题,这个bug只重现了一次,根据经验这很大可能是硬件造成的bug,所以最后分析到这里就打住了,但是分析方法和步骤,以及怎么结合内存dump去check相应的bug还是挺有借鉴意思的。




上一篇:X86 stack frame understand
下一篇:一个bug引发的linux smp 血案(二)