linux Coredump系列之一：coredump的产生以及原因分析-Garfield

本系列概述：

对于一个C/C++程序员来说，可能听得很多的一个语句就是“XX程序又coredump了，赶紧处理”。如果一个程序coredump了，那么意味着这个程序会终止一段时间的服务。对于现场环境来说都是不可估量的损失，该系列的文章会从浅到深分析coredump的产生，产生原理，怎样调试coredump，以及怎样利用coredump对该程序的内存分析。

1. 什么是coredump文件

Coredump又称core，是linux/unix操作系统的一种机制。程序在运行的工程中，满足一定的条件就会产生coredump文件。虽然coredump会使得程序停止服务一段时间，但是coredump文件会保留当前进程的第一手现场资料，这些信息包含了程序运行时的内存，寄存器状态，堆栈指针，内存管理信息还有各种函数调用堆栈信息等，以便后续调试。但是有时候，你会看到的coredump文件不全，或者coredump的堆栈信息已经破坏，这种情况后面会有一个专门的专题说明。

2. Coredump产生的原因和几种情况

一般进程产生coredump是因为进程收到了一个segmentfault的信号，这时候，进程就会coredump，这个信号为SIGSEGV,一般在程序中看到的是收到11号信号。其就是SIGSEGV信号。比如：

其产生的几种可能有：

2.1 内存访问越界

a) 数组下标访问越界，当然也包括STL中的容器下标访问越界。

b) 遍历字符串时，依靠字符串结束符来判断字符串是否结束，但是字符串没有正常的使用结束符。

c) 使用strcpy, strcat, sprintf, strcmp,strcasecmp等字符串操作函数，将目标字符串写越界。应该使用strncpy, strlcpy, strncat, strlcat, snprintf, strncmp, strncasecmp等函数防止读写越界。

2.2 多线程程序使用了线程不安全的函数。

应该使用下面这线程安全入的函数，它们很容易被用错：asctime_r(3c) gethostbyname_r(3n) getservbyname_r(3n)ctermid_r(3s) gethostent_r(3n) getservbyport_r(3n) ctime_r(3c) getlogin_r(3c)getservent_r(3n) fgetgrent_r(3c) getnetbyaddr_r(3n) getspent_r(3c)fgetpwent_r(3c) getnetbyname_r(3n) getspnam_r(3c) fgetspent_r(3c)getnetent_r(3n) gmtime_r(3c) gamma_r(3m) getnetgrent_r(3n) lgamma_r(3m) getauclassent_r(3)getprotobyname_r(3n) localtime_r(3c) getauclassnam_r(3) etprotobynumber_r(3n)nis_sperror_r(3n) getauevent_r(3) getprotoent_r(3n) rand_r(3c) getauevnam_r(3)getpwent_r(3c) readdir_r(3c) getauevnum_r(3) getpwnam_r(3c) strtok_r(3c) getgrent_r(3c)getpwuid_r(3c) tmpnam_r(3s) getgrgid_r(3c) getrpcbyname_r(3n) ttyname_r(3c)getgrnam_r(3c) getrpcbynumber_r(3n) gethostbyaddr_r(3n) getrpcent_r(3n)，网络上一个线程安全函数列表的文档。

2.3 多线程读写的变量数据未加锁保护。比如说，全局变量，静态变量等。

对于会被多个线程同时访问的全局数据，应该注意加锁保护，否则很容易造成coredump

2.4 非法指针，包括空指针，不合法的指针转换，野指针，还有跨平台产品中的结构体字节对齐问题等。

2.5 堆栈溢出。尤其是嵌入式系统，堆栈比较小，很容易就溢出了。

2.6 顺便附带几种不会产生coredump的情况：

The core file will not be generated if

(a) the process was set-user-ID and the current user is not the owner of the program file, or

(b) the process was set-group-ID and the current user is not the group owner of the file,

(d) the file already exists and the user does not have permission to write to it, or

(e) the file is too big (recall the RLIMIT_CORE limit in Section 7.11). The permissions of the core file (assuming that the file doesn't already exist) are usually user-read and user-write, although Mac OS X sets only user-read.

3. Coredump的存储位置：

在SuSE Linux中，首先设置当前用户的.bash_profile如下内容：ulimit –S –c unlimted

然后再source /root/.bash_profile,当让本人用的用户是root.

再输入如下命令：

echo '/home/coredump/core-%e-%p'> /proc/sys/kernel/core_pattern

echo 1 > /proc/sys/kernel/core_uses_pid

ulimit –c 0

这样就可以产生coredump，其位置就在/home/coredump目录下。

现在来看看core_pattern的格式：

%p - insert pid into filename # pid

%u - insert current uid into filename #当前uid

%g - insert current gid into filename #当前gid

%s - insert signal that caused the coredump into the filename #导致产生core的信号

%t - insert UNIX time that the coredump occurred into filename # core文件生成时的unix时间

%h - insert hostname where the coredump happened into filename #主机名

%e - insert coredumping executable name into filename #产生coredump的文件名

也可以在程序中设定coredump的路径，使用chdir函数来设定：

4. 识别coredump文件

其实coredump文件和make生成的程序文件一样，是ELF文件，我们可以用file命令来查看：

我这里是一个32位系统下产生的ELF文件。对于AIX系统，还会指出生成次coredump文件的程序名称。

确定了是ELF文件，那么就可以用readelf来解析：

首先从类型来看，是Core File，从Magic也可以看出这是在一个32位系统下产生的core，红圈的01代表32位，00 代表64位。具体怎么样解析ELF文件，可以去help一下readelf命令。

敬请期待：系列二之，如何调试coredump