问题描述: When Lisa try to run the little tool `runmem’ on her test Appliance domain 0, after a few repeat, the java tomcat server process was killed by system while runmem process can obtain its 200 MB memory. At the critical point, there was still nearly 600 MB memory free.
- Runmem is a little tool I developed for testing, it will consume 200MB RAM in one process, and another 200MB when you run another runmem process …
- The testing Appliance domain 0 system has its separately 4GB physical memory but no swap
- the critical point means it is ok now, but once you start a runmem process, the java tomcat process will be killed.
研究发现:After some time for researching, I found that linux has a self-safeguard function named oom-killer ( out of memory killer), this function will kill a process when linux can’t provide enough mem for some other process’ new mem request. But we see there was 600MB free mem in this case, so, why oom-killer killing java tomcat process? (we see oom-killer performed from /var/log/message)
Use /proc/zoneinfo, we can get more detail about memory. There are 3 zones in here, DMA zone, Normal zone and HighMem zone. (see bellow.), user space app will use HighMem zone while kernel space process will use normal zone memory.
[root@lisa1043 mem and swap]# cat /proc/zoneinfo
Node 0, zone DMA
pages free 3160
min 17
low 21
high 25
active 0
inactive 0
scanned 0 (a: 17 i: 17)
spanned 4096
present 4096
nr_anon_pages 0
nr_mapped 1
nr_file_pages 0
nr_slab 0
nr_page_table_pages 0
nr_dirty 0
nr_writeback 0
nr_unstable 0
nr_bounce 0
protection: (0, 0, 851, 24149)
pagesets
all_unreclaimable: 1
prev_priority: 12
start_pfn: 0
Node 0, zone Normal
pages free 137757
min 924
low 1155
high 1386
active 48
inactive 35
scanned 97 (a: 5 i: 7)
spanned 218110
present 218110
nr_anon_pages 0
nr_mapped 1
nr_file_pages 80
nr_slab 4052
nr_page_table_pages 1827
nr_dirty 0
nr_writeback 0
nr_unstable 0
nr_bounce 0
protection: (0, 0, 0, 186383)
pagesets
cpu: 0 pcp: 0
count: 9
high: 186
batch: 31
cpu: 0 pcp: 1
count: 61
high: 62
batch: 15
vm stats threshold: 24
cpu: 1 pcp: 0
count: 46
high: 186
batch: 31
cpu: 1 pcp: 1
count: 59
high: 62
batch: 15
vm stats threshold: 24
cpu: 2 pcp: 0
count: 60
high: 186
batch: 31
cpu: 2 pcp: 1
count: 51
high: 62
batch: 15
vm stats threshold: 24
cpu: 3 pcp: 0
count: 121
high: 186
batch: 31
cpu: 3 pcp: 1
count: 53
high: 62
batch: 15
vm stats threshold: 24
all_unreclaimable: 0
prev_priority: 2
start_pfn: 4096
Node 0, zone HighMem
pages free 11114
min 128
low 6449
high 12770
active 795251
inactive 381
scanned 297953 (a: 0 i: 20)
spanned 5964270
present 5964270
nr_anon_pages 793116
nr_mapped 2155
nr_file_pages 2494
nr_slab 0
nr_page_table_pages 0
nr_dirty 38
nr_writeback 0
nr_unstable 0
nr_bounce 0
protection: (0, 0, 0, 0)
pagesets
cpu: 0 pcp: 0
count: 152
high: 186
batch: 31
cpu: 0 pcp: 1
count: 0
high: 62
batch: 15
vm stats threshold: 54
cpu: 1 pcp: 0
count: 184
high: 186
batch: 31
cpu: 1 pcp: 1
count: 5
high: 62
batch: 15
vm stats threshold: 54
cpu: 2 pcp: 0
count: 71
high: 186
batch: 31
cpu: 2 pcp: 1
count: 3
high: 62
batch: 15
vm stats threshold: 54
cpu: 3 pcp: 0
count: 22
high: 186
batch: 31
cpu: 3 pcp: 1
count: 5
high: 62
batch: 15
vm stats threshold: 54
all_unreclaimable: 0
prev_priority: 2
start_pfn: 222206
In this case we see Highmem zone has 11114 free pages (11114*4/1024 = 43 MB), but runmem request 200 MB per time. So it turn to Normal zone. This time we should notice this pages free 137757 & protection: (0, 0, 0, 186383) in normal zone and 137757 < 186383, so oom-killer is awaken to found a process to kill, then it found java tomcat.
Note: those protected memory can be alloced by a kernel space process.
Seems we got the reason, is there any way to fix or optimize this, that’s say we want there are more available memory for user space app?
There’s a way we can adjust /proc/sys/vm/lowmem_reserve_ratio, the bigger value in here, the lower value protection will display. You can check this like:
# echo "1024 1024 256" > /proc/sys/vm/lowmem_reserve_ratio
# cat /proc/zoneinfo | grep protection
I think the way that return null to user app *alloc function when there is no more memory to be alloc is a better way because most user app can handle is exception. But now, I don’t got how to do this, I will keep going on this… and welcome your interesting.