Issue
Customer report Ubuntu 22.04 consume more memory than Ubuntu 20.04.
|
free -lh
|
Ubuntu 22.04 / 5.15.0
|
total used free shared buff/cache available
Mem: 376Gi 5.0Gi 368Gi 10Mi 2.5Gi 368Gi
Low: 376Gi 7.5Gi 368Gi
High: 0B 0B 0B
Swap: 0B 0B 0B
|
Ubuntu 20.04 / 5.4.0
|
total used free shared buff/cache available
Mem: 376Gi 2.3Gi 371Gi 10Mi 3.0Gi 371Gi
Low: 376Gi 5.2Gi 371Gi
High: 0B 0B 0B
Swap: 0B 0B 0B
|
free 显示Ubuntu 20.04 ‘used’ 内存为2.3G; 而Ubuntu 22.04 ‘used’ 内存为4.9G.
为什么22.04多使用了2.6G的内存?
分析
Phase1 :’used’是如何计算得来的
从procps源码看’used’是读取并分析 /proc/meminfo文件的内容计算得来;
# cat /proc/meminfo
MemTotal: 394594036 kB
MemFree: 389106200 kB
MemAvailable: 389952084 kB
Buffers: 4276 kB
Cached: 2817564 kB
SwapCached: 0 kB
SReclaimable: 281992 kB
|
计算方法是从total减去free,cached,Reclaimable;
kb_main_cached = kb_page_cache + kb_slab_reclaimable;
mem_used = kb_main_total - kb_main_free - kb_main_cached - kb_main_buffers;
|
等价于:
used = MemTotal - MemFree - Cached - SReclaimable - Buffers
|
此时看不出used memory被什么程序占用.
Phase2:/proc/meminfo
‘/proc/meminfo’用于报告系统内存的使用情况, 每个统计项的含义可以查看下面的链接.
https://github.com/torvalds/linux/blob/master/Documentation/filesystems/proc.rst#meminfo
cat /proc/meminfo
MemTotal: 394594036 kB
MemFree: 389105524 kB
MemAvailable: 389951424 kB
Buffers: 4276 kB
Cached: 2817564 kB
SwapCached: 0 kB
Active: 687244 kB
Inactive: 2337940 kB
Active(anon): 219916 kB
Inactive(anon): 8600 kB
Active(file): 467328 kB
Inactive(file): 2329340 kB
Unevictable: 17612 kB
Mlocked: 17612 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 480 kB
Writeback: 0 kB
AnonPages: 221548 kB
Mapped: 243640 kB
Shmem: 10760 kB
KReclaimable: 282024 kB
Slab: 727528 kB
SReclaimable: 282024 kB
SUnreclaim: 445504 kB
KernelStack: 16432 kB
PageTables: 4552 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 197297016 kB
Committed_AS: 2100600 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 1302520 kB
VmallocChunk: 0 kB
Percpu: 61760 kB
HardwareCorrupted: 0 kB
AnonHugePages: 0 kB
ShmemHugePages: 0 kB
ShmemPmdMapped: 0 kB
FileHugePages: 0 kB
FilePmdMapped: 0 kB
CmaTotal: 0 kB
CmaFree: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 0 kB
DirectMap4k: 517584 kB
DirectMap2M: 8556544 kB
DirectMap1G: 394264576 kB
|
系统使用的内存可以分为两类:用户态内存和kernel态内存.
通过/proc/meminfo各个字段的信息,可以统计出大致的内存分配:
用户态消耗的内存:
(Cached + AnonPages + Buffers) + (HugePages_Total * Hugepagesize)
|
或者
(Active + Inactive + Unevictable) + (HugePages_Total * Hugepagesize)
|
对应到’free’ 输出,用户态消耗相当于是buff/cache 部分;
内核态消耗内存:
Slab+ VmallocUsed + PageTables + KernelStack + HardwareCorrupted + Bounce + X
|
· X表示直接通过alloc_pages/__get_free_page分配的内存, /proc/meminfo中没有针对此类的统计, 是内存黑洞;
· vmalloc :详细的vmalloc信息记录在/proc/vmallocinfo 中,可以通过的下面的命令来计算:
cat /proc/vmallocinfo |grep vmalloc | awk ‘{ total += $2 } END { printf “Totalvmalloc of user-mode: %.02f MB\n”, total/1024/1024 }’
|
两个版本消耗内存主要差异在内核态, 将已知的统计项做和,得到内存并没有太大差异.
怀疑点落在X部分,但这部分属于内存黑洞, 没有踪迹可寻.
是不是可以从MemAvailable来找到线索?
答案是否定的, MemAvailable的来自各个zone free pages , 再加上可以回收再利用的一些内存.
同样找不到是谁分配的pages.
available = vm_zone_stat[NR_FREE_PAGES] + pagecache + reclaimable
|
代码实现参考: si_mem_available()
Phase3:/proc/zoneinfo
该文件是针对per NUMA node / per memory zone 的内存pages的metric信息.
其文件内容与/proc/meminfo的十分相似. 事实是meminfo是基于zoneinfo计算出来的.
zoneinfo信息仍然不能解决当前的问题.
这个文件可以帮助理解memory与node,zone之间的关系.
spanned_pages is the total pages spanned by the zone;
present_pages is physical pages existing within the zone;
reserved_pages includes pages allocated by the bootmem allocator;
managed_pages is present pages managed by the buddy system;
present_pages = spanned_pages - absent_pages(pages in holes);
managed_pages = present_pages - reserved_pages;
系统总的有效内存是各个zone的managed之和:
total_memory = node0_zone_DMA[managed] + node0_zone_DMA32[managed] + node0_zone_Normal[managed]+ node1_zone_Normal[managed]
Node 0, zone DMA
per-node stats
nr_inactive_anon 2117
nr_active_anon 38986
nr_inactive_file 545121
nr_active_file 98412
nr_unevictable 3141
nr_slab_reclaimable 59505
nr_slab_unreclaimable 84004
nr_isolated_anon 0
nr_isolated_file 0
workingset_nodes 0
workingset_refault 0
workingset_activate 0
workingset_restore 0
workingset_nodereclaim 0
nr_anon_pages 39335
nr_mapped 55358
nr_file_pages 648505
nr_dirty 136
nr_writeback 0
nr_writeback_temp 0
nr_shmem 2630
nr_shmem_hugepages 0
nr_shmem_pmdmapped 0
nr_file_hugepages 0
nr_file_pmdmapped 0
nr_anon_transparent_hugepages 0
nr_unstable 0
nr_vmscan_write 0
nr_vmscan_immediate_reclaim 0
nr_dirtied 663519
nr_written 576275
nr_kernel_misc_reclaimable 0
pages free 3840
min 0
low 3
high 6
spanned 4095
present 3993
managed 3840
protection: (0, 1325, 191764, 191764, 191764)
|
Node 0, zone DMA32
pages free 347116
min 56
low 395
high 734
spanned 1044480
present 429428
managed 347508
protection: (0, 0, 190439, 190439, 190439)
Node 0, zone Normal
pages free 47545626
min 8097
low 56849
high 105601
spanned 49545216
present 49545216
managed 48754582
protection: (0, 0, 0, 0, 0)
nr_free_pages 47545626
nr_zone_inactive_anon 2117
nr_zone_active_anon 38986
nr_zone_inactive_file 545121
nr_zone_active_file 98412
nr_zone_unevictable 3141
nr_zone_write_pending 34
nr_mlock 3141
nr_page_table_pages 872
nr_kernel_stack 10056
nr_bounce 0
nr_zspages 0
nr_free_cma 0
numa_hit 2365759
numa_miss 0
numa_foreign 0
numa_interleave 43664
numa_local 2365148
numa_other 611
|
Node 1, zone Normal
per-node stats
nr_inactive_anon 33
nr_active_anon 16139
nr_inactive_file 37211
nr_active_file 18441
nr_unevictable 1262
nr_slab_reclaimable 11198
nr_slab_unreclaimable 27613
nr_isolated_anon 0
nr_isolated_file 0
workingset_nodes 0
workingset_refault 0
workingset_activate 0
workingset_restore 0
workingset_nodereclaim 0
nr_anon_pages 16213
nr_mapped 5952
nr_file_pages 56974
nr_dirty 0
nr_writeback 0
nr_writeback_temp 0
nr_shmem 60
nr_shmem_hugepages 0
nr_shmem_pmdmapped 0
nr_file_hugepages 0
nr_file_pmdmapped 0
nr_anon_transparent_hugepages 0
nr_unstable 0
nr_vmscan_write 0
nr_vmscan_immediate_reclaim 0
nr_dirtied 59535
nr_written 37528
nr_kernel_misc_reclaimable 0
pages free 49379629
min 8229
low 57771
high 107313
spanned 50331648
present 50331648
managed 49542579
protection: (0, 0, 0, 0, 0)
|
Phase4:有没有可能是reserved memory
reserved memory 发生在什么阶段?
在系统初始化boot阶段,常规的内存管理还未被使能,此时物理内存的分配由特殊的分配器Bootmem来承担. Bootmem生命周期始于setup_arch()终于mem_init().
预留内存是在Boot阶段完成,可以从dmesg日志中看到具体信息,
[ 2.938694] Memory: 394552504K/401241140K available (14339K kernel code, 2390K rwdata, 8352K rodata, 2728K init, 4988K bss, 6688636K reserved, 0K cma-reserved)
|
401241140K == zoneinfo中所有zone present_pages总和
394552504K ?? zoneinfo中所有zone managed_pages总和
当前issue涉及的memory是在buddy系统中分配, 所以否定了memory被reserved的可能.
Phase5: CONFIG_PAGE_OWNER
已知的系统信息对定位问题没有帮助.
阅读pages分配流程, 发现可以通过enable CONFIG_PAGE_OWNER来记录page owner, 该功能会记录分配pages时刻的stack信息. 这样就可以追溯到分配内存时的上下文.
__alloc_pages
-> get_page_from_freelist
-> prep_new_page
-> post_alloc_hook
-> set_page_owner
->
|
参考:
https://git.nju.edu.cn/nju/linux/-/blob/master/Documentation/mm/page_owner.rst
导出当前时刻 page owner信息:
cat /sys/kernel/debug/page_owner > page_owner_full.txt
./page_owner_sort page_owner_full.txt sorted_page_owner.txt
|
得到下面的输出:
253 times:
Page allocated via order 0, mask 0x2a22(GFP_ATOMIC|__GFP_HIGHMEM|__GFP_NOWARN), pid 1, ts 4292808510 ns, free_ts 0 ns
prep_new_page+0xa6/0xe0
get_page_from_freelist+0x2f8/0x450
__alloc_pages+0x178/0x330
alloc_page_interleave+0x19/0x90
alloc_pages+0xef/0x110
__vmalloc_area_node.constprop.0+0x105/0x280
__vmalloc_node_range+0x74/0xe0
__vmalloc_node+0x4e/0x70
__vmalloc+0x1e/0x20
alloc_large_system_hash+0x264/0x356
futex_init+0x87/0x131
do_one_initcall+0x46/0x1d0
kernel_init_freeable+0x289/0x2f2
kernel_init+0x1b/0x150
ret_from_fork+0x1f/0x30
……
1 times:
Page allocated via order 9, mask 0xcc0(GFP_KERNEL), pid 707, ts 9593710865 ns, free_ts 0 ns
prep_new_page+0xa6/0xe0
get_page_from_freelist+0x2f8/0x450
__alloc_pages+0x178/0x330
__dma_direct_alloc_pages+0x8e/0x120
dma_direct_alloc+0x66/0x2b0
dma_alloc_attrs+0x3e/0x50
irdma_puda_qp_create.constprop.0+0x76/0x4e0 [irdma]
irdma_puda_create_rsrc+0x26d/0x560 [irdma]
irdma_initialize_ieq+0xae/0xe0 [irdma]
irdma_rt_init_hw+0x2a3/0x580 [irdma]
i40iw_open+0x1c3/0x320 [irdma]
i40e_client_subtask+0xc3/0x140 [i40e]
i40e_service_task+0x2af/0x680 [i40e]
process_one_work+0x228/0x3d0
worker_thread+0x4d/0x3f0
kthread+0x127/0x150
|
很容易看出相同上下文中分配pages的次数, page order, pid等信息;
以order为关键字来统计页面分配的情况,
5.15.0中分配2^9 pages (512*4K)的case很异常. 查看对应的stack, 发现几乎都与irdma有关.
5.15.0
|
5.4.0
|
order: 0, times: 2107310, memory: 8429240 KB, 8231 MB
order: 1, times: 8110, memory: 64880 KB, 63 MB
order: 2, times: 1515, memory: 24240 KB, 23 MB
order: 3, times: 9671, memory: 309472 KB, 302 MB
order: 4, times: 101, memory: 6464 KB, 6 MB
order: 5, times: 33, memory: 4224 KB, 4 MB
order: 6, times: 5, memory: 1280 KB, 1 MB
order: 7, times: 9, memory: 4608 KB, 4 MB
order: 8, times: 3, memory: 3072 KB, 3 MB
order: 9, times: 1426, memory: 2920448 KB, 2852 MB
order: 10, times: 3, memory: 12288 KB, 12 MB
all memory: 11780216 KB 11 GB
|
order: 0, times: 1218829, memory: 4875316 KB, 4761 MB
order: 1, times: 12370, memory: 98960 KB, 96 MB
order: 2, times: 1825, memory: 29200 KB, 28 MB
order: 3, times: 6834, memory: 218688 KB, 213 MB
order: 4, times: 110, memory: 7040 KB, 6 MB
order: 5, times: 17, memory: 2176 KB, 2 MB
order: 6, times: 0, memory: 0 KB, 0 MB
order: 7, times: 2, memory: 1024 KB, 1 MB
order: 8, times: 0, memory: 0 KB, 0 MB
order: 9, times: 0, memory: 0 KB, 0 MB
order: 10, times: 0, memory: 0 KB, 0 MB
all memory: 5232404 KB 4 GB
|
Fix
据了解现在业务中暂时没有业务使用RDMA功能,系统初始化时可以暂时不加载irdma.ko. 在OS中搜索关键字,没有很明显的加载irdma的指令.
仅发现irdma module被定义了几个alias:
alias i40iw irdma
alias auxiliary:ice.roce irdma
alias auxiliary:ice.iwarp irdma
alias auxiliary:i40e.iwarp irdma
|
irdma是如何被自动加载的?
发现在手动加载i40e网卡驱动时,irdma也被加载了.
# modprobe -v i40e
insmod /lib/modules/5.15.0-26-generic/kernel/drivers/net/ethernet/intel/i40e/i40e.ko
|
irdma是被什么程序带起来的?
利用tracepoint监控内核的行为
# cd /sys/kernel/debug/tracing
# tree -d events/module/
events/module/
├── module_free
├── module_get
├── module_load
├── module_put
└── module_request
5 directories
#echo 1 > tracing_on
#echo 1 > ./events/module/enable
#modprobe i40e
# cat trace
……
modprobe-11202 [068] ..... 24358.270849: module_put: i40e call_site=do_init_module refcnt=1
systemd-udevd-11444 [035] ..... 24358.275116: module_get: ib_uverbs call_site=resolve_symbol refcnt=2
systemd-udevd-11444 [035] ..... 24358.275130: module_get: ib_core call_site=resolve_symbol refcnt=3
systemd-udevd-11444 [035] ..... 24358.275185: module_get: ice call_site=resolve_symbol refcnt=2
systemd-udevd-11444 [035] ..... 24358.275247: module_get: i40e call_site=resolve_symbol refcnt=2
systemd-udevd-11444 [009] ..... 24358.295650: module_load: irdma
systemd-udevd-11444 [009] .N... 24358.295730: module_put: irdma call_site=do_init_module refcnt=1
……
|
trace log显示是system-udevd service 加载了 irdma, 使能debug模式可以查看system_udevd详细的日志,
/lib/systemd/systemd-udevd --debug
|
system-udevd为什么会load irdma?
要返回到内核态看i40e probe流程.
涉及到一个概念: Auxiliary Bus .
加载i40e driver时,会创建auxiliary device i40e.iwarp, 并发送uevent通知用户态.
system-udevd监听到消息后, 加载i40e.iwarp对应的module, 由于i40e.iwarp是irdma的alias, 实际被加载的module为irdma.
i40e_probe
->i40e_lan_add_device()
->i40e_client_add_instance(pf);
->i40e_register_auxiliary_dev(&cdev->lan_info, "iwarp”)
->auxiliary_device_add(aux_dev);
->dev_set_name(dev, "%s.%s.%d", modname, auxdev->name, auxdev->id); //i40e.iwarp.0, i40e.iwarp.1
->device_add(dev);
->kobject_uevent(&dev->kobj, KOBJ_ADD);
->kobject_uevent_env(kobj, action, NULL); // send an uevent with environmental data
->bus_probe_device(dev);
|
如何禁止irdma被自动加载?
在/etc/modprobe.d/blacklist.conf中增加一行, 来禁止irdma以alias扩展的身份来load.
# This file lists those modules which we don't want to be loaded by
# alias expansion, usually so some other driver will be loaded for the
# device instead.
blacklist irdma
|
这样并不会影响手动modprobe irdma
总结:
server上使用的网卡是intel X722,该型号网卡支持iWARP, 在加载网卡驱动时,i40e driver注册了auxiliary iwarp device, 并发送了uevent消息到userspace;
system-udevd service接受到uevent消息后加载了irdma.ko, irdma在初始化时创建了dma 映射, 通过__alloc_pages()分配掉了大块的内存.
intel在 5.14 内核合入一个patch, 将iWARP driver更改为irdma, 并删除了原来的 i40iw driver.
5.4.0 kernel没有自动加载 i40iw driver, 若手动 load i40iw后,同样会消耗 3G 左右的内存.