案例分析- 5.15.0 kernel Memory Issue

6390阅读 0评论2023-04-18 静默梧桐
分类:LINUX

Issue

Customer report Ubuntu 22.04 consume more memory than Ubuntu 20.04.

 

free -lh

Ubuntu 22.04 / 5.15.0

                    total        used        free      shared  buff/cache   available

Mem:          376Gi       5.0Gi       368Gi        10Mi       2.5Gi       368Gi

Low:           376Gi       7.5Gi       368Gi

High:             0B          0B          0B

Swap:             0B          0B          0B

Ubuntu 20.04 / 5.4.0

                    total        used        free      shared  buff/cache   available

Mem:        376Gi       2.3Gi       371Gi        10Mi       3.0Gi       371Gi

Low:          376Gi       5.2Gi       371Gi

High:           0B          0B          0B

Swap:          0B          0B          0B

free 显示Ubuntu 20.04 ‘used’ 内存为2.3G; Ubuntu 22.04 ‘used’ 内存为4.9G.

为什么22.04多使用了2.6G的内存?

分析

Phase1 :’used’是如何计算得来的

从procps源码看’used’是读取分析 /proc/meminfo文件的内容计算得来;

# cat /proc/meminfo

MemTotal:       394594036 kB

MemFree:        389106200 kB

MemAvailable:   389952084 kB

Buffers:            4276 kB

Cached:          2817564 kB

SwapCached:            0 kB

 

SReclaimable:     281992 kB

 

 

计算方法是从total减去free,cached,Reclaimable;

kb_main_cached = kb_page_cache + kb_slab_reclaimable;

mem_used = kb_main_total - kb_main_free - kb_main_cached - kb_main_buffers;

 

等价于:

used = MemTotal - MemFree - Cached - SReclaimable - Buffers

 

此时看不出used memory被什么程序占用.

Phase2:/proc/meminfo

‘/proc/meminfo’用于报告系统内存的使用情况, 每个统计项的含义可以查看下面的链接.

https://github.com/torvalds/linux/blob/master/Documentation/filesystems/proc.rst#meminfo

 

cat /proc/meminfo

MemTotal:       394594036 kB

MemFree:        389105524 kB

MemAvailable:   389951424 kB

Buffers:            4276 kB

Cached:          2817564 kB

SwapCached:            0 kB

Active:           687244 kB

Inactive:        2337940 kB

Active(anon):     219916 kB

Inactive(anon):     8600 kB

Active(file):     467328 kB

Inactive(file):  2329340 kB

Unevictable:       17612 kB

Mlocked:           17612 kB

SwapTotal:             0 kB

SwapFree:              0 kB

Dirty:               480 kB

Writeback:             0 kB

AnonPages:        221548 kB

Mapped:           243640 kB

Shmem:             10760 kB

KReclaimable:     282024 kB

Slab:             727528 kB

SReclaimable:     282024 kB

SUnreclaim:       445504 kB

KernelStack:       16432 kB

PageTables:         4552 kB

NFS_Unstable:          0 kB

Bounce:                0 kB

WritebackTmp:          0 kB

CommitLimit:    197297016 kB

Committed_AS:    2100600 kB

VmallocTotal:   34359738367 kB

VmallocUsed:     1302520 kB

VmallocChunk:          0 kB

Percpu:            61760 kB

HardwareCorrupted:     0 kB

AnonHugePages:         0 kB

ShmemHugePages:        0 kB

ShmemPmdMapped:        0 kB

FileHugePages:         0 kB

FilePmdMapped:         0 kB

CmaTotal:              0 kB

CmaFree:               0 kB

HugePages_Total:       0

HugePages_Free:        0

HugePages_Rsvd:        0

HugePages_Surp:        0

Hugepagesize:       2048 kB

Hugetlb:               0 kB

DirectMap4k:      517584 kB

DirectMap2M:     8556544 kB

DirectMap1G:    394264576 kB

 

系统使用的内存可以分为两类:用户态内存和kernel态内存.

 

通过/proc/meminfo各个字段的信息,可以统计出大致的内存分配:

 

用户态消耗内存:

(Cached + AnonPages + Buffers) + (HugePages_Total * Hugepagesize)

或者

(Active + Inactive + Unevictable) + (HugePages_Total * Hugepagesize)

 

对应到’free’ 输出,用户态消耗相当于是buff/cache 部分;

 

内核态消耗内存:

 

Slab+ VmallocUsed + PageTables + KernelStack + HardwareCorrupted + Bounce + X

·      X表示直接通过alloc_pages/__get_free_page分配的内存, /proc/meminfo中没有针对此类的统计, 是内存黑洞;

·      vmalloc :详细的vmalloc信息记录在/proc/vmallocinfo 中,可以通过的下面的命令来计算:

cat /proc/vmallocinfo |grep vmalloc | awk ‘{ total += $2 } END { printf “Totalvmalloc of user-mode: %.02f MB\n”, total/1024/1024 }’

 

两个版本消耗内存主要差异在内核态,  将已知的统计项做和,得到内存并没有太大差异.


怀疑点落在X部分,但这部分属于内存黑洞, 没有踪迹可寻.



是不是可以从MemAvailable来找到线索?

 

答案是否定的,  MemAvailable的来自各个zone free pages , 再加上可以回收再利用的一些内存.

同样找不到是谁分配的pages.

 

available = vm_zone_stat[NR_FREE_PAGES] + pagecache + reclaimable

 

代码实现参考: si_mem_available()

Phase3:/proc/zoneinfo

该文件是针对per NUMA node / per memory zone 的内存pages的metric信息.

文件内容与/proc/meminfo的十分相似. 事实是meminfo是基于zoneinfo计算出来的.

zoneinfo信息仍然不能解决当前的问题.

这个文件可以帮助理解memorynode,zone之间的关系.

spanned_pages is the total pages spanned by the zone;

present_pages is physical pages existing within the zone;

reserved_pages includes pages allocated by the bootmem allocator;

managed_pages is present pages managed by the buddy system;

 

present_pages = spanned_pages - absent_pages(pages in holes);

managed_pages = present_pages - reserved_pages;

 

系统总的有效内存是各个zonemanaged之和:

total_memory = node0_zone_DMA[managed] + node0_zone_DMA32[managed] + node0_zone_Normal[managed]+ node1_zone_Normal[managed]

 

Node 0, zone      DMA

  per-node stats

      nr_inactive_anon 2117

      nr_active_anon 38986

      nr_inactive_file 545121

      nr_active_file 98412

      nr_unevictable 3141

      nr_slab_reclaimable 59505

      nr_slab_unreclaimable 84004

      nr_isolated_anon 0

      nr_isolated_file 0

      workingset_nodes 0

      workingset_refault 0

      workingset_activate 0

      workingset_restore 0

      workingset_nodereclaim 0

      nr_anon_pages 39335

      nr_mapped    55358

      nr_file_pages 648505

      nr_dirty     136

      nr_writeback 0

      nr_writeback_temp 0

      nr_shmem     2630

      nr_shmem_hugepages 0

      nr_shmem_pmdmapped 0

      nr_file_hugepages 0

      nr_file_pmdmapped 0

      nr_anon_transparent_hugepages 0

      nr_unstable  0

      nr_vmscan_write 0

      nr_vmscan_immediate_reclaim 0

      nr_dirtied   663519

      nr_written   576275

      nr_kernel_misc_reclaimable 0

  pages free     3840

        min      0

        low      3

        high     6

        spanned  4095

        present  3993

        managed  3840

        protection: (0, 1325, 191764, 191764, 191764)

Node 0, zone    DMA32

  pages free     347116

        min      56

        low      395

        high     734

        spanned  1044480

        present  429428

        managed  347508

        protection: (0, 0, 190439, 190439, 190439)

 

 

Node 0, zone   Normal

  pages free     47545626

        min      8097

        low      56849

        high     105601

        spanned  49545216

        present  49545216

        managed  48754582

        protection: (0, 0, 0, 0, 0)

      nr_free_pages 47545626

      nr_zone_inactive_anon 2117

      nr_zone_active_anon 38986

      nr_zone_inactive_file 545121

      nr_zone_active_file 98412

      nr_zone_unevictable 3141

      nr_zone_write_pending 34

      nr_mlock     3141

      nr_page_table_pages 872

      nr_kernel_stack 10056

      nr_bounce    0

      nr_zspages   0

      nr_free_cma  0

      numa_hit     2365759

      numa_miss    0

      numa_foreign 0

      numa_interleave 43664

      numa_local   2365148

      numa_other   611

Node 1, zone   Normal

  per-node stats

      nr_inactive_anon 33

      nr_active_anon 16139

      nr_inactive_file 37211

      nr_active_file 18441

      nr_unevictable 1262

      nr_slab_reclaimable 11198

      nr_slab_unreclaimable 27613

      nr_isolated_anon 0

      nr_isolated_file 0

      workingset_nodes 0

      workingset_refault 0

      workingset_activate 0

      workingset_restore 0

      workingset_nodereclaim 0

      nr_anon_pages 16213

      nr_mapped    5952

      nr_file_pages 56974

      nr_dirty     0

      nr_writeback 0

      nr_writeback_temp 0

      nr_shmem     60

      nr_shmem_hugepages 0

      nr_shmem_pmdmapped 0

      nr_file_hugepages 0

      nr_file_pmdmapped 0

      nr_anon_transparent_hugepages 0

      nr_unstable  0

      nr_vmscan_write 0

      nr_vmscan_immediate_reclaim 0

      nr_dirtied   59535

      nr_written   37528

      nr_kernel_misc_reclaimable 0

  pages free     49379629

        min      8229

        low      57771

        high     107313

        spanned  50331648

        present  50331648

        managed  49542579

        protection: (0, 0, 0, 0, 0)

 

Phase4:有没有可能是reserved memory

 

reserved memory 发生在什么阶段?

 

在系统初始化boot阶段,常规的内存管理还未被使能,此时物理内存的分配由特殊的分配器Bootmem来承担. Bootmem生命周期始于setup_arch()终于mem_init().

 

预留内存是在Boot阶段完成,可以从dmesg日志中看到具体信息,

[    2.938694] Memory: 394552504K/401241140K available (14339K kernel code, 2390K rwdata, 8352K rodata, 2728K init, 4988K bss, 6688636K reserved, 0K cma-reserved)

 

401241140K == zoneinfo中所有zone present_pages总和

394552504K ?? zoneinfo中所有zone managed_pages总和

 

当前issue涉及的memory是在buddy系统中分配, 所以否定了memoryreserved的可能.

 

Phase5: CONFIG_PAGE_OWNER

已知的系统信息对定位问题没有帮助.

阅读pages分配流程, 发现可以通过enable CONFIG_PAGE_OWNER来记录page owner, 该功能会记录分配pages时刻的stack信息. 这样就可以追溯到分配内存时的上下文.

__alloc_pages

-> get_page_from_freelist

            -> prep_new_page

                        -> post_alloc_hook

                                    -> set_page_owner

                                                ->

参考:

https://git.nju.edu.cn/nju/linux/-/blob/master/Documentation/mm/page_owner.rst

 

导出当前时刻 page owner信息:

cat /sys/kernel/debug/page_owner > page_owner_full.txt

./page_owner_sort page_owner_full.txt sorted_page_owner.txt

 

得到下面的输出:

253 times:

Page allocated via order 0, mask 0x2a22(GFP_ATOMIC|__GFP_HIGHMEM|__GFP_NOWARN), pid 1, ts 4292808510 ns, free_ts 0 ns

 prep_new_page+0xa6/0xe0

 get_page_from_freelist+0x2f8/0x450

 __alloc_pages+0x178/0x330

 alloc_page_interleave+0x19/0x90

 alloc_pages+0xef/0x110

 __vmalloc_area_node.constprop.0+0x105/0x280

 __vmalloc_node_range+0x74/0xe0

 __vmalloc_node+0x4e/0x70

 __vmalloc+0x1e/0x20

 alloc_large_system_hash+0x264/0x356

 futex_init+0x87/0x131

 do_one_initcall+0x46/0x1d0

 kernel_init_freeable+0x289/0x2f2

 kernel_init+0x1b/0x150

 ret_from_fork+0x1f/0x30

 

……

 

1 times:

Page allocated via order 9, mask 0xcc0(GFP_KERNEL), pid 707, ts 9593710865 ns, free_ts 0 ns

 prep_new_page+0xa6/0xe0

 get_page_from_freelist+0x2f8/0x450

 __alloc_pages+0x178/0x330

 __dma_direct_alloc_pages+0x8e/0x120

 dma_direct_alloc+0x66/0x2b0

 dma_alloc_attrs+0x3e/0x50

 irdma_puda_qp_create.constprop.0+0x76/0x4e0 [irdma]

 irdma_puda_create_rsrc+0x26d/0x560 [irdma]

 irdma_initialize_ieq+0xae/0xe0 [irdma]

 irdma_rt_init_hw+0x2a3/0x580 [irdma]

 i40iw_open+0x1c3/0x320 [irdma]

 i40e_client_subtask+0xc3/0x140 [i40e]

 i40e_service_task+0x2af/0x680 [i40e]

 process_one_work+0x228/0x3d0

 worker_thread+0x4d/0x3f0

 kthread+0x127/0x150

很容易看出相同上下文中分配pages的次数, page order, pid等信息;

 

order为关键字来统计页面分配的情况,

5.15.0中分配2^9 pages (512*4K)case很异常.  查看对应的stack, 发现几乎都与irdma有关.

5.15.0

5.4.0

order: 0, times: 2107310, memory: 8429240 KB, 8231 MB

order: 1, times: 8110, memory: 64880 KB, 63 MB

order: 2, times: 1515, memory: 24240 KB, 23 MB

order: 3, times: 9671, memory: 309472 KB, 302 MB

order: 4, times: 101, memory: 6464 KB, 6 MB

order: 5, times: 33, memory: 4224 KB, 4 MB

order: 6, times: 5, memory: 1280 KB, 1 MB

order: 7, times: 9, memory: 4608 KB, 4 MB

order: 8, times: 3, memory: 3072 KB, 3 MB

order: 9, times: 1426, memory: 2920448 KB, 2852 MB

order: 10, times: 3, memory: 12288 KB, 12 MB

all memory: 11780216 KB 11 GB

order: 0, times: 1218829, memory: 4875316 KB, 4761 MB

order: 1, times: 12370, memory: 98960 KB, 96 MB

order: 2, times: 1825, memory: 29200 KB, 28 MB

order: 3, times: 6834, memory: 218688 KB, 213 MB

order: 4, times: 110, memory: 7040 KB, 6 MB

order: 5, times: 17, memory: 2176 KB, 2 MB

order: 6, times: 0, memory: 0 KB, 0 MB

order: 7, times: 2, memory: 1024 KB, 1 MB

order: 8, times: 0, memory: 0 KB, 0 MB

order: 9, times: 0, memory: 0 KB, 0 MB

order: 10, times: 0, memory: 0 KB, 0 MB

all memory: 5232404 KB 4 GB

Fix

据了解现在业务中暂时没有业务使用RDMA功能,系统初始化时可以暂时不加载irdma.ko. OS中搜索关键字,没有很明显的加载irdma的指令.

 

仅发现irdma module被定义了几个alias:

alias i40iw irdma

alias auxiliary:ice.roce irdma

alias auxiliary:ice.iwarp irdma

alias auxiliary:i40e.iwarp irdma

 


irdma是如何被自动加载的?

发现在手动加载i40e网卡驱动时,irdma也被加载了.

# modprobe -v i40e

insmod /lib/modules/5.15.0-26-generic/kernel/drivers/net/ethernet/intel/i40e/i40e.ko

irdma是被什么程序带起来的?

 

利用tracepoint监控内核的行为

# cd /sys/kernel/debug/tracing

# tree -d events/module/

events/module/

├── module_free

├── module_get

├── module_load

├── module_put

└── module_request

 

5 directories

 

#echo 1 > tracing_on

#echo 1 > ./events/module/enable

 

#modprobe i40e

 

# cat trace

……

   modprobe-11202   [068] ..... 24358.270849: module_put: i40e call_site=do_init_module refcnt=1

   systemd-udevd-11444   [035] ..... 24358.275116: module_get: ib_uverbs call_site=resolve_symbol refcnt=2

   systemd-udevd-11444   [035] ..... 24358.275130: module_get: ib_core call_site=resolve_symbol refcnt=3

   systemd-udevd-11444   [035] ..... 24358.275185: module_get: ice call_site=resolve_symbol refcnt=2

   systemd-udevd-11444   [035] ..... 24358.275247: module_get: i40e call_site=resolve_symbol refcnt=2

   systemd-udevd-11444   [009] ..... 24358.295650: module_load: irdma

   systemd-udevd-11444   [009] .N... 24358.295730: module_put: irdma call_site=do_init_module refcnt=1

……


trace log显示是system-udevd service 加载了 irdma, 使能debug模式可以查看system_udevd详细的日志,

/lib/systemd/systemd-udevd --debug

 

system-udevd为什么会load irdma?

要返回到内核态看i40e  probe流程.

 

涉及到一个概念: Auxiliary Bus .

 

加载i40e driver,会创建auxiliary device i40e.iwarp, 并发送uevent通知用户态.

system-udevd监听到消息后, 加载i40e.iwarp对应的module, 由于i40e.iwarpirdmaalias, 实际被加载的moduleirdma.

i40e_probe

->i40e_lan_add_device()

    ->i40e_client_add_instance(pf);

        ->i40e_register_auxiliary_dev(&cdev->lan_info, "iwarp”)

            ->auxiliary_device_add(aux_dev);

                ->dev_set_name(dev, "%s.%s.%d", modname, auxdev->name, auxdev->id);    //i40e.iwarp.0, i40e.iwarp.1

                ->device_add(dev);

                    ->kobject_uevent(&dev->kobj, KOBJ_ADD);

                        ->kobject_uevent_env(kobj, action, NULL); // send an uevent with environmental data

                    ->bus_probe_device(dev);


如何禁止irdma被自动加载?

 

/etc/modprobe.d/blacklist.conf中增加一行, 来禁止irdmaalias扩展的身份来load.

# This file lists those modules which we don't want to be loaded by

# alias expansion, usually so some other driver will be loaded for the

# device instead.

 

blacklist irdma

这样并不会影响手动modprobe irdma

总结:

server上使用的网卡是intel X722,该型号网卡支持iWARP, 在加载网卡驱动时,i40e driver注册了auxiliary iwarp device, 并发送了uevent消息到userspace;

system-udevd service接受到uevent消息后加载了irdma.ko, irdma在初始化时创建了dma 映射, 通过__alloc_pages()分配掉了大块的内存.

 

intel 5.14 内核合入一个patch, iWARP driver更改irdma, 并删除了原来的 i40iw driver.
5.4.0 kernel没有自动加载 i40iw driver, 手动 load i40iw,同样会消耗 3G 左右的内存.

上一篇:案例分析-netperf performance issue
下一篇:没有了