场景

对比kernel 5.4.0-96 和 5.15.0-26的网络性能.

测试工具/指令

netperf -H 10.15.198.37 -t TCP_RR -v 2 -- -r 1024 1024

测试结果

TCP_RR结果显示 5.15.0比5.4.0性能低3%;

对比节点上nic参数, channel和ring 都是一致的配置.

Perf分析

使用perf抓取数据来看一下,

perf record -F 99 -ag netperf -H 10.15.198.98 -t TCP_RR -v 2 -- -r 1024 1024

perf report -i perf.data --no-children

生成火焰图:

perf script -i perf.data &> perf.unfold

FlameGraph/stackcollapse-perf.pl perf.unfold &> perf.folded

FlameGraph/flamegraph.pl perf.folded > perf.svg

看到在i40e_napi_poll函数处理延时较大, ovs占据了较多的份额;

list出node上的端口信息, 对比发现性能差的 5.15.0 所在的node上创建了 ovs interface, 5.4.0 node上仅有eth0 interface;

删除ovs interface后, 再次测试

5.15.0 性能有所提升, 但是任然存在 1.5%的差距

继续使用perf 抓取数据,

使用perf report大致看一下, 发现一处比较奇怪

perf report -i perf.data --no-children

多出来一些 iommu的热点, 而5.4并没有任何iommu的痕迹;

0.53% netperf [kernel.kallsyms] [k] intel_iommu_iotlb_sync_map

---intel_iommu_iotlb_sync_map

_iommu_map

iommu_map_atomic

__iommu_dma_map

__iommu_dma_map_swiotlb.constprop.0

iommu_dma_map_page

dma_map_page_attrs

i40e_xmit_frame_ring

i40e_lan_xmit_frame

dev_hard_start_xmit

sch_direct_xmit

__dev_queue_xmit

dev_queue_xmit

ip_finish_output2

__ip_finish_output

ip_finish_output

ip_output

ip_local_out

__ip_queue_xmit

ip_queue_xmit

__tcp_transmit_skb

tcp_write_xmit

__tcp_push_pending_frames

tcp_push

tcp_sendmsg_locked

tcp_sendmsg

inet_sendmsg

sock_sendmsg

__sys_sendto

__x64_sys_sendto

do_syscall_64

entry_SYSCALL_64_after_hwframe

0x7f2fd35c4a60

send_omni_inner

send_tcp_rr

main

0x7f2fd34c6d90

生成火焰图, 再重点看一下 i40e_napi_poll部分,

i40e_clean_tx_irq 函数上多出来一块iommu的小柱子;

阅读dma_unmap_page_attrs代码, 函数处理有两种不同路径: 直接DMA访问和 iommu ;

void dma_unmap_page_attrs(struct device *dev, dma_addr_t addr, size_t size,

enum dma_data_direction dir, unsigned long attrs)

{

const struct dma_map_ops *ops = get_dma_ops(dev);

BUG_ON(!valid_dma_direction(dir));

if (dma_map_direct(dev, ops) ||

arch_dma_unmap_page_direct(dev, addr + size))

dma_direct_unmap_page(dev, addr, size, dir, attrs);

else if (ops->unmap_page)

ops->unmap_page(dev, addr, size, dir, attrs);

debug_dma_unmap_page(dev, addr, size, dir);

}

显然现在遇到的情况 5.15.0 是使能了 iommu; 而5.4.0并没有出现 iommu, 也许就是经过直接DMA访问内存.

怎么印证猜想? 最直接就是看的kernel config,

5.4.0-96-generic

5.15.0-26-generic

CONFIG_INTEL_IOMMU=y

CONFIG_INTEL_IOMMU_SVM=y

# CONFIG_INTEL_IOMMU_DEFAULT_ON is not set

CONFIG_INTEL_IOMMU_FLOPPY_WA=y

CONFIG_INTEL_IOMMU=y

CONFIG_INTEL_IOMMU_SVM=y

CONFIG_INTEL_IOMMU_DEFAULT_ON=y

CONFIG_INTEL_IOMMU_FLOPPY_WA=y

CONFIG_INTEL_IOMMU_SCALABLE_MODE_DEFAULT_ON=y

Disable CONFIG_INTEL_IOMMU_DEFAULT_ON/ CONFIG_INTEL_IOMMU_SCALABLE_MODE_DEFAULT_ON后问题就解决了.

IOMMU

与直接物理内存访问对比,

优势:

分配大范围的内存时, 不需要连续的物理内存; IOMMU映射分段的物理内存为连续的虚拟内存地址;

设备寻址长度不支持寻址整个物理内存时, 可以通过IOMMU来寻找整个物理内存;

防止恶意DMA攻击;

劣势:

转换和管理带来的性能损失;

因增加I/O page table带来的物理内存的消耗.

案例分析-netperf performance issue

场景

测试工具/指令

测试结果

Perf分析

IOMMU