Linux Performance Tunning: IPIP Tunnel(1)

860阅读 0评论2020-11-21 静默梧桐
分类:LINUX

问题描述:
在使用iperf测试容器网络性能时发现IPIP隧道性能非常的低,同样在VXLAN模式下网络性能明显高于IPIP隧道;这个问题是在Thunderx2-02 与Thunderx2-04 服务发现。针对这个问题,对内核进行tunning。

1.  将iperf client绑定在特定的CPU上,利用perf工具获取程序运行过程中cache,cycle,context-switches 的一些信息;

perf stat -a -e cache-references,cache-misses,cycles,instructions,context-switches,irq:irq_handler_entry,irq:softirq_raise,irq:softirq_entry -C 8,9,10,11 --  taskset -c 8,9,10,11 iperf -c 172.200.50.102 -t 10 -l 8k -i 1 -w 100K -P 4


这个过程中发现 04上下文切换特别频繁;

thunderx2-04:

Performance counter stats for 'system wide':


     2,507,215,365      cache-references

       111,415,469      cache-misses              #    4.444 % of all cache refs

    67,086,874,844      cycles

           795,370      context-switches


      10.047327675 seconds time elapsed


thunderx2-02:

Performance counter stats for 'system wide':


     8,643,300,648      cache-references

       704,501,629      cache-misses              #    8.151 % of all cache refs

    88,676,115,363      cycles

            31,310      context-switches


      10.050243036 seconds time elapsed


但是还不清楚有哪些进程跑在这些cpu上,利用下面工具:

可以看到几个内核线程switch频繁,但此时我还不知道这个线程是做什么用的。

perf sched record -C 8,9,10,11  -- taskset -c 8,9,10,11 iperf -c 172.200.50.102 -t 10 -l 8k -i 1 -w 100K -P 4

perf sched latency

-----------------------------------------------------------------------------------------------------------------

  Task                  |   Runtime ms  | Switches | Average delay ms | Maximum delay ms | Maximum delay at       |

 -----------------------------------------------------------------------------------------------------------------

  kworker/u449:3-:202481 |      0.029 ms |        2 | avg: 2184.930 ms | max: 4016.295 ms | max at: 910804.377561 s

  iperf:(6)             |   2574.831 ms |    27683 | avg:    0.027 ms | max:   31.598 ms | max at: 910809.489586 s

  migration/11:78       |      0.000 ms |        3 | avg:    0.005 ms | max:    0.006 ms | max at: 910805.141550 s

  kworker/9:1H-kb:3907  |      0.029 ms |        2 | avg:    0.004 ms | max:    0.005 ms | max at: 910802.574529 s

  kworker/9:1-eve:211650 |   2643.633 ms |    53689 | avg:    0.003 ms | max:    2.335 ms | max at: 910801.817809 s

  kworker/11:0-ev:212453 |   2013.207 ms |    82402 | avg:    0.003 ms | max:   13.222 ms | max at: 910810.381545 s

  kworker/8:0-eve:182175 |   2395.868 ms |    84278 | avg:    0.003 ms | max:    6.304 ms | max at: 910809.171858 s

  kworker/10:2-ev:179435 |   2914.797 ms |    90490 | avg:    0.003 ms | max:    6.503 ms | max at: 910804.112214 s

  kworker/9:2-eve:217678 |     62.932 ms |     7036 | avg:    0.001 ms | max:    0.004 ms | max at: 910806.190321 s

  kworker/11:2-ev:216856 |     31.314 ms |     3509 | avg:    0.001 ms | max:    0.002 ms | max at: 910807.091390 s

  kworker/8:1:217889    |     26.993 ms |     3381 | avg:    0.001 ms | max:    0.002 ms | max at: 910803.190522 s

  kworker/10:0-ev:217348 |     20.548 ms |     2776 | avg:    0.001 ms | max:    0.004 ms | max at: 910804.786117 s

  perf:218630           |     15.906 ms |        1 | avg:    0.000 ms | max:    0.000 ms | max at:      0.000000 s

  sshd:213979           |      0.932 ms |        1 | avg:    0.000 ms | max:    0.000 ms | max at:      0.000000 s

  jenkins.util.Ti:5749  |      0.029 ms |        1 | avg:    0.000 ms | max:    0.000 ms | max at:      0.000000 s

 -----------------------------------------------------------------------------------------------------------------

  TOTAL:                |  12774.563 ms |   355893 |

 --------------------------------------------------- ---------------------------------------------------



那就用火焰图来分析吧

sudo perf record -g -e branch-misses,cycles,instructions,cache-misses -C 8,9,10,11 -o iperf-11-6 taskset -c 8,9,10,11 iperf -c 172.200.50.102 -t 10 -l 8k -i 1 -w 100K -P 4 

sudo perf report --no-children -i iperf-11-6


perf script -i iperf-11-6 &> perf.unfold

code/FlameGraph/stackcollapse-perf.pl  perf.unfold &> perf.folded

code/FlameGraph/flamegraph.pl perf.folded > perf-thunderx2-04-iperf-11-6.svg

可以看到4个kworker,每一个的栈都是类似的,发现了od_dbs_update;这个函数是在kernle driver/cpufreq目录下; 没接触过这个模块,查看Documentation\cpu-freq\, 了解关键信息“save battery power”,





google下相关资料还有同事给的指令,可以将这个thread disable;

echo performance > /sys/devices/system/cpu/cpu8/cpufreq/scaling_governor

echo performance > /sys/devices/system/cpu/cpu9/cpufreq/scaling_governor

echo performance > /sys/devices/system/cpu/cpu10/cpufreq/scaling_governor

echo performance > /sys/devices/system/cpu/cpu11/cpufreq/scaling_governor


之后的perf结果为:

perf stat -a -e cache-references,cache-misses,cycles,instructions,context-switches,irq:irq_handler_entry,irq:softirq_raise,irq:softirq_entry -C 8,9,10,11 --  taskset -c 8,9,10,11 iperf -c 172.200.50.102 -t 10 -l 8k -i 1 -w 100K -P 4

Performance counter stats for 'system wide':

     1,934,218,498      cache-references

       162,761,564      cache-misses              #    8.415 % of all cache refs

   111,003,587,975      cycles

     5,523,437,453      instructions              #    0.05  insn per cycle

            43,713      context-switches

             5,498      irq:irq_handler_entry

             8,700      irq:softirq_raise

             8,698      irq:softirq_entry

root@net-arm-thunderx2-04:/home/jianlin# perf sched latency

 -----------------------------------------------------------------------------------------------------------------

  Task                  |   Runtime ms  | Switches | Average delay ms | Maximum delay ms | Maximum delay at       |

 -----------------------------------------------------------------------------------------------------------------

  :12373:12373          |     39.073 ms |        3 | avg:   20.733 ms | max:   62.070 ms | max at:  89536.961252 s

  :12394:12394          |      0.701 ms |        2 | avg:   15.095 ms | max:   30.186 ms | max at:  89536.991686 s

  kworker/8:1-mm_:124291 |      0.657 ms |       16 | avg:    0.016 ms | max:    0.160 ms | max at:  89536.982505 s

  kworker/10:1:12361    |      0.004 ms |        1 | avg:    0.016 ms | max:    0.016 ms | max at:  89529.641059 s

  migration/10:72       |      0.000 ms |        3 | avg:    0.008 ms | max:    0.012 ms | max at:  89537.848265 s

  kworker/10:2-ev:132535 |      0.976 ms |       15 | avg:    0.007 ms | max:    0.021 ms | max at:  89532.884291 s

  ksoftirqd/10:73       |      0.014 ms |        1 | avg:    0.006 ms | max:    0.006 ms | max at:  89531.128267 s

  kworker/9:0-mm_:11957 |      0.361 ms |       12 | avg:    0.005 ms | max:    0.012 ms | max at:  89534.076292 s

  kworker/11:5-mm:219074 |      0.894 ms |       19 | avg:    0.005 ms | max:    0.014 ms | max at:  89538.968274 s

  iperf:(6)             |   2116.851 ms |    16204 | avg:    0.004 ms | max:    0.258 ms | max at:  89529.628820 s

  migration/11:78       |      0.000 ms |        3 | avg:    0.003 ms | max:    0.003 ms | max at:  89529.912260 s

  migration/9:66        |      0.000 ms |        3 | avg:    0.003 ms | max:    0.004 ms | max at:  89529.784259 s

  :12388:12388          |      1.713 ms |        1 | avg:    0.003 ms | max:    0.003 ms | max at:  89536.976836 s

  migration/8:60        |      0.000 ms |        3 | avg:    0.003 ms | max:    0.003 ms | max at:  89537.720259 s

  kworker/10:0:4978     |      0.052 ms |        1 | avg:    0.002 ms | max:    0.002 ms | max at:  89529.640951 s

  :12397:12397          |      2.157 ms |        1 | avg:    0.000 ms | max:    0.000 ms | max at:      0.000000 s

 -----------------------------------------------------------------------------------------------------------------

  TOTAL:                |   2163.453 ms |    16288 |

 ---------------------------------------------------


没有了内核thread 的频繁切换,但是相比02来说,04的iperf switches 还是很高;

root@net-arm-thunderx2-02:/home/jianlin# perf sched latency

 -----------------------------------------------------------------------------------------------------------------

  Task                  |   Runtime ms  | Switches | Average delay ms | Maximum delay ms | Maximum delay at       |

 -----------------------------------------------------------------------------------------------------------------

  iperf:(6)             |  21872.641 ms |       44 | avg:    0.031 ms | max:    0.180 ms | max at: 348025.791558 s

  jbd2/sda2-8:1934      |      2.566 ms |        2 | avg:    0.003 ms | max:    0.006 ms | max at: 348035.269940 s


为什么? 难道对端响应不及时导致的阻塞吗?

Brendan Gregg有blog提及如何使用perf来抓取off-CPU信息;

http://www.brendangregg.com/blog/2015-02-26/linux-perf-off-cpu-flame-graph.html

参照这位神的指令,抓取了04上iperf client off-CPU时间火焰图,

这里的宽度代表了代码路径睡眠的总持续时间。

in this graph the width spans the total duration that a code path was sleeping. 

消息发送时等待memory( tcp_sendmsg_locked),导致了thread被换出;


在02上,tcp_sendmsg_locked上睡眠时间仅仅是04上1/3;



Server端是什么情况呐; 02做server时,tcp_recvmsg代码路径上睡眠时间是12,571ms;



对应04上代码路径上睡眠时间是3,453


看样子04 send数据也不给力;
关键点也许就是tcp_sendmsg_locked,数据发送的过程中被阻塞了,线程被切换出去;等待资源的释放。 
问题还没定位,后续继续进行tunning.



上一篇:ebpf-summit 2020 随笔
下一篇:snort3 for arm64