ILD

openwrt procd看门狗饿死问题分析
作者:Yuan Jianpeng 邮箱:yuanjp89@163.com
发布时间:2023-8-29 站点:Inside Linux Development

    最近在稳定性挂机的时候,单台设备,连接70+无线终端,进行直播播放,挂机2天,出现看门狗重启。

[20230826_20:52:09_155]: [263114.367975] Watchdog bark! Now = 263114.367962
[20230826_20:52:09_155]: [263114.367999] Causing a watchdog bite!
[20230826_20:52:09_162]: [263114.371303] Configuring Watchdog Timer
[20230826_20:52:09_163]: [263114.375118] Waiting for Reboot
[20230826_20:52:10_476]: [263114.379769] Wdog - CTL: 0x1, BARK TIME: 0xe2900, BITE TIME: 0x1

openwrt看门狗机制

   openwrt平台,看门狗由1号进程,procd进程维护。源码:procd/watchdog.c


    可以通过ubus,设置看门狗的参数,关闭看门狗等。

# ubus -vv list system

'system' @e32153d7

        "board":{}

        "info":{}

        "reboot":{}

        "watchdog":{"frequency":"Integer","timeout":"Integer","magicclose":"Boolean","stop":"Boolean"}

        "signal":{"pid":"Integer","signum":"Integer"}

        "sysupgrade":{"path":"String","prefix":"String","command":"String"}


# ubus call system watchdog

{

        "status": "running",

        "timeout": 30,

        "frequency": 5,

        "magicclose": false

}


    可以看到,看门狗超时时间为30s,喂狗频率为5s


openwrt启动流程

    procd这个package,会编译出init、procd、askfirst、reload_config等进程


    /sbin/init

    源码:

initd/init.c 

initd/early.c 

initd/preinit.c 

initd/mkdev.c 

sysupgrade.c 

watchdog.c 

utils/utils.c


    procd源码

procd.c

signal.c

state.c

inittab.c

rcS.c

ubus.c

system.c

sysupgrade.c

service/service.c

service/instance.c

service/validate.c

service/trigger.c

service/watch.c

utils/utils.c

watchdog.c

plug/coldplug.c

plug/hotplug.c


init任务

    内核第一个拉起的进程是/sbin/init,它的main函数在,initd/init.c

        -> early ()     挂载各种必要文件系统/dev/ /tmp /sys /proc等等

        -> watchdog_init(1)  初始化看门狗

       执行 kmodloader /etc/modules-boot.d/ 安装启动阶段内核模块

        -> preinit()

            fork后台执行procd -h /etc/hotplug-preinit.json

            fork后台执行 /etc/preinit,等待完成后执行:spawn_procd(),使用procd替换自己

    spawn_procd()

        得到看门狗的fd,保存到WDTFD环境变量。这样看门狗的fd就继承给procd了。


procd任务

    它有个状态轮转,一个任务执行完后,调用procd_state_next()执行下一个

enum {

        STATE_NONE = 0,

        STATE_EARLY,

        STATE_UBUS,

        STATE_INIT,

        STATE_RUNNING,

        STATE_SHUTDOWN,

        STATE_HALT,

        __STATE_MAX,

};

        

    STATE_EARLY

            初始化看门狗,

            初始化hotplug,hotplug("/etc/hotplug.json")。创建套接字,接收内核消息,然后执行/etc/hotplug.json指定的规则。

            调用procd_coldplug()。调用"udevtrigger"。调用完毕后,进入下一个state。


    STATE_UBUS

        连接到ubus,连接成功才走下一个state。注册两个object,system和service。


procd调试

    procd的调试打开方法:

        读取DBGLVL环境变量,读取 -d选项。


    init进程,如果debug大于0,则会设置DBGLVL环境变量。

    check_dbglvl() 会读取 /tmp/debug_level, 设置debug。


    lib/preinit/30_failsafe_wait

    启动过程中,按1-4,设置debug到/tmp/debug_level。


Press the [f] key and hit [enter] to enter failsafe mode

Press the [1], [2], [3] or [4] key and hit [enter] to select the debug level



    在syslog就可以看procd的日志了:

2023-05-05T10:30:34+08:00 XiaoQiang procd: Launched hotplug exec instance, pid=1027

2023-05-05T10:30:34+08:00 XiaoQiang procd: Command: exec

2023-05-05T10:30:34+08:00 XiaoQiang procd:  /sbin/hotplug-call

2023-05-05T10:30:34+08:00 XiaoQiang procd:  button

2023-05-05T10:30:34+08:00 XiaoQiang procd: Message:

2023-05-05T10:30:34+08:00 XiaoQiang procd:  PATH=/sbin:/bin:/usr/sbin:/usr/bin

2023-05-05T10:30:34+08:00 XiaoQiang procd:  SUBSYSTEM=button

2023-05-05T10:30:34+08:00 XiaoQiang procd:  ACTION=released

2023-05-05T10:30:34+08:00 XiaoQiang procd:  BUTTON=mesh

2023-05-05T10:30:34+08:00 XiaoQiang procd:  SEEN=0

2023-05-05T10:30:34+08:00 XiaoQiang procd:  SEQNUM=1011

2023-05-05T10:30:34+08:00 XiaoQiang procd: Finished hotplug exec instance, pid=1027

2023-05-05T10:30:34+08:00 XiaoQiang procd: Launched hotplug exec instance, pid=1028

2023-05-05T10:30:34+08:00 XiaoQiang procd: Finished hotplug exec instance, pid=1028

2023-05-05T10:30:35+08:00 XiaoQiang procd: Ping

2023-05-05T10:30:40+08:00 XiaoQiang procd: Ping


        ping这个就是喂狗了。


复现

之前一直怀疑可能是cpu占用高,导致procd得不到调度,但是脚本后台起多个死循环:

        $ while true; do : ; done & 

把cpu吃光,查看procd的日志,仍然非常稳定且精确的喂狗,内核调度器能很公平的调度每个进程,在cpu很忙的时候,不会出现某个进程饿死的情况,饿死30s,更是不可能的。内核调度器还是很ok的。


那怀疑是procd有bug,其执行某一个其它任务阻塞了整个进程,导致喂狗被阻塞了。查看procd的最新提交记录,没有这方面的修复。而且procd使用fork执行hotplug等,因此这个可能性也不大。


查看之前一个类似bug:挂机5小时controller指示灯为橘黄色,串口卡住不能输入只能输出,之后出现看门狗重启。这就出现曙光了,内核一个臭名昭著的问题浮现出来了:This is a bug known since 2007 - see System freeze on high memory usage.


当内存紧张到低于内存的水线时,内核就freeze了,串口响应极慢,进程得不到调度,内核在忙着回收buffer,cache中的内存。内核的oom触发得极慢,在oom触发前,看门狗bark了,设备重启。


于是尝试,制造内存紧张:


1 调高内存watermark

 echo 16000 > /proc/sys/vm/min_free_kbytes

调整为16M,之前为4M


2 往tmp里面写入大文件

dd if=/dev/zero of=8M bs=1024 count=8192


不停在/tmp生成大文件,直到串口出现卡顿,很快在内存紧张的时候,就复现问题了,watchdog bark重启了。


重启后,尝试关闭看门狗,

ubus call system watchdog '{"stop":true,"magicclose":true}'


再次调高水线,制造内存紧张的情况,发现串口变得及其卡顿,但是系统没有重启,很久之后,才触发oom

[ 2054.807162] startscene_cron invoked oom-killer: gfp_mask=0x2420848, order=0, oom_score_adj=0

[ 2054.807195] startscene_cron cpuset=/ mems_allowed=0

[ 2054.818481] CPU: 0 PID: 3574 Comm: startscene_cron Tainted: P                4.4.60 #0

[ 2054.819522] Hardware name: Generic DT based system

[ 2054.827268] [<8121e540>] (unwind_backtrace) from [<8121b6a0>] (show_stack+0x10/0x14)

[ 2054.832035] [<8121b6a0>] (show_stack) from [<81394920>] (dump_stack+0x7c/0x9c)

[ 2054.839933] [<81394920>] (dump_stack) from [<81293578>] (dump_header+0x4c/0x19c)

[ 2054.846962] [<81293578>] (dump_header) from [<81293a78>] (oom_kill_process+0x80/0x41c)

[ 2054.854510] [<81293a78>] (oom_kill_process) from [<812940e8>] (out_of_memory+0x26c/0x300)

[ 2054.862237] [<812940e8>] (out_of_memory) from [<81297af4>] (__alloc_pages_nodemask+0x6bc/0x774)

[ 2054.870484] [<81297af4>] (__alloc_pages_nodemask) from [<81290a54>] (pagecache_get_page+0x120/0x1c0)

[ 2054.878994] [<81290a54>] (pagecache_get_page) from [<812f3040>] (__getblk_slow+0xac/0x24c)

[ 2054.888372] [<812f3040>] (__getblk_slow) from [<813164e4>] (squashfs_read_data+0x174/0xb9c)

[ 2054.896441] [<813164e4>] (squashfs_read_data) from [<813170b0>] (squashfs_cache_get+0x1a4/0x34c)

[ 2054.904687] [<813170b0>] (squashfs_cache_get) from [<81318810>] (squashfs_readpage+0x700/0x950)

[ 2054.913718] [<81318810>] (squashfs_readpage) from [<8129b7b0>] (__do_page_cache_readahead+0x1bc/0x1fc)

[ 2054.922134] [<8129b7b0>] (__do_page_cache_readahead) from [<81292a84>] (filemap_fault+0x194/0x440)

[ 2054.931508] [<81292a84>] (filemap_fault) from [<812ad9d8>] (__do_fault+0x3c/0x94)

[ 2054.940450] [<812ad9d8>] (__do_fault) from [<812b0328>] (handle_mm_fault+0x598/0xfbc)

[ 2054.948001] [<812b0328>] (handle_mm_fault) from [<8121ffcc>] (do_page_fault+0x12c/0x288)

[ 2054.955813] [<8121ffcc>] (do_page_fault) from [<812092c4>] (do_PrefetchAbort+0x34/0x98)

[ 2054.963972] [<812092c4>] (do_PrefetchAbort) from [<8120a95c>] (ret_from_exception+0x0/0x24)

[ 2054.971693] Exception stack(0x853cdfb0 to 0x853cdff8)

[ 2054.980024] dfa0:                                     00000001 7ecf9f65 00000000 7ecf9f64

[ 2054.985238] dfc0: 7ecf9f63 76f331e0 00000002 7ecf9e44 00000003 00000001 7ecf9df4 00000002

[ 2054.993397] dfe0: 7ecf9f68 7ecf9b60 76e89c68 76e89c68 60000010 ffffffff

[ 2055.015883] Mem-Info:

[ 2055.015932] active_anon:2533 inactive_anon:12553 isolated_anon:0

[ 2055.015932]  active_file:336 inactive_file:343 isolated_file:0

[ 2055.015932]  unevictable:721 dirty:0 writeback:0 unstable:0

[ 2055.015932]  slab_reclaimable:669 slab_unreclaimable:13325

[ 2055.015932]  mapped:86 shmem:12557 pagetables:329 bounce:0

[ 2055.015932]  free:5098 free_pcp:97 free_cma:0

[ 2055.028813] Normal free:20392kB min:16000kB low:20000kB high:24000kB active_anon:10132kB inactive_anon:50212kB active_file:1380kB inactive_file:1408kB unevictable:2884kB isolated(anon):0kB isolated(file):0kB present:196608kB managed:187952kB mlocked:0kB dirty:0kB writeback:0kB mapped:344kB shmem:50228kB slab_reclaimable:2676kB slab_unreclaimable:53300kB kernel_stack:1464kB pagetables:1316kB unstable:0kB bounce:0kB free_pcp:328kB local_pcp:264kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no

[ 2055.074293] lowmem_reserve[]: 0 0

[ 2055.098642] Normal: 146*4kB (UMEH) 85*8kB (UME) 50*16kB (UMEH) 24*32kB (UMEH) 6*64kB (UME) 39*128kB (UMH) 2*256kB (M) 3*512kB (UME) 2*1024kB (UE) 2*2048kB (ME) 1*4096kB (M) = 20496kB

[ 2055.113490] 13991 total pagecache pages

[ 2055.114545] 49152 pages RAM

[ 2055.118777] 0 pages HighMem/MovableOnly

[ 2055.121138] 2164 pages reserved

[ 2055.125131] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name

[ 2055.128681] [  350]     0   350      261       19       4       0        0             0 ubusd

[ 2055.137793] [  351]     0   351      321       12       3       0        0             0 ash

[ 2055.145934] [  389]     0   389      182        9       3       0        0         -1000 qrtr-ns

[ 2055.154450] [  660]     0   660      296       10       5       0        0             0 led_srv

[ 2055.163069] [  719]     0   719      327       12       4       0        0             0 rpcd


[ 2055.875997] Out of memory: Kill process 1777 (messagingagent) score 13 or sacrifice child

[ 2055.884469] Killed process 1777 (messagingagent) total-vm:8044kB, anon-rss:2544kB, file-rss:0kB


但是系统还是在运行,在卡顿的串口,挣扎着删除大文件,系统就恢复了。


为什么内存紧张时系统会freeze

参考链接5,有讲。进程是file map加载到内核的:

Active(file):      13896 kB

Inactive(file):     1552 kB


当进程紧张的时候,内核会reclaim active file,导致进程的可执行代码页缺页。因此进程会运行十分缓慢。但是内核运行不受影响。因此可以看到,串口的wifi打印很舒畅,oom的输出很顺畅,但是串口交互很卡顿,因为串口交互是用户态shell进程。


参考链接5,也提供了一个patch,不回收可执行代码页。效果还可以。

参考

【1】https://superuser.com/questions/1281118/linux-hangs-when-it-runs-on-out-of-memory

【2】https://github.com/hakavlad/nohang

【3】https://github.com/rfjakob/earlyoom

【4】https://serverfault.com/questions/390623/how-do-i-prevent-linux-from-freezing-when-out-of-memory

【5】https://stackoverflow.com/questions/52067753/how-to-keep-executable-code-in-memory-even-under-memory-pressure-in-linux



Copyright © linuxdev.cc 2017-2024. Some Rights Reserved.