最近碰到一个问题,有一个lua写的脚本文件,联网检查是否有新固件。这个脚本在wan up的时候触发,在自动化wan周期性插拔测试中。这个脚本卡住,导致有很多实例,最后导致oom,设备重启。
使用strace分析进程,发现一直在poll一个tcp fd,使用netstat发现这个tcp就是curl的连接请求。curl一直卡住。
# ls /proc/701/fd -l
lr-x------ 1 root root 64 May 16 11:37 0 -> /dev/null
l-wx------ 1 root root 64 May 16 11:37 1 -> /dev/null
l-wx------ 1 root root 64 May 16 11:37 2 -> /dev/null
lrwx------ 1 root root 64 May 16 11:37 3 -> socket:[2663893]
lrwx------ 1 root root 64 May 16 11:37 4 -> socket:[2663894]
lrwx------ 1 root root 64 May 16 11:37 5 -> socket:[2663897]
# netstat -anp | grep 701
tcp 0 0 192.168.25.184:39723 220.181.106.182:443 ESTABLISHED 0 0 701/lua
unix 3 [ ] STREAM CONNECTED 2663894 701/lua
unix 3 [ ] STREAM CONNECTED 2663893 701/lua
此时connect track已经已经没有此tcp的连接信息:
# cat /proc/net/nf_conntrack
ipv4 2 tcp 6 3591 ESTABLISHED src=192.168.25.184 dst=123.6.50.100 sport=27225 dport=1890 packets=154 bytes=14583 src=123.6.50.100 dst=192.168.25.184 sport=1890 dport=27225 packets=80 bytes=9124 [ASSURED] mark=0 zone=0 use=2
ipv4 2 unknown 2 558 src=192.168.25.184 dst=224.0.0.22 packets=3 bytes=120 [UNREPLIED] src=224.0.0.22 dst=192.168.25.184 packets=0 bytes=0 mark=0 zone=0 use=2
ipv4 2 unknown 2 591 src=192.168.31.1 dst=224.0.0.22 packets=376 bytes=18048 [UNREPLIED] src=224.0.0.22 dst=192.168.31.1 packets=0 bytes=0 mark=0 zone=0 use=2
ipv4 2 tcp 6 3593 ESTABLISHED src=127.0.0.1 dst=127.0.0.1 sport=64013 dport=784 packets=1148 bytes=59696 src=127.0.0.1 dst=127.0.0.1 sport=784 dport=64013 packets=1148 bytes=59696 [ASSURED] mark=0 zone=0 use=2
ipv4 2 tcp 6 3582 ESTABLISHED src=192.168.25.184 dst=58.83.177.195 sport=11275 dport=8887 packets=179 bytes=15717 src=58.83.177.195 dst=192.168.25.184 sport=8887 dport=11275 packets=158 bytes=8397 [ASSURED] mark=0 zone=0 use=2
ipv4 2 tcp 6 3593 ESTABLISHED src=127.0.0.1 dst=127.0.0.1 sport=45795 dport=784 packets=1148 bytes=59696 src=127.0.0.1 dst=127.0.0.1 sport=784 dport=45795 packets=1148 bytes=59696 [ASSURED] mark=0 zone=0 use=2
ipv4 2 unknown 2 594 src=192.168.31.1 dst=224.0.0.1 packets=504 bytes=18144 [UNREPLIED] src=224.0.0.1 dst=192.168.31.1 packets=0 bytes=0 mark=0 zone=0 use=2
ipv4 2 tcp 6 3595 ESTABLISHED src=192.168.31.1 dst=192.168.31.8 sport=23 dport=54978 packets=1106 bytes=163855 src=192.168.31.8 dst=192.168.31.1 sport=54978 dport=23 packets=1581 bytes=83164 [ASSURED] mark=0 zone=0 use=2
ipv4 2 unknown 2 587 src=192.168.31.8 dst=224.0.0.22 packets=379 bytes=15160 [UNREPLIED] src=224.0.0.22 dst=192.168.31.8 packets=0 bytes=0 mark=0 zone=0 use=2
参考1中的原话:
TCP connections can be totally without traffic in either direction when they are not used. A totally idle connection can therefore not be clearly separated from a connection that has gone completely stale because of network or server issues.
我的理解:
TCP可以处于完全静默的状态,永远保持下去。
测试方法:
1 路由器创建一个套接字,监听8080端口。
2 pc创建一个tcp套接字,连接到路由器8080端口,然后执行recv。
3 路由器accept后,直接断电。
此时pc的recv将永远卡住。因为路由器accept后,3次握手完成,所以双方都不发任何数据的情况下,此tcp将处于永久静默的状态。
recv也手不动任何数据。
tcp keealive就是用来保持连接,上述lua进程永久卡住的bug,就是recv一直等待,此时加上keepalive即可解决。
参考
https://everything.curl.dev/usingcurl/connections/keepalive.html
https://tldp.org/HOWTO/TCP-Keepalive-HOWTO/overview.html