本文,根据异常日志中的code信息,定位到异常模块,及其对应的代码。
mt7621出现升级后连续异常重启5次。每次都是kernel panic
[0918_16:27:20][ 52.383539] CPU 2 Unable to handle kernel paging request at virtual address d0c17558, epc == 8ce01110, ra == 8ce010d4
[0918_16:28:19][ 47.723370] CPU 3 Unable to handle kernel paging request at virtual address d0c77558, epc == 8ce61110, ra == 8ce610d4
[0918_16:29:20][ 50.585248] CPU 0 Unable to handle kernel paging request at virtual address d0c77558, epc == 8ce61110, ra == 8ce610d4
[0918_16:30:21][ 50.653888] CPU 2 Unable to handle kernel paging request at virtual address d04d7558, epc == 8c6c1110, ra == 8c6c10d4
[0918_16:27:20][ 52.383539] CPU 2 Unable to handle kernel paging request at virtual address d0c17558, epc == 8ce01110, ra == 8ce010d4
[0918_16:27:20][ 52.394360] Oops[#1]:
[0918_16:27:20][ 52.396750] CPU: 2 PID: 6392 Comm: trafficd Not tainted 4.4.198 #0
[0918_16:27:20][ 52.402971] task: 8fedc860 task.stack: 8b0ac000
[0918_16:27:20][ 52.407528] $ 0 : 00000000 7705f098 00000001 d0c17548
[0918_16:27:20][ 52.412868] $ 4 : 0a1f0337 8ce09c40 00000000 00000000
[0918_16:27:20][ 52.418200] $ 8 : 00000000 ffffffff 00000000 0000003c
[0918_16:27:20][ 52.423530] $12 : 00000001 00000000 00000000 00000000
[0918_16:27:20][ 52.428862] $16 : 8ce10e68 8ce00000 8ce00000 8ce10000
[0918_16:27:20][ 52.434197] $20 : 8ce10000 8ce00000 00000000 8ce01c20
[0918_16:27:20][ 52.439529] $24 : 00000000 8105596c
[0918_16:27:20][ 52.444861] $28 : 8b0ac000 8fc0fe88 8184a884 8ce010d4
[0918_16:27:20][ 52.450200] Hi : 00000000
[0918_16:27:20][ 52.453114] Lo : 0000bc00
[0918_16:27:20][ 52.456102] epc : 8ce01110 0x8ce01110
[0918_16:27:20][ 52.460002] ra : 8ce010d4 0x8ce010d4
[0918_16:27:20][ 52.463866] Status: 11007c03 KERNEL EXL IE
[0918_16:27:20][ 52.468137] Cause : 40800008 (ExcCode 02)
[0918_16:27:20][ 52.472176] BadVA : d0c17558
[0918_16:27:20][ 52.475100] PrId : 0001992f (MIPS 1004Kc)
[0918_16:27:20][ 52.479220] Modules linked in:
[0918_16:27:20][ 52.653669] Process trafficd (pid: 6392, threadinfo=8b0ac000, task=8fedc860, tls=7707adc0)
[0918_16:27:20][ 52.661967] Stack : 00000002 8184a884 00000005 8fc4a804 81850000 8fc4a430 81740000 8104d6c0
[0918_16:27:20] 304d9b80 81ec2120 30e63200 0000000c 00000001 81c90000 00000100 81850000
[0918_16:27:20] 8ce009ec 81ec1940 81ec1a40 81ec1340 00200000 00000200 00100000 8107a9a4
[0918_16:27:20] 818591fc 81c22ec0 00000008 00000001 81853c20 8184a884 81ec1320 81850000
[0918_16:27:20] 81ec1840 8107ac4c 0000003a 8106c000 00000001 00000001 00000001 81ec1740
[0918_16:27:20] ...
[0918_16:27:20][ 52.698394] Call Trace:
[0918_16:27:20][ 52.700922] [<8ce01110>] 0x8ce01110
[0918_16:27:20][ 52.704465]
[0918_16:27:20][ 52.705983]
[0918_16:27:21]Code: 00041940 02031821 8cab0008 <8c660010> 8c6c0014 8cad000c 00cb5821 018d6021 0166302b
[0918_16:27:21][ 52.716856] ---[ end trace eb3bf2a8ccaf6627 ]---
可以看到,没有内核栈。死机的指令地址,一次是8ce01110,3次是8ce61110。但是导致非法内存访问的虚拟地址都是d0c17558。
读内核符号表,没有8ce01110和8ce61110地址的符号。
这个地址看起来是external kernel module的地址,读取/proc/modules。找到的内核模块是nf_nat_sip
# grep sip /proc/modules
nf_nat_sip 7216 0 - Live 0x8df16000
Live后面的地址是内核模块的起始地址,7216是内核模块的大小。
对比两次启动,每个模块的加载地址都会变化。因此通过地址无法对应内核模块。仔细查看日志,发现内核打印出了,出问题的前后指令:
Code: 00041940 02031821 8cab0008 <8c660010> 8c6c0014 8cad000c 00cb5821 018d6021 0166302b
尖括号是出问题的指令。
这就好办了,可以搜搜所有的内核模块,找出包含上述指令的内核模块就可以了。选择8cad作为搜索关键字,使用hexdump打印。
# find . -name *.ko | while read f ; do hexdump $f |grep 8cab && echo $f ; done
0001140 2024 0089 1940 0004 1821 0203 0008 8cab
0001170 0014 ac66 0010 8cab 0018 8c66 001c 8c6c
00011a0 0018 ac6b 001c ac66 0018 8cab 0000 8c86
./lib/modules/4.4.198/ip_account.ko
0000f70 0001 240d 001f 1580 0010 8cab 0000 8ca3
./lib/modules/4.4.198/nf_nat.ko
可以看到ip_account.ko模块,能够匹配,反汇编出来,搜搜:
# mipsel-openwrt-linux-objdump -S ip_account.ko > /work/ip_account.ko.objdump
table->hnat_stats[slot].src_bytes += node[i].src_bytes;
1104: 00041940 sll v1,a0,0x5
1108: 02031821 addu v1,s0,v1
110c: 8cab0008 lw t3,8(a1)
1110: 8c660010 lw a2,16(v1)
1114: 8c6c0014 lw t4,20(v1)
1118: 8cad000c lw t5,12(a1)
111c: 00cb5821 addu t3,a2,t3
1120: 018d6021 addu t4,t4,t5
1124: 0166302b sltu a2,t3,a2
可以看到8c660010前后的指令完全匹配。因此,通过code指令,定位到了出现问题的模块,及其指令地址。