记录一次C语言调用go生成的动态库的踩坑过程
问题现象
由于某些特殊原因,需要在C语言中调用go语言生成的so,本来挺顺利,一切都运行的很好。突然某一天,不知道怎么回事,再一个新程序中无法正常运行了,看到的现象是程序无任何响应,类似于直接卡死了。使用gdb查看进程当前的信息,看到如下调用栈:
(gdb) source /root/go/src/runtime/runtime-gdb.py
Loading Go Runtime support.
(gdb) bt
#0 runtime.futex () at /usr/local/go/src/runtime/sys_linux_amd64.s:520
#1 0x00007fff23c550f6 in runtime.futexsleep (addr=0xfffffffffffffe00, val=0, ns=140733793710979) at /usr/local/go/src/runtime/os_linux.go:44
#2 0x00007fff23c35667 in runtime.notetsleep_internal (n=0x7fff23dc66e8 <runtime.work+232>, ns=-1) at /usr/local/go/src/runtime/lock_futex.go:183
#3 0x00007fff23c35785 in runtime.notetsleepg (n=0x7fff23dc66e8 <runtime.work+232>, ns=-1) at /usr/local/go/src/runtime/lock_futex.go:237
#4 0x00007fff23c41338 in runtime.gcBgMarkStartWorkers () at /usr/local/go/src/runtime/mgc.go:1126
#5 0x00007fff23c3fd92 in runtime.gcStart (trigger=...) at /usr/local/go/src/runtime/mgc.go:637
#6 0x00007fff23c36e3d in runtime.mallocgc (size=565248, typ=0x7fff23cfd160, needzero=true) at /usr/local/go/src/runtime/malloc.go:1174
#7 0x00007fff23c6dfaa in runtime.growslice (et=, old=..., cap=) at /usr/local/go/src/runtime/slice.go:267
#8 0x00007fff23c918ef in regexp/syntax.(*compiler).inst (c=0xc000044bf8, op=) at /usr/local/go/src/regexp/syntax/compile.go:164
#9 regexp/syntax.(*compiler).rune (c=0xc000044bf8, r=[]int32 = {...}, flags=212) at /usr/local/go/src/regexp/syntax/compile.go:273
#10 0x00007fff23c90f1e in regexp/syntax.(*compiler).compile (c=0xc000044bf8, re=0xc000190000) at /usr/local/go/src/regexp/syntax/compile.go:101
#11 0x00007fff23c8f805 in regexp/syntax.Compile (re=0xc000190000) at /usr/local/go/src/regexp/syntax/compile.go:74
#12 0x00007fff23ca4a9a in regexp.compile (expr=,
mode=, longest=false) at /usr/local/go/src/regexp/regexp.go:178
#13 0x00007fff23ca5591 in regexp.Compile (expr="") at /usr/local/go/src/regexp/regexp.go:133
#14 regexp.MustCompile (str="") at /usr/local/go/src/regexp/regexp.go:309
#15 0x00007fff23ca8e0a in main.CheckPostAddress (buf=, bufLen=) at /root/codes/di_rechk/rechk.go:540
#16 0x00007fff23ca9a88 in _cgoexp_67df0785bef2_CheckPostAddress (a=0x7fffffffc740) at _cgo_gotypes.go:162
#17 0x00007fff23c2e21a in runtime.cgocallbackg1 (fn=0x7fff23ca9a60 <_cgoexp_67df0785bef2_CheckPostAddress>, frame=0x7fffffffc740, ctxt=0)
at /usr/local/go/src/runtime/cgocall.go:306
#18 0x00007fff23c2dee9 in runtime.cgocallbackg (fn=0x7fff23ca9a60 <_cgoexp_67df0785bef2_CheckPostAddress>, frame=0x7fffffffc740, ctxt=0)
at /usr/local/go/src/runtime/cgocall.go:232
#19 0x00007fff23c83791 in runtime.cgocallbackg (fn=0x7fff23ca9a60 <_cgoexp_67df0785bef2_CheckPostAddress>, frame=0x7fffffffc740, ctxt=0)
at :1
#20 0x00007fff23c813f3 in runtime.cgocallback () at /usr/local/go/src/runtime/asm_amd64.s:915
#21 0x00007fff23c81601 in runtime.goexit () at /usr/local/go/src/runtime/asm_amd64.s:1581
#22 0x0000000000000000 in ?? ()
(gdb) info goroutines
17 syscall runtime.notetsleepg
2 waiting runtime.gopark
3 waiting runtime.gopark
4 waiting runtime.gopark
19 runnable runtime.gcBgMarkWorker
(gdb) goroutine 17 bt
#0 runtime.notetsleepg (n=0x7fff23dc66e8 <runtime.work+232>, ns=-1) at /usr/local/go/src/runtime/lock_futex.go:237
#1 0x00007fff23c41338 in runtime.gcBgMarkStartWorkers () at /usr/local/go/src/runtime/mgc.go:1126
#2 0x00007fff23c3fd92 in runtime.gcStart (trigger=...) at /usr/local/go/src/runtime/mgc.go:637
#3 0x00007fff23c36e3d in runtime.mallocgc (size=565248, typ=0x7fff23cfd160, needzero=true) at /usr/local/go/src/runtime/malloc.go:1174
#4 0x00007fff23c6dfaa in runtime.growslice (et=, old=..., cap=) at /usr/local/go/src/runtime/slice.go:267
#5 0x00007fff23c918ef in regexp/syntax.(*compiler).inst (c=0xc000044bf8, op=) at /usr/local/go/src/regexp/syntax/compile.go:164
#6 regexp/syntax.(*compiler).rune (c=0xc000044bf8, r=[]int32 = {...}, flags=212) at /usr/local/go/src/regexp/syntax/compile.go:273
#7 0x00007fff23c90f1e in regexp/syntax.(*compiler).compile (c=0xc000044bf8, re=0xc00022a000) at /usr/local/go/src/regexp/syntax/compile.go:101
#8 0x00007fff23c8f805 in regexp/syntax.Compile (re=0xc00022a000) at /usr/local/go/src/regexp/syntax/compile.go:74
#9 0x00007fff23ca4a9a in regexp.compile (expr=,
mode=, longest=false) at /usr/local/go/src/regexp/regexp.go:178
#10 0x00007fff23ca5591 in regexp.Compile (expr=""") at /usr/local/go/src/regexp/regexp.go:133
#11 regexp.MustCompile (str=""") at /usr/local/go/src/regexp/regexp.go:309
#12 0x00007fff23ca8e0a in main.CheckPostAddress (buf=, bufLen=) at /root/codes/di_rechk/rechk.go:540
#13 0x00007fff23ca9a88 in _cgoexp_67df0785bef2_CheckPostAddress (a=0x7fffffffc740) at _cgo_gotypes.go:162
#14 0x00007fff23c2e21a in runtime.cgocallbackg1 (fn=0x7fff23ca9a60 <_cgoexp_67df0785bef2_CheckPostAddress>, frame=0x7fffffffc740, ctxt=0)
at /usr/local/go/src/runtime/cgocall.go:306
#15 0x00007fff23c2dee9 in runtime.cgocallbackg (fn=0x7fff23ca9a60 <_cgoexp_67df0785bef2_CheckPostAddress>, frame=0x7fffffffc740, ctxt=0)
at /usr/local/go/src/runtime/cgocall.go:232
#16 0x00007fff23c83791 in runtime.cgocallbackg (fn=0x7fff23ca9a60 <_cgoexp_67df0785bef2_CheckPostAddress>, frame=0x7fffffffc740, ctxt=0)
at :1
#17 0x00007fff23c813f3 in runtime.cgocallback () at /usr/local/go/src/runtime/asm_amd64.s:915
#18 0x00007fff23c81601 in runtime.goexit () at /usr/local/go/src/runtime/asm_amd64.s:1581
#19 0x0000000000000000 in ?? ()
(gdb) goroutine 2 bt
#0 runtime.gopark (unlockf=, lock=, reason=, traceEv=, traceskip=)
at /usr/local/go/src/runtime/proc.go:367
#1 0x00007fff23c5b86d in runtime.goparkunlock (reason=, traceEv=, traceskip=, lock=)
at /usr/local/go/src/runtime/proc.go:372
#2 runtime.forcegchelper () at /usr/local/go/src/runtime/proc.go:306
#3 0x00007fff23c81601 in runtime.goexit () at /usr/local/go/src/runtime/asm_amd64.s:1581
#4 0x0000000000000000 in ?? ()
(gdb) goroutine 3 bt
#0 runtime.gopark (unlockf=, lock=, reason=, traceEv=, traceskip=)
at /usr/local/go/src/runtime/proc.go:367
#1 0x00007fff23c48ee8 in runtime.goparkunlock (reason=, traceEv=, traceskip=, lock=)
at /usr/local/go/src/runtime/proc.go:372
#2 runtime.bgsweep () at /usr/local/go/src/runtime/mgcsweep.go:163
#3 0x00007fff23c81601 in runtime.goexit () at /usr/local/go/src/runtime/asm_amd64.s:1581
#4 0x0000000000000000 in ?? ()
(gdb) goroutine 4 bt
#0 runtime.gopark (unlockf=, lock=, reason=, traceEv=, traceskip=)
at /usr/local/go/src/runtime/proc.go:367
#1 0x00007fff23c46fed in runtime.goparkunlock (reason=, traceEv=, traceskip=, lock=)
at /usr/local/go/src/runtime/proc.go:372
#2 runtime.bgscavenge () at /usr/local/go/src/runtime/mgcscavenge.go:265
#3 0x00007fff23c81601 in runtime.goexit () at /usr/local/go/src/runtime/asm_amd64.s:1581
#4 0x0000000000000000 in ?? ()
(gdb) goroutine 19 bt
#0 runtime.gcBgMarkWorker () at /usr/local/go/src/runtime/mgc.go:1166
#1 0x00007fff23c81601 in runtime.goexit () at /usr/local/go/src/runtime/asm_amd64.s:1581
#2 0x0000000000000000 in ?? ()
(gdb) info threads
Id Target Id Frame
1 Thread 0x7ffff7966b80 (LWP 2024707) "nginx" runtime.futex () at /usr/local/go/src/runtime/sys_linux_amd64.s:520
2 Thread 0x7ffef1ffb700 (LWP 2024717) "nginx" 0x00007ffff7a483bf in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0,
req=req@entry=0x7ffef1ffae80, rem=rem@entry=0x0) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
3 Thread 0x7ffef27fc700 (LWP 2024718) "ZMQbg/Reaper" 0x00007ffff7a8a5ce in epoll_wait (epfd=42, events=events@entry=0x7ffef27fb200,
maxevents=maxevents@entry=256, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
4 Thread 0x7ffef2ffd700 (LWP 2024719) "ZMQbg/IO/0" 0x00007ffff7a8a5ce in epoll_wait (epfd=44, events=events@entry=0x7ffef2ffc200,
maxevents=maxevents@entry=256, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
问题解决过程
由于对go不太熟悉,所以不知道出了什么问题。go本身的代码也比较简单,没有太复杂的操作。而且之前的程序调用这些go代码生成的so,也能整成运行。从网上查资料也是一头雾水,不知道该怎么查。经过一段时间的搜索,终于在stackoverflow中看到了一个类似问题。按照这个原因进行修改,果然修改好了。
问题原因总结
按照stackoverflow中的总结。原因是在程序启动的时候,如果调用了go生成的so,那么go会调用一些函数建立go运行时环境,会创建一些线程。而我的新程序中,具体调用go函数的地方,是在子进程中调用的,而父进程fork子进程的时候,不会把go建立的那几个运行时线程复制到子进程中,即便在子进程中重新dlopen go生成的so也不会重新创建那几个运行时线程,所以导致go中那些涉及到一些需要运行时程序干预的代码(如gc等)无法正常运行。而我的go代码中刚好触发了GC,导致go无法运行,一直处在死循环之中。
解决方法是:在fork之后运行dlopen打开go生成的so,这样就可以了。
修改之后,程序运行的信息如下:
(gdb) info threads
Id Target Id Frame
* 1 Thread 0x7ffff7966b80 (LWP 2366880) "nginx" 0x00007ffff7a8a5ce in epoll_wait (epfd=31, events=0x555555d0af90, maxevents=512,
timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
2 Thread 0x7fff481fd700 (LWP 2366893) "nginx" 0x00007ffff7a483bf in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0,
req=req@entry=0x7fff481fce80, rem=rem@entry=0x0) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
3 Thread 0x7fff489fe700 (LWP 2366894) "ZMQbg/Reaper" 0x00007ffff7a8a5ce in epoll_wait (epfd=42, events=events@entry=0x7fff489fd200,
maxevents=maxevents@entry=256, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
4 Thread 0x7fff491ff700 (LWP 2366895) "ZMQbg/IO/0" 0x00007ffff7a8a5ce in epoll_wait (epfd=44, events=events@entry=0x7fff491fe200,
maxevents=maxevents@entry=256, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
5 Thread 0x7fff245cb700 (LWP 2368456) "nginx" runtime.futex () at /root/go/src/runtime/sys_linux_amd64.s:560
6 Thread 0x7fff20c34700 (LWP 2368457) "nginx" runtime.futex () at /root/go/src/runtime/sys_linux_amd64.s:560
7 Thread 0x7fff091a1700 (LWP 2368458) "nginx" runtime.futex () at /root/go/src/runtime/sys_linux_amd64.s:560
8 Thread 0x7fff089a0700 (LWP 2368459) "nginx" runtime.futex () at /root/go/src/runtime/sys_linux_amd64.s:560
9 Thread 0x7ffed7fff700 (LWP 2368460) "nginx" runtime.futex () at /root/go/src/runtime/sys_linux_amd64.s:560
10 Thread 0x7ffed77fe700 (LWP 2368461) "nginx" runtime.futex () at /root/go/src/runtime/sys_linux_amd64.s:560
11 Thread 0x7ffed6ffd700 (LWP 2368462) "nginx" runtime.futex () at /root/go/src/runtime/sys_linux_amd64.s:560
12 Thread 0x7ffed67fc700 (LWP 2368463) "nginx" runtime.futex () at /root/go/src/runtime/sys_linux_amd64.s:560
和有问题时的info threads信息对比,发现多了好几个runtime.futex的线程,这些线程即是运行go程序需要的线程。之前的程序之所以没出问题,是因为之前的程序没有调用fork,所以不会出现此问题。
额外问题
在有问题的时候,为什么触发GC的时候导致go程序无法运行,一直处在死循环中?这个需要了解go的垃圾回收机制才能知道原因,我本身对go不太熟悉,所以只是简单了解了一点,还无法详细解释清楚。