Seems a pretty good bet that this is due to https://sourceware.org/git/?p=glibc.git;a=commit;h=27761a1042daf01987e7d79636d0c41511c6df3c
#0 0x4ec8eb4c in __lll_lock_wait_private () from /lib/libc.so.6
#1 0x4ec8efc6 in __unregister_atfork () from /lib/libc.so.6
#2 0x4ebbd5a9 in __cxa_finalize () from /lib/libc.so.6
#3 0x4dc1a3b1 in __do_global_dtors_aux () from /usr/lib/libpcsclite.so.1
#4 0x4f05ddca in _dl_close_worker () from /lib/ld-linux.so.2
#5 0x4f05e9e2 in _dl_close () from /lib/ld-linux.so.2
#6 0x4ecbdda9 in _dl_catch_exception () from /lib/libc.so.6
#7 0x4ecbde50 in _dl_catch_error () from /lib/libc.so.6
#8 0x4ed3d5e4 in _dlerror_run () from /lib/libdl.so.2
#9 0x4ed3ce9f in dlclose () from /lib/libdl.so.2
#10 0x4ea70664 in sc_dlclose () from /usr/lib/libopensc.so.6
#11 0x4e98d3fb in pcsc_finish () from /usr/lib/libopensc.so.6
#12 0x4e95b259 in sc_release_context () from /usr/lib/libopensc.so.6
#13 0x4eb3ebf2 in C_Finalize () from /usr/lib/opensc-pkcs11.so
#14 0x4eb3ec72 in C_Initialize () from /usr/lib/opensc-pkcs11.so
#15 0x4efe6a1b in __pkcs11h_threading_atfork_child ()
#16 0x4ec8f1ed in __run_fork_handlers () from /lib/libc.so.6
#17 0x4ec43aa8 in fork () from /lib/libc.so.6
#18 0x4f00c540 in fork_compat () from /lib/libpthread.so.0
#19 0x125511d5 in openvpn_execve ()
#20 0x12551355 in openvpn_execve_check ()
Some additional observations:
The prohibition against downgrading glibc is not there "just in case" - downgrading glibc to 2.27 caused any binary built against 2.28 to fail to run.
There was a commit to glibc after 2.28 relating to a deadlock in atfork handlers (https://sourceware.org/git/?p=glibc.git;a=commit;h=669ff911e2571f74a2668493e326ac9a505776bd). Applying this as a user patch did not resolve this deadlock.
I created a patch reverting 27761a1042daf01987e7d79636d0c41511c6df3c, applied that as a user patch, and that did resolve the deadlock, confirming that this is the commit which caused the regression.
I am not entirely sure that glibc is "in the wrong" here, as some sources online suggest that atfork handlers are restricted to async-signal-safe functions, of which I don't believe dlclose is a member. However, other sources do not have such a restriction, and besides a signal handler is not involved here.
Created attachment 575484 [details, diff]
Patch which reverts 27761a1042daf01987e7d79636d0c41511c6df3c
Here's the patch I generated to confirm that this commit was the cause of the deadlock.
Can you describe how to reproduce the deadlock and a few details about it?
I could try to bisect glibc to find the first offender and try to reduce an example to something to show upstream.
Can you post your `emerge --info`?
Sorry I have been busy and have not gotten back to this as soon as I would have liked. Also, I'm sorry I didn't include as much information as I thought I did in the original report.
I am not including emerge --info as it is not relevant - once I tracked down the cause, it was obvious it would happen every time, and does not depend on environment.
The specific use case is kind of complicated. I am using OpenVPN with a pkcs11 "smartcard" (actually GnuK). The software stack involved at the time consists of
Note that both glibc and opensc have newer versions now, but I have not gotten the chance to test with those.
The situation is that pkcs11-helper installs an atfork handler, whose purpose is to deinitialize the smartcard in the child process, to avoid inadvertantly allowing it to inherit an open connection to the smartcard. As part of opensc's "Finalize", it dlclose()s its backend module, which in this case is pcsc-lite. It appears that pcsc-lite also has an atfork handler installed (I did not investigate that one, but presumably it is also for the purpose of closing any open smartcard in the child). Glibc registers a mechansim, apparently the same way that C++ destructors are registered, to remove any atfork handlers that are registered in a module when it is being unloaded. Now that there is a (non-recursive) lock around the list of atfork handlers, and the handlers are called while that lock is held, attempting to unregister an atfork handler from within an atfork handler callback results in a deadlock.
There is no need for you to bisect - I already tracked down the commit in question - 27761a1042daf01987e7d79636d0c41511c6df3c - and confirmed that reverting this solves my deadlock. If you want a simple test case to reproduce this, I think the most simple incarnation would be to have an executable and a shared library. The executable would register with pthread_atfork(), dlopen() the shared library, and call something in it which also registers an atfork handler with pthread_atfork(). The executable would then fork(), and in its atfork child handler dlclose() the shared library. This should deadlock with glibc 2.28 (and 2.29, though I have not yet confirmed this), but work fine with 2.27 and older.
I'd like the bug fixed upstream (if it's still in glibc-master). Feel free to file it yourself to upstream tracker at https://www.gnu.org/software/libc/bugs.html
I'd prefer an upstream fix backport and a confirmation it's a glibc bug and not application API misuse.
After having upgraded to sys-libs/glibc-2.29-r2 and dev-libs/opensc-0.19.0-r2 I am no longer seeing the hang/deadlock. The only thing that seems like it should have helped is the 669ff911e2571f74a2668493e326ac9a505776bd commit disabling locking in the single-threaded case, but I was sure I tried that. Either I screwed that up, or some change in opensc also helped.
(In reply to Jeremy Drake from comment #6)
> After having upgraded to sys-libs/glibc-2.29-r2 and
> dev-libs/opensc-0.19.0-r2 I am no longer seeing the hang/deadlock. The only
> thing that seems like it should have helped is the
> 669ff911e2571f74a2668493e326ac9a505776bd commit disabling locking in the
> single-threaded case, but I was sure I tried that. Either I screwed that
> up, or some change in opensc also helped.
I downgraded opensc to 0.18.0 and the hang/deadlock is back. I also confirmed that there was a second thread in the parent.