Summary: | sys-libs/glibc-2.28-r6 regression: deadlock in openvpn using pkcs11-helper in atfork child handler | ||
---|---|---|---|
Product: | Gentoo Linux | Reporter: | Jeremy Drake <gentoo-bugzilla> |
Component: | Current packages | Assignee: | Gentoo Toolchain Maintainers <toolchain> |
Status: | RESOLVED UPSTREAM | ||
Severity: | normal | ||
Priority: | Normal | ||
Version: | unspecified | ||
Hardware: | All | ||
OS: | Linux | ||
See Also: | https://sourceware.org/bugzilla/show_bug.cgi?id=24595 | ||
Whiteboard: | |||
Package list: | Runtime testing required: | --- | |
Attachments: | Patch which reverts 27761a1042daf01987e7d79636d0c41511c6df3c |
Description
Jeremy Drake
2019-05-04 05:25:23 UTC
Some additional observations: The prohibition against downgrading glibc is not there "just in case" - downgrading glibc to 2.27 caused any binary built against 2.28 to fail to run. There was a commit to glibc after 2.28 relating to a deadlock in atfork handlers (https://sourceware.org/git/?p=glibc.git;a=commit;h=669ff911e2571f74a2668493e326ac9a505776bd). Applying this as a user patch did not resolve this deadlock. I created a patch reverting 27761a1042daf01987e7d79636d0c41511c6df3c, applied that as a user patch, and that did resolve the deadlock, confirming that this is the commit which caused the regression. I am not entirely sure that glibc is "in the wrong" here, as some sources online suggest that atfork handlers are restricted to async-signal-safe functions, of which I don't believe dlclose is a member. However, other sources do not have such a restriction, and besides a signal handler is not involved here. Created attachment 575484 [details, diff]
Patch which reverts 27761a1042daf01987e7d79636d0c41511c6df3c
Here's the patch I generated to confirm that this commit was the cause of the deadlock.
Can you describe how to reproduce the deadlock and a few details about it? I could try to bisect glibc to find the first offender and try to reduce an example to something to show upstream. Can you post your `emerge --info`? Sorry I have been busy and have not gotten back to this as soon as I would have liked. Also, I'm sorry I didn't include as much information as I thought I did in the original report. I am not including emerge --info as it is not relevant - once I tracked down the cause, it was obvious it would happen every time, and does not depend on environment. The specific use case is kind of complicated. I am using OpenVPN with a pkcs11 "smartcard" (actually GnuK). The software stack involved at the time consists of sys-libs/glibc-2.28-r6 sys-apps/pcsc-lite-1.8.24 dev-libs/opensc-0.18.0 dev-libs/pkcs11-helper-1.25.1 net-vpn/openvpn-2.4.6 Note that both glibc and opensc have newer versions now, but I have not gotten the chance to test with those. The situation is that pkcs11-helper installs an atfork handler, whose purpose is to deinitialize the smartcard in the child process, to avoid inadvertantly allowing it to inherit an open connection to the smartcard. As part of opensc's "Finalize", it dlclose()s its backend module, which in this case is pcsc-lite. It appears that pcsc-lite also has an atfork handler installed (I did not investigate that one, but presumably it is also for the purpose of closing any open smartcard in the child). Glibc registers a mechansim, apparently the same way that C++ destructors are registered, to remove any atfork handlers that are registered in a module when it is being unloaded. Now that there is a (non-recursive) lock around the list of atfork handlers, and the handlers are called while that lock is held, attempting to unregister an atfork handler from within an atfork handler callback results in a deadlock. There is no need for you to bisect - I already tracked down the commit in question - 27761a1042daf01987e7d79636d0c41511c6df3c - and confirmed that reverting this solves my deadlock. If you want a simple test case to reproduce this, I think the most simple incarnation would be to have an executable and a shared library. The executable would register with pthread_atfork(), dlopen() the shared library, and call something in it which also registers an atfork handler with pthread_atfork(). The executable would then fork(), and in its atfork child handler dlclose() the shared library. This should deadlock with glibc 2.28 (and 2.29, though I have not yet confirmed this), but work fine with 2.27 and older. I'd like the bug fixed upstream (if it's still in glibc-master). Feel free to file it yourself to upstream tracker at https://www.gnu.org/software/libc/bugs.html I'd prefer an upstream fix backport and a confirmation it's a glibc bug and not application API misuse. After having upgraded to sys-libs/glibc-2.29-r2 and dev-libs/opensc-0.19.0-r2 I am no longer seeing the hang/deadlock. The only thing that seems like it should have helped is the 669ff911e2571f74a2668493e326ac9a505776bd commit disabling locking in the single-threaded case, but I was sure I tried that. Either I screwed that up, or some change in opensc also helped. (In reply to Jeremy Drake from comment #6) > After having upgraded to sys-libs/glibc-2.29-r2 and > dev-libs/opensc-0.19.0-r2 I am no longer seeing the hang/deadlock. The only > thing that seems like it should have helped is the > 669ff911e2571f74a2668493e326ac9a505776bd commit disabling locking in the > single-threaded case, but I was sure I tried that. Either I screwed that > up, or some change in opensc also helped. I downgraded opensc to 0.18.0 and the hang/deadlock is back. I also confirmed that there was a second thread in the parent. Let's leave it to upstream to decide on final solution. |