Newer kernels starting with gentoo-sources-6.1.12 behave badly/weirdly for me - arbitrary processes hang, multiple instances of kworker/events hogging CPU are created. Reproducible: Always Steps to Reproduce: (Those steps always brought out the problem for me and are kinda what I did after upgrading to affected / non-affected kernels anyway. There are other ways to possibly trigger this issue, see below. For affected/non-affected versions, also see below.) 1. emerge a gentoo-sources-6.x release 2. eselect kernel that release 3. cd /usr/src/linux 3. gunzip < /proc/config.gz > .config 4. make oldconfig 5. [build and install the new kernel] 6. reboot to new kernel 7. emerge -bavt -1 @module-rebuild 8. [while this is running] play a round of HoloCure (under wine-7.0.1/wine-8.0) Actual Results: HoloCure or other processes (librewolf web browser, keepassxc password manager, shutter screenshot tool, atop, ...) may hang hard, for good or at least for multiple minutes. Hung processes do not even react to kill -9. Simultaneously, multiple kworker/x:y+events threads consuming 100% CPU (or as much as they can get) show up. Those kworker threads never go away again, even after process hangs resolve (if/when those do resolve). Expected Results: As in unaffected versions, processes should not hang randomly and kworker threads continuously eating up lots of CPU should not spawn. The problem first happened for me with gentoo-sources-6.1.12. Before, I'd run 5.15.74 without any issues. At the time when that problem occurred, I upgraded to 6.2.1 which worked without any issues as well. These days, I upgraded to now-stable 6.1.19 and again could immediately see the hallmarks of this issue. So I upgraded to 6.2.11, but this version is affected as well. So, it looks that some improvement in newer kernel versions causes my system to be prone to breaking misbehavior. Since I don't find a deluge of similar cases on google, I quite believe that it may be some idiosyncrasy of my setup. I'd like to address this before finding a kernel version that's stable on my system becomes even more difficult. At the time of first experiencing these issues in 6.1.12, I wrote a question on superuser.com with some more details: https://superuser.com/questions/1771030/arbitrary-other-processes-hang-when-i-pause-one-how-can-i-pause-processes-b When these symptoms happened again today after upgrading first to 6.1.19 and then with 6.2.11 again, I used this recipe to capture what the kworkers might be doing: https://stackoverflow.com/questions/58161086/verifying-where-kworker-nn-in-ps-aux-is-invoked-from which results in 11721 logged events over ~ 18 seconds, see here: https://gist.github.com/jmbreuer/a12aa78f1031aff4ce6e28c7259dc99e I appreciate any advice how to go from here.
I dug around after this a bit further and I think I'm onto something: My issue appears to be related to CONFIG_SCHED_BMQ - all affected kernels on my system have (had) this enabled, whereas it doesn't even exist as an option on the ones that are unaffected. I tried gentoo-sources-6.1.19, gentoo-sources-6.2.11 with this option turned off, as well as vanilla-sources-6.2.11 which does not have this option, and all of those run without issues. While running a kernel with BMQ enabled, at one point during boot I saw a kernel diagnostic complaining about an inconsistent task state - digging after those in journalctl, it seems those only show up with affected kernels: psi: inconsistent task state! task=2366:udevd cpu=5 psi_flags=0 clear=1 set=4 psi: inconsistent task state! task=9:kworker/u16:0 cpu=1 psi_flags=0 clear=1 set=4 psi: inconsistent task state! task=9:kworker/u16:0 cpu=0 psi_flags=0 clear=1 set=4 I definitely had hangs and kworker CPU hogging issues also without those messages, but the other way around it's clear: every kernel printing this diagnostic during boot would hang within 10 minutes of uptime, at most. I'll take this over to Alfred Chen / the BMQ scheduler project: https://gitlab.com/alfredchen/linux-prjc
... how do I figure out which commit of https://gitlab.com/alfredchen/linux-prjc/ the respective genpatches-6.x-yy sets correspond to?
I've created a corresponding issue with ProjectC (which encompasses the BMQ scheduler) here: https://gitlab.com/alfredchen/linux-prjc/-/issues/79
(In reply to Joe Breuer from comment #2) > ... how do I figure out which commit of > https://gitlab.com/alfredchen/linux-prjc/ the respective genpatches-6.x-yy > sets correspond to? Our patches are at https://gitweb.gentoo.org/proj/linux-patches.git/ and you can see the tags & history there, as well as maybe grab it from git history for gentoo-sources. But I wouldn't expect any of them to be causing this.
let's see upstream opinion on this. if they decide to update the patch we will follow it.
I found a "works for me" kinda workaround comprising of both a not-yet merged patch to ProjectC and turning off psi, see https://gitlab.com/alfredchen/linux-prjc/-/issues/79#note_1360907509 There definitely are situations where ProjectC (BMQ/PDS) does not play nice with psi; plus at least one other issue that causes hangs for me on gentoo-sources-6.1.19 out-of-the-box with BMQ enabled but psi disabled. In those hangs without the patch applied, now that I already knew what I was looking for, I typically could not get at a shell - though the system appeared to be still alive to some degree (music continuing playing, responding to pings). But no ssh, mouse cursor movable but focus does not track / no reaction to keyboard / num-lock toggling, sometimes no reaction at all to Alt+SysRq sequences, sometimes those would print their headers, but not the result/success lines/"do their thing" - I do not know how to get useful diagnostics out of a system in that state.
could you test if the behavior is same with gentoo-sources-6.2.13 and 6.1.26? because if is solved on more updated kernel we can close this as solved upstream.
(In reply to Alice Ferrazzi from comment #7) > could you test if the behavior is same with gentoo-sources-6.2.13 and 6.1.26? > because if is solved on more updated kernel we can close this as solved > upstream. I've tested both 6.1.26 and 6.2.13, and observed identical behavior with both of them. With BMQ and psi enabled, they will hang processes within short order (triggered within tens of seconds of the reproduction scenario detailed above, one time I couldn't even get through the menus of HoloCure into the game proper). Disabling psi / booting with 'psi=0' on the command line alleviates this, so far my system appears stable with both 6.1.26 and 6.2.13. This is a change/improvement from 6.1.19, which needed a patch to be stable for me (and also psi disabled). Interestingly, though, none of the changes of that patch are applied in current gentoo-sources-6.1.26/6.2.13; so the improved behavior must be due to some other change or a second order effect. I'll certainly continue to keep an eye on it; also, being able to also have psi again at some point would be "nice to have".
... just saw, latest ProjectC/BMQ patches from Alfred Chen are mutually exclusive with psi for now: https://gitlab.com/alfredchen/linux-prjc/-/commit/542887ccaeadc65843ec171bccc87f8aa8bbca95 That patch is not yet in gentoo-sources-6.1.26 or 6.2.13.
The bug has been referenced in the following commit(s): https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=6febcb5b9366ea8425956ec72d35073b650f1b13 commit 6febcb5b9366ea8425956ec72d35073b650f1b13 Author: Mike Pagano <mpagano@gentoo.org> AuthorDate: 2023-05-10 18:53:16 +0000 Commit: Mike Pagano <mpagano@gentoo.org> CommitDate: 2023-05-10 18:53:58 +0000 sys-kernel/gentoo-sources: netfltr patch for CVE-2023-32233, BMQ Patch netfilter: nf_tables: deactivate anonymous set from preparation phase sched/alt: Remove psi support Bug: https://bugs.gentoo.org/906064 Bug: https://bugs.gentoo.org/904514 Signed-off-by: Mike Pagano <mpagano@gentoo.org> sys-kernel/gentoo-sources/Manifest | 3 +++ .../gentoo-sources/gentoo-sources-6.2.14-r1.ebuild | 28 ++++++++++++++++++++++ 2 files changed, 31 insertions(+) https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=a6053524af4e316e45c59dc66243f8ce52facaef commit a6053524af4e316e45c59dc66243f8ce52facaef Author: Mike Pagano <mpagano@gentoo.org> AuthorDate: 2023-05-10 18:51:40 +0000 Commit: Mike Pagano <mpagano@gentoo.org> CommitDate: 2023-05-10 18:53:58 +0000 sys-kernel/gentoo-sources: netfltr patch for CVE-2023-32233, BMQ Patch netfilter: nf_tables: deactivate anonymous set from preparation phase sched/alt: Remove psi support Bug: https://bugs.gentoo.org/906064 Bug: https://bugs.gentoo.org/904514 Signed-off-by: Mike Pagano <mpagano@gentoo.org> sys-kernel/gentoo-sources/Manifest | 3 +++ .../gentoo-sources/gentoo-sources-6.1.27-r1.ebuild | 28 ++++++++++++++++++++++ 2 files changed, 31 insertions(+)