Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 904514 - sys-kernel/gentoo-sources-6.1 regressions - kworker/events live locks, processes hanging
Summary: sys-kernel/gentoo-sources-6.1 regressions - kworker/events live locks, proces...
Status: RESOLVED FIXED
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: Current packages (show other bugs)
Hardware: All Linux
: Normal normal (vote)
Assignee: Gentoo Kernel Bug Wranglers and Kernel Maintainers
URL: https://gitlab.com/alfredchen/linux-p...
Whiteboard: 6.1.28, 6.2.15
Keywords: InVCS
Depends on:
Blocks:
 
Reported: 2023-04-18 15:01 UTC by Joe Breuer
Modified: 2023-05-11 12:15 UTC (History)
0 users

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Joe Breuer 2023-04-18 15:01:55 UTC
Newer kernels starting with gentoo-sources-6.1.12 behave badly/weirdly for me - arbitrary processes hang, multiple instances of kworker/events hogging CPU are created.

Reproducible: Always

Steps to Reproduce:
(Those steps always brought out the problem for me and are kinda what I did after upgrading to affected / non-affected kernels anyway. There are other ways to possibly trigger this issue, see below. For affected/non-affected versions, also see below.)

1. emerge a gentoo-sources-6.x release
2. eselect kernel that release
3. cd /usr/src/linux
3. gunzip < /proc/config.gz > .config
4. make oldconfig
5. [build and install the new kernel]
6. reboot to new kernel
7. emerge -bavt -1 @module-rebuild
8. [while this is running] play a round of HoloCure (under wine-7.0.1/wine-8.0)
Actual Results:  
HoloCure or other processes (librewolf web browser, keepassxc password manager, shutter screenshot tool, atop, ...) may hang hard, for good or at least for multiple minutes.
Hung processes do not even react to kill -9.
Simultaneously, multiple kworker/x:y+events threads consuming 100% CPU (or as much as they can get) show up. Those kworker threads never go away again, even after process hangs resolve (if/when those do resolve).

Expected Results:  
As in unaffected versions, processes should not hang randomly and kworker threads continuously eating up lots of CPU should not spawn.

The problem first happened for me with gentoo-sources-6.1.12. Before, I'd run 5.15.74 without any issues. At the time when that problem occurred, I upgraded to 6.2.1 which worked without any issues as well.

These days, I upgraded to now-stable 6.1.19 and again could immediately see the hallmarks of this issue. So I upgraded to 6.2.11, but this version is affected as well.

So, it looks that some improvement in newer kernel versions causes my system to be prone to breaking misbehavior. Since I don't find a deluge of similar cases on google, I quite believe that it may be some idiosyncrasy of my setup. I'd like to address this before finding a kernel version that's stable on my system becomes even more difficult.

At the time of first experiencing these issues in 6.1.12, I wrote a question on superuser.com with some more details:

https://superuser.com/questions/1771030/arbitrary-other-processes-hang-when-i-pause-one-how-can-i-pause-processes-b

When these symptoms happened again today after upgrading first to 6.1.19 and then with 6.2.11 again, I used this recipe to capture what the kworkers might be doing:

https://stackoverflow.com/questions/58161086/verifying-where-kworker-nn-in-ps-aux-is-invoked-from

which results in 11721 logged events over ~ 18 seconds, see here:

https://gist.github.com/jmbreuer/a12aa78f1031aff4ce6e28c7259dc99e

I appreciate any advice how to go from here.
Comment 1 Joe Breuer 2023-04-19 07:37:13 UTC
I dug around after this a bit further and I think I'm onto something:

My issue appears to be related to CONFIG_SCHED_BMQ - all affected kernels on my system have (had) this enabled, whereas it doesn't even exist as an option on the ones that are unaffected.

I tried gentoo-sources-6.1.19, gentoo-sources-6.2.11 with this option turned off, as well as vanilla-sources-6.2.11 which does not have this option, and all of those run without issues.

While running a kernel with BMQ enabled, at one point during boot I saw a kernel diagnostic complaining about an inconsistent task state - digging after those in journalctl, it seems those only show up with affected kernels:

psi: inconsistent task state! task=2366:udevd cpu=5 psi_flags=0 clear=1 set=4
psi: inconsistent task state! task=9:kworker/u16:0 cpu=1 psi_flags=0 clear=1 set=4
psi: inconsistent task state! task=9:kworker/u16:0 cpu=0 psi_flags=0 clear=1 set=4

I definitely had hangs and kworker CPU hogging issues also without those messages, but the other way around it's clear: every kernel printing this diagnostic during boot would hang within 10 minutes of uptime, at most.


I'll take this over to Alfred Chen / the BMQ scheduler project:

https://gitlab.com/alfredchen/linux-prjc
Comment 2 Joe Breuer 2023-04-19 07:41:01 UTC
... how do I figure out which commit of https://gitlab.com/alfredchen/linux-prjc/ the respective genpatches-6.x-yy sets correspond to?
Comment 3 Joe Breuer 2023-04-19 13:48:43 UTC
I've created a corresponding issue with ProjectC (which encompasses the BMQ scheduler) here: https://gitlab.com/alfredchen/linux-prjc/-/issues/79
Comment 4 Sam James archtester Gentoo Infrastructure gentoo-dev Security 2023-04-19 14:14:19 UTC
(In reply to Joe Breuer from comment #2)
> ... how do I figure out which commit of
> https://gitlab.com/alfredchen/linux-prjc/ the respective genpatches-6.x-yy
> sets correspond to?

Our patches are at https://gitweb.gentoo.org/proj/linux-patches.git/ and you can see the tags & history there, as well as maybe grab it from git history for gentoo-sources. But I wouldn't expect any of them to be causing this.
Comment 5 Alice Ferrazzi Gentoo Infrastructure gentoo-dev 2023-04-20 12:56:21 UTC
let's see upstream opinion on this.
if they decide to update the patch we will follow it.
Comment 6 Joe Breuer 2023-04-21 13:57:10 UTC
I found a "works for me" kinda workaround comprising of both a not-yet merged patch to ProjectC and turning off psi, see

https://gitlab.com/alfredchen/linux-prjc/-/issues/79#note_1360907509

There definitely are situations where ProjectC (BMQ/PDS) does not play nice with psi; plus at least one other issue that causes hangs for me on gentoo-sources-6.1.19 out-of-the-box with BMQ enabled but psi disabled.

In those hangs without the patch applied, now that I already knew what I was looking for, I typically could not get at a shell - though the system appeared to be still alive to some degree (music continuing playing, responding to pings). But no ssh, mouse cursor movable but focus does not track / no reaction to keyboard / num-lock toggling, sometimes no reaction at all to Alt+SysRq sequences, sometimes those would print their headers, but not the result/success lines/"do their thing" - I do not know how to get useful diagnostics out of a system in that state.
Comment 7 Alice Ferrazzi Gentoo Infrastructure gentoo-dev 2023-04-27 04:49:21 UTC
could you test if the behavior is same with gentoo-sources-6.2.13 and 6.1.26?
because if is solved on more updated kernel we can close this as solved upstream.
Comment 8 Joe Breuer 2023-04-28 14:57:21 UTC
(In reply to Alice Ferrazzi from comment #7)
> could you test if the behavior is same with gentoo-sources-6.2.13 and 6.1.26?
> because if is solved on more updated kernel we can close this as solved
> upstream.

I've tested both 6.1.26 and 6.2.13, and observed identical behavior with both of them.

With BMQ and psi enabled, they will hang processes within short order (triggered within tens of seconds of the reproduction scenario detailed above, one time I couldn't even get through the menus of HoloCure into the game proper).

Disabling psi / booting with 'psi=0' on the command line alleviates this, so far my system appears stable with both 6.1.26 and 6.2.13.

This is a change/improvement from 6.1.19, which needed a patch to be stable for me (and also psi disabled).

Interestingly, though, none of the changes of that patch are applied in current gentoo-sources-6.1.26/6.2.13; so the improved behavior must be due to some other change or a second order effect.

I'll certainly continue to keep an eye on it; also, being able to also have psi again at some point would be "nice to have".
Comment 9 Joe Breuer 2023-04-28 14:59:15 UTC
... just saw, latest ProjectC/BMQ patches from Alfred Chen are mutually exclusive with psi for now:

https://gitlab.com/alfredchen/linux-prjc/-/commit/542887ccaeadc65843ec171bccc87f8aa8bbca95

That patch is not yet in gentoo-sources-6.1.26 or 6.2.13.
Comment 10 Larry the Git Cow gentoo-dev 2023-05-10 18:54:09 UTC
The bug has been referenced in the following commit(s):

https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=6febcb5b9366ea8425956ec72d35073b650f1b13

commit 6febcb5b9366ea8425956ec72d35073b650f1b13
Author:     Mike Pagano <mpagano@gentoo.org>
AuthorDate: 2023-05-10 18:53:16 +0000
Commit:     Mike Pagano <mpagano@gentoo.org>
CommitDate: 2023-05-10 18:53:58 +0000

    sys-kernel/gentoo-sources: netfltr patch for CVE-2023-32233, BMQ Patch
    
    netfilter: nf_tables: deactivate anonymous set from preparation phase
    sched/alt: Remove psi support
    
    Bug: https://bugs.gentoo.org/906064
    Bug: https://bugs.gentoo.org/904514
    
    Signed-off-by: Mike Pagano <mpagano@gentoo.org>

 sys-kernel/gentoo-sources/Manifest                 |  3 +++
 .../gentoo-sources/gentoo-sources-6.2.14-r1.ebuild | 28 ++++++++++++++++++++++
 2 files changed, 31 insertions(+)

https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=a6053524af4e316e45c59dc66243f8ce52facaef

commit a6053524af4e316e45c59dc66243f8ce52facaef
Author:     Mike Pagano <mpagano@gentoo.org>
AuthorDate: 2023-05-10 18:51:40 +0000
Commit:     Mike Pagano <mpagano@gentoo.org>
CommitDate: 2023-05-10 18:53:58 +0000

    sys-kernel/gentoo-sources: netfltr patch for CVE-2023-32233, BMQ Patch
    
    netfilter: nf_tables: deactivate anonymous set from preparation phase
    sched/alt: Remove psi support
    
    Bug: https://bugs.gentoo.org/906064
    Bug: https://bugs.gentoo.org/904514
    
    Signed-off-by: Mike Pagano <mpagano@gentoo.org>

 sys-kernel/gentoo-sources/Manifest                 |  3 +++
 .../gentoo-sources/gentoo-sources-6.1.27-r1.ebuild | 28 ++++++++++++++++++++++
 2 files changed, 31 insertions(+)