837500 – sys-kernel/gentoo-sources-5.16 UI hangs on resume for a few seconds starting sometime in kernel 5.16

Bug 837500 - sys-kernel/gentoo-sources-5.16 UI hangs on resume for a few seconds starting sometime in kernel 5.16

Summary: sys-kernel/gentoo-sources-5.16 UI hangs on resume for a few seconds starting ...

Status:	RESOLVED UPSTREAM

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	Current packages (show other bugs)
Hardware:	AMD64 Linux

Importance:	Normal normal
Assignee:	Gentoo Kernel Bug Wranglers and Kernel Maintainers

URL:	https://bugzilla.kernel.org/show_bug....
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2022-04-09 15:28 UTC by sunox9
Modified:	2022-05-21 22:18 UTC (History)
CC List:	1 user (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
Good dmesg (dmesg.good,5.12 KB, text/plain) 2022-04-12 19:26 UTC, Mike Pagano	Details
Bad dmesg (dmesg.bad,5.30 KB, text/plain) 2022-04-12 19:26 UTC, Mike Pagano	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description sunox9 2022-04-09 15:28:04 UTC

When resuming, mouse and keyboard inputs are not registered for 5 seconds or so.

The problem began when I upgraded to kernel 5.16.16 a few weeks ago. If I revert back to the 5.15.23 kernel the problem goes away. It is not fixed by upgrading further to 5.17.2.

At the end of this message I paste a few lines of the first thing dmesg starts outputting on resume, which says that the system is still suspending. 

I found a related post (https://askubuntu.com/questions/828486/kernel-suspends-too-quickly-upon-resume-continues-suspend-tasks) and someone there says it is an instance of a kernel bug https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1574125. These seem to involve laptops only as far as I can tell, though.

I'm using OpenRC and suspending with elogind. 

I asked the people in IRC for advice and they suggested I file a bug report and then do some digging with git bisect. I will probably give git bisect a try tomorrow or so. If I can provide anything else please let me know.

DMESG OUTPUT:
[ 3562.179515] elogind-daemon[1276]: Suspending system...
[ 3562.179519] PM: suspend entry (deep)
[ 3562.183286] Filesystems sync: 0.003 seconds
[ 3562.183388] Freezing user space processes ... (elapsed 0.001 seconds) done.
[ 3562.184612] OOM killer disabled.
[ 3562.184612] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
[ 3562.185727] printk: Suspending console(s) (use no_console_suspend to debug)
[ 3562.186194] serial 00:04: disabled
[ 3562.321510] sd 5:0:0:0: [sdb] Synchronizing SCSI cache
[ 3562.321523] sd 1:0:0:0: [sda] Synchronizing SCSI cache
[ 3562.322125] sd 5:0:0:0: [sdb] Stopping disk
[ 3562.324637] sd 1:0:0:0: [sda] Stopping disk
[ 3562.705213] amdgpu 0000:0f:00.0: amdgpu: BACO reset
[ 3563.486182] ACPI: EC: interrupt blocked
[ 3563.611512] ACPI: PM: Preparing to enter system sleep state S3
[ 3563.941720] ACPI: EC: event blocked
[ 3563.941721] ACPI: EC: EC stopped
[ 3563.941721] ACPI: PM: Saving platform NVS memory
[ 3563.941763] Disabling non-boot CPUs ...
[ 3563.943482] smpboot: CPU 1 is now offline
[ 3563.945230] smpboot: CPU 2 is now offline
[ 3563.946943] smpboot: CPU 3 is now offline
[ 3563.948449] smpboot: CPU 4 is now offline
[ 3563.949861] smpboot: CPU 5 is now offline
[ 3563.951271] smpboot: CPU 6 is now offline
[ 3563.952726] smpboot: CPU 7 is now offline
[ 3563.954137] smpboot: CPU 8 is now offline
[ 3563.955539] smpboot: CPU 9 is now offline
[ 3563.956794] smpboot: CPU 10 is now offline
[ 3563.958062] smpboot: CPU 11 is now offline
[ 3563.958419] ACPI: PM: Low-level resume complete
[ 3563.958437] ACPI: EC: EC started

Comment 1 Mike Pagano gentoo-dev

2022-04-10 22:43:50 UTC

Those people in IRC were wise to suggest the git bisect.

Let us know the outcome.

Comment 2 sunox9 2022-04-11 23:19:22 UTC

Haha hello! Thanks again for the tip.

So I got around to trying git bisect today.

So to test I would follow the following procedure:

1. boot
2. launch sway
3. `loginctl suspend'
4. resume with key press
5. mash keyboard to see how long it took for characters to start appearing
6. `git bisect good/bad'
7. make and install new kernel
8. reboot

I did close to 10 of these when I realized that the amount of time between suspend and resume seemed to be another factor that determined whether the system would freeze or not upon resume. If I try to resume immediately after suspending the system won't hang. I wasn't consistent in how long I waited - sometimes I would try to resume as soon as it suspended, and other times I would wait - so I decided to start over.

My second round of testing lasted exactly 2 iterations (one `git bisect bad' and one `good') before it seems to have tracked the bug down, which is suspiciously short. This is the message I get:

The merge base 8bb7eca972ad531c9b149c0a51ab43a417385813 is bad.
This means the bug has been fixed between 8bb7eca972ad531c9b149c0a51ab43a417385813 and [8993e6067f263765fd26edabf3e3012e3ec4d81e 8bb7eca972ad531c9b149c0a51ab43a417385813].

Also, one time when booted into a kernel for which I was getting the system hang issue, I tested to see if I would get it from the tty (outside sway), and I did not get it. Maybe sway or the dbus-run-session that I start alongside sway are contributors?

Anyway, that is what I have to report so far. I am a little spooked by how quickly it claims to have found the bug in my second go-through given how far apart the initial `good' and `bad' kernels were (5.15.23 and 5.16.16).

Comment 3 sunox9 2022-04-12 03:20:35 UTC

I just tested 5.15.33 (the oldest post-5.15.23 kernel currently in gentoo-sources) and I do experience the bug. I may try git bisect again tomorrow with 5.15.33 as the new initial 'bad' version and see if I get different results.

Comment 4 sunox9 2022-04-12 05:39:45 UTC

Sorry for the spam, but I started to feel like I was really doing something wrong, so I decided to give it one more shot. This was the first time in 5 or so attempts that it didn't feel like something was off. Throughout the 14 or so rounds of tests I got about an equal number of 'good' and 'bad', which I take to be a good sign. This wasn't the case in prior tests, which might be because I booted into the wrong kernel and gave a wrong 'good' or 'bad' call as a result. 

Below is the full log of this last round of git-bisect tests. It identifies a commit that at least seems relevant to the issue:

'''
    scsi: core: pm: Rely on the device driver core for async power management

    Instead of implementing asynchronous resume support in the SCSI core, rely
    on the device driver core for resuming SCSI devices asynchronously.
    Instead of only supporting asynchronous resumes, also support asynchronous
    suspends."
'''

I also tried 5.15.33 again since my last update, and I wasn't able to reproduce the issue. I'm confused as to why I got it in the first place.

==============================
FULL LOG:

Your branch is up to date with 'origin/master'.
Bisecting: a merge base must be tested
[8bb7eca972ad531c9b149c0a51ab43a417385813] Linux 5.15
Bisecting: 9484 revisions left to test after this (roughly 13 steps)
[5cd4dc44b8a0f656100e3b6916cf73b1623299eb] Merge tag 'staging-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Bisecting: 4741 revisions left to test after this (roughly 12 steps)
[4ec6afd628665aa2cffeddf0c9ca6ab6b984fc03] Merge tag 'pm-5.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Bisecting: 2382 revisions left to test after this (roughly 11 steps)
[3e28850cbd359bed841b832200f9fc208a9ef040] Merge tag 'for-5.16/block-2021-11-09' of git://git.kernel.dk/linux-block
Bisecting: 1181 revisions left to test after this (roughly 10 steps)
[7fd982f394c42f25a73fe9dfbf1e6b11fa26b40a] module: change to print useful messages from elf_validity_check()
Bisecting: 561 revisions left to test after this (roughly 9 steps)
[abfecb39092029c42c79bacac3d1c96a133ff231] Merge tag 'tty-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Bisecting: 281 revisions left to test after this (roughly 8 steps)
[83c3a7beaef7fd261c190b69f6be6337f251bf16] scsi: lpfc: Update lpfc version to 14.0.0.3
Bisecting: 139 revisions left to test after this (roughly 7 steps)
[e6ed928effb6d516272fe47b2bf0b4ba9e6a0143] scsi: dpt_i2o: Call scsi_done() directly
Bisecting: 69 revisions left to test after this (roughly 6 steps)
[edc0596cc04bf0ac3a69c66e994d3ff8b650ff71] scsi: ufs: core: Stop clearing UNIT ATTENTIONS
Bisecting: 34 revisions left to test after this (roughly 5 steps)
[046ab7d0f5943dd74c351e1f3a771dea785fe25d] scsi: hisi_sas: Wait for phyup in hisi_sas_control_phy()
Bisecting: 17 revisions left to test after this (roughly 4 steps)
[656f26ade03ad98fe9cb3dab11d2111d1a60cb4f] scsi: 53c700: Call scsi_done() directly
Bisecting: 8 revisions left to test after this (roughly 3 steps)
[a710eacb9d13cb5d9eb5341ebc6fc8f7b96f8c6f] scsi: core: Rename scsi_mq_done() into scsi_done() and export it
Bisecting: 3 revisions left to test after this (roughly 2 steps)
[a19a93e4c6a98c9c0f2f5a6db76846f10d7d1f85] scsi: core: pm: Rely on the device driver core for async power management
Bisecting: 1 revision left to test after this (roughly 1 step)
[21c7e972475e6a975fbe97f8974c96fe4713077c] scsi: hisi_sas: Disable SATA disk phy for severe I_T nexus reset failure
Bisecting: 0 revisions left to test after this (roughly 0 steps)
[a4bcbf71914b0cc44151a5238e4a8afa8e3319cd] scsi: Documentation: Fix typo in sysfs-driver-ufs
a19a93e4c6a98c9c0f2f5a6db76846f10d7d1f85 is the first bad commit
commit a19a93e4c6a98c9c0f2f5a6db76846f10d7d1f85
Author: Bart Van Assche <bvanassche@acm.org>
Date:   Wed Oct 6 14:54:51 2021 -0700

    scsi: core: pm: Rely on the device driver core for async power management
    
    Instead of implementing asynchronous resume support in the SCSI core, rely
    on the device driver core for resuming SCSI devices asynchronously.
    Instead of only supporting asynchronous resumes, also support asynchronous
    suspends.
    
    Link: https://lore.kernel.org/r/20211006215453.3318929-2-bvanassche@acm.org
    Cc: Alan Stern <stern@rowland.harvard.edu>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Hannes Reinecke <hare@suse.com>
    Cc: Adrian Hunter <adrian.hunter@intel.com>
    Cc: Martin Kepplinger <martin.kepplinger@puri.sm>
    Signed-off-by: Bart Van Assche <bvanassche@acm.org>
    Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>

 drivers/scsi/hosts.c      |  1 +
 drivers/scsi/scsi.c       |  8 --------
 drivers/scsi/scsi_pm.c    | 44 ++------------------------------------------
 drivers/scsi/scsi_priv.h  |  4 +---
 drivers/scsi/scsi_scan.c  | 17 +++++++++++++++++
 drivers/scsi/scsi_sysfs.c |  1 +
 drivers/scsi/sd.c         |  1 -
 7 files changed, 22 insertions(+), 54 deletions(-)

Comment 5 Mike Pagano gentoo-dev

2022-04-12 16:39:35 UTC

Would it be possible to get full dmesg from both the working kernel (suspend, resume) and the same test with a non-working kernel.

Also, are you using the i8042 driver ?

Comment 6 sunox9 2022-04-12 18:25:27 UTC

(In reply to Mike Pagano from comment #5)
> Would it be possible to get full dmesg from both the working kernel
> (suspend, resume) and the same test with a non-working kernel.
> 
> Also, are you using the i8042 driver ?

Yes I am using the i8042 driver.

dmesg for suspend -> resume in 'good' case (5.15.33):
https://dpaste.org/S90gh

dmesg for suspend -> resume in 'bad' case (5.17.2):
https://dpaste.org/9KQ6C

Just a note that some of the stuff that appears before "System resumed" in the 'bad' case appears after "System resumed" in the 'good' case (e.g. "ata2: link is slow to respond, please be patient (ready=0)") . It's not 'extra' stuff in the 'bad' case, but just stuff that doesn't have time to appear in the 'good' case. I can pause a little longer before capturing the dmesg output in the 'good' case if you think it would help.

Comment 7 Mike Pagano gentoo-dev

2022-04-12 19:26:18 UTC

Created attachment 770450 [details]
Good dmesg

Comment 8 Mike Pagano gentoo-dev

2022-04-12 19:26:35 UTC

Created attachment 770453 [details]
Bad dmesg

Comment 9 Mike Pagano gentoo-dev

2022-04-12 19:48:39 UTC

Can you test with :

i8042.kbdreset=1

in your kernel cmdline on a non-working kernel?

Comment 10 sunox9 2022-04-12 21:00:50 UTC

(In reply to Mike Pagano from comment #9)
> Can you test with :
> 
> i8042.kbdreset=1
> 
> in your kernel cmdline on a non-working kernel?

I added "i8042.kbdreset=1" to GRUB_CMDLINE_LINUX, regenerated the grub config and rebooted. Still experiencing the issue. "kbdreset" doesn't have an entry in /sys/modules/i8042/parameters so as far as I know I can't double check its state.

Comment 11 sunox9 2022-04-12 21:47:24 UTC

I just noticed something on a Debian machine that is maybe irrelevant but I thought I should mention anyway. 

I just got this USB headphone amp which seems to have trouble coming back from sleep for whatever reason, both on Gentoo and Debian Sid. However I just noticed that I get the same temporary system-hang behaviour on Debian whenever I try to resume from sleep with this USB audio device plugged in, and for about the same length of time as well (maybe 5-6 seconds). According to dmesg, it seems that during the time that it's hanging it is trying to resuscitate the USB device. It eventually gives up, and the system resume process continues.

Anyway possibly not relevant at all, but it was too similar to the other issue for me not to mention.

Comment 12 Mike Pagano gentoo-dev

2022-04-14 16:24:48 UTC

(In reply to sunox9 from comment #10)
> (In reply to Mike Pagano from comment #9)
> > Can you test with :
> > 
> > i8042.kbdreset=1
> > 
> > in your kernel cmdline on a non-working kernel?
> 
> I added "i8042.kbdreset=1" to GRUB_CMDLINE_LINUX, regenerated the grub
> config and rebooted. Still experiencing the issue. "kbdreset" doesn't have
> an entry in /sys/modules/i8042/parameters so as far as I know I can't double
> check its state.

Take a look in:

/usr/src/linux/Documentation/admin-guide/kernel-parameters.txt

Might be worth trying:
i8042.debug and:

        i8042.reset     [HW] Reset the controller during init, cleanup and
                             suspend-to-ram transitions, only during s2r
                             transitions, or never reset
                        Format: { 1 | Y | y | 0 | N | n }
                        1, Y, y: always reset controller
                        0, N, n: don't ever reset controller
                        Default: only on s2r transitions on x86; most other
                        architectures force reset to be always executed


The keyboard one is also listed there:
        i8042.kbdreset  [HW] Reset device connected to KBD port

Comment 13 sunox9 2022-04-14 19:50:44 UTC

Thanks. I tried enabling all of those kernel parameters and it seems like nothing has changed. I also enabled ACPI debugging in the kernel and set acpi.debug_level to "verbose info", but I'm not seeing anything extra in dmesg.

I'll take a closer look at other available kernel-parameters over the next few days. 

I also tried: loading the microcode for my CPU, disabling the blk-mq io scheduler, unbinding my graphics card and its audio device from vfio-pci driver. Just taking shots in the dark. 

I should maybe note that it's not just mouse and keyboard that freezes. When I suspend from sway the entire GUI hangs on resume. If I back out of sway into a tty and suspend from there, the 'loginctl suspend' command hangs for what seems like about the usual amount of time (5-6 seconds), but this time my key presses are registered in the console. So maybe the keyboard is never locked-up at all.

I'm wondering if it has anything to do with this message:

[   21.572831] xhci_hcd 0000:08:00.0: xHC error in resume, USBSTS 0x401, Reinit
[   21.572834] usb usb1: root hub lost power or was reset
[   21.572836] usb usb2: root hub lost power or was reset

I get this message in 'good' kernels as well, however.

Comment 14 Mike Pagano gentoo-dev

2022-04-15 13:01:32 UTC

Can you give me the make/model of your system and is there a BIOS upgrade available?

Comment 15 sunox9 2022-04-15 14:01:59 UTC

AMD Ryzen 3600
AMD Radeon RX580 
AMD Radeon 270X (usually bound to vfio-pci, otherwise amdgpu)
ASUS Strix ROG B450-F (latest BIOS)
Seasonic Prime GX-750 (brand new)
I also have an M.2 NVME drive (boot), a SATA SSD, and one HDD.

I also tried gentoo-kernel-bin 5.16.20 and still getting the issue. I think that means I can stop trying out different options with my .config.

Comment 16 Mike Pagano gentoo-dev

2022-04-21 14:24:58 UTC

(In reply to sunox9 from comment #15)
> AMD Ryzen 3600
> AMD Radeon RX580 
> AMD Radeon 270X (usually bound to vfio-pci, otherwise amdgpu)
> ASUS Strix ROG B450-F (latest BIOS)
> Seasonic Prime GX-750 (brand new)
> I also have an M.2 NVME drive (boot), a SATA SSD, and one HDD.
> 
> I also tried gentoo-kernel-bin 5.16.20 and still getting the issue. I think
> that means I can stop trying out different options with my .config.

Let's bring this upstream at https://bugzilla.kernel.org.

Can you let us know the bug number once it's submitted?

Comment 17 sunox9 2022-04-21 18:40:39 UTC

Ok, I will make some time to do this in the next few days and will report back when it's done. Thanks again for the help!

Comment 18 sunox9 2022-04-24 18:33:41 UTC

Here is a link to the kernel bug report: 

https://bugzilla.kernel.org/show_bug.cgi?id=215880

Comment 19 Mike Pagano gentoo-dev

2022-05-21 22:18:50 UTC

(In reply to sunox9 from comment #18)
> Here is a link to the kernel bug report: 
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=215880

Looks like great progress on the upstream bug.  We'll keep an eye on it and backport any patch the developer creates to help this issue