Created attachment 765212 [details] 5.15.23 kernel config I've been having issues with kernels 5.15 (originally reported here: https://forums.gentoo.org/viewtopic-t-1146747.html) I'm seeing the same problem in 5.15.23: --- Jan 12 08:54:10 nana kernel: BUG: kernel NULL pointer dereference, address: 0000000000000110 Jan 12 08:54:10 nana kernel: #PF: supervisor read access in kernel mode Jan 12 08:54:10 nana kernel: #PF: error_code(0x0000) - not-present page Jan 12 08:54:10 nana kernel: PGD 0 P4D 0 Jan 12 08:54:10 nana kernel: Oops: 0000 [#1] SMP NOPTI Jan 12 08:54:10 nana kernel: CPU: 1 PID: 2864 Comm: lockd Not tainted 5.15.11-gentoo-x86_64 #2 Jan 12 08:54:10 nana kernel: Hardware name: HP ProLiant MicroServer, BIOS O41 07/29/2011 Jan 12 08:54:10 nana kernel: RIP: 0010:vfs_lock_file+0x5/0x30 Jan 12 08:54:10 nana kernel: Code: a3 fe ff ff 4d 89 e1 e9 a4 fd ff ff 66 0f 1f 84 00 00 00 00 00 e8 2b 0d d7 ff 48 8b 7f 20 e9 f2 f5 ff ff 66 90 e8 1b 0d d7 ff <48> 8b 47 28 49 89 d0 48 8b 80 98 00 00 00 48 85 c0 74 05 e9 43 b8 Jan 12 08:54:10 nana kernel: RSP: 0018:ffff9d3640997c80 EFLAGS: 00010246 Jan 12 08:54:10 nana kernel: RAX: 7fffffffffffffff RBX: 00000000000000e8 RCX: 0000000000000000 Jan 12 08:54:10 nana kernel: RDX: ffff9d3640997c88 RSI: 0000000000000006 RDI: 00000000000000e8 Jan 12 08:54:10 nana kernel: RBP: ffff8b754767b400 R08: ffff8b7549dcf000 R09: ffff8b754bef1a00 Jan 12 08:54:10 nana kernel: R10: 0000000000000000 R11: 000000000000f000 R12: ffffffff9c34bfd0 Jan 12 08:54:10 nana kernel: R13: ffff8b76a518e7a8 R14: ffff8b7549d60c10 R15: ffff8b754767b400 Jan 12 08:54:10 nana kernel: FS: 0000000000000000(0000) GS:ffff8b7860500000(0000) knlGS:0000000000000000 Jan 12 08:54:10 nana kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jan 12 08:54:10 nana kernel: CR2: 0000000000000110 CR3: 000000010ffd4000 CR4: 00000000000006e0 Jan 12 08:54:10 nana kernel: Call Trace: Jan 12 08:54:10 nana kernel: <TASK> Jan 12 08:54:10 nana kernel: nlm_unlock_files+0x6e/0xb0 Jan 12 08:54:10 nana kernel: ? _raw_spin_lock+0x5/0x20 Jan 12 08:54:10 nana kernel: ? trace_hardirqs_on+0x35/0xd0 Jan 12 08:54:10 nana kernel: ? __local_bh_enable_ip+0x44/0x80 Jan 12 08:54:10 nana kernel: ? trace_hardirqs_on+0x35/0xd0 Jan 12 08:54:10 nana kernel: ? mutex_lock+0x5/0x20 Jan 12 08:54:10 nana kernel: ? nlmsvc_traverse_blocks+0x36/0x120 Jan 12 08:54:10 nana kernel: nlm_traverse_files+0x14d/0x280 Jan 12 08:54:10 nana kernel: nlmsvc_free_host_resources+0x17/0x30 Jan 12 08:54:10 nana kernel: nlm_host_rebooted+0x23/0x90 Jan 12 08:54:10 nana kernel: nlmsvc_proc_sm_notify+0xa1/0x110 Jan 12 08:54:10 nana kernel: ? trace_hardirqs_on+0x35/0xd0 Jan 12 08:54:10 nana kernel: ? nlmsvc_decode_reboot+0x95/0xc0 Jan 12 08:54:10 nana kernel: nlmsvc_dispatch+0x89/0x180 Jan 12 08:54:10 nana kernel: svc_process_common+0x399/0x640 Jan 12 08:54:10 nana kernel: ? lockd_inet6addr_event+0xf0/0xf0 Jan 12 08:54:10 nana kernel: ? set_grace_period+0xb0/0xb0 Jan 12 08:54:10 nana kernel: svc_process+0xca/0xe0 Jan 12 08:54:10 nana kernel: lockd+0x8f/0x130 Jan 12 08:54:10 nana kernel: ? set_grace_period+0xb0/0xb0 Jan 12 08:54:10 nana kernel: kthread+0x10e/0x130 Jan 12 08:54:10 nana kernel: ? set_kthread_struct+0x40/0x40 Jan 12 08:54:10 nana kernel: ret_from_fork+0x22/0x30 Jan 12 08:54:10 nana kernel: </TASK> Jan 12 08:54:10 nana kernel: Modules linked in: ecb xts dm_crypt dm_mod tun bridge stp llc ipt_REJECT nf_reject_ipv4 xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter iptable_mangle iptable_raw ip_tables radeon i2c_algo_bit drm_ttm_helper ttm kvm_amd drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt w83795 kvm fb_sys_fops uas cfbcopyarea drm usb_storage irqbypass tg3 drm_panel_orientation_quirks pata_atiixp libphy pcspkr i2c_piix4 Jan 12 08:54:10 nana kernel: CR2: 0000000000000110 Jan 12 08:54:10 nana kernel: ---[ end trace 6ac413c9433d0bd8 ]--- Jan 12 08:54:10 nana kernel: RIP: 0010:vfs_lock_file+0x5/0x30 Jan 12 08:54:10 nana kernel: Code: a3 fe ff ff 4d 89 e1 e9 a4 fd ff ff 66 0f 1f 84 00 00 00 00 00 e8 2b 0d d7 ff 48 8b 7f 20 e9 f2 f5 ff ff 66 90 e8 1b 0d d7 ff <48> 8b 47 28 49 89 d0 48 8b 80 98 00 00 00 48 85 c0 74 05 e9 43 b8 Jan 12 08:54:10 nana kernel: RSP: 0018:ffff9d3640997c80 EFLAGS: 00010246 Jan 12 08:54:10 nana kernel: RAX: 7fffffffffffffff RBX: 00000000000000e8 RCX: 0000000000000000 Jan 12 08:54:10 nana kernel: RDX: ffff9d3640997c88 RSI: 0000000000000006 RDI: 00000000000000e8 Jan 12 08:54:10 nana kernel: RBP: ffff8b754767b400 R08: ffff8b7549dcf000 R09: ffff8b754bef1a00 Jan 12 08:54:10 nana kernel: R10: 0000000000000000 R11: 000000000000f000 R12: ffffffff9c34bfd0 Jan 12 08:54:10 nana kernel: R13: ffff8b76a518e7a8 R14: ffff8b7549d60c10 R15: ffff8b754767b400 Jan 12 08:54:10 nana kernel: FS: 0000000000000000(0000) GS:ffff8b7860500000(0000) knlGS:0000000000000000 Jan 12 08:54:10 nana kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jan 12 08:54:10 nana kernel: CR2: 0000000000000110 CR3: 000000010ffd4000 CR4: 00000000000006e0 --- System becomes irresponsive, CPU lockups: --- Feb 15 19:10:14 nana kernel: watchdog: BUG: soft lockup - CPU#1 stuck for 45s! [lockd:2868] Feb 15 19:10:14 nana kernel: Modules linked in: ecb xts dm_crypt dm_mod tun bridge stp llc ipt_REJECT nf_reject_ipv4 xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter iptable_mangle iptable_raw ip_tables radeon i2c_algo_bit drm_ttm_helper ttm w83795 drm_kms_helper kvm_amd cfbfillrect syscopyarea kvm cfbimgblt uas sysfillrect sysimgblt usb_storage fb_sys_fops irqbypass cfbcopyarea pcspkr drm tg3 libphy pata_atiixp drm_panel_orientation_quirks i2c_piix4 Feb 15 19:10:14 nana kernel: CPU: 1 PID: 2868 Comm: lockd Not tainted 5.15.23-gentoo-x86_64 #1 Feb 15 19:10:14 nana kernel: Hardware name: HP ProLiant MicroServer, BIOS O41 07/29/2011 Feb 15 19:10:14 nana kernel: RIP: 0010:locks_init_lock+0x3a/0x80 Feb 15 19:10:14 nana kernel: Code: c7 87 d0 00 00 00 00 00 00 00 48 8d 7f 08 48 89 d1 31 c0 48 c7 c6 4c c6 4c 88 48 83 e7 f8 48 29 f9 81 c1 d8 00 00 00 c1 e9 03 <f3> 48 ab 48 8d 42 08 48 8d 7a 60 48 89 42 08 48 89 42 10 48 8d 42 Feb 15 19:10:14 nana kernel: RSP: 0018:ffffb1fd80f83c88 EFLAGS: 00000212 Feb 15 19:10:14 nana kernel: RAX: 0000000000000000 RBX: ffff8e3ca421c900 RCX: 0000000000000018 Feb 15 19:10:14 nana kernel: RDX: ffffb1fd80f83c90 RSI: ffffffff884cc64c RDI: ffffb1fd80f83ca8 Feb 15 19:10:14 nana kernel: RBP: ffff8e3c6bcd6c00 R08: ffffb1fd80f83c90 R09: 0000000000000000 Feb 15 19:10:14 nana kernel: R10: 0000000000000000 R11: ffff8e3b80369f84 R12: ffffffff8774c800 Feb 15 19:10:14 nana kernel: R13: ffff8e3b8d55fe38 R14: ffff8e3b81ef2040 R15: ffff8e3ca421c900 Feb 15 19:10:14 nana kernel: FS: 0000000000000000(0000) GS:ffff8e3ea0500000(0000) knlGS:0000000000000000 Feb 15 19:10:14 nana kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Feb 15 19:10:14 nana kernel: CR2: 000006c9dd32b018 CR3: 000000011318e000 CR4: 00000000000006e0 Feb 15 19:10:14 nana kernel: Call Trace: Feb 15 19:10:14 nana kernel: <TASK> Feb 15 19:10:14 nana kernel: nlm_unlock_files+0x32/0xd0 Feb 15 19:10:14 nana kernel: nlm_traverse_files+0x14d/0x280 Feb 15 19:10:14 nana kernel: nlmsvc_free_host_resources+0x17/0x30 Feb 15 19:10:14 nana kernel: nlm_host_rebooted+0x23/0x90 Feb 15 19:10:14 nana kernel: nlmsvc_proc_sm_notify+0xa1/0x110 Feb 15 19:10:14 nana kernel: ? trace_hardirqs_on+0x35/0xd0 Feb 15 19:10:14 nana kernel: ? nlmsvc_decode_reboot+0x95/0xc0 Feb 15 19:10:14 nana kernel: nlmsvc_dispatch+0x89/0x180 Feb 15 19:10:14 nana kernel: svc_process_common+0x399/0x640 Feb 15 19:10:14 nana kernel: ? lockd_inet6addr_event+0xf0/0xf0 Feb 15 19:10:14 nana kernel: ? set_grace_period+0xb0/0xb0 Feb 15 19:10:14 nana kernel: svc_process+0xca/0xe0 Feb 15 19:10:14 nana kernel: lockd+0x8f/0x130 Feb 15 19:10:14 nana kernel: ? set_grace_period+0xb0/0xb0 Feb 15 19:10:14 nana kernel: kthread+0x10e/0x130 Feb 15 19:10:14 nana kernel: ? set_kthread_struct+0x40/0x40 Feb 15 19:10:14 nana kernel: ret_from_fork+0x22/0x30 Feb 15 19:10:14 nana kernel: </TASK> --- emerge --info (older kernel, but basically same system): https://cloud.gagv.org.uk/s/TnfytcBARAifdkx Any ideas, please let me know - I'll revert to 5.10 as I need the system for work.
5.15.24 just came and has lots of NFS fixes
OK, .24 is looking promising, I'll run some more tests this week, will mark fixed if the issue doesn't present itself again.
I spoke too soon: [Feb21 10:31] watchdog: BUG: soft lockup - CPU#0 stuck for 7205s! [lockd:2861] [ +0.001971] Modules linked in: ecb xts dm_crypt dm_mod tun bridge stp llc ipt_REJECT nf_reject_ipv4 xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter iptable_mangle iptable_raw ip_tables radeon i2c_algo_bit drm_ttm_helper ttm w83795 drm_kms_helper cfbfillrect kvm_amd syscopyarea cfbimgblt sysfillrect kvm sysimgblt fb_sys_fops cfbcopyarea drm tg3 uas irqbypass pcspkr usb_storage drm_panel_orientation_quirks libphy pata_atiixp i2c_piix4 [ +0.000079] CPU: 0 PID: 2861 Comm: lockd Tainted: G L 5.15.24-gentoo-x86_64 #1 [ +0.000008] Hardware name: HP ProLiant MicroServer, BIOS O41 07/29/2011 [ +0.000004] RIP: 0010:_raw_spin_lock+0x10/0x20 [ +0.000014] Code: 5d c3 48 89 ef 5d e9 cf 3f 6e ff 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 e8 ab db 66 ff 31 c0 ba 01 00 00 00 f0 0f b1 17 <75> 02 f3 c3 89 c6 e9 d5 3d 6e ff 0f 1f 44 00 00 e8 8b db 66 ff 48 [ +0.000005] RSP: 0018:ffffa2d600df3d78 EFLAGS: 00000246 [ +0.000006] RAX: 0000000000000000 RBX: ffff89cccb548018 RCX: ffff89cccb548018 [ +0.000004] RDX: 0000000000000001 RSI: ffff89cee6818c00 RDI: ffff89cccb548000 [ +0.000004] RBP: ffff89cd7d372800 R08: ffffa2d600df3c90 R09: 0000000000000000 [ +0.000004] R10: 0000000000000000 R11: ffff89ccf0047f84 R12: ffffffff9314c8c0 [ +0.000004] R13: ffff89cccb548000 R14: ffff89cccc95be98 R15: ffff89ced6f66200 [ +0.000004] FS: 0000000000000000(0000) GS:ffff89cfe0400000(0000) knlGS:0000000000000000 [ +0.000005] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ +0.000004] CR2: 000008c71bbaa018 CR3: 000000011427a000 CR4: 00000000000006f0 [ +0.000005] Call Trace: [ +0.000004] <TASK> [ +0.000003] nlm_traverse_files+0xf5/0x280 [ +0.000012] nlmsvc_free_host_resources+0x17/0x30 [ +0.000008] nlm_host_rebooted+0x23/0x90 [ +0.000008] nlmsvc_proc_sm_notify+0xa1/0x110 [ +0.000007] ? trace_hardirqs_on+0x35/0xd0 [ +0.000008] ? nlmsvc_decode_reboot+0x95/0xc0 [ +0.000007] nlmsvc_dispatch+0x89/0x180 [ +0.000008] svc_process_common+0x399/0x640 [ +0.000010] ? lockd_inet6addr_event+0xf0/0xf0 [ +0.000009] ? set_grace_period+0xb0/0xb0 [ +0.000005] svc_process+0xca/0xe0 [ +0.000007] lockd+0x8f/0x130 [ +0.000006] ? set_grace_period+0xb0/0xb0 [ +0.000004] kthread+0x10e/0x130 [ +0.000007] ? set_kthread_struct+0x40/0x40 [ +0.000007] ret_from_fork+0x22/0x30 [ +0.000009] </TASK>
You may have to take that upstream.
so I thought the URL references your problem but it does not look like it according to your title. Can you test with the latest 5.15.X and then if it reoccurs please post the dmesg and .config from that exact kernel.
If you mean the forum URL I posted, it is indeed this problem, but reported some time ago. However, the symptom is the same. If you meant the email link - yes, patch 1/2 doesn't seem to be my problem, but patch 2/2 rings some bells [without looking at any code], but raises suspicion, as it contains the word "dubious" :) I have a couple of things to do: - Open this upstream, which I was delaying because - ...5.15.26 seemed to work quite nicely for over a week (!!!) until today (boo) - Test gentoo-kernel-bin to rule out my config - Test latest if above fails
Quick update: trying gentoo-kernel-bin 5.15.25, and while I don't get the "soft lockup" error, NFS started to fail, and I'm having trouble restarting it. Observed in dmesg when trying to restart: "lockd: couldn't shutdown host module for net f0000098!" I'll post the kernel config diff later, but there isn't much difference besides nfsd being a module (in -kernel-bin) vs compiled-in in my config.
Another update: 5.15.29 has been running for over 9 full days and the issue hasn't presented itself - so hopefully no kernel config debugging required. I did switch off a couple of options that supported extremely old+deprecated syscalls, but I doubt that was the issue.
(In reply to Gabriel from comment #8) > Another update: 5.15.29 has been running for over 9 full days and the issue > hasn't presented itself - so hopefully no kernel config debugging required. > I did switch off a couple of options that supported extremely old+deprecated > syscalls, but I doubt that was the issue. Thanks for the update. We'll leave this open for a few days, maybe when we hit two full weeks, we can close this.
Reopening this, as the issue reappeared with latest stable kernel (5.15.41). To be 110% honest, it didn't go away completely, but it rarely appeared. This new kernel, however, causes the problem within minutes of being up and running. I'm still considering just changing hardware, as it appears that I'm the only one with this issue :/
Can you report this upstream, please ?
@gabriel any information on this ? if you sent this upstream could you link here the upstream link. thanks
Hi, Alice - apologies that this is taking me so long to action (I've never opened a kernel bug, so have to go through the priming process). On a side note, I upgraded to kernel 5.15.49 (gentoo-sources) and the problem's frequency is rare. I'll certainly link the kernel.org bug when I get to open it.
Please let us know the link when you report this upstream, we will follow the bug and backport any fixes identified.
Apologies again for the delay in providing feedback; _hopefully_, this is fixed/worked around now in 5.15.69 - throughout all this time, I managed to reproduce the bug consistently and in version 5.15.69 of the kernel I only get a: lockd: couldn't shutdown host module for net f0000000! ...rather than a CPU soft lock that brings the system to its proverbial knees. There are still some user space issues, such as having to restart nfs otherwise the mac client will refuse to connect, but that I can live with. <guess>If this was related to HW combination involving an old AMD CPU and a tg3 network driver, I would have never guessed. But here we are. Hopefully this can remain closed and will serve as reference to another poor soul on the internet.</guess>