Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 833438 - =sys-kernel/gentoo-sources-5.15.23 nfsd/lockd issues
Summary: =sys-kernel/gentoo-sources-5.15.23 nfsd/lockd issues
Status: RESOLVED NEEDINFO
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: Current packages (show other bugs)
Hardware: All Linux
: Normal normal (vote)
Assignee: Gentoo Kernel Bug Wranglers and Kernel Maintainers
URL: https://www.spinics.net/lists/linux-n...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2022-02-15 22:21 UTC by Gabriel
Modified: 2022-10-18 12:25 UTC (History)
1 user (show)

See Also:
Package list:
Runtime testing required: ---


Attachments
5.15.23 kernel config (config,135.36 KB, text/plain)
2022-02-15 22:21 UTC, Gabriel
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Gabriel 2022-02-15 22:21:49 UTC
Created attachment 765212 [details]
5.15.23 kernel config

I've been having issues with kernels 5.15 (originally reported here: https://forums.gentoo.org/viewtopic-t-1146747.html)
I'm seeing the same problem in 5.15.23:
---
Jan 12 08:54:10 nana kernel: BUG: kernel NULL pointer dereference, address: 0000000000000110
Jan 12 08:54:10 nana kernel: #PF: supervisor read access in kernel mode
Jan 12 08:54:10 nana kernel: #PF: error_code(0x0000) - not-present page
Jan 12 08:54:10 nana kernel: PGD 0 P4D 0
Jan 12 08:54:10 nana kernel: Oops: 0000 [#1] SMP NOPTI
Jan 12 08:54:10 nana kernel: CPU: 1 PID: 2864 Comm: lockd Not tainted 5.15.11-gentoo-x86_64 #2
Jan 12 08:54:10 nana kernel: Hardware name: HP ProLiant MicroServer, BIOS O41     07/29/2011
Jan 12 08:54:10 nana kernel: RIP: 0010:vfs_lock_file+0x5/0x30
Jan 12 08:54:10 nana kernel: Code: a3 fe ff ff 4d 89 e1 e9 a4 fd ff ff 66 0f 1f 84 00 00 00 00 00 e8 2b 0d d7 ff 48 8b 7f 20 e9 f2 f5 ff ff 66 90 e8 1b 0d d7 ff <48> 8b 47 28 49 89 d0 48 8b 80 98 00 00 00 48 85 c0 74 05 e9 43 b8
Jan 12 08:54:10 nana kernel: RSP: 0018:ffff9d3640997c80 EFLAGS: 00010246
Jan 12 08:54:10 nana kernel: RAX: 7fffffffffffffff RBX: 00000000000000e8 RCX: 0000000000000000
Jan 12 08:54:10 nana kernel: RDX: ffff9d3640997c88 RSI: 0000000000000006 RDI: 00000000000000e8
Jan 12 08:54:10 nana kernel: RBP: ffff8b754767b400 R08: ffff8b7549dcf000 R09: ffff8b754bef1a00
Jan 12 08:54:10 nana kernel: R10: 0000000000000000 R11: 000000000000f000 R12: ffffffff9c34bfd0
Jan 12 08:54:10 nana kernel: R13: ffff8b76a518e7a8 R14: ffff8b7549d60c10 R15: ffff8b754767b400
Jan 12 08:54:10 nana kernel: FS:  0000000000000000(0000) GS:ffff8b7860500000(0000) knlGS:0000000000000000
Jan 12 08:54:10 nana kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 12 08:54:10 nana kernel: CR2: 0000000000000110 CR3: 000000010ffd4000 CR4: 00000000000006e0
Jan 12 08:54:10 nana kernel: Call Trace:
Jan 12 08:54:10 nana kernel:  <TASK>
Jan 12 08:54:10 nana kernel:  nlm_unlock_files+0x6e/0xb0
Jan 12 08:54:10 nana kernel:  ? _raw_spin_lock+0x5/0x20
Jan 12 08:54:10 nana kernel:  ? trace_hardirqs_on+0x35/0xd0
Jan 12 08:54:10 nana kernel:  ? __local_bh_enable_ip+0x44/0x80
Jan 12 08:54:10 nana kernel:  ? trace_hardirqs_on+0x35/0xd0
Jan 12 08:54:10 nana kernel:  ? mutex_lock+0x5/0x20
Jan 12 08:54:10 nana kernel:  ? nlmsvc_traverse_blocks+0x36/0x120
Jan 12 08:54:10 nana kernel:  nlm_traverse_files+0x14d/0x280
Jan 12 08:54:10 nana kernel:  nlmsvc_free_host_resources+0x17/0x30
Jan 12 08:54:10 nana kernel:  nlm_host_rebooted+0x23/0x90
Jan 12 08:54:10 nana kernel:  nlmsvc_proc_sm_notify+0xa1/0x110
Jan 12 08:54:10 nana kernel:  ? trace_hardirqs_on+0x35/0xd0
Jan 12 08:54:10 nana kernel:  ? nlmsvc_decode_reboot+0x95/0xc0
Jan 12 08:54:10 nana kernel:  nlmsvc_dispatch+0x89/0x180
Jan 12 08:54:10 nana kernel:  svc_process_common+0x399/0x640
Jan 12 08:54:10 nana kernel:  ? lockd_inet6addr_event+0xf0/0xf0
Jan 12 08:54:10 nana kernel:  ? set_grace_period+0xb0/0xb0
Jan 12 08:54:10 nana kernel:  svc_process+0xca/0xe0
Jan 12 08:54:10 nana kernel:  lockd+0x8f/0x130
Jan 12 08:54:10 nana kernel:  ? set_grace_period+0xb0/0xb0
Jan 12 08:54:10 nana kernel:  kthread+0x10e/0x130
Jan 12 08:54:10 nana kernel:  ? set_kthread_struct+0x40/0x40
Jan 12 08:54:10 nana kernel:  ret_from_fork+0x22/0x30
Jan 12 08:54:10 nana kernel:  </TASK>
Jan 12 08:54:10 nana kernel: Modules linked in: ecb xts dm_crypt dm_mod tun bridge stp llc ipt_REJECT nf_reject_ipv4 xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter iptable_mangle iptable_raw ip_tables radeon i2c_algo_bit drm_ttm_helper ttm kvm_amd drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt w83795 kvm fb_sys_fops uas cfbcopyarea drm usb_storage irqbypass tg3 drm_panel_orientation_quirks pata_atiixp libphy pcspkr i2c_piix4
Jan 12 08:54:10 nana kernel: CR2: 0000000000000110
Jan 12 08:54:10 nana kernel: ---[ end trace 6ac413c9433d0bd8 ]---
Jan 12 08:54:10 nana kernel: RIP: 0010:vfs_lock_file+0x5/0x30
Jan 12 08:54:10 nana kernel: Code: a3 fe ff ff 4d 89 e1 e9 a4 fd ff ff 66 0f 1f 84 00 00 00 00 00 e8 2b 0d d7 ff 48 8b 7f 20 e9 f2 f5 ff ff 66 90 e8 1b 0d d7 ff <48> 8b 47 28 49 89 d0 48 8b 80 98 00 00 00 48 85 c0 74 05 e9 43 b8
Jan 12 08:54:10 nana kernel: RSP: 0018:ffff9d3640997c80 EFLAGS: 00010246
Jan 12 08:54:10 nana kernel: RAX: 7fffffffffffffff RBX: 00000000000000e8 RCX: 0000000000000000
Jan 12 08:54:10 nana kernel: RDX: ffff9d3640997c88 RSI: 0000000000000006 RDI: 00000000000000e8
Jan 12 08:54:10 nana kernel: RBP: ffff8b754767b400 R08: ffff8b7549dcf000 R09: ffff8b754bef1a00
Jan 12 08:54:10 nana kernel: R10: 0000000000000000 R11: 000000000000f000 R12: ffffffff9c34bfd0
Jan 12 08:54:10 nana kernel: R13: ffff8b76a518e7a8 R14: ffff8b7549d60c10 R15: ffff8b754767b400
Jan 12 08:54:10 nana kernel: FS:  0000000000000000(0000) GS:ffff8b7860500000(0000) knlGS:0000000000000000
Jan 12 08:54:10 nana kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 12 08:54:10 nana kernel: CR2: 0000000000000110 CR3: 000000010ffd4000 CR4: 00000000000006e0 
---
System becomes irresponsive, CPU lockups:
---
Feb 15 19:10:14 nana kernel: watchdog: BUG: soft lockup - CPU#1 stuck for 45s! [lockd:2868]
Feb 15 19:10:14 nana kernel: Modules linked in: ecb xts dm_crypt dm_mod tun bridge stp llc ipt_REJECT nf_reject_ipv4 xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter iptable_mangle iptable_raw ip_tables radeon i2c_algo_bit drm_ttm_helper ttm w83795 drm_kms_helper kvm_amd cfbfillrect syscopyarea kvm cfbimgblt uas sysfillrect sysimgblt usb_storage fb_sys_fops irqbypass cfbcopyarea pcspkr drm tg3 libphy pata_atiixp drm_panel_orientation_quirks i2c_piix4
Feb 15 19:10:14 nana kernel: CPU: 1 PID: 2868 Comm: lockd Not tainted 5.15.23-gentoo-x86_64 #1
Feb 15 19:10:14 nana kernel: Hardware name: HP ProLiant MicroServer, BIOS O41     07/29/2011
Feb 15 19:10:14 nana kernel: RIP: 0010:locks_init_lock+0x3a/0x80
Feb 15 19:10:14 nana kernel: Code: c7 87 d0 00 00 00 00 00 00 00 48 8d 7f 08 48 89 d1 31 c0 48 c7 c6 4c c6 4c 88 48 83 e7 f8 48 29 f9 81 c1 d8 00 00 00 c1 e9 03 <f3> 48 ab 48 8d 42 08 48 8d 7a 60 48 89 42 08 48 89 42 10 48 8d 42
Feb 15 19:10:14 nana kernel: RSP: 0018:ffffb1fd80f83c88 EFLAGS: 00000212
Feb 15 19:10:14 nana kernel: RAX: 0000000000000000 RBX: ffff8e3ca421c900 RCX: 0000000000000018
Feb 15 19:10:14 nana kernel: RDX: ffffb1fd80f83c90 RSI: ffffffff884cc64c RDI: ffffb1fd80f83ca8
Feb 15 19:10:14 nana kernel: RBP: ffff8e3c6bcd6c00 R08: ffffb1fd80f83c90 R09: 0000000000000000
Feb 15 19:10:14 nana kernel: R10: 0000000000000000 R11: ffff8e3b80369f84 R12: ffffffff8774c800
Feb 15 19:10:14 nana kernel: R13: ffff8e3b8d55fe38 R14: ffff8e3b81ef2040 R15: ffff8e3ca421c900
Feb 15 19:10:14 nana kernel: FS:  0000000000000000(0000) GS:ffff8e3ea0500000(0000) knlGS:0000000000000000
Feb 15 19:10:14 nana kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 15 19:10:14 nana kernel: CR2: 000006c9dd32b018 CR3: 000000011318e000 CR4: 00000000000006e0
Feb 15 19:10:14 nana kernel: Call Trace:
Feb 15 19:10:14 nana kernel:  <TASK>
Feb 15 19:10:14 nana kernel:  nlm_unlock_files+0x32/0xd0
Feb 15 19:10:14 nana kernel:  nlm_traverse_files+0x14d/0x280
Feb 15 19:10:14 nana kernel:  nlmsvc_free_host_resources+0x17/0x30
Feb 15 19:10:14 nana kernel:  nlm_host_rebooted+0x23/0x90
Feb 15 19:10:14 nana kernel:  nlmsvc_proc_sm_notify+0xa1/0x110
Feb 15 19:10:14 nana kernel:  ? trace_hardirqs_on+0x35/0xd0
Feb 15 19:10:14 nana kernel:  ? nlmsvc_decode_reboot+0x95/0xc0
Feb 15 19:10:14 nana kernel:  nlmsvc_dispatch+0x89/0x180
Feb 15 19:10:14 nana kernel:  svc_process_common+0x399/0x640
Feb 15 19:10:14 nana kernel:  ? lockd_inet6addr_event+0xf0/0xf0
Feb 15 19:10:14 nana kernel:  ? set_grace_period+0xb0/0xb0
Feb 15 19:10:14 nana kernel:  svc_process+0xca/0xe0
Feb 15 19:10:14 nana kernel:  lockd+0x8f/0x130
Feb 15 19:10:14 nana kernel:  ? set_grace_period+0xb0/0xb0
Feb 15 19:10:14 nana kernel:  kthread+0x10e/0x130
Feb 15 19:10:14 nana kernel:  ? set_kthread_struct+0x40/0x40
Feb 15 19:10:14 nana kernel:  ret_from_fork+0x22/0x30
Feb 15 19:10:14 nana kernel:  </TASK>
---

emerge --info (older kernel, but basically same system): https://cloud.gagv.org.uk/s/TnfytcBARAifdkx

Any ideas, please let me know - I'll revert to 5.10 as I need the system for work.
Comment 1 Joakim Tjernlund 2022-02-16 16:52:18 UTC
5.15.24 just came and has lots of NFS fixes
Comment 2 Gabriel 2022-02-19 21:22:29 UTC
OK, .24 is looking promising, I'll run some more tests this week, will mark fixed if the issue doesn't present itself again.
Comment 3 Gabriel 2022-02-21 11:42:42 UTC
I spoke too soon:
[Feb21 10:31] watchdog: BUG: soft lockup - CPU#0 stuck for 7205s! [lockd:2861]
[  +0.001971] Modules linked in: ecb xts dm_crypt dm_mod tun bridge stp llc ipt_REJECT nf_reject_ipv4 xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter iptable_mangle iptable_raw ip_tables radeon i2c_algo_bit drm_ttm_helper ttm w83795 drm_kms_helper cfbfillrect kvm_amd syscopyarea cfbimgblt sysfillrect kvm sysimgblt fb_sys_fops cfbcopyarea drm tg3 uas irqbypass pcspkr usb_storage drm_panel_orientation_quirks libphy pata_atiixp i2c_piix4
[  +0.000079] CPU: 0 PID: 2861 Comm: lockd Tainted: G             L    5.15.24-gentoo-x86_64 #1
[  +0.000008] Hardware name: HP ProLiant MicroServer, BIOS O41     07/29/2011
[  +0.000004] RIP: 0010:_raw_spin_lock+0x10/0x20
[  +0.000014] Code: 5d c3 48 89 ef 5d e9 cf 3f 6e ff 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 e8 ab db 66 ff 31 c0 ba 01 00 00 00 f0 0f b1 17 <75> 02 f3 c3 89 c6 e9 d5 3d 6e ff 0f 1f 44 00 00 e8 8b db 66 ff 48
[  +0.000005] RSP: 0018:ffffa2d600df3d78 EFLAGS: 00000246
[  +0.000006] RAX: 0000000000000000 RBX: ffff89cccb548018 RCX: ffff89cccb548018
[  +0.000004] RDX: 0000000000000001 RSI: ffff89cee6818c00 RDI: ffff89cccb548000
[  +0.000004] RBP: ffff89cd7d372800 R08: ffffa2d600df3c90 R09: 0000000000000000
[  +0.000004] R10: 0000000000000000 R11: ffff89ccf0047f84 R12: ffffffff9314c8c0
[  +0.000004] R13: ffff89cccb548000 R14: ffff89cccc95be98 R15: ffff89ced6f66200
[  +0.000004] FS:  0000000000000000(0000) GS:ffff89cfe0400000(0000) knlGS:0000000000000000
[  +0.000005] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000004] CR2: 000008c71bbaa018 CR3: 000000011427a000 CR4: 00000000000006f0
[  +0.000005] Call Trace:
[  +0.000004]  <TASK>
[  +0.000003]  nlm_traverse_files+0xf5/0x280
[  +0.000012]  nlmsvc_free_host_resources+0x17/0x30
[  +0.000008]  nlm_host_rebooted+0x23/0x90
[  +0.000008]  nlmsvc_proc_sm_notify+0xa1/0x110
[  +0.000007]  ? trace_hardirqs_on+0x35/0xd0
[  +0.000008]  ? nlmsvc_decode_reboot+0x95/0xc0
[  +0.000007]  nlmsvc_dispatch+0x89/0x180
[  +0.000008]  svc_process_common+0x399/0x640
[  +0.000010]  ? lockd_inet6addr_event+0xf0/0xf0
[  +0.000009]  ? set_grace_period+0xb0/0xb0
[  +0.000005]  svc_process+0xca/0xe0
[  +0.000007]  lockd+0x8f/0x130
[  +0.000006]  ? set_grace_period+0xb0/0xb0
[  +0.000004]  kthread+0x10e/0x130
[  +0.000007]  ? set_kthread_struct+0x40/0x40
[  +0.000007]  ret_from_fork+0x22/0x30
[  +0.000009]  </TASK>
Comment 4 Joakim Tjernlund 2022-02-21 11:59:19 UTC
You may have to take that upstream.
Comment 5 Mike Pagano gentoo-dev 2022-03-10 17:59:24 UTC
so I thought the URL references your problem but it does not look like it according to your title.

Can you test with the latest 5.15.X and then if it reoccurs please post the dmesg and .config from that exact kernel.
Comment 6 Gabriel 2022-03-11 17:32:56 UTC
If you mean the forum URL I posted, it is indeed this problem, but reported some time ago. However, the symptom is the same.
If you meant the email link - yes, patch 1/2 doesn't seem to be my problem, but patch 2/2 rings some bells [without looking at any code], but raises suspicion, as it contains the word "dubious" :)

I have a couple of things to do:
- Open this upstream, which I was delaying because
- ...5.15.26 seemed to work quite nicely for over a week (!!!) until today (boo)
- Test gentoo-kernel-bin to rule out my config
- Test latest if above fails
Comment 7 Gabriel 2022-03-17 08:55:47 UTC
Quick update: trying gentoo-kernel-bin 5.15.25, and while I don't get the "soft lockup" error, NFS started to fail, and I'm having trouble restarting it. Observed in dmesg when trying to restart: "lockd: couldn't shutdown host module for net f0000098!"
I'll post the kernel config diff later, but there isn't much difference besides nfsd being a module (in -kernel-bin) vs compiled-in in my config.
Comment 8 Gabriel 2022-03-27 09:22:15 UTC
Another update: 5.15.29 has been running for over 9 full days and the issue hasn't presented itself - so hopefully no kernel config debugging required. I did switch off a couple of options that supported extremely old+deprecated syscalls, but I doubt that was the issue.
Comment 9 Mike Pagano gentoo-dev 2022-03-27 17:09:21 UTC
(In reply to Gabriel from comment #8)
> Another update: 5.15.29 has been running for over 9 full days and the issue
> hasn't presented itself - so hopefully no kernel config debugging required.
> I did switch off a couple of options that supported extremely old+deprecated
> syscalls, but I doubt that was the issue.

Thanks for the update. We'll leave this open for a few days, maybe when we hit two full weeks, we can close this.
Comment 10 Gabriel 2022-05-24 12:07:37 UTC
Reopening this, as the issue reappeared with latest stable kernel (5.15.41).
To be 110% honest, it didn't go away completely, but it rarely appeared. This new kernel, however, causes the problem within minutes of being up and running.

I'm still considering just changing hardware, as it appears that I'm the only one with this issue :/
Comment 11 Mike Pagano gentoo-dev 2022-05-24 13:50:42 UTC
Can you report this upstream, please ?
Comment 12 Alice Ferrazzi Gentoo Infrastructure gentoo-dev 2022-07-04 10:20:36 UTC
@gabriel any information on this ? if you sent this upstream could you link here the upstream link. thanks
Comment 13 Gabriel 2022-07-08 10:02:26 UTC
Hi, Alice - apologies that this is taking me so long to action (I've never opened a kernel bug, so have to go through the priming process).
On a side note, I upgraded to kernel 5.15.49 (gentoo-sources) and the problem's frequency is rare.
I'll certainly link the kernel.org bug when I get to open it.
Comment 14 Mike Pagano gentoo-dev 2022-09-23 16:56:55 UTC
Please let us know the link when you report this upstream, we will follow the bug and backport any fixes identified.
Comment 15 Gabriel 2022-10-18 12:25:22 UTC
Apologies again for the delay in providing feedback; _hopefully_, this is fixed/worked around now in 5.15.69 - throughout all this time, I managed to reproduce the bug consistently and in version 5.15.69 of the kernel I only get a:
lockd: couldn't shutdown host module for net f0000000!
...rather than a CPU soft lock that brings the system to its proverbial knees.
There are still some user space issues, such as having to restart nfs otherwise the mac client will refuse to connect, but that I can live with.

<guess>If this was related to HW combination involving an old AMD CPU and a tg3 network driver, I would have never guessed. But here we are.  Hopefully this can remain closed and will serve as reference to another poor soul on the internet.</guess>