| Summary: | =sys-kernel/gentoo-sources-5.15.23 nfsd/lockd issues | ||
|---|---|---|---|
| Product: | Gentoo Linux | Reporter: | Gabriel <gabriel> |
| Component: | Current packages | Assignee: | Gentoo Kernel Bug Wranglers and Kernel Maintainers <kernel> |
| Status: | RESOLVED NEEDINFO | ||
| Severity: | normal | CC: | joakim.tjernlund |
| Priority: | Normal | ||
| Version: | unspecified | ||
| Hardware: | All | ||
| OS: | Linux | ||
| URL: | https://www.spinics.net/lists/linux-nfs/msg88535.html | ||
| Whiteboard: | |||
| Package list: | Runtime testing required: | --- | |
| Attachments: | 5.15.23 kernel config | ||
|
Description
Gabriel
2022-02-15 22:21:49 UTC
5.15.24 just came and has lots of NFS fixes OK, .24 is looking promising, I'll run some more tests this week, will mark fixed if the issue doesn't present itself again. I spoke too soon: [Feb21 10:31] watchdog: BUG: soft lockup - CPU#0 stuck for 7205s! [lockd:2861] [ +0.001971] Modules linked in: ecb xts dm_crypt dm_mod tun bridge stp llc ipt_REJECT nf_reject_ipv4 xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter iptable_mangle iptable_raw ip_tables radeon i2c_algo_bit drm_ttm_helper ttm w83795 drm_kms_helper cfbfillrect kvm_amd syscopyarea cfbimgblt sysfillrect kvm sysimgblt fb_sys_fops cfbcopyarea drm tg3 uas irqbypass pcspkr usb_storage drm_panel_orientation_quirks libphy pata_atiixp i2c_piix4 [ +0.000079] CPU: 0 PID: 2861 Comm: lockd Tainted: G L 5.15.24-gentoo-x86_64 #1 [ +0.000008] Hardware name: HP ProLiant MicroServer, BIOS O41 07/29/2011 [ +0.000004] RIP: 0010:_raw_spin_lock+0x10/0x20 [ +0.000014] Code: 5d c3 48 89 ef 5d e9 cf 3f 6e ff 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 e8 ab db 66 ff 31 c0 ba 01 00 00 00 f0 0f b1 17 <75> 02 f3 c3 89 c6 e9 d5 3d 6e ff 0f 1f 44 00 00 e8 8b db 66 ff 48 [ +0.000005] RSP: 0018:ffffa2d600df3d78 EFLAGS: 00000246 [ +0.000006] RAX: 0000000000000000 RBX: ffff89cccb548018 RCX: ffff89cccb548018 [ +0.000004] RDX: 0000000000000001 RSI: ffff89cee6818c00 RDI: ffff89cccb548000 [ +0.000004] RBP: ffff89cd7d372800 R08: ffffa2d600df3c90 R09: 0000000000000000 [ +0.000004] R10: 0000000000000000 R11: ffff89ccf0047f84 R12: ffffffff9314c8c0 [ +0.000004] R13: ffff89cccb548000 R14: ffff89cccc95be98 R15: ffff89ced6f66200 [ +0.000004] FS: 0000000000000000(0000) GS:ffff89cfe0400000(0000) knlGS:0000000000000000 [ +0.000005] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ +0.000004] CR2: 000008c71bbaa018 CR3: 000000011427a000 CR4: 00000000000006f0 [ +0.000005] Call Trace: [ +0.000004] <TASK> [ +0.000003] nlm_traverse_files+0xf5/0x280 [ +0.000012] nlmsvc_free_host_resources+0x17/0x30 [ +0.000008] nlm_host_rebooted+0x23/0x90 [ +0.000008] nlmsvc_proc_sm_notify+0xa1/0x110 [ +0.000007] ? trace_hardirqs_on+0x35/0xd0 [ +0.000008] ? nlmsvc_decode_reboot+0x95/0xc0 [ +0.000007] nlmsvc_dispatch+0x89/0x180 [ +0.000008] svc_process_common+0x399/0x640 [ +0.000010] ? lockd_inet6addr_event+0xf0/0xf0 [ +0.000009] ? set_grace_period+0xb0/0xb0 [ +0.000005] svc_process+0xca/0xe0 [ +0.000007] lockd+0x8f/0x130 [ +0.000006] ? set_grace_period+0xb0/0xb0 [ +0.000004] kthread+0x10e/0x130 [ +0.000007] ? set_kthread_struct+0x40/0x40 [ +0.000007] ret_from_fork+0x22/0x30 [ +0.000009] </TASK> You may have to take that upstream. so I thought the URL references your problem but it does not look like it according to your title. Can you test with the latest 5.15.X and then if it reoccurs please post the dmesg and .config from that exact kernel. If you mean the forum URL I posted, it is indeed this problem, but reported some time ago. However, the symptom is the same. If you meant the email link - yes, patch 1/2 doesn't seem to be my problem, but patch 2/2 rings some bells [without looking at any code], but raises suspicion, as it contains the word "dubious" :) I have a couple of things to do: - Open this upstream, which I was delaying because - ...5.15.26 seemed to work quite nicely for over a week (!!!) until today (boo) - Test gentoo-kernel-bin to rule out my config - Test latest if above fails Quick update: trying gentoo-kernel-bin 5.15.25, and while I don't get the "soft lockup" error, NFS started to fail, and I'm having trouble restarting it. Observed in dmesg when trying to restart: "lockd: couldn't shutdown host module for net f0000098!" I'll post the kernel config diff later, but there isn't much difference besides nfsd being a module (in -kernel-bin) vs compiled-in in my config. Another update: 5.15.29 has been running for over 9 full days and the issue hasn't presented itself - so hopefully no kernel config debugging required. I did switch off a couple of options that supported extremely old+deprecated syscalls, but I doubt that was the issue. (In reply to Gabriel from comment #8) > Another update: 5.15.29 has been running for over 9 full days and the issue > hasn't presented itself - so hopefully no kernel config debugging required. > I did switch off a couple of options that supported extremely old+deprecated > syscalls, but I doubt that was the issue. Thanks for the update. We'll leave this open for a few days, maybe when we hit two full weeks, we can close this. Reopening this, as the issue reappeared with latest stable kernel (5.15.41). To be 110% honest, it didn't go away completely, but it rarely appeared. This new kernel, however, causes the problem within minutes of being up and running. I'm still considering just changing hardware, as it appears that I'm the only one with this issue :/ Can you report this upstream, please ? @gabriel any information on this ? if you sent this upstream could you link here the upstream link. thanks Hi, Alice - apologies that this is taking me so long to action (I've never opened a kernel bug, so have to go through the priming process). On a side note, I upgraded to kernel 5.15.49 (gentoo-sources) and the problem's frequency is rare. I'll certainly link the kernel.org bug when I get to open it. Please let us know the link when you report this upstream, we will follow the bug and backport any fixes identified. Apologies again for the delay in providing feedback; _hopefully_, this is fixed/worked around now in 5.15.69 - throughout all this time, I managed to reproduce the bug consistently and in version 5.15.69 of the kernel I only get a: lockd: couldn't shutdown host module for net f0000000! ...rather than a CPU soft lock that brings the system to its proverbial knees. There are still some user space issues, such as having to restart nfs otherwise the mac client will refuse to connect, but that I can live with. <guess>If this was related to HW combination involving an old AMD CPU and a tg3 network driver, I would have never guessed. But here we are. Hopefully this can remain closed and will serve as reference to another poor soul on the internet.</guess> |