909066 – sys-kernel/gentoo-sources-6.3.8: "rcu: INFO: rcu_preempt self-detected stall on CPU"

Bug 909066 - sys-kernel/gentoo-sources-6.3.8: "rcu: INFO: rcu_preempt self-detected stall on CPU"

Summary: sys-kernel/gentoo-sources-6.3.8: "rcu: INFO: rcu_preempt self-detected stall ...

Status:	RESOLVED FIXED

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	Current packages (show other bugs)
Hardware:	AMD64 Linux

Importance:	Normal critical
Assignee:	Gentoo Kernel Bug Wranglers and Kernel Maintainers

URL:
Whiteboard:	6.3.12
Keywords:

Depends on:
Blocks:

Reported:	2023-06-24 08:12 UTC by satmd
Modified:	2023-07-05 21:56 UTC (History)
CC List:	2 users (show)

See Also:	https://bugzilla.kernel.org/show_bug.cgi?id=217620
Package list:
Runtime testing required:	---

Attachments
rcu stall message 1 (rcu-1.txt,3.59 KB, text/plain) 2023-06-24 08:14 UTC, satmd	Details
rcu stall message 1 complete, the previous one got cut off early (rcu-1.txt,5.84 KB, text/plain) 2023-06-24 08:15 UTC, satmd	Details
Another RCU stall of the same boot (rcu-2.txt,6.57 KB, text/plain) 2023-06-24 08:19 UTC, satmd	Details
Another RCU stall of the same boot (rcu-3.txt,6.57 KB, text/plain) 2023-06-24 08:19 UTC, satmd	Details
Additional kernel messages of potential interest between attachment rcu stall 3 and 4 (kernel-1.txt,4.90 KB, text/plain) 2023-06-24 08:19 UTC, satmd	Details
Another RCU stall of the same boot (rcu-4.txt,9.96 KB, text/plain) 2023-06-24 08:20 UTC, satmd	Details
The .config for 6.12.6 (config-working,123.90 KB, text/plain) 2023-06-30 20:22 UTC, satmd	Details
RCU stall with 6.4.0 with delayed loading of wireguard.ko (rcu-5.txt,4.65 KB, text/plain) 2023-07-01 12:22 UTC, satmd	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description satmd 2023-06-24 08:12:17 UTC

Since upgrading from gentoo-sources 6.2.8 to 6.3.8, I no longer can boot my system

Reproducible: Always

Steps to Reproduce:
1. Build 6.3.8 with the .config from 6.2.8 + make oldconfig, default choices for prompts
2. reboot
Actual Results:  
System gets stuck once my wireguard interface goes up

Expected Results:  
System just boots :tm:

My setup is not trivial and I have done some extensive tests already.

- gcc 12 VS gcc 13 does not make a difference (6.2 was on gcc 12 and my first 6.3 was on gcc 13, I've then tried downgrading gcc to 12 for the new kernel)
- The code of wireguard did not change at all between 6.2.8 and 6.3.8
- 6.3.4 exhibits the same problem
- If I rename wireguard.ko to wireguard.ko.disabled and I don't load the module, the reboot is successful
- dmesg timing always shows the rcu stall few seconds after loading wireguard and setting up the interface
- the stall happens when I put an interface up and presumably receive/sends traffic
- the wireguard interface is operating on top of a bonding (lacp) interface made of two ports of an intel network card (igb).
- Unfortunately I cannot rebuild the 6.2.8 kernel, because its sources already have been uninstalled and aren't available in portage anymore (there's only 6.1.x and 6.3.x)
- The affected machine is a Dell R620 model (ancient intel cpu), so maybe microcode?
- The machine is remote with no physical access, but I have a ssh to serial access to it.

There's some things that I want to try and document:
- Try 6.3.0 too
- Replace igb of 6.3.8 with the code of 6.2.x
- Replace bonding of 6.3.8 with the code of 6.2.x
- I'd like to eliminate buggy firmware as a possible source of trouble, but I'm haven't found a way to build 6.3.x with older firmware yet, I don't know the version used in the 6.2.8 kernel

I'll attach all rcu stalls to this bug report, as well as the diff of the kernel configs.

That's what I came up with now, I'll add more information on request.

Comment 1 satmd 2023-06-24 08:14:10 UTC

Created attachment 864546 [details]
rcu stall message 1

The first RCU stall message, indicating the network path of wireguard/bonding/igb

Comment 2 satmd 2023-06-24 08:15:24 UTC

Created attachment 864547 [details]
rcu stall message 1 complete, the previous one got cut off early

Comment 3 satmd 2023-06-24 08:19:16 UTC

Created attachment 864548 [details]
Another RCU stall of the same boot

Comment 4 satmd 2023-06-24 08:19:28 UTC

Created attachment 864549 [details]
Another RCU stall of the same boot

Comment 5 satmd 2023-06-24 08:19:58 UTC

Created attachment 864550 [details]
Additional kernel messages of potential interest between attachment rcu stall 3 and 4

Comment 6 satmd 2023-06-24 08:20:12 UTC

Created attachment 864551 [details]
Another RCU stall of the same boot

Comment 7 Sam James archtester

2023-06-24 08:20:33 UTC

> - Unfortunately I cannot rebuild the 6.2.8 kernel, because its sources
> already have been uninstalled and aren't available in portage anymore
> (there's only 6.1.x and 6.3.x)

You should be able to grab it from git history, see https://wiki.gentoo.org/wiki/Downgrading_a_package_to_removed_version. Let us know if you need more help with that bit.

Ideally, would grab 6.2.16 as well and then that gives you a smaller range to bisect between if that works (or doesn't, even).

Comment 8 satmd 2023-06-24 08:26:22 UTC

Kernel cmdline:
> root=UUID=... ro console=ttyS1,115200 rd.auto quiet loglevel=1  cgroup_enable=memory swapaccount=1 msr.allow_writes=on

Please note that I also tried booting with
> root=UUID=... ro console=ttyS1,115200 rd.auto swapaccount=1

At least those option do not seem to make a difference

Comment 9 satmd 2023-06-24 08:31:43 UTC

Config diff for the kernel betwen 6.2.8 and 6.3.8:

# diff -uNr /boot/config-6.2.8-gentoo .config | grep "^[+-][^#+-]"
-CONFIG_CC_VERSION_TEXT="gcc (Gentoo Hardened 12.2.1_p20230304 p13) 12.2.1 20230304"
+CONFIG_CC_VERSION_TEXT="gcc (Gentoo Hardened 13.1.1_p20230527 p3) 13.1.1 20230527"
-CONFIG_GCC_VERSION=120201
+CONFIG_GCC_VERSION=130101
-CONFIG_GCC12_NO_ARRAY_BOUNDS=y
+CONFIG_SCHED_MM_CID=y
+CONFIG_KVM_GENERIC_HARDWARE_ENABLING=y
+CONFIG_AS_GFNI=y
-CONFIG_BLOCK_COMPAT=y
-CONFIG_NETFILTER_FAMILY_ARP=y
-CONFIG_IP_NF_TARGET_CLUSTERIP=m
+CONFIG_NET_SCH_MQPRIO_LIB=m
+CONFIG_SERIAL_8250_PCILIB=y
+CONFIG_THERMAL_ACPI=y
+CONFIG_INTEL_TCC=y
+CONFIG_HID_SUPPORT=y
+CONFIG_I2C_HID=y
+CONFIG_LEGACY_DIRECT_IO=y
+CONFIG_RCU_CPU_STALL_CPUTIME=y

This is the current state of .config with GCC 13. I tried GCC 12 with the same outcome, only CONFIG_CC_VERSION_TEXT and CONFIG_GCC_VERSION would be different.

I have added CONFIG_RCU_CPU_STALL_CPUTIME only after the first problems in order to get more diagnostic output, other options have been left as defaults when doing `make oldconfig`.

I *do* note that cpuidle code has been problematic on this machine some years ago and cpuidle is now enabled as a dependency together with ACPI code it seems. I do not have much more knowledge about the differences' influence.

Comment 10 satmd 2023-06-24 08:56:41 UTC

Building and testing 6.2.16 and 6.3.0 with gcc 13 and current firmwares now.

Comment 11 satmd 2023-06-24 09:58:19 UTC

Results with 6.2.16 and 6.3.0:

- 6.3.0 does not boot
- 6.2.16 does boot

Since I've used the current firmware, microcode and gcc 13, I think I can eliminate those from the equation.

Leaves me to think that this is really a bug with the kernel.

Anything else I should try?

Comment 12 Mike Pagano gentoo-dev

2023-06-24 22:21:43 UTC

How about a git bisect between the last working version and the first non-working version?

Comment 13 satmd 2023-06-25 14:34:52 UTC

I will now start bisecting using instructions from https://wiki.gentoo.org/wiki/Kernel_git-bisect

good: v6.2.16
bad: v6.3

Since my hardware is older and is in productive use, this will probably take some days for me.

Comment 14 satmd 2023-06-25 19:20:02 UTC

I've progressed a bit:
- The bug appears in upstream kernel (not gentoo patchset)
- It seems that the bug was introduced after 6.3.0-rc5
- It seems that the bug was introduced before 6.3.0-rc7

I'm still running bisect.

Comment 15 satmd 2023-06-25 20:57:11 UTC

Bisect came up with this:

```
fed8d8773b8ea68ad99d9eee8c8343bef9da2c2c is the first bad commit
commit fed8d8773b8ea68ad99d9eee8c8343bef9da2c2c
Author: Eric DeVolder <eric.devolder@oracle.com>
Date:   Mon Mar 27 15:10:26 2023 -0400

    x86/acpi/boot: Correct acpi_is_processor_usable() check

    The logic in acpi_is_processor_usable() requires the online capable
    bit be set for hotpluggable CPUs.  The online capable bit has been
    introduced in ACPI 6.3.

    However, for ACPI revisions < 6.3 which do not support that bit, CPUs
    should be reported as usable, not the other way around.

    Reverse the check.

      [ bp: Rewrite commit message. ]

    Fixes: e2869bd7af60 ("x86/acpi/boot: Do not register processors that cannot be onlined for x2APIC")
    Suggested-by: Miguel Luis <miguel.luis@oracle.com>
    Suggested-by: Boris Ostrovsky <boris.ovstrosky@oracle.com>
    Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
    Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
    Tested-by: David R <david@unsolicited.net>
    Cc: <stable@kernel.org>
    Link: https://lore.kernel.org/r/20230327191026.3454-2-
eric.devolder@oracle.com

 arch/x86/kernel/acpi/boot.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)
```

I will try to toggle CPU hotplugging. Aside from that, I currently have no further ideas.

Comment 16 satmd 2023-06-26 07:36:01 UTC

I concluded that replacing boot.c by the one of 6.2.16 might be producing a working kernel, but it wasn't. Now I'm thinking that the rcu stall isn't happening consistently with every boot.

I'm doubting the results of git bisect too now.

Comment 17 Sam James archtester

2023-06-26 12:13:01 UTC

(In reply to satmd from comment #16)
> I concluded that replacing boot.c by the one of 6.2.16 might be producing a
> working kernel, but it wasn't. Now I'm thinking that the rcu stall isn't
> happening consistently with every boot.
> 
> I'm doubting the results of git bisect too now.

I suggest re-running the bisect but doing two steps per boot test instead of one before marking it as good/bad.

Comment 18 satmd 2023-06-26 12:53:40 UTC

(In reply to Sam James from comment #17)
> (In reply to satmd from comment #16)
> > I concluded that replacing boot.c by the one of 6.2.16 might be producing a
> > working kernel, but it wasn't. Now I'm thinking that the rcu stall isn't
> > happening consistently with every boot.
> > 
> > I'm doubting the results of git bisect too now.
> 
> I suggest re-running the bisect but doing two steps per boot test instead of
> one before marking it as good/bad.

What do you mean by this?

Reboot twice for each bisect step?

Comment 19 satmd 2023-06-27 08:03:30 UTC

After running the git bisect process another time with more reboot attempts, I still come up with:

```
fed8d8773b8ea68ad99d9eee8c8343bef9da2c2c is the first bad commit
commit fed8d8773b8ea68ad99d9eee8c8343bef9da2c2c
Author: Eric DeVolder <eric.devolder@oracle.com>
Date:   Mon Mar 27 15:10:26 2023 -0400

    x86/acpi/boot: Correct acpi_is_processor_usable() check

    The logic in acpi_is_processor_usable() requires the online capable
    bit be set for hotpluggable CPUs.  The online capable bit has been
    introduced in ACPI 6.3.

    However, for ACPI revisions < 6.3 which do not support that bit, CPUs
    should be reported as usable, not the other way around.

    Reverse the check.

      [ bp: Rewrite commit message. ]

    Fixes: e2869bd7af60 ("x86/acpi/boot: Do not register processors that cannot be onlined for x2APIC")
    Suggested-by: Miguel Luis <miguel.luis@oracle.com>
    Suggested-by: Boris Ostrovsky <boris.ovstrosky@oracle.com>
    Signed-off-by: Eric DeVolder <eric.devolder@oracle.com>
    Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
    Tested-by: David R <david@unsolicited.net>
    Cc: <stable@kernel.org>
    Link: https://lore.kernel.org/r/20230327191026.3454-2-eric.devolder@oracle.com

 arch/x86/kernel/acpi/boot.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)
```

Comment 20 satmd 2023-06-29 21:37:27 UTC

I'm at loss now

Comment 21 Mike Pagano gentoo-dev

2023-06-30 11:27:30 UTC

Can you remove wireguard and test?

Can I see your .config ?

Also, please test with 6.4.0 with and without wireguard.

Thanks.

Comment 22 Sam James archtester

2023-06-30 11:28:34 UTC

(In reply to satmd from comment #20)
> I'm at loss now

Once you've tried Mike's suggestion, at this point, report it upstream (CC the commit author), linking to this bug and explaining what you've tried.

Comment 23 Sam James archtester

2023-06-30 11:28:47 UTC

(In reply to Sam James from comment #22)
> (In reply to satmd from comment #20)
> > I'm at loss now
> 
> Once you've tried Mike's suggestion, at this point, report it upstream (CC
> the commit author), linking to this bug and explaining what you've tried.

Also: does a revert of that commit help? If not, it implies it might be a bogus result.

Comment 24 satmd 2023-06-30 20:22:53 UTC

Created attachment 864896 [details]
The .config for 6.12.6

This .config was also used for upgrades using "make oldconfig" and with confirming all question with the default (just pressing enter).

Comment 25 satmd 2023-06-30 20:36:22 UTC

Preparing 6.4.0 with wireguard now.

Those are my suggestions, please comment on it if you like me to change it. I'm writing it down so I can proceed a bit. I won't receive mails during testing, because the system also handles all my mail. Unfortunately, I do not have a fallback system nor a spare copy to do this on.

That kernel will also use a newer linux-firmware package. I mention it here just in case and ignore it for now. I may look into that if 6.4.0 + wireguard unexpectedly don't experience the bug - else I just expect it to not make a difference, but who knows, I'm keeping this door open.

When initially testing 6.3.x - before using git bisect - I was able to boot the system successfully as long as I did not load the wireguard.ko kernel module, albeit I'm not sure I gave it enough time to really produce a RCU stall.

So, right now I'll test 6.4.0 with and without wireguard.
I will first disable wireguard by hiding wireguard.ko on the filesystem and if that still triggers the bug, I'll disable wireguard support on the .config as well.

If 6.4.0 boots well with wireguard, I guess that's the fix. I do not want to waste time debugging superseded kernels unless you still want me to.

If 6.4.0 does not boot with wireguard, I'll try the methods from above and also try 6.4.0 and latest 6.3.x with *that* commit reverted.

That revert will be applied to the git sources of the kernel, right?
I also could try `git diff | patch -R` from the git-based kernel sources onto the gentoo sources. I didn't do that yet because I can produce the bug with vanilla.

Please feel free and encouraged to make changes to my plan. :)

Comment 26 Mike Pagano gentoo-dev

2023-06-30 22:32:43 UTC

(In reply to satmd from comment #25)
> Preparing 6.4.0 with wireguard now.
> 
> Those are my suggestions, please comment on it if you like me to change it.
> I'm writing it down so I can proceed a bit. I won't receive mails during
> testing, because the system also handles all my mail. Unfortunately, I do
> not have a fallback system nor a spare copy to do this on.
> 
> That kernel will also use a newer linux-firmware package. I mention it here
> just in case and ignore it for now. I may look into that if 6.4.0 +
> wireguard unexpectedly don't experience the bug - else I just expect it to
> not make a difference, but who knows, I'm keeping this door open.
> 
> When initially testing 6.3.x - before using git bisect - I was able to boot
> the system successfully as long as I did not load the wireguard.ko kernel
> module, albeit I'm not sure I gave it enough time to really produce a RCU
> stall.
> 
> So, right now I'll test 6.4.0 with and without wireguard.
> I will first disable wireguard by hiding wireguard.ko on the filesystem and
> if that still triggers the bug, I'll disable wireguard support on the
> .config as well.
> 
> If 6.4.0 boots well with wireguard, I guess that's the fix. I do not want to
> waste time debugging superseded kernels unless you still want me to.
> 
> If 6.4.0 does not boot with wireguard, I'll try the methods from above and
> also try 6.4.0 and latest 6.3.x with *that* commit reverted.
> 
> That revert will be applied to the git sources of the kernel, right?
> I also could try `git diff | patch -R` from the git-based kernel sources
> onto the gentoo sources. I didn't do that yet because I can produce the bug
> with vanilla.
> 
> Please feel free and encouraged to make changes to my plan. :)


Sorry this is such a pain. The reason I asked you to test without wireguard is because the BUG output references a function in the wireguard module.

Good luck !

Comment 27 satmd 2023-07-01 12:22:20 UTC

Created attachment 864948 [details]
RCU stall with 6.4.0 with delayed loading of wireguard.ko

New results!
(1) 6.4.0 with wireguard hangs
(2) 6.4.0 with wireguard.ko renamed to wireguard.ko.disabled hangs after loading the module *AND* configuring an interface on it

The attached file is with option (2).

Comment 28 satmd 2023-07-01 12:40:39 UTC

Upstream bug: https://bugzilla.kernel.org/show_bug.cgi?id=217620

Comment 29 Sam James archtester

2023-07-01 12:44:12 UTC

(In reply to satmd from comment #25)
> When initially testing 6.3.x - before using git bisect - I was able to boot
> the system successfully as long as I did not load the wireguard.ko kernel
> module, albeit I'm not sure I gave it enough time to really produce a RCU
> stall.
> 

Interesting!

> So, right now I'll test 6.4.0 with and without wireguard.
> I will first disable wireguard by hiding wireguard.ko on the filesystem and
> if that still triggers the bug, I'll disable wireguard support on the
> .config as well.
> 
> If 6.4.0 boots well with wireguard, I guess that's the fix. I do not want to
> waste time debugging superseded kernels unless you still want me to.
> 

Yeah, that's fine with me - kernel people may or may not want you to try find the fix, but I don't see the need really given it's about 6.3 vs 6.4.

> That revert will be applied to the git sources of the kernel, right?
> I also could try `git diff | patch -R` from the git-based kernel sources
> onto the gentoo sources. I didn't do that yet because I can produce the bug
> with vanilla.

yes, just git sources are fine

Comment 30 satmd 2023-07-01 12:45:49 UTC

# git bisect log
git bisect start
# status: waiting for both good and bad commits
# good: [46df6964c1a9eb72027710f626cb1c6bfb5d58c9] Linux 6.2.16
git bisect good 46df6964c1a9eb72027710f626cb1c6bfb5d58c9
# status: waiting for bad commit, 1 good commit known
# bad: [457391b0380335d5e9a5babdec90ac53928b23b4] Linux 6.3
git bisect bad 457391b0380335d5e9a5babdec90ac53928b23b4
# good: [c9c3395d5e3dcc6daee66c6908354d47bf98cb0c] Linux 6.2
git bisect good c9c3395d5e3dcc6daee66c6908354d47bf98cb0c
# good: [a5c95ca18a98d742d0a4a04063c32556b5b66378] Merge tag 'drm-next-2023-02-23' of git://anongit.freedesktop.org/drm/drm
git bisect good a5c95ca18a98d742d0a4a04063c32556b5b66378
# good: [1ec35eadc3b448c91a6b763371a7073444e95f9d] Merge tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
git bisect good 1ec35eadc3b448c91a6b763371a7073444e95f9d
# good: [3b11717f95b1880b9cab4b90bbaf61268e6bda2b] Merge tag 'vfs.misc.v6.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping
git bisect good 3b11717f95b1880b9cab4b90bbaf61268e6bda2b
# good: [fb5015bc8b733323b58f015b88e4f316010ec856] docs: kvm: x86: Fix broken field list
git bisect good fb5015bc8b733323b58f015b88e4f316010ec856
# good: [aa46fe36bbac623d58817eb12ed0222d88fe6b16] Merge tag 'tty-6.3-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
git bisect good aa46fe36bbac623d58817eb12ed0222d88fe6b16
# bad: [9772f14f557de9d4056212c84a0a4f64b7b09f31] Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
git bisect bad 9772f14f557de9d4056212c84a0a4f64b7b09f31
# bad: [4413ad01e27eb989f4b19bb5b038328c220a383d] Merge tag 'devicetree-fixes-for-6.2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux
git bisect bad 4413ad01e27eb989f4b19bb5b038328c220a383d
# bad: [fffb0b52d5258554c645c966c6cbef7de50b851d] fbcon: set_con2fb_map needs to set con2fb_map!
git bisect bad fffb0b52d5258554c645c966c6cbef7de50b851d
# good: [cdc9718d5e590d6905361800b938b93f2b66818e] Merge tag '6.3-rc5-smb3-cifs-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
git bisect good cdc9718d5e590d6905361800b938b93f2b66818e
# good: [c08cfd6716a170c549c1140f1d4a0e749c888a79] Merge tag 'cxl-fixes-6.3-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl
git bisect good c08cfd6716a170c549c1140f1d4a0e749c888a79
# bad: [09a9639e56c01c7a00d6c0ca63f4c7c41abe075d] Linux 6.3-rc6
git bisect bad 09a9639e56c01c7a00d6c0ca63f4c7c41abe075d
# bad: [4ba115e2694dc9a10abfe94766d70b64ae9479c7] Merge tag 'x86_urgent_for_v6.3_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect bad 4ba115e2694dc9a10abfe94766d70b64ae9479c7
# bad: [fed8d8773b8ea68ad99d9eee8c8343bef9da2c2c] x86/acpi/boot: Correct acpi_is_processor_usable() check
git bisect bad fed8d8773b8ea68ad99d9eee8c8343bef9da2c2c
# good: [a74fabfbd1b7013045afc8cc541e6cab3360ccb5] x86/ACPI/boot: Use FADT version to check support for online capable
git bisect good a74fabfbd1b7013045afc8cc541e6cab3360ccb5
# first bad commit: [fed8d8773b8ea68ad99d9eee8c8343bef9da2c2c] x86/acpi/boot: Correct acpi_is_processor_usable() check

Comment 31 satmd 2023-07-01 12:53:51 UTC

I skipped the test with removing wireguard from .config, since the "ok state" is reachable if I just do not create an interface with it. It's safe to assume that not building the module does not improve the situation.

How is the communication going on from here since we have a bug here and upstream?

I'd instinctively only post here important updates and continue all further debugging over there.

Comment 32 Sam James archtester

2023-07-01 12:56:41 UTC

(In reply to satmd from comment #31)
> I skipped the test with removing wireguard from .config, since the "ok
> state" is reachable if I just do not create an interface with it. It's safe
> to assume that not building the module does not improve the situation.
> 
> How is the communication going on from here since we have a bug here and
> upstream?
> 
> I'd instinctively only post here important updates and continue all further
> debugging over there.

Sure, that's fine. But see https://www.kernel.org/doc/html/v6.4/admin-guide/reporting-issues.html. Bugzilla may not be the right place(!)

Comment 33 satmd 2023-07-01 13:20:28 UTC

(In reply to Sam James from comment #32)
> (In reply to satmd from comment #31)
> > I skipped the test with removing wireguard from .config, since the "ok
> > state" is reachable if I just do not create an interface with it. It's safe
> > to assume that not building the module does not improve the situation.
> > 
> > How is the communication going on from here since we have a bug here and
> > upstream?
> > 
> > I'd instinctively only post here important updates and continue all further
> > debugging over there.
> 
> Sure, that's fine. But see
> https://www.kernel.org/doc/html/v6.4/admin-guide/reporting-issues.html.
> Bugzilla may not be the right place(!)

Well, since the bug is open, let's see.
Doing a web search for bug reporting comes up with *several* similar documents for kernel.org (even for the same version) and they obviously have the matching category for kernel bugs and it seems that this isn't actually my first report there. :D

Right now, I'm testing 6.4.0 + reverted commit and it does not crash yet.

Comment 34 satmd 2023-07-01 14:14:09 UTC

(In reply to satmd from comment #33)
> (In reply to Sam James from comment #32)
> > (In reply to satmd from comment #31)
> > > I skipped the test with removing wireguard from .config, since the "ok
> > > state" is reachable if I just do not create an interface with it. It's safe
> > > to assume that not building the module does not improve the situation.
> > > 
> > > How is the communication going on from here since we have a bug here and
> > > upstream?
> > > 
> > > I'd instinctively only post here important updates and continue all further
> > > debugging over there.
> > 
> > Sure, that's fine. But see
> > https://www.kernel.org/doc/html/v6.4/admin-guide/reporting-issues.html.
> > Bugzilla may not be the right place(!)
> 
> Well, since the bug is open, let's see.
> Doing a web search for bug reporting comes up with *several* similar
> documents for kernel.org (even for the same version) and they obviously have
> the matching category for kernel bugs and it seems that this isn't actually
> my first report there. :D
> 
> Right now, I'm testing 6.4.0 + reverted commit and it does not crash yet.

My 6.4.0 frankenkernel seems to be running stable.

Comment 35 Jason A. Donenfeld gentoo-dev

2023-07-02 13:48:49 UTC

Can you tell me if https://git.zx2c4.com/wireguard-linux/patch/?id=54d5e4329efe0d1dba8b4a58720d29493926bed0 fixes the issue?

Comment 36 satmd 2023-07-02 20:13:27 UTC

(In reply to Jason A. Donenfeld from comment #35)
> Can you tell me if
> https://git.zx2c4.com/wireguard-linux/patch/
> ?id=54d5e4329efe0d1dba8b4a58720d29493926bed0 fixes the issue?

This patch allows me to successfully boot 6.4! :)

Comment 37 satmd 2023-07-02 20:18:19 UTC

(In reply to satmd from comment #36)
> (In reply to Jason A. Donenfeld from comment #35)
> > Can you tell me if
> > https://git.zx2c4.com/wireguard-linux/patch/
> > ?id=54d5e4329efe0d1dba8b4a58720d29493926bed0 fixes the issue?
> 
> This patch allows me to successfully boot 6.4! :)

Your patch works for me. Tested-by: Manuel Leiner <manuel.leiner@gmx.de>

Comment 38 Jason A. Donenfeld gentoo-dev

2023-07-03 11:47:23 UTC

https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=7387943fa35516f6f8017a3b0e9ce48a3bef9faa

The fix hit the net tree. Will be in the next stable.

Comment 39 Mike Pagano gentoo-dev

2023-07-03 12:20:31 UTC

fix gateinstall (In reply to Jason A. Donenfeld from comment #38)
> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/
> ?id=7387943fa35516f6f8017a3b0e9ce48a3bef9faa
> 
> The fix hit the net tree. Will be in the next stable.

I'll add this to genpatches.  Keeping open until we do a release with this one included.

Thanks, Jason.

Comment 40 Larry the Git Cow gentoo-dev

2023-07-04 14:06:25 UTC

The bug has been referenced in the following commit(s):

https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=9b820c2c4ab55020ece10ea3123255daa5b8ace6

commit 9b820c2c4ab55020ece10ea3123255daa5b8ace6
Author:     Mike Pagano <mpagano@gentoo.org>
AuthorDate: 2023-07-04 14:04:00 +0000
Commit:     Mike Pagano <mpagano@gentoo.org>
CommitDate: 2023-07-04 14:06:16 +0000

    sys-kernel/gentoo-sources: multiple fixes, see below
    
    execve: always mark stack as growing down during early stack setup
    
    wireguard: queueing: use saner cpu selection wrapping
    Bug: https://bugs.gentoo.org/909066
    
    Disable CONFIG_PER_VMA_LOCK by default until its fixed
    See: https://bugzilla.kernel.org/show_bug.cgi?id=217624
    
    Signed-off-by: Mike Pagano <mpagano@gentoo.org>

 sys-kernel/gentoo-sources/Manifest                 |  3 +++
 .../gentoo-sources/gentoo-sources-6.4.1-r1.ebuild  | 28 ++++++++++++++++++++++
 2 files changed, 31 insertions(+)

Comment 41 Larry the Git Cow gentoo-dev

2023-07-05 21:56:32 UTC

The bug has been closed via the following commit(s):

https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=65f59d62c29e1488aaae92d45eacd666f0851f73

commit 65f59d62c29e1488aaae92d45eacd666f0851f73
Author:     Mike Pagano <mpagano@gentoo.org>
AuthorDate: 2023-07-05 21:54:15 +0000
Commit:     Mike Pagano <mpagano@gentoo.org>
CommitDate: 2023-07-05 21:54:15 +0000

    sys-kernel/gentoo-sources: add 6.3.12, addl patches
    
    Removed:
    1800_mm-execve-mark-stack-as-growing-down.patch
    
    wireguard: queueing: use saner cpu selection wrapping
    Closes: https://bugs.gentoo.org/909066
    
    Signed-off-by: Mike Pagano <mpagano@gentoo.org>

 sys-kernel/gentoo-sources/Manifest                 |  3 +++
 .../gentoo-sources/gentoo-sources-6.3.12.ebuild    | 28 ++++++++++++++++++++++
 2 files changed, 31 insertions(+)