935827 – sys-kernel/gentoo-sources-6.6.30 arm64 bug/panic in mvpp2_rx/skb_put

Bug 935827 - sys-kernel/gentoo-sources-6.6.30 arm64 bug/panic in mvpp2_rx/skb_put

Summary: sys-kernel/gentoo-sources-6.6.30 arm64 bug/panic in mvpp2_rx/skb_put

Status:	RESOLVED WORKSFORME

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	Current packages (show other bugs)
Hardware:	ARM64 Linux

Importance:	Normal normal
Assignee:	Gentoo Kernel Bug Wranglers and Kernel Maintainers

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2024-07-10 15:50 UTC by Chris Henhawke
Modified:	2024-08-21 15:08 UTC (History)
CC List:	1 user (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
kernel panic log (panic.txt,2.66 KB, text/plain) 2024-07-10 15:50 UTC, Chris Henhawke	Details
kernel config (config.txt,140.94 KB, text/plain) 2024-07-10 15:58 UTC, Chris Henhawke	Details
6.6.21 kernel dmesg (dmesg.txt,32.19 KB, text/plain) 2024-07-11 13:28 UTC, Chris Henhawke	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Chris Henhawke 2024-07-10 15:50:12 UTC

Created attachment 897402 [details]
kernel panic log

Hi, I'm trying to troubleshoot an issue that cropped up on some arm boards that I have.  I can run 6.6.21 just fine, but after upgrading to 6.6.30, they seem to get a bug/panic after running for a few days.

skb_panic+0x6c/0x78
skb_put+0xa4/0xb0
mvpp2_rx+0x604/0xbe8
mvpp2_poll+0x100/0x220

From what I can tell, the code hasn't changed in 3+ years, and 6.6.x is only a year old, and both kernels were compiled with the same gcc (13.2.1_p20240210).  I'm at a loss why it would panic on 6.6.30 and not 6.6.21 if I was somehow receiving malformed packets that weren't being caught by iptables/etc.

I was wondering if I could get a second set of eyes before I try reporting it upstream...  I'll attach the full crash, since I was able to pull it off the serial console.

Thanks in advance

Comment 1 Chris Henhawke 2024-07-10 15:58:56 UTC

Created attachment 897403 [details]
kernel config

Comment 2 Mike Pagano gentoo-dev

2024-07-11 12:43:31 UTC

Hello, can you try a few things, all without any proprietary modules loaded as indicated in your panic.

1. Try the latest 6.6.X to see if something is fixed, (6.6.38 as of this writing)
2. Do a git bisect from 6.6.21 to 6.6.30 to see if there is an offending commit
3. Increase your logging : You can add ignore_loglevel to your kernel parameters or 

If you get a more verbose panic, can you attach the full dmesg ?

Comment 3 Chris Henhawke 2024-07-11 13:28:01 UTC

Hi, unfortunately I don't have the means to move these systems back to ext4, but I'll try walking up the 6.6.x tree and see where things start going wrong.  I'll also try 6.6.38 (or 6.6.39 since I just saw that go up).  

In the meantime, I'll give you the full dmesg from a running 6.6.21 system, although it looks pretty normal to me.  

(A little backstory, I have three of these boards, one is my router, one is my mail server, and the last one is my web server...  I did have some memory issues on the web server (bug 907766), but it was resolved by reseating the memory.  I only saw this panic on the router and the mail server before I rolled all three back to 6.6.21.)

Comment 4 Chris Henhawke 2024-07-11 13:28:32 UTC

Created attachment 897445 [details]
6.6.21 kernel dmesg

Comment 5 Chris Henhawke 2024-07-18 15:23:00 UTC

22/23/24 seems to have lasted a week, moving up to 25/26/27...

Comment 6 Chris Henhawke 2024-07-30 12:29:15 UTC

Moving up to 28/29...  I'll try 30 again later, then 38 and whatever the latest 6.6.x kernel winds up being after next week...  Starting to wish I started at 29 and walked backwards instead.

Comment 7 Chris Henhawke 2024-08-17 08:03:27 UTC

I can't replicate this anymore on 6.6.30.

Comment 8 Chris Henhawke 2024-08-21 15:08:45 UTC

A small addendum...  While I haven't seen the crashes again from a month ago, I just saw this message on one of these arm boards...

[1113648.270846] TCP: eth0: Driver has suspect GRO implementation, TCP performance may be compromised.

If I had to guess, the crashes might have been caused by random garbage packets that got merged and offloaded by GRO.  I'm going to disable it across all interfaces and see where that gets me.

Cheers