Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 716286 - sys-kernel/gentoo-sources-5.4.28: r8169 is unstable.
Summary: sys-kernel/gentoo-sources-5.4.28: r8169 is unstable.
Status: CONFIRMED
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: Current packages (show other bugs)
Hardware: All Linux
: Normal normal (vote)
Assignee: Gentoo Kernel Bug Wranglers and Kernel Maintainers
URL: https://git.kernel.org/pub/scm/linux/...
Whiteboard: Unknown hardware erratum, Linux 5.8
Keywords: UPSTREAM
Depends on:
Blocks:
 
Reported: 2020-04-05 09:34 UTC by crocket
Modified: 2020-09-29 13:20 UTC (History)
0 users

See Also:
Package list:
Runtime testing required: ---


Attachments
On 4.19.97, r8169 works fine. (dmesg-4.19.97-gentoo.log,66.10 KB, text/x-log)
2020-04-09 11:54 UTC, crocket
Details
On 5.4, r8169 stops working after a while. (dmesg-5.4.31-gentoo.log,66.57 KB, text/plain)
2020-04-09 11:54 UTC, crocket
Details

Note You need to log in before you can comment on or make changes to this bug.
Description crocket 2020-04-05 09:34:10 UTC
I upgraded from 4.19.97 to 5.4.28.
With 5.4.28, r8169 is unstable. It stops working after a few minutes.
Comment 1 Mike Pagano gentoo-dev 2020-04-05 11:29:38 UTC
attach dmesg from working to non working
attach your .config

try the latest 5.4.X which is 5.4.30 as of this writing
try the latest 5.5.X which is 5.5.15 as of this writing
Comment 2 Maxim P. Dementiev 2020-04-07 16:05:05 UTC
Stable sys-kernel/gentoo-sources-5.4.28 contains this issue:
https://gitlab.freedesktop.org/drm/intel/issues/827
The system became unstable.
Comment 3 crocket 2020-04-09 11:53:25 UTC
I couldn't test linux 5.5 because zfs-kmod and virtualbox-modules doesn't support linux 5.5. Instead, I have test results from 4.19 and 5.4.
Comment 4 crocket 2020-04-09 11:54:02 UTC
Created attachment 631596 [details]
On 4.19.97, r8169 works fine.
Comment 5 crocket 2020-04-09 11:54:43 UTC
Created attachment 631598 [details]
On 5.4, r8169 stops working after a while.
Comment 6 Mike Pagano gentoo-dev 2020-05-21 17:16:46 UTC
There a lot of 8169 patches in the net-next.git tree queued up.


Around 10 or so.  Not sure if adding all of those will fix your issues or make new ones.  

For sure, I expect these patches to make it to a future kernel.  

Do you want to grab all 10, apply to 5.6.X and test ?

If so, apply this one and all newer ones

"r8169: add helper r8168g_wait_ll_share_fifo_ready"

Link in URL field.
Comment 7 crocket 2020-05-22 05:37:50 UTC
zfs-kmod breaks after 5.4
Comment 8 Mike Pagano gentoo-dev 2020-05-22 11:49:12 UTC
(In reply to crocket from comment #7)
> zfs-kmod breaks after 5.4

Is there an upstream bug on that?
Comment 9 crocket 2020-05-22 12:06:43 UTC
I don't know, but zfs-kmod ebuild says it is only compatible with linux 5.4 and below.
Comment 10 Mike Pagano gentoo-dev 2020-05-22 22:19:34 UTC
(In reply to crocket from comment #9)
> I don't know, but zfs-kmod ebuild says it is only compatible with linux 5.4
> and below.

You can try applying those to 5.4. Running out of ideas for you
Comment 11 crocket 2020-05-23 12:44:54 UTC
I think I will let https://bugzilla.kernel.org/show_bug.cgi?id=207205 fix it.
Comment 12 Richard Yao gentoo-dev 2020-05-24 02:24:46 UTC
(In reply to crocket from comment #9)
> I don't know, but zfs-kmod ebuild says it is only compatible with linux 5.4
> and below.

Use 0.8.4 from ~amd64 to get Linux 5.6 support. It will likely be stabilized in the near future. Alternatively, you can use the 9999 ebuild if you are running a bleeding edge kernel. It has no version check.
Comment 13 crocket 2020-05-24 07:07:15 UTC
The new patches will become available on 5.8. I will wait for a version of zfs-kmod that works with 5.8.
Comment 14 Richard Yao gentoo-dev 2020-05-25 00:12:20 UTC
I installed sys-kernel/gentoo-sources-5.4.38 on a system with an Asrock B450 Pro4 motherboard that has the following NIC today:

09:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)
        Subsystem: ASRock Incorporation Motherboard (one of many)
        Flags: bus master, fast devsel, latency 0, IRQ 24
        I/O ports at d000 [size=256]
        Memory at f7504000 (64-bit, non-prefetchable) [size=4K]
        Memory at f7500000 (64-bit, non-prefetchable) [size=16K]
        Capabilities: <access denied>
        Kernel driver in use: r8169
        Kernel modules: r8169

It has been a few hours. So far, I am unable to reproduce this issue. I will try using Google Stadia to do further testing after a new chromium build finishes.
Comment 15 Richard Yao gentoo-dev 2020-05-25 00:15:08 UTC
By the way, it looks like Linux 5.4.32 included a fix for this issue:

commit 74107d56d1e8e6ac5a061059941b7e2d03522df6
Author: Heiner Kallweit <hkallweit1@gmail.com>
Date:   Sat Apr 4 23:48:45 2020 +0200

    r8169: change back SG and TSO to be disabled by default
    
    [ Upstream commit 95099c569a9fdbe186a27447dfa8a5a0562d4b7f ]
    
    There has been a number of reports that using SG/TSO on different chip
    versions results in tx timeouts. However for a lot of people SG/TSO
    works fine. Therefore disable both features by default, but allow users
    to enable them. Use at own risk!
    
    Fixes: 93681cd7d94f ("r8169: enable HW csum and TSO")
    Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.4.32
Comment 16 Richard Yao gentoo-dev 2020-05-25 03:19:41 UTC
Looking at the upstream issue, it seems that crocket is still affected by this even on Linux 5.4.40, so the patch in 5.4.32 is not a solution for his issue.

It seems like our hardware is slightly different. From dmesg, he has the RTL8168f/8111f while I have the RTL8168h/8111h. My guess is that the hardware revisions fixed at least one erratum that a change to the kernel driver triggered.

There are roughly 13 months worth of changes to r8169, which makes this particularly difficult to track down as there appear to be more than 100 changes (with a refactoring of the kernel source tree midway) to consider.

If crocket would be willing to spend some time building and testing kernels to help narrow this down, it would be really useful if he could try bisecting this. A normal git bisect should give us the precise patch where things went wrong, but I won't ask an end user to do that on their system, so I will suggest an alternative.

It is possible to manually get the tarballs for older kernels, extract them to /usr/src, use eselect to change the /usr/src/linux symlink and then build either manually or with genkernel:

https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.3.18.tar.xz
https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.2.21.tar.xz
https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.1.21.tar.xz
https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.0.21.tar.xz

The idea is to build Linux 5.1.y and see if it is affected. If it is affected, try Linux 5.0.y. If not, try Linux 5.2.y. If 5.0.y was picked, then the result of that would tell us which version introduced the bug. If Linux 5.2.y was picked, then if it is affected, we know that 5.2.y is where the bug started. If Linux 5.2.y is not affected, then we need a test of 5.3.y.

Anyway, if we know which major kernel version introduced the issue, it would greatly narrow down what we need to consider in terms of finding the bug. Bisecting Linus' tree would end up giving us the precise commit, but I won't suggest that an end user do that.

By the way, one possible workaround would be to switch to the vendor driver:

https://packages.gentoo.org/packages/net-misc/r8168

It presumably isn't affected by this. Note that the comment in the ebuild says that it is only up to 4.15, but that is outdated because the upstream source says that it supports up to 5.6:

https://www.realtek.com/en/component/zoo/category/network-interface-controllers-10-100-1000m-gigabit-ethernet-pci-express-software

I will do a commit to portage to fix the comment, although that could take up to a day before it is reflected in what end users see on their machines.
Comment 17 Richard Yao gentoo-dev 2020-05-25 14:30:58 UTC
I forgot to include a tarball link for Linux 4.20.y last night:

https://cdn.kernel.org/pub/linux/kernel/v4.x/linux-4.20.17.tar.xz
Comment 18 crocket 2020-05-26 05:41:03 UTC
I already bisected once.

Look at https://bugzilla.kernel.org/show_bug.cgi?id=207205#c9
Comment 19 crocket 2020-09-29 13:20:57 UTC
This issue was still not fixed on gentoo-sources-5.8.7
However, it is now somehow fixed on gentoo-sources-5.8.12

A maintainer of r8169 driver says he didn't change anything between 5.8.7 and 5.8.12

I suspect gentoo's own patches to linux kernel introduced the issue.