Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 698228 - sys-kernel/linux-firmware-20191008 creates unusable kernel with RX Vega 56/64
Summary: sys-kernel/linux-firmware-20191008 creates unusable kernel with RX Vega 56/64
Status: RESOLVED OBSOLETE
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: Current packages (show other bugs)
Hardware: AMD64 Linux
: Normal normal (vote)
Assignee: Chí-Thanh Christopher Nguyễn
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-10-21 20:58 UTC by Mark
Modified: 2021-05-13 00:00 UTC (History)
3 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Mark 2019-10-21 20:58:03 UTC
I found my system in an unbootable state this weekend when I rebooted after making several changes to my kernel config. Hours of debugging later I figured out that the root cause had nothing to do with my kernel config changes and everything to do with upgrading sys-kernel/linux-firmware to version 20191008.

SYSTEM INFORMATION/CONFIGURATION:
    Kernel Package: sys-kernel/gentoo-sources-4.19.72.ebuild
    Kernel Config:  https://pastebin.com/raw/KSLZiGEk
    lshw Output:    https://pastebin.com/raw/f3a5P5rY
    lspci (kinda):  https://pastebin.com/raw/P3KvqjH9
Relevant Hardware:
    AMD GPU:        04:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XL/XT [Radeon RX Vega 56/64] [1002:687f]
    NVIDIA GPU:     01:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU106 [GeForce RTX 2070 Rev. A] [10de:1f07] (rev a1)
Kernel Boot Parameters:
    GRUB_CMDLINE_LINUX_DEFAULT="quiet splash intel_iommu=on acpi_enforce_resources=lax iommu=pt vfio-pci.ids=10de:1f07,10de:10f9,10de:1ada,10de:1adb pcie_acs_override=downstream pci=routeirq isolcpus=5-7,13-15 nohz_full=5-7,13-15"
For completeness, the kernel is patched with the ACS override patches:
    https://github.com/feniksa/gentoo_ACS_override_patch/tree/master/sys-kernel/gentoo-sources-4.19.72

Sorry I don't have the standard lspci output because I'm away from my computer right now. I can post it if needed.

KERNEL BOOT MESSAGES:
    Unfortunately I don't have a serial console so I couldn't capture the boot output in text. While I was debugging I did take a video of the screen as it was booting so I could try to pause and read the screen. I realize this isn't ideal but (unless someone has a way to capture this output) this is the best I've got:
    https://www.youtube.com/watch?v=p4_L27kf3vg

OTHER IMPORTANT INFORMATION:
    * Kernel boots just fine with sys-kernel/linux-firmware-20190904 & sys-kernel/linux-firmware-20190923
    * Before I realized what had caused this, I tried compiling as a module with the following results:
        - I was dumped to initramfs shell due to being unable to load tty
        - I copied the amdgpu module files and relevant files from lib/firmware to initramfs so I could load the module from initramfs
        - Immediately after entering "modprobe amdgpu" I lost the ability to type in the console and saw messages similar to those at the end of the video:
            timekeeping watchdog on CPU[#] (something about cpu unreachable or spinning forever I don't quite remember sorry)
    * If it's not clear from my config, the NVIDIA GPU is passed through to a VM while the AMD GPU is used on the host (Gentoo) box.
Comment 1 Thomas Deutschmann (RETIRED) gentoo-dev 2019-10-22 21:30:54 UTC
sys-kernel/linux-firmware is a complicated and dangerous package: It's possible that you are experiencing a bug, it's also possible that your system (used kernel + used driver) is just incompatible with latest firmware.

The problem is: You are on your own and we cannot help you. Or tell us what you expect us to do.

We don't have your hardware. We don't really know what's the problem.

Keep in mind that Gentoo is a rolling distribution. We don't know which kernel user is running or will run on next boot. We don't know which driver you are using. Even if we would know "AMDGPU firmware from 2019-10-08 requires at least linux-5.3.5 and dev-libs/amdgpu-pro-opencl-19.30" we cannot block this upgrade for you like we don't know if you plan to upgrade kernel/driver next and therefore will require latest firmware which is incompatible with current running kernel/driver.
Comment 2 Daniel Nilsson 2019-12-27 20:23:11 UTC
Workaround/solution is to upgrade to newest kernel.

I can confirm the problem with gentoo-sources-4.19.86 in combination with a version of linux-firmware containing vega10 ucode 19.30 (i.e. >=sys-kernel/linux-firmware-20191008). I have a Radeon RX Vega 64 GPU and with that combination the computer freezes with no monitor output as soon as the amdgpu driver initializes (the driver and firmware are built into the kernel). Version of userspace drivers are irrelevant because kernel won't boot that far.

I didn't want to downgrade to linux-firmware 20190923 because then I will miss the newest AMD CPU microcode. So I tried the latest gentoo-sources (5.4.3) and that works with the latests linux-firmware.

So this is a reason to get 5.4.x kernel stable.
Comment 3 Thomas Deutschmann (RETIRED) gentoo-dev 2021-05-13 00:00:12 UTC
=sys-kernel/linux-firmware-2019100 was removed long time ago.