648724 – sys-kernel/vanilla-sources-4.15.7: firmware load failures on amdgpu

Bug 648724 - sys-kernel/vanilla-sources-4.15.7: firmware load failures on amdgpu

Summary: sys-kernel/vanilla-sources-4.15.7: firmware load failures on amdgpu

Status:	RESOLVED UPSTREAM

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	Current packages (show other bugs)
Hardware:	All Linux

Importance:	Normal normal
Assignee:	Gentoo Kernel Bug Wranglers and Kernel Maintainers

URL:	https://bugs.freedesktop.org/show_bug...
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2018-02-24 19:25 UTC by José Pekkarinen
Modified:	2018-07-23 19:42 UTC (History)
CC List:	1 user (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
Udevadm executed during modprobe amdgpu. (udevadm.txt,8.92 KB, text/plain) 2018-02-24 19:25 UTC, José Pekkarinen	Details
emerge --info (einfo.txt,11.37 KB, text/plain) 2018-03-07 07:24 UTC, José Pekkarinen	Details
kernel config. (kern.conf,142.40 KB, text/x-mpsub) 2018-03-07 07:26 UTC, José Pekkarinen	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description José Pekkarinen 2018-02-24 19:25:49 UTC

Created attachment 520946 [details]
Udevadm executed during modprobe amdgpu.

Hi,

I'm having troubles in my new laptop regarding to firmware loading.
The laptop is a Lenovo ideapad 510s-14IKB(kabylake + topaz), and the only
way to make it work is disabling amdgpu driver, as anytime you load it,
modprobe will end up killed, and the following backtrace will appear in
/var/log/messages file:

Feb 24 15:41:48 bee kernel: [ 1689.500738] amdgpu: [powerplay] smu not running, upload firmware again 
Feb 24 15:41:54 bee kernel: [ 1689.502041] BUG: unable to handle kernel paging request at ffffc91c008d0fec
Feb 24 15:41:54 bee kernel: [ 1689.502089] IP: smu7_populate_single_firmware_entry.isra.5+0x68/0xc0 [amdgpu]
Feb 24 15:41:54 bee kernel: [ 1689.502091] PGD 35d10e067 P4D 35d10e067 PUD 0 
Feb 24 15:41:54 bee kernel: [ 1689.502093] Oops: 0002 [#1] SMP PTI
Feb 24 15:41:54 bee kernel: [ 1689.502096] Dumping ftrace buffer:
Feb 24 15:41:54 bee kernel: [ 1689.502098]    (ftrace buffer empty)
Feb 24 15:41:54 bee kernel: [ 1689.502099] Modules linked in: amdgpu(+) chash ttm ctr ccm af_packet ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables nf_conntrack_ipv6 nf_defrag_ipv6 nft_ct nf_tables_ipv6 nft_masq_ipv4 nf_nat_masquerade_ipv4 nft
_masq nft_meta nft_chain_nat_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack nf_tables_ipv4 nf_tables nfnetlink tun bridge stp llc zram zsmalloc binfmt_misc vfio_pci vfio_virqfd udl loop bfq uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2
 videobuf2_core videodev media snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic btusb btrtl btbcm btintel bluetooth x86_pkg_temp_thermal ecdh_generic intel_powerclamp kvmgt coretemp arc4 vfio_mdev mdev vfio_iommu_type1 vfio i915 iwlmvm kvm_intel kvm mac8021
1 joydev mousedev rtsx_pci_ms
Feb 24 15:41:54 bee kernel: [ 1689.502129]  irqbypass rtsx_pci_sdmmc mmc_core memstick crc32c_intel snd_hda_intel snd_hda_codec iwlwifi i2c_algo_bit ghash_clmulni_intel snd_hwdep intel_cstate wmi_bmof drm_kms_helper efi_pstore snd_hda_core drm intel_uncore cfg80211 intel
_rapl_perf psmouse ideapad_laptop rtsx_pci efivars evdev input_leds snd_pcm mfd_core serio_raw sparse_keymap wmi syscopyarea rfkill video snd_timer snd sysfillrect sysimgblt fb_sys_fops ac battery thermal backlight fan i2c_i801 tpm_crb acpi_pad button soundcore efivarfs 
unix dm_zero dm_thin_pool dm_persistent_data dm_bio_prison dm_service_time dm_round_robin dm_queue_length dm_multipath dm_log_userspace cn dm_flakey dm_delay xts aesni_intel crypto_simd cryptd glue_helper aes_x86_64 cbc sha256_generic scsi_transport_iscsi r8169 mii fuse 
xfs nfs lockd grace sunrpc
Feb 24 15:41:54 bee kernel: [ 1689.502161]  fscache jfs reiserfs ext4 mbcache jbd2 fscrypto multipath linear raid10 raid1 raid0 dm_raid raid456 md_mod async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq libcrc32c dm_snapshot dm_bufio dm_crypt dm_mirro
r dm_region_hash dm_log dm_mod dax hid_generic usbhid xhci_pci xhci_hcd ohci_hcd uhci_hcd usb_storage ehci_pci ehci_hcd usbcore usb_common scsi_transport_fc sr_mod cdrom sg sd_mod ata_piix ahci libahci sata_sx4 pata_oldpiix
Feb 24 15:41:54 bee kernel: [ 1689.502184] CPU: 0 PID: 20686 Comm: modprobe Not tainted 4.15.4+ #1
Feb 24 15:41:54 bee kernel: [ 1689.502186] Hardware name: LENOVO 80UV/Lenovo ideapad 510S-14IKB, BIOS 2SCN26WW(V2.06) 07/12/2017
Feb 24 15:41:54 bee kernel: [ 1689.502226] RIP: 0010:smu7_populate_single_firmware_entry.isra.5+0x68/0xc0 [amdgpu]
Feb 24 15:41:54 bee kernel: [ 1689.502228] RSP: 0018:ffffc90002eeba38 EFLAGS: 00010246
Feb 24 15:41:54 bee kernel: [ 1689.502229] RAX: 000000000000007e RBX: 0000000000000003 RCX: 000001000f53d000
Feb 24 15:41:54 bee kernel: [ 1689.502230] RDX: 0000000000000004 RSI: 0000000000000000 RDI: ffff8801f780b020
Feb 24 15:41:54 bee kernel: [ 1689.502231] RBP: ffffc91c008d0fec R08: 0000000000000001 R09: 00000000000004d4
Feb 24 15:41:54 bee kernel: [ 1689.502232] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801fc972010
Feb 24 15:41:54 bee kernel: [ 1689.502233] R13: ffff88027adba000 R14: 00000000000005fe R15: ffff8801edb90000
Feb 24 15:41:54 bee kernel: [ 1689.502234] FS:  00007f4bd2284700(0000) GS:ffff88046ec00000(0000) knlGS:0000000000000000
Feb 24 15:41:54 bee kernel: [ 1689.502235] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 24 15:41:54 bee kernel: [ 1689.502236] CR2: ffffc91c008d0fec CR3: 000000035a3b8001 CR4: 00000000003606f0
Feb 24 15:41:54 bee kernel: [ 1689.502237] Call Trace:
Feb 24 15:41:54 bee kernel: [ 1689.502273]  smu7_request_smu_load_fw+0x97/0x310 [amdgpu]
Feb 24 15:41:54 bee kernel: [ 1689.502310]  pp_hw_init+0x48/0xc0 [amdgpu]
Feb 24 15:41:54 bee kernel: [ 1689.502338]  amdgpu_device_init+0xde9/0x1610 [amdgpu]
Feb 24 15:41:54 bee kernel: [ 1689.502342]  ? cache_alloc_debugcheck_after.isra.22+0x195/0x1e0
Feb 24 15:41:54 bee kernel: [ 1689.502344]  ? kmem_cache_alloc_trace+0x1f6/0x230
Feb 24 15:41:54 bee kernel: [ 1689.502371]  ? amdgpu_driver_load_kms+0x25/0x240 [amdgpu]
Feb 24 15:41:54 bee kernel: [ 1689.502397]  amdgpu_driver_load_kms+0x58/0x240 [amdgpu]
Feb 24 15:41:54 bee kernel: [ 1689.502408]  drm_dev_register+0x12f/0x1c0 [drm]
Feb 24 15:41:54 bee kernel: [ 1689.502435]  amdgpu_pci_probe+0x10f/0x140 [amdgpu]
Feb 24 15:41:54 bee kernel: [ 1689.502438]  pci_device_probe+0xc8/0x140
Feb 24 15:41:54 bee kernel: [ 1689.502441]  driver_probe_device+0x2a8/0x490
Feb 24 15:41:54 bee kernel: [ 1689.502443]  __driver_attach+0xda/0xe0
Feb 24 15:41:54 bee kernel: [ 1689.502444]  ? driver_probe_device+0x490/0x490
Feb 24 15:41:54 bee kernel: [ 1689.502446]  bus_for_each_dev+0x5a/0x90
Feb 24 15:41:54 bee kernel: [ 1689.502448]  bus_add_driver+0x16a/0x260
Feb 24 15:41:54 bee kernel: [ 1689.502450]  driver_register+0x57/0xc0
Feb 24 15:41:54 bee kernel: [ 1689.502452]  ? 0xffffffffa1467000
Feb 24 15:41:54 bee kernel: [ 1689.502453]  do_one_initcall+0x4e/0x190
Feb 24 15:41:54 bee kernel: [ 1689.502455]  ? kmem_cache_alloc_trace+0x1f6/0x230
Feb 24 15:41:54 bee kernel: [ 1689.502457]  ? do_init_module+0x22/0x20b
Feb 24 15:41:54 bee kernel: [ 1689.502459]  do_init_module+0x5b/0x20b
Feb 24 15:41:54 bee kernel: [ 1689.502461]  load_module+0x1511/0x1740
Feb 24 15:41:54 bee kernel: [ 1689.502464]  ? SyS_finit_module+0xaa/0xe0
Feb 24 15:41:54 bee kernel: [ 1689.502465]  SyS_finit_module+0xaa/0xe0
Feb 24 15:41:54 bee kernel: [ 1689.502468]  do_syscall_64+0x6e/0x190
Feb 24 15:41:54 bee kernel: [ 1689.502471]  entry_SYSCALL_64_after_hwframe+0x26/0x9b
Feb 24 15:41:54 bee kernel: [ 1689.502472] RIP: 0033:0x7f4bd1be68c9
Feb 24 15:41:54 bee kernel: [ 1689.502474] RSP: 002b:00007fff18f47a48 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
Feb 24 15:41:54 bee kernel: [ 1689.502475] RAX: ffffffffffffffda RBX: 00005594a4eb6540 RCX: 00007f4bd1be68c9
Feb 24 15:41:54 bee kernel: [ 1689.502476] RDX: 0000000000000000 RSI: 00005594a4be2e2a RDI: 0000000000000005
Feb 24 15:41:54 bee kernel: [ 1689.502477] RBP: 00005594a4be2e2a R08: 0000000000000000 R09: 0000000000000000
Feb 24 15:41:54 bee kernel: [ 1689.502478] R10: 0000000000000005 R11: 0000000000000246 R12: 0000000000000000
Feb 24 15:41:54 bee kernel: [ 1689.502479] R13: 00005594a4eb7e70 R14: 0000000000040000 R15: 0000000000000000
Feb 24 15:41:54 bee kernel: [ 1689.502481] Code: b6 b1 40 5c 36 a1 ff d0 85 c0 74 1a 83 eb 06 31 c0 83 e3 fb 0f 94 c0 66 89 45 18 48 83 c4 30 31 c0 5b 5d 41 5c c3 0f b7 44 24 02 <66> 89 5d 00 c7 45 0c 00 00 00 00 c7 45 10 00 00 00 00 66 89 45 
Feb 24 15:41:54 bee kernel: [ 1689.502538] RIP: smu7_populate_single_firmware_entry.isra.5+0x68/0xc0 [amdgpu] RSP: ffffc90002eeba38
Feb 24 15:41:54 bee kernel: [ 1689.502539] CR2: ffffc91c008d0fec
Feb 24 15:41:54 bee kernel: [ 1689.502540] ---[ end trace 8684bbc443b97149 ]---

The code referenced comes to the following lines in
drivers/gpu/drm/amd/powerplay/smumgr/smu7_smumgr.c smu7_request_smu_load_fw
function:

...
        PP_ASSERT_WITH_CODE(0 == smu7_populate_single_firmware_entry(hwmgr,
                                UCODE_ID_RLC_G, &toc->entry[toc->num_entries++]),
                                "Failed to Get Firmware Entry.", return -EINVAL);
        PP_ASSERT_WITH_CODE(0 == smu7_populate_single_firmware_entry(hwmgr,
                                UCODE_ID_CP_CE, &toc->entry[toc->num_entries++]),
                                "Failed to Get Firmware Entry.", return -EINVAL);
        PP_ASSERT_WITH_CODE(0 == smu7_populate_single_firmware_entry(hwmgr,
                                UCODE_ID_CP_PFP, &toc->entry[toc->num_entries++]),
                                "Failed to Get Firmware Entry.", return -EINVAL);
...

I attach an example of udevadm execute during the modprobe to see if the event
is seen by udev.

The kernel is a vanilla 4.15.4, with some bugfixes provided by amd drm team.
Config regarding to firmware is:

CONFIG_PREVENT_FIRMWARE_BUILD=y
CONFIG_FW_LOADER=y
CONFIG_FIRMWARE_IN_KERNEL=y
CONFIG_EXTRA_FIRMWARE="i915/kbl_dmc_ver1_04.bin i915/kbl_guc_ver9_39.bin i915/kbl_huc_ver02_00_1810.bin amdgpu/topaz_ce.bin amdgpu/topaz_k_smc.bin amdgpu/topaz_mc.bin amdgpu/topaz_me.bin amdgpu/topaz_mec2.bin amdgpu/topaz_mec.bin amdgpu/topaz_pfp.bin amdgpu/topaz_rlc.bin amdgpu/topaz_sdma1.bin amdgpu/topaz_sdma.bin amdgpu/topaz_smc.bin"
CONFIG_EXTRA_FIRMWARE_DIR="/lib/firmware"
# CONFIG_FW_LOADER_USER_HELPER_FALLBACK is not set

Any help/suggestions appreciated.

Thanks!

José.

Comment 1 José Pekkarinen 2018-02-25 09:59:02 UTC

I forgot to add that the firmware is obviously on place:

# ls -la /lib/firmware/amdgpu/topaz*
-rw-r--r-- 1 root root   8832 Jan  7 10:52 /lib/firmware/amdgpu/topaz_ce.bin
-rw-r--r-- 1 root root  80544 Jan  7 10:52 /lib/firmware/amdgpu/topaz_k_smc.bin
-rw-r--r-- 1 root root  32100 Jan  7 10:52 /lib/firmware/amdgpu/topaz_mc.bin
-rw-r--r-- 1 root root  17024 Jan  7 10:52 /lib/firmware/amdgpu/topaz_me.bin
-rw-r--r-- 1 root root 262784 Jan  7 10:52 /lib/firmware/amdgpu/topaz_mec2.bin
-rw-r--r-- 1 root root 262784 Jan  7 10:52 /lib/firmware/amdgpu/topaz_mec.bin
-rw-r--r-- 1 root root  17024 Jan  7 10:52 /lib/firmware/amdgpu/topaz_pfp.bin
-rw-r--r-- 1 root root   8448 Jan  7 10:52 /lib/firmware/amdgpu/topaz_rlc.bin
-rw-r--r-- 1 root root   8576 Jan  7 10:52 /lib/firmware/amdgpu/topaz_sdma1.bin
-rw-r--r-- 1 root root   8576 Jan  7 10:52 /lib/firmware/amdgpu/topaz_sdma.bin
-rw-r--r-- 1 root root  80544 Jan  7 10:52 /lib/firmware/amdgpu/topaz_smc.bin

Thanks!

José.

Comment 2 tt_1 2018-03-07 07:21:59 UTC

Could you please attach your emerge --info and also the config file of the kernel, also the additional patches you mentioned? I ran into this as well, but not entirely sure if it is the same bug, because I'm using eudev on an experiemental musl-amd64 profile. Apart from that, same problem - modprobe amdgpu kills the system. 

I don't have physical access to the machine at the moment, will attach my emerge --info and kernel config later on.

Comment 3 José Pekkarinen 2018-03-07 07:24:51 UTC

Created attachment 522642 [details]
emerge --info

Comment 4 José Pekkarinen 2018-03-07 07:26:02 UTC

Created attachment 522644 [details]
kernel config.

Comment 5 José Pekkarinen 2018-03-07 07:29:43 UTC

I'm reproducing it in 4.15.7, and this already don't need the patches I was talking about, but basically it was this:

diff --git a/drivers/gpu/drm/amd/amdgpu/vi.c b/drivers/gpu/drm/amd/amdgpu/vi.c
index 3a4c2fa7e36d..252b1c0df5ae 100644
--- a/drivers/gpu/drm/amd/amdgpu/vi.c
+++ b/drivers/gpu/drm/amd/amdgpu/vi.c
@@ -449,14 +449,18 @@ static bool vi_read_bios_from_rom(struct amdgpu_device *adev,
 
 static void vi_detect_hw_virtualization(struct amdgpu_device *adev)
 {
-       uint32_t reg = RREG32(mmBIF_IOV_FUNC_IDENTIFIER);
-       /* bit0: 0 means pf and 1 means vf */
-       /* bit31: 0 means disable IOV and 1 means enable */
-       if (reg & 1)
-               adev->virt.caps |= AMDGPU_SRIOV_CAPS_IS_VF;
-
-       if (reg & 0x80000000)
-               adev->virt.caps |= AMDGPU_SRIOV_CAPS_ENABLE_IOV;
+       uint32_t reg = 0;
+
+       if (adev->asic_type == CHIP_TONGA ||
+           adev->asic_type == CHIP_FIJI) {
+              reg = RREG32(mmBIF_IOV_FUNC_IDENTIFIER);
+              /* bit0: 0 means pf and 1 means vf */
+              if (REG_GET_FIELD(reg, BIF_IOV_FUNC_IDENTIFIER, FUNC_IDENTIFIER))
+                      adev->virt.caps |= AMDGPU_SRIOV_CAPS_IS_VF;
+              /* bit31: 0 means disable IOV and 1 means enable */
+              if (REG_GET_FIELD(reg, BIF_IOV_FUNC_IDENTIFIER, IOV_ENABLE))
+                      adev->virt.caps |= AMDGPU_SRIOV_CAPS_ENABLE_IOV;
+       }
 
        if (reg == 0) {
                if (is_virtual_machine()) /* passthrough mode exclus sr-iov mode */

Which fixes a problem detecting several gpus as virtualization capable while they are not.

Comment 6 Anthony Basile gentoo-dev

2018-03-07 07:39:03 UTC

This is not a problem with eudev but the kernel driver.

Comment 7 José Pekkarinen 2018-03-07 07:42:53 UTC

Yeah, I was thinking about it, as I think it's being years now that firmware
is first sneaked by the kernel, and user space firmware loaders became deprecated.

It's being reported in the kernel as well, no noise coming from it unfortunately.

Comment 8 tt_1 2018-03-07 07:46:39 UTC

So as a fix you take the patch from #5 and apply it with -R to the kernel sources?

Comment 9 José Pekkarinen 2018-03-07 07:51:09 UTC

no, that is not fixing this issue, but other.

Upstream report just for completeness: https://bugs.freedesktop.org/show_bug.cgi?id=104854

Comment 10 Mike Pagano gentoo-dev

2018-06-18 18:27:06 UTC

This is supposed to be fixed with:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/base/power/main.c?id=c62ec4610c40bcc44f2d3d5ed1c312737279e2f3

Which is in kernels >=4.17

Can you test with 4.17.2 or let us know if you are already running this kernel version and still experience the same issue?


Reference: https://bugzilla.kernel.org/show_bug.cgi?id=199693

Comment 11 tt_1 2018-06-19 06:41:16 UTC

Just did a quick and dirty test with gentoo-sources-4.17.2, there were no problems with the amdgpu module.

Comment 12 José Pekkarinen 2018-06-19 06:47:26 UTC

I'm on my way to test this during this morning. Please bear with it.

José.

Comment 13 José Pekkarinen 2018-06-19 09:07:59 UTC

I did a test with vanilla kernel and it shows the backtraces again. I'm updating theupstream bug with the output.

Comment 14 Mike Pagano gentoo-dev

2018-06-19 10:36:51 UTC

(In reply to José Pekkarinen from comment #13)
> I did a test with vanilla kernel and it shows the backtraces again. I'm
> updating theupstream bug with the output.

Thanks, José

Comment 15 tt_1 2018-06-19 20:38:44 UTC

Have you tried to not include the firmware files into the kernel but use external files, and compile amdgpu as a module instead? I have a rx 550, it didn't work for me with firmware built into the kernel and including amdgpu as well. Setting both to external works just fine.

Comment 16 José Pekkarinen 2018-06-20 07:25:22 UTC

(In reply to tt_1 from comment #15)
> Have you tried to not include the firmware files into the kernel but use
> external files, and compile amdgpu as a module instead? I have a rx 550, it
> didn't work for me with firmware built into the kernel and including amdgpu
> as well. Setting both to external works just fine.

I'm afraid amdgpu was always a module, and removing the topaz firmware from
CONFIG_EXTRA_FIRMWARE doesn't help. I do have another system where I use
amd apu + fiji and works nice, time ago I used a polaris like yours and
didn't show this king of issue. Of course they were discrete desktop gpus.

Comment 17 Mike Pagano gentoo-dev

2018-07-23 19:42:29 UTC

We'll watch the upstream bug and backport and fixes identified.