Created attachment 520946 [details] Udevadm executed during modprobe amdgpu. Hi, I'm having troubles in my new laptop regarding to firmware loading. The laptop is a Lenovo ideapad 510s-14IKB(kabylake + topaz), and the only way to make it work is disabling amdgpu driver, as anytime you load it, modprobe will end up killed, and the following backtrace will appear in /var/log/messages file: Feb 24 15:41:48 bee kernel: [ 1689.500738] amdgpu: [powerplay] smu not running, upload firmware again Feb 24 15:41:54 bee kernel: [ 1689.502041] BUG: unable to handle kernel paging request at ffffc91c008d0fec Feb 24 15:41:54 bee kernel: [ 1689.502089] IP: smu7_populate_single_firmware_entry.isra.5+0x68/0xc0 [amdgpu] Feb 24 15:41:54 bee kernel: [ 1689.502091] PGD 35d10e067 P4D 35d10e067 PUD 0 Feb 24 15:41:54 bee kernel: [ 1689.502093] Oops: 0002 [#1] SMP PTI Feb 24 15:41:54 bee kernel: [ 1689.502096] Dumping ftrace buffer: Feb 24 15:41:54 bee kernel: [ 1689.502098] (ftrace buffer empty) Feb 24 15:41:54 bee kernel: [ 1689.502099] Modules linked in: amdgpu(+) chash ttm ctr ccm af_packet ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables nf_conntrack_ipv6 nf_defrag_ipv6 nft_ct nf_tables_ipv6 nft_masq_ipv4 nf_nat_masquerade_ipv4 nft _masq nft_meta nft_chain_nat_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack nf_tables_ipv4 nf_tables nfnetlink tun bridge stp llc zram zsmalloc binfmt_misc vfio_pci vfio_virqfd udl loop bfq uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_core videodev media snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic btusb btrtl btbcm btintel bluetooth x86_pkg_temp_thermal ecdh_generic intel_powerclamp kvmgt coretemp arc4 vfio_mdev mdev vfio_iommu_type1 vfio i915 iwlmvm kvm_intel kvm mac8021 1 joydev mousedev rtsx_pci_ms Feb 24 15:41:54 bee kernel: [ 1689.502129] irqbypass rtsx_pci_sdmmc mmc_core memstick crc32c_intel snd_hda_intel snd_hda_codec iwlwifi i2c_algo_bit ghash_clmulni_intel snd_hwdep intel_cstate wmi_bmof drm_kms_helper efi_pstore snd_hda_core drm intel_uncore cfg80211 intel _rapl_perf psmouse ideapad_laptop rtsx_pci efivars evdev input_leds snd_pcm mfd_core serio_raw sparse_keymap wmi syscopyarea rfkill video snd_timer snd sysfillrect sysimgblt fb_sys_fops ac battery thermal backlight fan i2c_i801 tpm_crb acpi_pad button soundcore efivarfs unix dm_zero dm_thin_pool dm_persistent_data dm_bio_prison dm_service_time dm_round_robin dm_queue_length dm_multipath dm_log_userspace cn dm_flakey dm_delay xts aesni_intel crypto_simd cryptd glue_helper aes_x86_64 cbc sha256_generic scsi_transport_iscsi r8169 mii fuse xfs nfs lockd grace sunrpc Feb 24 15:41:54 bee kernel: [ 1689.502161] fscache jfs reiserfs ext4 mbcache jbd2 fscrypto multipath linear raid10 raid1 raid0 dm_raid raid456 md_mod async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq libcrc32c dm_snapshot dm_bufio dm_crypt dm_mirro r dm_region_hash dm_log dm_mod dax hid_generic usbhid xhci_pci xhci_hcd ohci_hcd uhci_hcd usb_storage ehci_pci ehci_hcd usbcore usb_common scsi_transport_fc sr_mod cdrom sg sd_mod ata_piix ahci libahci sata_sx4 pata_oldpiix Feb 24 15:41:54 bee kernel: [ 1689.502184] CPU: 0 PID: 20686 Comm: modprobe Not tainted 4.15.4+ #1 Feb 24 15:41:54 bee kernel: [ 1689.502186] Hardware name: LENOVO 80UV/Lenovo ideapad 510S-14IKB, BIOS 2SCN26WW(V2.06) 07/12/2017 Feb 24 15:41:54 bee kernel: [ 1689.502226] RIP: 0010:smu7_populate_single_firmware_entry.isra.5+0x68/0xc0 [amdgpu] Feb 24 15:41:54 bee kernel: [ 1689.502228] RSP: 0018:ffffc90002eeba38 EFLAGS: 00010246 Feb 24 15:41:54 bee kernel: [ 1689.502229] RAX: 000000000000007e RBX: 0000000000000003 RCX: 000001000f53d000 Feb 24 15:41:54 bee kernel: [ 1689.502230] RDX: 0000000000000004 RSI: 0000000000000000 RDI: ffff8801f780b020 Feb 24 15:41:54 bee kernel: [ 1689.502231] RBP: ffffc91c008d0fec R08: 0000000000000001 R09: 00000000000004d4 Feb 24 15:41:54 bee kernel: [ 1689.502232] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8801fc972010 Feb 24 15:41:54 bee kernel: [ 1689.502233] R13: ffff88027adba000 R14: 00000000000005fe R15: ffff8801edb90000 Feb 24 15:41:54 bee kernel: [ 1689.502234] FS: 00007f4bd2284700(0000) GS:ffff88046ec00000(0000) knlGS:0000000000000000 Feb 24 15:41:54 bee kernel: [ 1689.502235] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Feb 24 15:41:54 bee kernel: [ 1689.502236] CR2: ffffc91c008d0fec CR3: 000000035a3b8001 CR4: 00000000003606f0 Feb 24 15:41:54 bee kernel: [ 1689.502237] Call Trace: Feb 24 15:41:54 bee kernel: [ 1689.502273] smu7_request_smu_load_fw+0x97/0x310 [amdgpu] Feb 24 15:41:54 bee kernel: [ 1689.502310] pp_hw_init+0x48/0xc0 [amdgpu] Feb 24 15:41:54 bee kernel: [ 1689.502338] amdgpu_device_init+0xde9/0x1610 [amdgpu] Feb 24 15:41:54 bee kernel: [ 1689.502342] ? cache_alloc_debugcheck_after.isra.22+0x195/0x1e0 Feb 24 15:41:54 bee kernel: [ 1689.502344] ? kmem_cache_alloc_trace+0x1f6/0x230 Feb 24 15:41:54 bee kernel: [ 1689.502371] ? amdgpu_driver_load_kms+0x25/0x240 [amdgpu] Feb 24 15:41:54 bee kernel: [ 1689.502397] amdgpu_driver_load_kms+0x58/0x240 [amdgpu] Feb 24 15:41:54 bee kernel: [ 1689.502408] drm_dev_register+0x12f/0x1c0 [drm] Feb 24 15:41:54 bee kernel: [ 1689.502435] amdgpu_pci_probe+0x10f/0x140 [amdgpu] Feb 24 15:41:54 bee kernel: [ 1689.502438] pci_device_probe+0xc8/0x140 Feb 24 15:41:54 bee kernel: [ 1689.502441] driver_probe_device+0x2a8/0x490 Feb 24 15:41:54 bee kernel: [ 1689.502443] __driver_attach+0xda/0xe0 Feb 24 15:41:54 bee kernel: [ 1689.502444] ? driver_probe_device+0x490/0x490 Feb 24 15:41:54 bee kernel: [ 1689.502446] bus_for_each_dev+0x5a/0x90 Feb 24 15:41:54 bee kernel: [ 1689.502448] bus_add_driver+0x16a/0x260 Feb 24 15:41:54 bee kernel: [ 1689.502450] driver_register+0x57/0xc0 Feb 24 15:41:54 bee kernel: [ 1689.502452] ? 0xffffffffa1467000 Feb 24 15:41:54 bee kernel: [ 1689.502453] do_one_initcall+0x4e/0x190 Feb 24 15:41:54 bee kernel: [ 1689.502455] ? kmem_cache_alloc_trace+0x1f6/0x230 Feb 24 15:41:54 bee kernel: [ 1689.502457] ? do_init_module+0x22/0x20b Feb 24 15:41:54 bee kernel: [ 1689.502459] do_init_module+0x5b/0x20b Feb 24 15:41:54 bee kernel: [ 1689.502461] load_module+0x1511/0x1740 Feb 24 15:41:54 bee kernel: [ 1689.502464] ? SyS_finit_module+0xaa/0xe0 Feb 24 15:41:54 bee kernel: [ 1689.502465] SyS_finit_module+0xaa/0xe0 Feb 24 15:41:54 bee kernel: [ 1689.502468] do_syscall_64+0x6e/0x190 Feb 24 15:41:54 bee kernel: [ 1689.502471] entry_SYSCALL_64_after_hwframe+0x26/0x9b Feb 24 15:41:54 bee kernel: [ 1689.502472] RIP: 0033:0x7f4bd1be68c9 Feb 24 15:41:54 bee kernel: [ 1689.502474] RSP: 002b:00007fff18f47a48 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 Feb 24 15:41:54 bee kernel: [ 1689.502475] RAX: ffffffffffffffda RBX: 00005594a4eb6540 RCX: 00007f4bd1be68c9 Feb 24 15:41:54 bee kernel: [ 1689.502476] RDX: 0000000000000000 RSI: 00005594a4be2e2a RDI: 0000000000000005 Feb 24 15:41:54 bee kernel: [ 1689.502477] RBP: 00005594a4be2e2a R08: 0000000000000000 R09: 0000000000000000 Feb 24 15:41:54 bee kernel: [ 1689.502478] R10: 0000000000000005 R11: 0000000000000246 R12: 0000000000000000 Feb 24 15:41:54 bee kernel: [ 1689.502479] R13: 00005594a4eb7e70 R14: 0000000000040000 R15: 0000000000000000 Feb 24 15:41:54 bee kernel: [ 1689.502481] Code: b6 b1 40 5c 36 a1 ff d0 85 c0 74 1a 83 eb 06 31 c0 83 e3 fb 0f 94 c0 66 89 45 18 48 83 c4 30 31 c0 5b 5d 41 5c c3 0f b7 44 24 02 <66> 89 5d 00 c7 45 0c 00 00 00 00 c7 45 10 00 00 00 00 66 89 45 Feb 24 15:41:54 bee kernel: [ 1689.502538] RIP: smu7_populate_single_firmware_entry.isra.5+0x68/0xc0 [amdgpu] RSP: ffffc90002eeba38 Feb 24 15:41:54 bee kernel: [ 1689.502539] CR2: ffffc91c008d0fec Feb 24 15:41:54 bee kernel: [ 1689.502540] ---[ end trace 8684bbc443b97149 ]--- The code referenced comes to the following lines in drivers/gpu/drm/amd/powerplay/smumgr/smu7_smumgr.c smu7_request_smu_load_fw function: ... PP_ASSERT_WITH_CODE(0 == smu7_populate_single_firmware_entry(hwmgr, UCODE_ID_RLC_G, &toc->entry[toc->num_entries++]), "Failed to Get Firmware Entry.", return -EINVAL); PP_ASSERT_WITH_CODE(0 == smu7_populate_single_firmware_entry(hwmgr, UCODE_ID_CP_CE, &toc->entry[toc->num_entries++]), "Failed to Get Firmware Entry.", return -EINVAL); PP_ASSERT_WITH_CODE(0 == smu7_populate_single_firmware_entry(hwmgr, UCODE_ID_CP_PFP, &toc->entry[toc->num_entries++]), "Failed to Get Firmware Entry.", return -EINVAL); ... I attach an example of udevadm execute during the modprobe to see if the event is seen by udev. The kernel is a vanilla 4.15.4, with some bugfixes provided by amd drm team. Config regarding to firmware is: CONFIG_PREVENT_FIRMWARE_BUILD=y CONFIG_FW_LOADER=y CONFIG_FIRMWARE_IN_KERNEL=y CONFIG_EXTRA_FIRMWARE="i915/kbl_dmc_ver1_04.bin i915/kbl_guc_ver9_39.bin i915/kbl_huc_ver02_00_1810.bin amdgpu/topaz_ce.bin amdgpu/topaz_k_smc.bin amdgpu/topaz_mc.bin amdgpu/topaz_me.bin amdgpu/topaz_mec2.bin amdgpu/topaz_mec.bin amdgpu/topaz_pfp.bin amdgpu/topaz_rlc.bin amdgpu/topaz_sdma1.bin amdgpu/topaz_sdma.bin amdgpu/topaz_smc.bin" CONFIG_EXTRA_FIRMWARE_DIR="/lib/firmware" # CONFIG_FW_LOADER_USER_HELPER_FALLBACK is not set Any help/suggestions appreciated. Thanks! José.
I forgot to add that the firmware is obviously on place: # ls -la /lib/firmware/amdgpu/topaz* -rw-r--r-- 1 root root 8832 Jan 7 10:52 /lib/firmware/amdgpu/topaz_ce.bin -rw-r--r-- 1 root root 80544 Jan 7 10:52 /lib/firmware/amdgpu/topaz_k_smc.bin -rw-r--r-- 1 root root 32100 Jan 7 10:52 /lib/firmware/amdgpu/topaz_mc.bin -rw-r--r-- 1 root root 17024 Jan 7 10:52 /lib/firmware/amdgpu/topaz_me.bin -rw-r--r-- 1 root root 262784 Jan 7 10:52 /lib/firmware/amdgpu/topaz_mec2.bin -rw-r--r-- 1 root root 262784 Jan 7 10:52 /lib/firmware/amdgpu/topaz_mec.bin -rw-r--r-- 1 root root 17024 Jan 7 10:52 /lib/firmware/amdgpu/topaz_pfp.bin -rw-r--r-- 1 root root 8448 Jan 7 10:52 /lib/firmware/amdgpu/topaz_rlc.bin -rw-r--r-- 1 root root 8576 Jan 7 10:52 /lib/firmware/amdgpu/topaz_sdma1.bin -rw-r--r-- 1 root root 8576 Jan 7 10:52 /lib/firmware/amdgpu/topaz_sdma.bin -rw-r--r-- 1 root root 80544 Jan 7 10:52 /lib/firmware/amdgpu/topaz_smc.bin Thanks! José.
Could you please attach your emerge --info and also the config file of the kernel, also the additional patches you mentioned? I ran into this as well, but not entirely sure if it is the same bug, because I'm using eudev on an experiemental musl-amd64 profile. Apart from that, same problem - modprobe amdgpu kills the system. I don't have physical access to the machine at the moment, will attach my emerge --info and kernel config later on.
Created attachment 522642 [details] emerge --info
Created attachment 522644 [details] kernel config.
I'm reproducing it in 4.15.7, and this already don't need the patches I was talking about, but basically it was this: diff --git a/drivers/gpu/drm/amd/amdgpu/vi.c b/drivers/gpu/drm/amd/amdgpu/vi.c index 3a4c2fa7e36d..252b1c0df5ae 100644 --- a/drivers/gpu/drm/amd/amdgpu/vi.c +++ b/drivers/gpu/drm/amd/amdgpu/vi.c @@ -449,14 +449,18 @@ static bool vi_read_bios_from_rom(struct amdgpu_device *adev, static void vi_detect_hw_virtualization(struct amdgpu_device *adev) { - uint32_t reg = RREG32(mmBIF_IOV_FUNC_IDENTIFIER); - /* bit0: 0 means pf and 1 means vf */ - /* bit31: 0 means disable IOV and 1 means enable */ - if (reg & 1) - adev->virt.caps |= AMDGPU_SRIOV_CAPS_IS_VF; - - if (reg & 0x80000000) - adev->virt.caps |= AMDGPU_SRIOV_CAPS_ENABLE_IOV; + uint32_t reg = 0; + + if (adev->asic_type == CHIP_TONGA || + adev->asic_type == CHIP_FIJI) { + reg = RREG32(mmBIF_IOV_FUNC_IDENTIFIER); + /* bit0: 0 means pf and 1 means vf */ + if (REG_GET_FIELD(reg, BIF_IOV_FUNC_IDENTIFIER, FUNC_IDENTIFIER)) + adev->virt.caps |= AMDGPU_SRIOV_CAPS_IS_VF; + /* bit31: 0 means disable IOV and 1 means enable */ + if (REG_GET_FIELD(reg, BIF_IOV_FUNC_IDENTIFIER, IOV_ENABLE)) + adev->virt.caps |= AMDGPU_SRIOV_CAPS_ENABLE_IOV; + } if (reg == 0) { if (is_virtual_machine()) /* passthrough mode exclus sr-iov mode */ Which fixes a problem detecting several gpus as virtualization capable while they are not.
This is not a problem with eudev but the kernel driver.
Yeah, I was thinking about it, as I think it's being years now that firmware is first sneaked by the kernel, and user space firmware loaders became deprecated. It's being reported in the kernel as well, no noise coming from it unfortunately.
So as a fix you take the patch from #5 and apply it with -R to the kernel sources?
no, that is not fixing this issue, but other. Upstream report just for completeness: https://bugs.freedesktop.org/show_bug.cgi?id=104854
This is supposed to be fixed with: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/base/power/main.c?id=c62ec4610c40bcc44f2d3d5ed1c312737279e2f3 Which is in kernels >=4.17 Can you test with 4.17.2 or let us know if you are already running this kernel version and still experience the same issue? Reference: https://bugzilla.kernel.org/show_bug.cgi?id=199693
Just did a quick and dirty test with gentoo-sources-4.17.2, there were no problems with the amdgpu module.
I'm on my way to test this during this morning. Please bear with it. José.
I did a test with vanilla kernel and it shows the backtraces again. I'm updating theupstream bug with the output.
(In reply to José Pekkarinen from comment #13) > I did a test with vanilla kernel and it shows the backtraces again. I'm > updating theupstream bug with the output. Thanks, José
Have you tried to not include the firmware files into the kernel but use external files, and compile amdgpu as a module instead? I have a rx 550, it didn't work for me with firmware built into the kernel and including amdgpu as well. Setting both to external works just fine.
(In reply to tt_1 from comment #15) > Have you tried to not include the firmware files into the kernel but use > external files, and compile amdgpu as a module instead? I have a rx 550, it > didn't work for me with firmware built into the kernel and including amdgpu > as well. Setting both to external works just fine. I'm afraid amdgpu was always a module, and removing the topaz firmware from CONFIG_EXTRA_FIRMWARE doesn't help. I do have another system where I use amd apu + fiji and works nice, time ago I used a polaris like yours and didn't show this king of issue. Of course they were discrete desktop gpus.
We'll watch the upstream bug and backport and fixes identified.