Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 829209 - gentoo-sources: power9 le: amdgpu: *ERROR* hw_init of IP block <psp> failed -22
Summary: gentoo-sources: power9 le: amdgpu: *ERROR* hw_init of IP block <psp> failed -22
Status: RESOLVED UPSTREAM
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: Current packages (show other bugs)
Hardware: PPC64 Linux
: Normal normal (vote)
Assignee: Gentoo Kernel Bug Wranglers and Kernel Maintainers
URL: https://bugzilla.kernel.org/show_bug....
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-12-14 21:03 UTC by R030t1
Modified: 2022-03-31 11:33 UTC (History)
0 users

See Also:
Package list:
Runtime testing required: ---


Attachments
.config (config_bb,124.58 KB, text/plain)
2021-12-15 02:47 UTC, R030t1
Details
dmesg (dmesg_bb,78.40 KB, text/plain)
2021-12-15 02:49 UTC, R030t1
Details

Note You need to log in before you can comment on or make changes to this bug.
Description R030t1 2021-12-14 21:03:22 UTC
[   54.313046] [drm] Found VCN firmware Version ENC: 1.16 DEC: 2 VEP: 0 Revision: 1
[   54.313054] amdgpu 0000:03:00.0: amdgpu: Will use PSP to load VCN firmware
[   54.314153] amdgpu 0000:03:00.0: enabling bus mastering
[   54.570938] [drm:psp_hw_start [amdgpu]] *ERROR* PSP create ring failed!
[   54.571061] [drm:psp_hw_init [amdgpu]] *ERROR* PSP firmware loading failed
[   54.571175] [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* hw_init of IP block <psp> failed -22
[   54.571277] amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_init failed
[   54.571279] amdgpu 0000:03:00.0: amdgpu: Fatal error during GPU init
[   54.571282] amdgpu 0000:03:00.0: amdgpu: amdgpu: finishing device.
[   54.572789] amdgpu: probe of 0000:03:00.0 failed with error -22

However, this may be related to BAR code in the kernel.
Previous regressions: https://gitlab.freedesktop.org/drm/amd/-/issues/1519
Kernel bug: https://bugzilla.kernel.org/show_bug.cgi?id=215285

Current .config seems to work with other AMD cards, like 6800XT.
(If not up when you see this please give it a second, have to get the system back up.)
Comment 1 Mike Pagano gentoo-dev 2021-12-14 21:10:09 UTC
What gentoo-sources version ?
Comment 2 R030t1 2021-12-15 02:46:53 UTC
5.15.6-gentoo

Linux 5.15.6-gentoo #3 SMP Tue Dec 7 17:35:12 EST 2021 ppc64le POWER9, altivec supported PowerNV C1P9S01 REV 1.01 GNU/Linux
Comment 3 R030t1 2021-12-15 02:47:39 UTC
Created attachment 759020 [details]
.config

Kernel configuration.
Comment 4 R030t1 2021-12-15 02:49:00 UTC
Created attachment 759021 [details]
dmesg

Full dmesg output.
Comment 5 Joshua McDonagh 2021-12-15 04:14:07 UTC
kernel configs for CONFIG_DRM_AMDGPU should be set positively if I'm not correct
Comment 6 Joshua McDonagh 2021-12-15 04:24:11 UTC
resoning for comment above relating to CONFIG_DRM_AMDGPU

[   69.802528] [drm:psp_hw_start [amdgpu]] *ERROR* PSP create ring failed!
[   69.802671] [drm:psp_hw_init [amdgpu]] *ERROR* PSP firmware loading failed
[   69.802784] [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* hw_init of IP block <psp> failed -22
Comment 7 Joshua McDonagh 2021-12-15 04:40:25 UTC
also: [   71.950763] platform regulatory.0: Direct firmware load for regulatory.db failed with error -2
Comment 8 Joshua McDonagh 2021-12-15 04:42:28 UTC
ignore previous comment
Comment 9 Joshua McDonagh 2021-12-15 04:54:36 UTC
BUG: 736994

"I hit the same issue. I think it is because you have CONFIG_DRM_AMDGPU=y, which probably requires CONFIG_AMD_IOMMU_V2=y, but this isn't enforced for some reason. I strongly suspect this is an upstream issue. (Both as modules also works.)"
Comment 10 R030t1 2021-12-15 16:36:22 UTC
CONFIG_DRM_AMDGPU is set to m, this should be sufficient. I'll check the flag you've mentioned to see if it works with AMDGPU as module.

I also tried with it builtin and with firmware files inside kernel however I'm getting an issue with the sources. I just tried the newest kernel as well, same error. Tried to fix it but it doesn't seem to be the normal tab/space issue I see with this message.

# make -j32
  SYNC    include/config/auto.conf.cmd
  HOSTCC  scripts/kconfig/conf.o
  HOSTLD  scripts/kconfig/conf
*
* Restart config...
*
*
* Firmware loader
*
Firmware loading facility (FW_LOADER) [Y/?] y
  Build named firmware blobs into the kernel binary (EXTRA_FIRMWARE) [amdgpu/dimgrey_cavefish_ce.bin;amdgpu/dimgrey_cavefish_rlc.bin;amdgpu/dimgrey_cavefish_dmcub.bin;amdgpu/dimgrey_cavefish_sdma.bin;amdgpu/dimgrey_cavefish_me.bin;amdgpu/dimgrey_cavefish_smc.bin;amdgpu/dimgrey_cavefish_mec.bin;amdgpu/dimgrey_cavefish_sos.bin;amdgpu/dimgrey_cavefish_mec2.bin;amdgpu/dimgrey_cavefish_ta.bin;amdgpu/dimgrey_cavefish_pfp.bin;amdgpu/dimgrey_cavefish_vcn.bin] amdgpu/dimgrey_cavefish_ce.bin;amdgpu/dimgrey_cavefish_rlc.bin;amdgpu/dimgrey_cavefish_dmcub.bin;amdgpu/dimgrey_cavefish_sdma.bin;amdgpu/dimgrey_cavefish_me.bin;amdgpu/dimgrey_cavefish_smc.bin;amdgpu/dimgrey_cavefish_mec.bin;amdgpu/dimgrey_cavefish_sos.bin;amdgpu/dimgrey_cavefish_mec2.bin;amdgpu/dimgrey_cavefish_ta.bin;amdgpu/dimgrey_cavefish_pfp.bin;amdgpu/dimgrey_cavefish_vcn.bin
    Firmware blobs root directory (EXTRA_FIRMWARE_DIR) [/lib/firmware] (NEW) 
  Enable the firmware sysfs fallback mechanism (FW_LOADER_USER_HELPER) [N/y/?] n
  Enable compressed firmware support (FW_LOADER_COMPRESS) [Y/n/?] y
  CALL    scripts/atomic/check-atomics.sh
  CALL    scripts/checksyscalls.sh
  CHK     include/generated/compile.h
drivers/base/firmware_loader/builtin/Makefile:37: *** missing separator.  Stop.
make[3]: *** [scripts/Makefile.build:540: drivers/base/firmware_loader/builtin] Error 2
make[2]: *** [scripts/Makefile.build:540: drivers/base/firmware_loader] Error 2
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [scripts/Makefile.build:540: drivers/base] Error 2
make[1]: *** Waiting for unfinished jobs....
  UPD     kernel/config_data
  GZIP    kernel/config_data.gz
  CC      kernel/configs.o
  AR      kernel/built-in.a
make: *** [Makefile:1868: drivers] Error 2
Comment 11 Mike Pagano gentoo-dev 2021-12-15 17:45:50 UTC
space-separated


Build named firmware blobs into the kernel binary ( CONFIG_EXTRA_FIRMWARE ) This option is a string and takes the (space-separated) names of firmware files to be built into the kernel.Build named firmware blobs into the kernel binary ( CONFIG_EXTRA_FIRMWARE ) This option is a string and takes the (space-separated) names of firmware files to be built into the kernel.
Comment 12 R030t1 2021-12-16 03:42:46 UTC
(In reply to Joshua McDonagh from comment #9)
> BUG: 736994
> 
> "I hit the same issue. I think it is because you have CONFIG_DRM_AMDGPU=y,
> which probably requires CONFIG_AMD_IOMMU_V2=y, but this isn't enforced for
> some reason. I strongly suspect this is an upstream issue. (Both as modules
> also works.)"

How did you resolve this?

This option is impossible to select. It is implied by a series of conditionals that ends with a dangling "&& X86_64" (with ARM || X86_64 || PPC64 [=y] appearing before that) giving false. I conclude this is in error and edit it out.

But there may be a hard dep on X86_64? I can not tell. Looking at CONFIG_AMD_IOMMU it is masked because HAVE_CMPXCHG_DOUBLE=n. CONFIG_ACPI is also unset and I could not track down what should set it. I edit both of these out.

Forging boldly ahead gives:
# make
  CALL    scripts/checksyscalls.sh
  CALL    scripts/atomic/check-atomics.sh
  CHK     include/generated/compile.h
  CC      drivers/iommu/amd/iommu.o
drivers/iommu/amd/iommu.c:34:10: fatal error: asm/irq_remapping.h: No such file or directory
   34 | #include <asm/irq_remapping.h>
      |          ^~~~~~~~~~~~~~~~~~~~~
compilation terminated.
make[3]: *** [scripts/Makefile.build:277: drivers/iommu/amd/iommu.o] Error 1
make[2]: *** [scripts/Makefile.build:540: drivers/iommu/amd] Error 2
make[1]: *** [scripts/Makefile.build:540: drivers/iommu] Error 2
make: *** [Makefile:1868: drivers] Error 2



Thanks for the correction, Mike, I kept looking for the separator in the docs but didn't find it.
Comment 13 R030t1 2021-12-16 03:48:03 UTC
Edits to line 14 of drivers/iommu/amd/Kconfig, should be obvious.
Comment 14 Georgy Yakovlev archtester gentoo-dev 2021-12-16 05:49:20 UTC
I dumped some info on IRC, copying here for historic purposes:

R0b0t1: for your attempt to build fw into kernel I think you should not separate them by semicolons, just spaces.
R0b0t1: maybe try disabling CONFIG_DRM_AMD_SECURE_DISPLAY ?
"PSP create ring failed" comes from it
R0b0t1: ok I actually booted to 5.15.8 =)
I could not before.
now I'm trying to find what exactly I did to make it work.
Linux cerberus 5.15.8 #1 SMP Wed Dec 15 21:16:39 PST 2021 ppc64le POWER9, altivec supported PowerNV C1P9S01 REV 1.01 GNU/Linux
R0b0t1: https://gist.github.com/46de29bb2785f4df7797e8e76ea7cc2d my config for 5.15
I had to disable gcc-plugins (some hardening) due to build failure (need to send patch upstream, will figure out later)
writing from plasma session on blackbird with 
0000:03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT] (rev c4)
even works with iommu=nobypass
here are my kernel args
earlycon=hvc0 console=hvc0 console=tty0 iommu=nobypass pci=realloc crashkernel=2G-4G:320M,4G-32G:512M,32G-64G:1024M,64G-128G:2048M,128G-:4096M video=offb:off audit=1 quiet loglevel=3 rd.udev.log-priority=3 splash amdgpu.aspm=0
Comment 15 Georgy Yakovlev archtester gentoo-dev 2021-12-16 05:50:01 UTC
basically either try using my config, or at least disable CONFIG_DRM_AMD_SECURE_DISPLAY, that's the one that gives you PSP ring error
Comment 16 R030t1 2021-12-17 06:08:07 UTC
I have used a config that is yours, but modified so that I can compile dm-crypt into the kernel (requires adding kernel keyrings builtin). In that case, oddly, amdgpu is not loaded automatically. Probing it manually causes the aspeed BMC output to lock but still be valid and detected by the monitor. initramfs generated with --ramdisk-modules so everything necessary should be there.

I tried with/without the secure video flags, both failed. I will see if I can avoid the crash to test that again.
Comment 17 Georgy Yakovlev archtester gentoo-dev 2021-12-17 07:17:26 UTC
--ramdisk-modules is genkernel arg, right?

genkernel is ... I don't like it, and it's known not to load amdgpu, so users usually build it in.

here's what it preloads:
https://gitweb.gentoo.org/proj/genkernel.git/tree/defaults/modules_load
rest it omits

try specifying
AMODULES_MISC="amdgpu" in genkernel.conf

^ I think this is the syntax to load custom module at boot.

or just use dracut =)

gentoo-kernel with savedconfig will use dracut.
https://wiki.gentoo.org/wiki/Project:Distribution_Kernel#Using_savedconfig

you can also use dracut with gentoo-sources with no problems, but you'll have to build kernel manually.


gentoo-kernel-bin (we provide ppc64le prebuilt kernels) is also an option if you are ok with prebuilt.
those are build on our power9 machines in OSUOSL.

question, do you use BOOTKERNFW on blackbird?
https://wiki.raptorcs.com/wiki/Add_GPU_Firmware_To_BOOTKERNFW
^ don't if you do, it messes up graphics drivers.
basically petitboot kernel is old, and once it initializes amdgpu, real host kernel can't with really weird messages.
Comment 18 Georgy Yakovlev archtester gentoo-dev 2021-12-17 07:28:40 UTC
maybe even AMODULES_MISC="amdgpu drm drm_ttm_helper ttm gpu_sched i2c_algo_bit drm_kms_helper"

just in case.
those are all modules amdgpu uses on my bb.
Comment 19 Georgy Yakovlev archtester gentoo-dev 2021-12-17 07:30:53 UTC
and sometimes genkernel does not copy modules to initrd, 

also try adding --all-ramdisk-modules too, it will bloat up initrd but modules will be there 100%
Comment 20 Georgy Yakovlev archtester gentoo-dev 2022-01-21 10:06:47 UTC
is it still relevant?
there's really not much we can do in the bug I think.

I use 5.15.x kernel nowadays on power9 with navi, vega64, and polaris cards and it works. so do multiple people I talk to daily. I'd know if there was a recent amdgpu kernel bug.

just don't forget amdgpu.aspm=0 parameter, it's still needed.

otherwise looks like configuration issue or very specific hardware quirk.
Comment 21 Alice Ferrazzi Gentoo Infrastructure gentoo-dev 2022-01-28 03:48:44 UTC
Closing after 2022/02/04 if there are no more relevant comments.
Comment 22 R030t1 2022-01-28 21:44:00 UTC
Hi guys, I want to make sure you know I've appreciated the comments. I've run through most of them but still have problems with the firmware loading. It just barfs in the way outlined in the original post despite doing everything right. I had to table it for a week or so but I'll start trying the most recent kernels and give a second look to the commands I'm running.

Funnily enough I had similar issues with a 5700 when they were first released. I think I just need to wait a little bit, I returned the 5700 at the time and went back to console mode. Should have just waited :)
Comment 23 Alice Ferrazzi Gentoo Infrastructure gentoo-dev 2022-02-11 10:27:05 UTC
@R030t1 thanks, please keep us update with the results so that we can deal with this issue.
Comment 24 Mike Pagano gentoo-dev 2022-03-10 18:45:17 UTC
Any news on progress here ?
Comment 25 Mike Pagano gentoo-dev 2022-03-31 11:33:28 UTC
(In reply to Mike Pagano from comment #24)
> Any news on progress here ?

We'll keep an eye on the upstream bug and backport any fixes identified.