Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 692454 - dev-libs/rocm-opencl-runtime-2.6.0-r1 fails to find any devices
Summary: dev-libs/rocm-opencl-runtime-2.6.0-r1 fails to find any devices
Status: RESOLVED OBSOLETE
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: Current packages (show other bugs)
Hardware: AMD64 Linux
: Normal normal
Assignee: Craig Andrews
URL: https://github.com/RadeonOpenCompute/...
Whiteboard:
Keywords: PATCH
Depends on:
Blocks:
 
Reported: 2019-08-18 16:20 UTC by ernsteiswuerfel
Modified: 2020-12-19 18:39 UTC (History)
4 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments
clinfo out (clinfo.out,16.35 KB, text/plain)
2019-09-02 21:38 UTC, Hubert Kowalski
Details
kernel .config (5.2.13) (config_5213_opt,99.95 KB, text/plain)
2019-09-08 21:38 UTC, ernsteiswuerfel
Details

Note You need to log in before you can comment on or make changes to this bug.
Description ernsteiswuerfel archtester 2019-08-18 16:20:14 UTC
First of all, thanks that dev-libs/rocm-opencl-runtime finally found it's way into the Gentoo tree!

But I have the issue of dev-libs/rocm-opencl-runtime-2.6.0-r1::gentoo not finding my card, whereas dev-libs/rocm-opencl-runtime-2.6.0::rocm did, running on kernel 5.2.8. I scrapped the ::rocm versions and emerged the ::gentoo version. Did notice it had different deps however.

# clinfo 
Number of platforms                               1
  Platform Name                                   AMD Accelerated Parallel Processing
  Platform Vendor                                 Advanced Micro Devices, Inc.
  Platform Version                                OpenCL 2.0 AMD-APP.internal (2924.0)
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_icd cl_amd_object_metadata cl_amd_event_callback 
  Platform Max metadata object keys (AMD)         8
  Platform Extensions function suffix             AMD

  Platform Name                                   AMD Accelerated Parallel Processing
Number of devices                                 0

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  AMD Accelerated Parallel Processing
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   
  clCreateContext(NULL, ...) [default]            No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  No devices found in platform

ICD loader properties
  ICD loader Name                                 OpenCL ICD Loader
  ICD loader Vendor                               OCL Icd free software
  ICD loader Version                              2.2.12
  ICD loader Profile                              OpenCL 2.2

# rocminfo 
ROCm initialization failed
hsa api call failure at: /var/tmp/portage/dev-util/rocminfo-2.6.0/work/rocminfo-roc-2.6.0/rocminfo.cc:1068
Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.

I understand that rocm-opencl-runtime-2.6.0-r1::gentoo does not install the 'rocm' OpenCL provider any longer to choose with eselect opencl but uses ocl-icd now. But the transition from the overlay is not without glitches it seems...
Comment 1 Jeroen Roovers (RETIRED) gentoo-dev 2019-08-22 08:16:46 UTC
(In reply to ernsteiswuerfel from comment #0)
> First of all, thanks that dev-libs/rocm-opencl-runtime finally found it's
> way into the Gentoo tree!
> 
> But I have the issue of dev-libs/rocm-opencl-runtime-2.6.0-r1::gentoo not
> finding my card, whereas dev-libs/rocm-opencl-runtime-2.6.0::rocm did,

I cannot find that "rocm" overlay on https://overlays.gentoo.org/ so to help everyone else find it, please mention that URL in this bug report. The [URL] field should be a good place.
Comment 2 Craig Andrews gentoo-dev 2019-08-22 13:29:45 UTC
Gentoo doesn't support that overlay; if you can reproduce this issue using ony packages from Gentoo (I cannot), then I can help - but with a mix of Gentoo and overlay packages I'm sorry that to say that you'll need to get help elsewhere.
Comment 3 Hubert Kowalski 2019-09-02 21:37:42 UTC
Hi,

I have similar problem as ernsteiswuerfel - my system should theoretically be compatible with opencl on amd gpu however rocminfo output is:

==/
$ rocminfo 
ROCk module is loaded
johnny is member of video group
hsa api call failure at: /var/tmp/portage/dev-util/rocminfo-2.7.0/work/rocminfo-roc-2.7.0/rocminfo.cc:1102
Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.
==\

I'll attach clinfo next
Comment 4 Hubert Kowalski 2019-09-02 21:38:11 UTC
Created attachment 588854 [details]
clinfo out
Comment 5 Craig Andrews gentoo-dev 2019-09-03 01:10:17 UTC
Do you have these kernel configuration options set?

HSA_AMD
HMM_MIRROR
ZONE_DEVICE

If not, you must set them - please do that then try again. dev-libs/roct-thunk-interface would have warned you if these were not set.
Comment 6 Hubert Kowalski 2019-09-03 05:49:57 UTC
(In reply to Craig Andrews from comment #5)
> Do you have these kernel configuration options set?
> 
> HSA_AMD
> HMM_MIRROR
> ZONE_DEVICE
> 
> If not, you must set them - please do that then try again.
> dev-libs/roct-thunk-interface would have warned you if these were not set.

I do have those :) weren't easy to find:

==/
# zgrep "HSA_AMD" /proc/config.gz
CONFIG_HSA_AMD=y
# zgrep "HMM_MIRROR" /proc/config.gz
CONFIG_ARCH_HAS_HMM_MIRROR=y
CONFIG_HMM_MIRROR=y
# zgrep "ZONE_DEVICE" /proc/config.gz
CONFIG_ARCH_HAS_ZONE_DEVICE=y
CONFIG_ZONE_DEVICE=y
# uname -a
Linux inspiron17 5.2.9-gentoo #1 SMP Sun Aug 25 17:07:22 CEST 2019 x86_64 Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz GenuineIntel GNU/Linux
==\

Should it work? Amd GPU is secondary unit, but module loads and i don't see errors in dmesg...
Comment 7 Craig Andrews gentoo-dev 2019-09-03 14:44:48 UTC
Are you using all packages from Gentoo? I'm not sure if this could be caused by mixing old rocm overlay packages with Gentoo ones or not.

Also, some quick web searching for that error (HSA_STATUS_ERROR_OUT_OF_RESOURCES) indicates that you may need to reboot to resolve it. I'm curious to learn if you've tried that and if it changed anything.
Comment 8 Hubert Kowalski 2019-09-03 14:51:08 UTC
(In reply to Craig Andrews from comment #7)
> Are you using all packages from Gentoo? I'm not sure if this could be caused
> by mixing old rocm overlay packages with Gentoo ones or not.
> 
> Also, some quick web searching for that error
> (HSA_STATUS_ERROR_OUT_OF_RESOURCES) indicates that you may need to reboot to
> resolve it. I'm curious to learn if you've tried that and if it changed
> anything.

Rebooted frequently, no change in behaviour. Also - all packages from gentoo, no overlays on this system ever.
Comment 9 justXi 2019-09-08 18:52:37 UTC
> I understand that rocm-opencl-runtime-2.6.0-r1::gentoo does not install the
> 'rocm' OpenCL provider any longer to choose with eselect opencl but uses
> ocl-icd now. But the transition from the overlay is not without glitches it
> seems...

Did it work before with the OpenCL libs from the old ebuild?
Comment 10 ernsteiswuerfel archtester 2019-09-08 21:33:34 UTC
(In reply to justXi from comment #9)
> Did it work before with the OpenCL libs from the old ebuild?
Yes, with the libs from your rocm overlay it works. It also works with dev-libs/amdgpu-pro-opencl.
Comment 11 ernsteiswuerfel archtester 2019-09-08 21:38:38 UTC
Created attachment 589484 [details]
kernel .config (5.2.13)

Also the kernel options roct-thunk-interface requests are correctly set.

# rocminfo 
ROCm initialization failed
hsa api call failure at: /var/tmp/portage/dev-util/rocminfo-2.6.0/work/rocminfo-roc-2.6.0/rocminfo.cc:1068
Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.

rocminfo shows this output ^^ But I also got that output from it the time where OpenCL worked on rocm overlay.
Comment 12 Craig Andrews gentoo-dev 2019-09-30 16:13:31 UTC
(In reply to ernsteiswuerfel from comment #11)
> Created attachment 589484 [details]
> kernel .config (5.2.13)
> 
> Also the kernel options roct-thunk-interface requests are correctly set.
> 
> # rocminfo 
> ROCm initialization failed
> hsa api call failure at:
> /var/tmp/portage/dev-util/rocminfo-2.6.0/work/rocminfo-roc-2.6.0/rocminfo.cc:
> 1068
> Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to
> allocate the necessary resources. This error may also occur when the core
> runtime library needs to spawn threads or create internal OS-specific events.
> 
> rocminfo shows this output ^^ But I also got that output from it the time
> where OpenCL worked on rocm overlay.

Can you please try with dev-libs/roct-thunk-interface-2.8.0? If rocminfo doesn't show any devices with that version, can you please report the issue upstream at https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/issues and link to that issue report here? (I suspect they'll need to ask you some questions and to gather additional information)
Comment 13 Martin 2019-10-03 02:10:34 UTC
See:

https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/issues/44

That thread reports:


A change to ROCT 2.8.0 now requires the kernel config "CONFIG_NUMA=y" to be set, even for non-NUMA systems.

Also, the kernel config "CONFIG_CPU_SUP_* =y" (as appropriate for your CPU) should be set. For example:

"CONFIG_CPU_SUP_INTEL=y"
"CONFIG_CPU_SUP_AMD=y"


Also note from that thread:

 ascollard commented Oct 1, 2019 (Contributor)

"In the past ROCT uses cpuid instruction to get CPU cache information. This was causing problems when new CPUs were introduced to the market with new cpuid operations required. Using sysfs removes this limitation.

For now please make your kernel with CONFIG_NUMA=y. In the future ROCT release we can add the fallback when NUMA is not enabled in the system."
Comment 14 Craig Andrews gentoo-dev 2019-10-03 02:26:07 UTC
(In reply to Martin from comment #13)
> See:
> 
> https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/issues/44
> 
> That thread reports:
> 
> 
> A change to ROCT 2.8.0 now requires the kernel config "CONFIG_NUMA=y" to be
> set, even for non-NUMA systems.
> 
> Also, the kernel config "CONFIG_CPU_SUP_* =y" (as appropriate for your CPU)
> should be set. For example:
> 
> "CONFIG_CPU_SUP_INTEL=y"
> "CONFIG_CPU_SUP_AMD=y"
> 
> 
> Also note from that thread:
> 
>  ascollard commented Oct 1, 2019 (Contributor)
> 
> "In the past ROCT uses cpuid instruction to get CPU cache information. This
> was causing problems when new CPUs were introduced to the market with new
> cpuid operations required. Using sysfs removes this limitation.
> 
> For now please make your kernel with CONFIG_NUMA=y. In the future ROCT
> release we can add the fallback when NUMA is not enabled in the system."

All of that information is relevant when dev-libs/roct-thunk-interface-2.8.0 is installed; however, that version wasn't added to Gentoo until September 24, 60b6c127957561b46198428eb401c8b04e5644ea which is well after this bug report was created. So this bug report expresses a different problem.
Comment 15 Martin 2019-10-03 02:49:58 UTC
Thanks for that and indeed so.

I've just recompiled my kernel with "CONFIG_NUMA=y", rebooted into that kernel, and... No change seen. For example I still get:

clinfo:
[...]
  Platform Name                                   AMD Accelerated Parallel Processing
Number of devices                                 0
[...]


Checking for the NUMA, there is correctly now a "/sys/devices/system/node/node0" for my system showing that the NUMA code is in place...

Sorry for the noise.
Comment 16 Craig Andrews gentoo-dev 2019-10-10 15:05:41 UTC
Can you please try again with ROC 2.8 (currently in Gentoo)? And also, what card do you have?
Comment 17 ernsteiswuerfel archtester 2019-10-10 21:53:34 UTC
(In reply to Craig Andrews from comment #16)
> Can you please try again with ROC 2.8 (currently in Gentoo)? And also, what
> card do you have?
Card is a Radeon RX 590.
# inxi -b
System:    Host: supah Kernel: 5.3.5-gentoo x86_64 bits: 64 Desktop: MATE 1.22.0 Distro: Gentoo Base System release 2.6 
Machine:   Type: Server Mobo: Supermicro model: H8SGL v: 1234567890 serial: OM1BS70566 BIOS: American Megatrends v: 3.5b 
           date: 03/18/2016 
CPU:       8-Core: AMD Opteron 6380 type: MT MCP speed: 1399 MHz min/max: 1400/2500 MHz 
Graphics:  Device-1: Advanced Micro Devices [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] driver: amdgpu 
           v: kernel 
           Device-2: Matrox Systems MGA G200eW WPCM450 driver: N/A 
           Display: x11 server: X.Org 1.20.5 driver: amdgpu,ati unloaded: modesetting,radeon resolution: 1920x1080~60Hz 
           OpenGL: renderer: Radeon RX 590 Series (POLARIS10 DRM 3.33.0 5.3.5-gentoo LLVM 8.0.1) v: 4.5 Mesa 19.1.7 

There is definately a change with Kernel 5.3.x and ROCm 2.8. Both rocminfo/clinfo now give me a kernel crash when I invoke them:
[...]
[  277.314209] BUG: kernel NULL pointer dereference, address: 00000000000001ec
[  277.314214] #PF: supervisor write access in kernel mode
[  277.314216] #PF: error_code(0x0002) - not-present page
[  277.314218] PGD 0 P4D 0 
[  277.314222] Oops: 0002 [#1] SMP NOPTI
[  277.314226] CPU: 11 PID: 1664 Comm: rocminfo Not tainted 5.3.5-gentoo #2
[  277.314228] Hardware name: Supermicro H8SGL/H8SGL, BIOS 3.5b       03/18/2016
[  277.314231] RIP: 0010:0xffffffffc0cc8aa0
[...]

I have yet to report this crash upstream. Funnily amdgpu-pro-opencl still works without any problem.
Comment 18 ernsteiswuerfel archtester 2019-10-24 21:27:59 UTC
Finally found some time for an upstream report: https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/issues/45
Sorry it took me that long!

Did not mention the crash from comment #17 as it happens no longer with ROC 2.9 and kernel 5.4-rc4.
Comment 19 ernsteiswuerfel archtester 2020-12-19 18:39:49 UTC
2.6.0-r1 no longer in tree.