Created attachment 577830 [details] emerge --info After installing dev-libs/amdgpu-pro-opencl-19.10.785425, sci-misc/boinc-7.14.2 reports "Missing coprocessor for task [...]". This is repeatable. Reverting to the previous version 18.20.684755 restores normal operation.
19.10.785425 is unusable for me as well (18.20.684755 works without problems), both on Oland (HD 8600) and Ellesmere (RX 570): clinfo[3537]: segfault at 0 ip 0000000000000000 sp 00007ffc83a9be48 error 14 in clinfo[55cf968b4000+2000] Code: Bad RIP value. Could it be that the new binary package, which is for Ubuntu 18.04 instead of 16.04 as 18.20.684755 was, is now incompatible with gentoo?
I do not think this is a problem with Ubuntu binaries, on my own Polaris10 system everything works fine: clinfo, LuxMark, BOINC, Darktable... I've even downloaded ethminer and that works without problems as well. Peter: what hardware do you try to run BOINC on? What, if anything, do you get when you run clinfo? Bernd: Could you try running clinfo via strace and see what syscalls it tries to invoke just before the segfault?
(In reply to Marek Szuba from comment #2) > Peter: what hardware do you try to run BOINC on? What, if anything, do you > get when you run clinfo? I'll attach it as it's 229 lines.
Created attachment 577926 [details] clinfo
(In reply to Marek Szuba from comment #2) > Peter: what hardware do you try to run BOINC on? It's an Intel 2x8-core i7 with 32GB RAM and 256GB NVMe. The GPU is an AMD/ATI Ellesmere Radeon Pro WX 5100. Do you need anything else that's not in the clinfo output?
Created attachment 577956 [details] strace clinfo I attach the output of strace clinfo. This is on the Ellesmere (RX 570). I'm running current ~amd64, Kernel 5.1.5 with CONFIG_DRM_AMDGPU_SI=y and CONFIG_HSA_AMD=y. After reverting to 18.20.684755, clinfo output is: Platform #0 Name: Clover Version: OpenCL 1.1 Mesa 19.1.0-rc3 Device #0 Name: Radeon RX 570 Series (POLARIS10, DRM 3.30.0, 5.1.5-gentoo, LLVM 8.0.0) Type: GPU Version: OpenCL 1.1 Mesa 19.1.0-rc3 Global memory size: 4 GB Local memory size: 32 kB Max work group size: 256 Max work item sizes: (256, 256, 256) Platform #1 Name: AMD Accelerated Parallel Processing Version: OpenCL 2.1 AMD-APP (2639.3) Device #0 Name: Ellesmere Type: GPU Version: OpenCL 1.2 AMD-APP (2639.3) Global memory size: 3 GB 720 MB 740 kB Local memory size: 32 kB Max work group size: 256 Max work item sizes: (1024, 1024, 1024)
Prompted by Comment 6, I made a couple of kernel changes: CONFIG_DRM_AMDGPU = m (was y), CONFIG_DRM_AMDGPU_SI = y (was n). Following the kernel help for the latter, I added this to /etc/conf.d/modules: modules ="amdgpu" module_amdgpu_args="radeon.si_support=0 amdgpu.si_support=1" Then I installed =amdgpu-pro-opencl-19.10.785425 and rebooted. I got the same fault condition as I reported.
Peter: Well, you have a Polaris GPU so it isn't really unexpected that flags related to Southern Island GPUs haven't had any effect. That said, you could try using "amdgpu.dc=1" instead - I think this has since been made the default but it wouldn't hurt to try.
Bernd: Interesting - looks like you have got two OpenCL platforms pointing to the same GPU. Could you please try the following: - upgrade dev-libs/amdgpu-pro-opencl again - go to /etc/OpenCL/vendors, remove all ICD files not pointing to /opt/amdgpu, then try reproducing the segfault; - if it helps, let me know to which package(s) the removed files belonged to (my bet is on media-libs/mesa but I would like to make sure).
(In reply to Marek Szuba from comment #8) > Peter: Well, you have a Polaris GPU so it isn't really unexpected that flags > related to Southern Island GPUs haven't had any effect. Ah, so that's what SI stands for. > That said, you could > try using "amdgpu.dc=1" instead - I think this has since been made the > default but it wouldn't hurt to try. You're right; it made no difference. But now, is this make.conf entry correct for my setup? VIDEO_CARDS="amdgpu radeonsi" I got that from a wiki page, I think.
(In reply to Marek Szuba from comment #9) > Bernd: Interesting - looks like you have got two OpenCL platforms pointing > to the same GPU. Could you please try the following: > - upgrade dev-libs/amdgpu-pro-opencl again > - go to /etc/OpenCL/vendors, remove all ICD files not pointing to > /opt/amdgpu, then try reproducing the segfault; > - if it helps, let me know to which package(s) the removed files belonged > to (my bet is on media-libs/mesa but I would like to make sure). Yes, there's mesa.icd from media-libs/mesa[opencl]. I thought it was by design that multiple opencl platforms could be present in parallel. Up to now it wasn't a problem. Moving mesa.icd away, opencl initialization works again without segfault! I also checked with Tesseract using opencl, no problems.
The bug has been closed via the following commit(s): https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=5b8c1208f08af0fa8b870e3e677abf8cdc7c1bc3 commit 5b8c1208f08af0fa8b870e3e677abf8cdc7c1bc3 Author: Marek Szuba <marecki@gentoo.org> AuthorDate: 2019-05-29 10:30:24 +0000 Commit: Marek Szuba <marecki@gentoo.org> CommitDate: 2019-05-29 10:30:58 +0000 dev-libs/amdgpu-pro-opencl: conflict with media-libs/mesa[opencl] With both this and the Mesa OpenCL state tracker enabled, using dev-libs/ocl-icd one ends up with two OpenCL platforms pointing to the same hardware - and more importantly causes segmentation faults in all applications attempting to use OpenCL. Closes: https://bugs.gentoo.org/686790 Signed-off-by: Marek Szuba <marecki@gentoo.org> Package-Manager: Portage-2.3.66, Repoman-2.3.11 .../amdgpu-pro-opencl-19.10.785425-r1.ebuild | 100 +++++++++++++++++++++ 1 file changed, 100 insertions(+)
Having had a look at Peter's clinfo as well as having successfully reproduced the segfaults by emerging media-libs/mesa with USE=opencl, I think it is safe to assume the two OpenCL runtimes can no longer co-exist. I have just pushed a new revision of dev-libs/amdgpu-pro-opencl which explicitly conflicts with media-libs/mesa[opencl], that ought to do it. BTW. Peter, I think your VIDEO_CARDS setting is correct for Polaris. Or at least it was at the time I configured my own system in exactly the same way :-)
Good news Marek. Thanks. As to radeonsi, I'm left wondering "when is Southern Islands not Southern Islands?" :)
You ned to update virtual/opencl as well otherwise you cannot install the amdgpu package with this virtual installed.
I've uninstalled the virtual, installed the latest amdgpu-opencl and recompiled mesa with USE=-opencl, but when I try to update, emerge wants to re-emerge the virtual and mesa with USE=opencl.
(In reply to Luke A. Guest from comment #16) > I've uninstalled the virtual, installed the latest amdgpu-opencl and > recompiled mesa with USE=-opencl, but when I try to update, emerge wants to > re-emerge the virtual and mesa with USE=opencl. What has pulled the virtual in? It isn't installed here. Can you remove it?
It's already removed, trying to upgrade wants to pull it in, seems it's clinfo: # required by virtual/opencl-0-r6::gentoo # required by dev-util/clinfo-2.2.18.04.06::gentoo # required by @selected # required by @world (argument) >=media-libs/mesa-9999 opencl
REmoved clinfo, now it's imagemagick. # required by virtual/opencl-0-r6::gentoo # required by media-gfx/imagemagick-7.0.8.45::gentoo # required by @selected # required by @world (argument) >=media-libs/mesa-9999 opencl ... (dependency required by "virtual/opencl-0-r6::gentoo" [ebuild]) (dependency required by "app-emulation/wine-d3d9-4.1::gentoo[opencl]" [installed]) (dependency required by "@selected" [set]) (dependency required by "@world" [argument])
Just cannot uninstall it from Mesa. # cat /etc/portage/package.use/mesa.use media-libs/mesa d3d9 gles1 gles2 -opencl openmax -xa x11-apps/mesa-progs gles2 # emerge -av mesa These are the packages that would be merged, in order: Calculating dependencies... done! [ebuild R *] media-libs/mesa-9999::gentoo USE="classic d3d9 dri3 egl gallium gbm gles1 gles2 llvm lm_sensors vaapi vdpau vulkan wayland -debug (-libglvnd) -opencl* -osmesa -pax_kernel -pic (-selinux) -test -unwind -valgrind -vulkan-overlay -xa -xvmc" ABI_X86="32 (64) (-x32)" VIDEO_CARDS="radeon radeonsi (-freedreno) -i915 -i965 -intel -iris -nouveau -r100 -r200 -r300 -r600 (-vc4) -virgl (-vivante) -vmware" 0 KiB Total: 1 package (1 reinstall), Size of downloads: 0 KiB !!! Multiple package instances within a single package slot have been pulled !!! into the dependency graph, resulting in a slot conflict: media-libs/mesa:0 (media-libs/mesa-9999:0/0::gentoo, ebuild scheduled for merge) pulled in by media-libs/mesa (Argument) (media-libs/mesa-9999:0/0::gentoo, installed) pulled in by >=media-libs/mesa-9.1.6[opencl,abi_x86_32(-)?,abi_x86_64(-)?,abi_x86_x32(-)?,abi_mips_n32(-)?,abi_mips_n64(-)?,abi_mips_o32(-)?,abi_riscv_lp64d(-)?,abi_riscv_lp64(-)?,abi_s390_32(-)?,abi_s390_64(-)?] required by (virtual/opencl-0-r6:0/0::gentoo, installed) ^^^^^^ It might be possible to solve this slot collision by applying all of the following changes: - media-libs/mesa-9999 (Change USE: +opencl) # eix opencl [I] dev-libs/amdgpu-pro-opencl Available versions: (~)18.20.684755^fms (~)18.30.641594^fms[1] (~)19.10.785425^fms [m](~)19.10.785425-r1^fms {ABI_X86="32 64"} Installed versions: 19.10.785425^fms(00:56:06 12/06/19)(ABI_X86="32 64") Homepage: https://www.amd.com/en/support/kb/release-notes/rn-rad-lin-19-10-unified Description: Proprietary OpenCL implementation for AMD GPUs [I] virtual/opencl Available versions: 0-r5 0-r6 {ABI_MIPS="n32 n64 o32" ABI_RISCV="lp64 lp64d" ABI_S390="32 64" ABI_X86="32 64 x32" VIDEO_CARDS="amdgpu i965 nvidia"} Installed versions: 0-r6(08:15:28 22/06/19)(ABI_MIPS="-n32 -n64 -o32" ABI_RISCV="-lp64 -lp64d" ABI_S390="-32 -64" ABI_X86="32 64 -x32" VIDEO_CARDS="amdgpu -i965 -nvidia") Description: Virtual for OpenCL implementations
https://bugs.gentoo.org/686964