Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 686790 - dev-libs/amdgpu-pro-opencl-19.10.785425 conflicts at run time with dev-libs/mesa[opencl]
Summary: dev-libs/amdgpu-pro-opencl-19.10.785425 conflicts at run time with dev-libs/m...
Status: RESOLVED FIXED
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: Current packages (show other bugs)
Hardware: All Linux
: Normal normal
Assignee: Marek Szuba (RETIRED)
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-05-26 11:05 UTC by peter@prh.myzen.co.uk
Modified: 2019-06-25 10:30 UTC (History)
2 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments
emerge --info (emerge.info,6.30 KB, application/x-info)
2019-05-26 11:05 UTC, peter@prh.myzen.co.uk
Details
clinfo (clinfo,13.97 KB, text/plain)
2019-05-28 08:14 UTC, peter@prh.myzen.co.uk
Details
strace clinfo (strace_clinfo.txt,214.36 KB, text/plain)
2019-05-28 19:51 UTC, Bernd Feige
Details

Note You need to log in before you can comment on or make changes to this bug.
Description peter@prh.myzen.co.uk 2019-05-26 11:05:13 UTC
Created attachment 577830 [details]
emerge --info

After installing dev-libs/amdgpu-pro-opencl-19.10.785425, sci-misc/boinc-7.14.2 reports "Missing coprocessor for task [...]". This is repeatable.

Reverting to the previous version 18.20.684755 restores normal operation.
Comment 1 Bernd Feige 2019-05-27 07:53:26 UTC
19.10.785425 is unusable for me as well (18.20.684755 works without problems), both on Oland (HD 8600) and Ellesmere (RX 570):

clinfo[3537]: segfault at 0 ip 0000000000000000 sp 00007ffc83a9be48 error 14 in clinfo[55cf968b4000+2000]
Code: Bad RIP value.
 
Could it be that the new binary package, which is for Ubuntu 18.04 instead of 16.04 as 18.20.684755 was, is now incompatible with gentoo?
Comment 2 Marek Szuba (RETIRED) archtester gentoo-dev 2019-05-28 07:48:25 UTC
I do not think this is a problem with Ubuntu binaries, on my own Polaris10 system everything works fine: clinfo, LuxMark, BOINC, Darktable... I've even downloaded ethminer and that works without problems as well.

Peter: what hardware do you try to run BOINC on? What, if anything, do you get when you run clinfo?

Bernd: Could you try running clinfo via strace and see what syscalls it tries to invoke just before the segfault?
Comment 3 peter@prh.myzen.co.uk 2019-05-28 08:14:07 UTC
(In reply to Marek Szuba from comment #2)

> Peter: what hardware do you try to run BOINC on? What, if anything, do you
> get when you run clinfo?

I'll attach it as it's 229 lines.
Comment 4 peter@prh.myzen.co.uk 2019-05-28 08:14:49 UTC
Created attachment 577926 [details]
clinfo
Comment 5 peter@prh.myzen.co.uk 2019-05-28 08:20:43 UTC
(In reply to Marek Szuba from comment #2)

> Peter: what hardware do you try to run BOINC on?

It's an Intel 2x8-core i7 with 32GB RAM and 256GB NVMe. The GPU is an AMD/ATI Ellesmere Radeon Pro WX 5100. Do you need anything else that's not in the clinfo output?
Comment 6 Bernd Feige 2019-05-28 19:51:43 UTC
Created attachment 577956 [details]
strace clinfo

I attach the output of strace clinfo.
This is on the Ellesmere (RX 570). I'm running current ~amd64, Kernel 5.1.5 with CONFIG_DRM_AMDGPU_SI=y and CONFIG_HSA_AMD=y.

After reverting to 18.20.684755, clinfo output is:

Platform #0
  Name:                                  Clover
  Version:                               OpenCL 1.1 Mesa 19.1.0-rc3

  Device #0
    Name:                                Radeon RX 570 Series (POLARIS10, DRM 3.30.0, 5.1.5-gentoo, LLVM 8.0.0)
    Type:                                GPU
    Version:                             OpenCL 1.1 Mesa 19.1.0-rc3
    Global memory size:                  4 GB 
    Local memory size:                   32 kB 
    Max work group size:                 256
    Max work item sizes:                 (256, 256, 256)

Platform #1
  Name:                                  AMD Accelerated Parallel Processing
  Version:                               OpenCL 2.1 AMD-APP (2639.3)

  Device #0
    Name:                                Ellesmere
    Type:                                GPU
    Version:                             OpenCL 1.2 AMD-APP (2639.3)
    Global memory size:                  3 GB 720 MB 740 kB 
    Local memory size:                   32 kB 
    Max work group size:                 256
    Max work item sizes:                 (1024, 1024, 1024)
Comment 7 peter@prh.myzen.co.uk 2019-05-29 08:50:26 UTC
Prompted by Comment 6, I made a couple of kernel changes: CONFIG_DRM_AMDGPU = m (was y), CONFIG_DRM_AMDGPU_SI = y (was n). Following the kernel help for the latter, I added this to /etc/conf.d/modules:

modules ="amdgpu"
module_amdgpu_args="radeon.si_support=0 amdgpu.si_support=1"

Then I installed =amdgpu-pro-opencl-19.10.785425 and rebooted. I got the same fault condition as I reported.
Comment 8 Marek Szuba (RETIRED) archtester gentoo-dev 2019-05-29 08:59:23 UTC
Peter: Well, you have a Polaris GPU so it isn't really unexpected that flags related to Southern Island GPUs haven't had any effect. That said, you could try using "amdgpu.dc=1" instead - I think this has since been made the default but it wouldn't hurt to try.
Comment 9 Marek Szuba (RETIRED) archtester gentoo-dev 2019-05-29 09:05:42 UTC
Bernd: Interesting - looks like you have got two OpenCL platforms pointing to the same GPU. Could you please try the following:
 - upgrade dev-libs/amdgpu-pro-opencl again
 - go to /etc/OpenCL/vendors, remove all ICD files not pointing to /opt/amdgpu, then try reproducing the segfault;
 - if it helps, let me know to which package(s) the removed files belonged to (my bet is on media-libs/mesa but I would like to make sure).
Comment 10 peter@prh.myzen.co.uk 2019-05-29 09:15:01 UTC
(In reply to Marek Szuba from comment #8)
> Peter: Well, you have a Polaris GPU so it isn't really unexpected that flags
> related to Southern Island GPUs haven't had any effect.

Ah, so that's what SI stands for.

> That said, you could
> try using "amdgpu.dc=1" instead - I think this has since been made the
> default but it wouldn't hurt to try.
You're right; it made no difference.

But now, is this make.conf entry correct for my setup?
   VIDEO_CARDS="amdgpu radeonsi"
I got that from a wiki page, I think.
Comment 11 Bernd Feige 2019-05-29 09:42:10 UTC
(In reply to Marek Szuba from comment #9)
> Bernd: Interesting - looks like you have got two OpenCL platforms pointing
> to the same GPU. Could you please try the following:
>  - upgrade dev-libs/amdgpu-pro-opencl again
>  - go to /etc/OpenCL/vendors, remove all ICD files not pointing to
> /opt/amdgpu, then try reproducing the segfault;
>  - if it helps, let me know to which package(s) the removed files belonged
> to (my bet is on media-libs/mesa but I would like to make sure).

Yes, there's mesa.icd from media-libs/mesa[opencl]. I thought it was by design that multiple opencl platforms could be present in parallel. Up to now it wasn't a problem.

Moving mesa.icd away, opencl initialization works again without segfault! I also checked with Tesseract using opencl, no problems.
Comment 12 Larry the Git Cow gentoo-dev 2019-05-29 10:31:08 UTC
The bug has been closed via the following commit(s):

https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=5b8c1208f08af0fa8b870e3e677abf8cdc7c1bc3

commit 5b8c1208f08af0fa8b870e3e677abf8cdc7c1bc3
Author:     Marek Szuba <marecki@gentoo.org>
AuthorDate: 2019-05-29 10:30:24 +0000
Commit:     Marek Szuba <marecki@gentoo.org>
CommitDate: 2019-05-29 10:30:58 +0000

    dev-libs/amdgpu-pro-opencl: conflict with media-libs/mesa[opencl]
    
    With both this and the Mesa OpenCL state tracker enabled, using
    dev-libs/ocl-icd one ends up with two OpenCL platforms pointing
    to the same hardware - and more importantly causes segmentation faults
    in all applications attempting to use OpenCL.
    
    Closes: https://bugs.gentoo.org/686790
    Signed-off-by: Marek Szuba <marecki@gentoo.org>
    Package-Manager: Portage-2.3.66, Repoman-2.3.11

 .../amdgpu-pro-opencl-19.10.785425-r1.ebuild       | 100 +++++++++++++++++++++
 1 file changed, 100 insertions(+)
Comment 13 Marek Szuba (RETIRED) archtester gentoo-dev 2019-05-29 10:35:51 UTC
Having had a look at Peter's clinfo as well as having successfully reproduced the segfaults by emerging media-libs/mesa with USE=opencl, I think it is safe to assume the two OpenCL runtimes can no longer co-exist. I have just pushed a new revision of dev-libs/amdgpu-pro-opencl which explicitly conflicts with media-libs/mesa[opencl], that ought to do it.

BTW. Peter, I think your VIDEO_CARDS setting is correct for Polaris. Or at least it was at the time I configured my own system in exactly the same way :-)
Comment 14 peter@prh.myzen.co.uk 2019-05-29 16:47:56 UTC
Good news Marek. Thanks.

As to radeonsi, I'm left wondering "when is Southern Islands not Southern Islands?"  :)
Comment 15 Luke A. Guest 2019-06-11 15:10:13 UTC
You ned to update virtual/opencl as well otherwise you cannot install the amdgpu package with this virtual installed.
Comment 16 Luke A. Guest 2019-06-12 12:14:05 UTC
I've uninstalled the virtual, installed the latest amdgpu-opencl and recompiled mesa with USE=-opencl, but when I try to update, emerge wants to re-emerge the virtual and mesa with USE=opencl.
Comment 17 peter@prh.myzen.co.uk 2019-06-12 14:45:20 UTC
(In reply to Luke A. Guest from comment #16)
> I've uninstalled the virtual, installed the latest amdgpu-opencl and
> recompiled mesa with USE=-opencl, but when I try to update, emerge wants to
> re-emerge the virtual and mesa with USE=opencl.

What has pulled the virtual in? It isn't installed here. Can you remove it?
Comment 18 Luke A. Guest 2019-06-12 19:21:27 UTC
It's already removed, trying to upgrade wants to pull it in, seems it's clinfo:

# required by virtual/opencl-0-r6::gentoo
# required by dev-util/clinfo-2.2.18.04.06::gentoo
# required by @selected
# required by @world (argument)
>=media-libs/mesa-9999 opencl
Comment 19 Luke A. Guest 2019-06-12 19:27:27 UTC
REmoved clinfo, now it's imagemagick.

# required by virtual/opencl-0-r6::gentoo
# required by media-gfx/imagemagick-7.0.8.45::gentoo
# required by @selected
# required by @world (argument)
>=media-libs/mesa-9999 opencl

...

(dependency required by "virtual/opencl-0-r6::gentoo" [ebuild])
(dependency required by "app-emulation/wine-d3d9-4.1::gentoo[opencl]" [installed])
(dependency required by "@selected" [set])
(dependency required by "@world" [argument])
Comment 20 Luke A. Guest 2019-06-25 10:16:58 UTC
Just cannot uninstall it from Mesa.

# cat /etc/portage/package.use/mesa.use
media-libs/mesa d3d9 gles1 gles2 -opencl openmax -xa
x11-apps/mesa-progs gles2

# emerge -av mesa

These are the packages that would be merged, in order:

Calculating dependencies... done!
[ebuild   R   *] media-libs/mesa-9999::gentoo  USE="classic d3d9 dri3 egl gallium gbm gles1 gles2 llvm lm_sensors vaapi vdpau vulkan wayland -debug (-libglvnd) -opencl* -osmesa -pax_kernel -pic (-selinux) -test -unwind -valgrind -vulkan-overlay -xa -xvmc" ABI_X86="32 (64) (-x32)" VIDEO_CARDS="radeon radeonsi (-freedreno) -i915 -i965 -intel -iris -nouveau -r100 -r200 -r300 -r600 (-vc4) -virgl (-vivante) -vmware" 0 KiB

Total: 1 package (1 reinstall), Size of downloads: 0 KiB

!!! Multiple package instances within a single package slot have been pulled
!!! into the dependency graph, resulting in a slot conflict:

media-libs/mesa:0

  (media-libs/mesa-9999:0/0::gentoo, ebuild scheduled for merge) pulled in by
    media-libs/mesa (Argument)

  (media-libs/mesa-9999:0/0::gentoo, installed) pulled in by
    >=media-libs/mesa-9.1.6[opencl,abi_x86_32(-)?,abi_x86_64(-)?,abi_x86_x32(-)?,abi_mips_n32(-)?,abi_mips_n64(-)?,abi_mips_o32(-)?,abi_riscv_lp64d(-)?,abi_riscv_lp64(-)?,abi_s390_32(-)?,abi_s390_64(-)?] required by (virtual/opencl-0-r6:0/0::gentoo, installed)
                            ^^^^^^                                                                                                                                                                                                                                                                      


It might be possible to solve this slot collision
by applying all of the following changes:
   - media-libs/mesa-9999 (Change USE: +opencl)

# eix opencl

[I] dev-libs/amdgpu-pro-opencl
     Available versions:  (~)18.20.684755^fms (~)18.30.641594^fms[1] (~)19.10.785425^fms [m](~)19.10.785425-r1^fms {ABI_X86="32 64"}
     Installed versions:  19.10.785425^fms(00:56:06 12/06/19)(ABI_X86="32 64")
     Homepage:            https://www.amd.com/en/support/kb/release-notes/rn-rad-lin-19-10-unified
     Description:         Proprietary OpenCL implementation for AMD GPUs

[I] virtual/opencl
     Available versions:  0-r5 0-r6 {ABI_MIPS="n32 n64 o32" ABI_RISCV="lp64 lp64d" ABI_S390="32 64" ABI_X86="32 64 x32" VIDEO_CARDS="amdgpu i965 nvidia"}
     Installed versions:  0-r6(08:15:28 22/06/19)(ABI_MIPS="-n32 -n64 -o32" ABI_RISCV="-lp64 -lp64d" ABI_S390="-32 -64" ABI_X86="32 64 -x32" VIDEO_CARDS="amdgpu -i965 -nvidia")
     Description:         Virtual for OpenCL implementations
Comment 21 Marek Szuba (RETIRED) archtester gentoo-dev 2019-06-25 10:30:41 UTC
https://bugs.gentoo.org/686964