854345 – Incorrect driver dependencies for nvidia-cuda-toolkit (was: =dev-util/nvidia-cuda-toolkit-11.7.0-r1 is not compatible with x11-drivers/nvidia-drivers-510.73.05-r1)

Bug 854345 - Incorrect driver dependencies for nvidia-cuda-toolkit (was: =dev-util/nvidia-cuda-toolkit-11.7.0-r1 is not compatible with x11-drivers/nvidia-drivers-510.73.05-r1)

Summary: Incorrect driver dependencies for nvidia-cuda-toolkit (was: =dev-util/nvidia-...

Status:	UNCONFIRMED

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	Current packages (show other bugs)
Hardware:	All Linux

Importance:	Normal normal (vote)
Assignee:	Gentoo Science Related Packages

URL:
Whiteboard:
Keywords:	PullRequest

Depends on:
Blocks:

Reported:	2022-06-26 05:12 UTC by Jan Vesely
Modified:	2024-03-21 13:36 UTC (History)
CC List:	1 user (show)

See Also:	https://github.com/gentoo/gentoo/pull/28838 916976
Package list:
Runtime testing required:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jan Vesely 2022-06-26 05:12:55 UTC

CUDA 11.7 uses a new version/format of ptx (7.7 [0]). However, the 
510.73.05-r1 driver supports only ptx version 7.6.

This can lead to runtime failures in cuda programs that require ptx jit compiler. E.g.
nvcc -arch compute_50 example.cu
./a.out:
CUDA ERROR: 222, the provided PTX was compiled with an unsupported toolchain.

Trying lower level APIs gives a more readable error, e.g.:

cuLinkAddData failed: the provided PTX was compiled with an unsupported toolchain. - ptxas application ptx input, line 9; fatal   : Unsupported .version 7.7; current version is '7.6'



[0] https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#ptx-isa-version-7-7

Reproducible: Always

Comment 1 Jan Vesely 2022-06-26 06:05:25 UTC

commit ee4f5450b77850798de40cdd73107dd2955ac8e9 ("dev-util/nvidia-cuda-toolkit: relax driver bounds for sm_35/sm_37") change the driver constraints,
but this can lead to issues[0]:

"""
* Limited feature set
Sometimes features introduced in a CUDA Toolkit version may actually span both
the toolkit and the driver. In such cases an application that relies on features introduced in a newer version of the toolkit and driver may return the following error on older drivers: cudaErrorCallRequiresNewerDriver. As mentioned earlier, admins should then upgrade the installed driver also.

Application developers can avoid running into this problem by having the application explicitly check for the availability of features. Refer to the CUDA Compatibility Developers Guide for more details.

* Applications using PTX will see runtime issues
Applications that compile device code to PTX will not work on older drivers. If the application requires PTX then admins have to upgrade the installed driver.

PTX Developers should refer to the CUDA Compatibility Developers Guide and PTX programming guide in the CUDA C++ Programming Guide for details on this limitation.
"""

The "forward compatibility" feature of CUDA (running newer CUDA on older drivers, basically copying few runtime libraries from the newer driver) only works for a few datacenter SKUs. Everybody else will get
error 804 "forward compatibility was attempted on non supported HW".

It'd be nice if dev-utils/nvidia-cuda-toolkit;
* pointed to the above warning about issues with PTX
* applied "cuda forward compatibility" package. "forward compatibility not supported" is more readable error than "the provided PTX was compiled with an unsupported toolchain."

[0] https://docs.nvidia.com/deploy/cuda-compatibility/index.html#application-considerations

Comment 2 David Seifert gentoo-dev

2022-10-26 11:30:46 UTC

(In reply to Jan Vesely from comment #1)
> commit ee4f5450b77850798de40cdd73107dd2955ac8e9
> ("dev-util/nvidia-cuda-toolkit: relax driver bounds for sm_35/sm_37") change
> the driver constraints,
> but this can lead to issues[0]:
> 
> """
> * Limited feature set
> Sometimes features introduced in a CUDA Toolkit version may actually span
> both 
> the toolkit and the driver. In such cases an application that relies on
> features introduced in a newer version of the toolkit and driver may return
> the following error on older drivers: cudaErrorCallRequiresNewerDriver. As
> mentioned earlier, admins should then upgrade the installed driver also.
> 
> Application developers can avoid running into this problem by having the
> application explicitly check for the availability of features. Refer to the
> CUDA Compatibility Developers Guide for more details.
> 
> * Applications using PTX will see runtime issues
> Applications that compile device code to PTX will not work on older drivers.
> If the application requires PTX then admins have to upgrade the installed
> driver.
> 
> PTX Developers should refer to the CUDA Compatibility Developers Guide and
> PTX programming guide in the CUDA C++ Programming Guide for details on this
> limitation.
> """
> 
> The "forward compatibility" feature of CUDA (running newer CUDA on older
> drivers, basically copying few runtime libraries from the newer driver) only
> works for a few datacenter SKUs. Everybody else will get
> error 804 "forward compatibility was attempted on non supported HW".
> 
> It'd be nice if dev-utils/nvidia-cuda-toolkit;
> * pointed to the above warning about issues with PTX
> * applied "cuda forward compatibility" package. "forward compatibility not
> supported" is more readable error than "the provided PTX was compiled with
> an unsupported toolchain."
> 
> [0]
> https://docs.nvidia.com/deploy/cuda-compatibility/index.html#application-
> considerations

What do you suggest? If we tighten the bound, you'll lose Kepler support.

Comment 3 Jan Vesely 2022-10-31 14:25:37 UTC

(In reply to David Seifert from comment #2)

> 
> What do you suggest? If we tighten the bound, you'll lose Kepler support.

The kepler support is not there, or at least not on par with later GPU families
I think having a useflag would help.

iiuc, there are 3 "required driver" versions for each cuda release:
1.) the cuda release driver. Offers full features on all supported GPUs

2.) the major version release driver (Y.0); should be enough for kepler support for 11.x (11.0 driver is r450). This setup has possible issues with PTX and/or some features mentioned above [0]

3.) forward compatibility; needs installation of "compat package" (basically uspace cuda driver from later release) [1]. Should provide full features for select GPUs.


I'm not sure if 3. is needed, since it only works for a handful of datacenter SKUs. If needed, it can be handled by having a new "cuda-compat" package to override uspace libraries installed by "nvidia-drivers".


Having 2. guarded by a useflag (e.g. "cuda-minor-compat") would be helpful.
The expected behaviour:

* nvidia-cuda-toolkit[+cuda-minor-compat]: basically the current behaviour with a post-install message warning about potential issues with a link to [0]
* nvidia-cuda-toolkit[-cuda-minor-compat]: strict depend on release driver version (e.g. 520 for 11.8).

Even if the useflag is 'default-on', it would allow users to decide whether to risk PTX incompatibility, which is fine if the workloads use cubins/SASS compiled kernels. Or they need PTX JIT to work and therefore need newer driver.


[0] https://docs.nvidia.com/deploy/cuda-compatibility/index.html#application-considerations
[1] https://docs.nvidia.com/deploy/cuda-compatibility/index.html#forward-compatibility-title

Comment 4 Andrew Udvare 2022-11-08 05:02:08 UTC

I think I've run into a similar issue since updating yesterday, which is that nvidia-drivers 520.56.06 is not compatible with 11.7.0 or 11.8.0 from what I can tell.

I used to get information from clinfo but now it shows there are no platforms. It tries to open /dev/nvidia-uvm but fails:

openat(AT_FDCWD, "/dev/nvidia-uvm", O_RDWR|O_CLOEXEC) = -1 EIO (Input/output error)
openat(AT_FDCWD, "/dev/nvidia-uvm", O_RDWR) = -1 EIO (Input/output error)

 $ ls -la /dev/nvidia*
crw-rw---- 1 root video 195,   0 Nov  7 00:36 /dev/nvidia0
crw-rw---- 1 root video 195, 255 Nov  7 00:36 /dev/nvidiactl
crw-rw---- 1 root video 195, 254 Nov  7 00:36 /dev/nvidia-modeset
crw-rw-rw- 1 root root  239,   0 Nov  7 17:15 /dev/nvidia-uvm
crw-rw-rw- 1 root root  239,   1 Nov  7 17:15 /dev/nvidia-uvm-tools

/dev/nvidia-caps:
total 0
drwxr-xr-x  2 root tatsh     80 Nov  7 00:36 .
drwxr-xr-x 23 root root    4300 Nov  7 22:44 ..
cr--------  1 root root  242, 1 Nov  7 00:36 nvidia-cap1
cr--r--r--  1 root root  242, 2 Nov  7 00:36 nvidia-cap2

Comment 5 Jan Vesely 2022-11-09 19:17:53 UTC

(In reply to Andrew Udvare from comment #4)
> I think I've run into a similar issue since updating yesterday, which is
> that nvidia-drivers 520.56.06 is not compatible with 11.7.0 or 11.8.0 from
> what I can tell.
> 
> I used to get information from clinfo but now it shows there are no
> platforms. It tries to open /dev/nvidia-uvm but fails:
> 
> openat(AT_FDCWD, "/dev/nvidia-uvm", O_RDWR|O_CLOEXEC) = -1 EIO (Input/output
> error)
> openat(AT_FDCWD, "/dev/nvidia-uvm", O_RDWR) = -1 EIO (Input/output error)
> 
>  $ ls -la /dev/nvidia*
> crw-rw---- 1 root video 195,   0 Nov  7 00:36 /dev/nvidia0
> crw-rw---- 1 root video 195, 255 Nov  7 00:36 /dev/nvidiactl
> crw-rw---- 1 root video 195, 254 Nov  7 00:36 /dev/nvidia-modeset
> crw-rw-rw- 1 root root  239,   0 Nov  7 17:15 /dev/nvidia-uvm
> crw-rw-rw- 1 root root  239,   1 Nov  7 17:15 /dev/nvidia-uvm-tools
> 
> /dev/nvidia-caps:
> total 0
> drwxr-xr-x  2 root tatsh     80 Nov  7 00:36 .
> drwxr-xr-x 23 root root    4300 Nov  7 22:44 ..
> cr--------  1 root root  242, 1 Nov  7 00:36 nvidia-cap1
> cr--r--r--  1 root root  242, 2 Nov  7 00:36 nvidia-cap2

sounds like a different issue.
The drivers should be always backward compatible (i.e. newer driver should always work with older cuda toolkit).
The release driver for 11.8 is 520.61.05. 11.8 can potentially run into issues with 520.56.06, but 11.7 should work.
since the signature is different it sounds like a general bug in 520.56.06

Comment 6 Andrew Udvare 2022-11-12 08:03:34 UTC

The issue seems to be OpenCL-related only. It seems everything else works fine with the GPU though I didn't try something like PyTorch. The issue doesn't involve nvidia-cuda-toolkit so I will maybe file another bug.

Comment 7 Anton Bolshakov 2022-11-30 14:02:58 UTC

Please align cuda dependencies according to the table 3:
https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#abstract

nvidia-cuda-toolkit-11.8.0 should use >=x11-drivers/nvidia-drivers-520.61.05


Without it, I'm getting the following error message:
hashcat -m 1500 XXXX --force

cuLinkAddData(): the provided PTX was compiled with an unsupported toolchain.

* Device #1: Kernel /usr/share/hashcat/OpenCL/shared.cl link failed. Error Log:

ptxas application ptx input, line 9; fatal   : Unsupported .version 7.8; current version is '7.7'



* Device #1: Kernel /usr/share/hashcat/OpenCL/shared.cl build failed.

Comment 8 Jan Vesely 2022-12-09 17:54:12 UTC

(In reply to Andrew Udvare from comment #6)
> The issue seems to be OpenCL-related only. It seems everything else works
> fine with the GPU though I didn't try something like PyTorch. The issue
> doesn't involve nvidia-cuda-toolkit so I will maybe file another bug.

failure to open /dev/nvidia-uvm should be accompanied by messages in the kernel log. You can also try loading the uvm module with "uvm_debug_prints=1 uvm_release_asserts=1" to get more information. Either way, the issue is different from the one in this bug.

Comment 9 Patrick Strateman 2022-12-27 05:09:15 UTC

(In reply to Jan Vesely from comment #3)
> (In reply to David Seifert from comment #2)
> 
> > 
> > What do you suggest? If we tighten the bound, you'll lose Kepler support.
> 
> The kepler support is not there, or at least not on par with later GPU
> families
> I think having a useflag would help.
> 
> iiuc, there are 3 "required driver" versions for each cuda release:
> 1.) the cuda release driver. Offers full features on all supported GPUs
> 
> 2.) the major version release driver (Y.0); should be enough for kepler
> support for 11.x (11.0 driver is r450). This setup has possible issues with
> PTX and/or some features mentioned above [0]
> 
> 3.) forward compatibility; needs installation of "compat package" (basically
> uspace cuda driver from later release) [1]. Should provide full features for
> select GPUs.
> 
> 
> I'm not sure if 3. is needed, since it only works for a handful of
> datacenter SKUs. If needed, it can be handled by having a new "cuda-compat"
> package to override uspace libraries installed by "nvidia-drivers".
> 
> 
> Having 2. guarded by a useflag (e.g. "cuda-minor-compat") would be helpful.
> The expected behaviour:
> 
> * nvidia-cuda-toolkit[+cuda-minor-compat]: basically the current behaviour
> with a post-install message warning about potential issues with a link to [0]
> * nvidia-cuda-toolkit[-cuda-minor-compat]: strict depend on release driver
> version (e.g. 520 for 11.8).
> 
> Even if the useflag is 'default-on', it would allow users to decide whether
> to risk PTX incompatibility, which is fine if the workloads use cubins/SASS
> compiled kernels. Or they need PTX JIT to work and therefore need newer
> driver.
> 
> 
> [0]
> https://docs.nvidia.com/deploy/cuda-compatibility/index.html#application-
> considerations
> [1]
> https://docs.nvidia.com/deploy/cuda-compatibility/index.html#forward-
> compatibility-title

I've created a pull request that implements cuda-minor-compat

https://github.com/gentoo/gentoo/pull/28838

Comment 10 David Seifert gentoo-dev

2023-06-27 13:49:44 UTC

With CUDA 12 in the tree now, I think this bug is obsolete.

Comment 11 Jan Vesely 2023-07-05 13:25:21 UTC

The specific instance for 11.7 and r510 might be obsolete, but the generic problem remains;

Depending on the internal versioning of PTX asm, CUDA applications that need ptx runtime compiler will fail on a driver older than the release driver of a cuda toolkit release.

Comment 12 Pacho Ramos gentoo-dev

2024-03-21 13:35:05 UTC

(In reply to David Seifert from comment #10)
> With CUDA 12 in the tree now, I think this bug is obsolete.

I just hit this bug as VMD was failing to detect CUDA with a combination of x11-drivers/nvidia-drivers-535.161.07 and dev-util/nvidia-cuda-toolkit-12.3.2:
CUDA error: the provided PTX was compiled with an unsupported toolchain., CUDAClearDevice.cu line 54