CUDA 11.7 uses a new version/format of ptx (7.7 [0]). However, the 510.73.05-r1 driver supports only ptx version 7.6. This can lead to runtime failures in cuda programs that require ptx jit compiler. E.g. nvcc -arch compute_50 example.cu ./a.out: CUDA ERROR: 222, the provided PTX was compiled with an unsupported toolchain. Trying lower level APIs gives a more readable error, e.g.: cuLinkAddData failed: the provided PTX was compiled with an unsupported toolchain. - ptxas application ptx input, line 9; fatal : Unsupported .version 7.7; current version is '7.6' [0] https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#ptx-isa-version-7-7 Reproducible: Always
commit ee4f5450b77850798de40cdd73107dd2955ac8e9 ("dev-util/nvidia-cuda-toolkit: relax driver bounds for sm_35/sm_37") change the driver constraints, but this can lead to issues[0]: """ * Limited feature set Sometimes features introduced in a CUDA Toolkit version may actually span both the toolkit and the driver. In such cases an application that relies on features introduced in a newer version of the toolkit and driver may return the following error on older drivers: cudaErrorCallRequiresNewerDriver. As mentioned earlier, admins should then upgrade the installed driver also. Application developers can avoid running into this problem by having the application explicitly check for the availability of features. Refer to the CUDA Compatibility Developers Guide for more details. * Applications using PTX will see runtime issues Applications that compile device code to PTX will not work on older drivers. If the application requires PTX then admins have to upgrade the installed driver. PTX Developers should refer to the CUDA Compatibility Developers Guide and PTX programming guide in the CUDA C++ Programming Guide for details on this limitation. """ The "forward compatibility" feature of CUDA (running newer CUDA on older drivers, basically copying few runtime libraries from the newer driver) only works for a few datacenter SKUs. Everybody else will get error 804 "forward compatibility was attempted on non supported HW". It'd be nice if dev-utils/nvidia-cuda-toolkit; * pointed to the above warning about issues with PTX * applied "cuda forward compatibility" package. "forward compatibility not supported" is more readable error than "the provided PTX was compiled with an unsupported toolchain." [0] https://docs.nvidia.com/deploy/cuda-compatibility/index.html#application-considerations
(In reply to Jan Vesely from comment #1) > commit ee4f5450b77850798de40cdd73107dd2955ac8e9 > ("dev-util/nvidia-cuda-toolkit: relax driver bounds for sm_35/sm_37") change > the driver constraints, > but this can lead to issues[0]: > > """ > * Limited feature set > Sometimes features introduced in a CUDA Toolkit version may actually span > both > the toolkit and the driver. In such cases an application that relies on > features introduced in a newer version of the toolkit and driver may return > the following error on older drivers: cudaErrorCallRequiresNewerDriver. As > mentioned earlier, admins should then upgrade the installed driver also. > > Application developers can avoid running into this problem by having the > application explicitly check for the availability of features. Refer to the > CUDA Compatibility Developers Guide for more details. > > * Applications using PTX will see runtime issues > Applications that compile device code to PTX will not work on older drivers. > If the application requires PTX then admins have to upgrade the installed > driver. > > PTX Developers should refer to the CUDA Compatibility Developers Guide and > PTX programming guide in the CUDA C++ Programming Guide for details on this > limitation. > """ > > The "forward compatibility" feature of CUDA (running newer CUDA on older > drivers, basically copying few runtime libraries from the newer driver) only > works for a few datacenter SKUs. Everybody else will get > error 804 "forward compatibility was attempted on non supported HW". > > It'd be nice if dev-utils/nvidia-cuda-toolkit; > * pointed to the above warning about issues with PTX > * applied "cuda forward compatibility" package. "forward compatibility not > supported" is more readable error than "the provided PTX was compiled with > an unsupported toolchain." > > [0] > https://docs.nvidia.com/deploy/cuda-compatibility/index.html#application- > considerations What do you suggest? If we tighten the bound, you'll lose Kepler support.
(In reply to David Seifert from comment #2) > > What do you suggest? If we tighten the bound, you'll lose Kepler support. The kepler support is not there, or at least not on par with later GPU families I think having a useflag would help. iiuc, there are 3 "required driver" versions for each cuda release: 1.) the cuda release driver. Offers full features on all supported GPUs 2.) the major version release driver (Y.0); should be enough for kepler support for 11.x (11.0 driver is r450). This setup has possible issues with PTX and/or some features mentioned above [0] 3.) forward compatibility; needs installation of "compat package" (basically uspace cuda driver from later release) [1]. Should provide full features for select GPUs. I'm not sure if 3. is needed, since it only works for a handful of datacenter SKUs. If needed, it can be handled by having a new "cuda-compat" package to override uspace libraries installed by "nvidia-drivers". Having 2. guarded by a useflag (e.g. "cuda-minor-compat") would be helpful. The expected behaviour: * nvidia-cuda-toolkit[+cuda-minor-compat]: basically the current behaviour with a post-install message warning about potential issues with a link to [0] * nvidia-cuda-toolkit[-cuda-minor-compat]: strict depend on release driver version (e.g. 520 for 11.8). Even if the useflag is 'default-on', it would allow users to decide whether to risk PTX incompatibility, which is fine if the workloads use cubins/SASS compiled kernels. Or they need PTX JIT to work and therefore need newer driver. [0] https://docs.nvidia.com/deploy/cuda-compatibility/index.html#application-considerations [1] https://docs.nvidia.com/deploy/cuda-compatibility/index.html#forward-compatibility-title
I think I've run into a similar issue since updating yesterday, which is that nvidia-drivers 520.56.06 is not compatible with 11.7.0 or 11.8.0 from what I can tell. I used to get information from clinfo but now it shows there are no platforms. It tries to open /dev/nvidia-uvm but fails: openat(AT_FDCWD, "/dev/nvidia-uvm", O_RDWR|O_CLOEXEC) = -1 EIO (Input/output error) openat(AT_FDCWD, "/dev/nvidia-uvm", O_RDWR) = -1 EIO (Input/output error) $ ls -la /dev/nvidia* crw-rw---- 1 root video 195, 0 Nov 7 00:36 /dev/nvidia0 crw-rw---- 1 root video 195, 255 Nov 7 00:36 /dev/nvidiactl crw-rw---- 1 root video 195, 254 Nov 7 00:36 /dev/nvidia-modeset crw-rw-rw- 1 root root 239, 0 Nov 7 17:15 /dev/nvidia-uvm crw-rw-rw- 1 root root 239, 1 Nov 7 17:15 /dev/nvidia-uvm-tools /dev/nvidia-caps: total 0 drwxr-xr-x 2 root tatsh 80 Nov 7 00:36 . drwxr-xr-x 23 root root 4300 Nov 7 22:44 .. cr-------- 1 root root 242, 1 Nov 7 00:36 nvidia-cap1 cr--r--r-- 1 root root 242, 2 Nov 7 00:36 nvidia-cap2
(In reply to Andrew Udvare from comment #4) > I think I've run into a similar issue since updating yesterday, which is > that nvidia-drivers 520.56.06 is not compatible with 11.7.0 or 11.8.0 from > what I can tell. > > I used to get information from clinfo but now it shows there are no > platforms. It tries to open /dev/nvidia-uvm but fails: > > openat(AT_FDCWD, "/dev/nvidia-uvm", O_RDWR|O_CLOEXEC) = -1 EIO (Input/output > error) > openat(AT_FDCWD, "/dev/nvidia-uvm", O_RDWR) = -1 EIO (Input/output error) > > $ ls -la /dev/nvidia* > crw-rw---- 1 root video 195, 0 Nov 7 00:36 /dev/nvidia0 > crw-rw---- 1 root video 195, 255 Nov 7 00:36 /dev/nvidiactl > crw-rw---- 1 root video 195, 254 Nov 7 00:36 /dev/nvidia-modeset > crw-rw-rw- 1 root root 239, 0 Nov 7 17:15 /dev/nvidia-uvm > crw-rw-rw- 1 root root 239, 1 Nov 7 17:15 /dev/nvidia-uvm-tools > > /dev/nvidia-caps: > total 0 > drwxr-xr-x 2 root tatsh 80 Nov 7 00:36 . > drwxr-xr-x 23 root root 4300 Nov 7 22:44 .. > cr-------- 1 root root 242, 1 Nov 7 00:36 nvidia-cap1 > cr--r--r-- 1 root root 242, 2 Nov 7 00:36 nvidia-cap2 sounds like a different issue. The drivers should be always backward compatible (i.e. newer driver should always work with older cuda toolkit). The release driver for 11.8 is 520.61.05. 11.8 can potentially run into issues with 520.56.06, but 11.7 should work. since the signature is different it sounds like a general bug in 520.56.06
The issue seems to be OpenCL-related only. It seems everything else works fine with the GPU though I didn't try something like PyTorch. The issue doesn't involve nvidia-cuda-toolkit so I will maybe file another bug.
Please align cuda dependencies according to the table 3: https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#abstract nvidia-cuda-toolkit-11.8.0 should use >=x11-drivers/nvidia-drivers-520.61.05 Without it, I'm getting the following error message: hashcat -m 1500 XXXX --force cuLinkAddData(): the provided PTX was compiled with an unsupported toolchain. * Device #1: Kernel /usr/share/hashcat/OpenCL/shared.cl link failed. Error Log: ptxas application ptx input, line 9; fatal : Unsupported .version 7.8; current version is '7.7' * Device #1: Kernel /usr/share/hashcat/OpenCL/shared.cl build failed.
(In reply to Andrew Udvare from comment #6) > The issue seems to be OpenCL-related only. It seems everything else works > fine with the GPU though I didn't try something like PyTorch. The issue > doesn't involve nvidia-cuda-toolkit so I will maybe file another bug. failure to open /dev/nvidia-uvm should be accompanied by messages in the kernel log. You can also try loading the uvm module with "uvm_debug_prints=1 uvm_release_asserts=1" to get more information. Either way, the issue is different from the one in this bug.
(In reply to Jan Vesely from comment #3) > (In reply to David Seifert from comment #2) > > > > > What do you suggest? If we tighten the bound, you'll lose Kepler support. > > The kepler support is not there, or at least not on par with later GPU > families > I think having a useflag would help. > > iiuc, there are 3 "required driver" versions for each cuda release: > 1.) the cuda release driver. Offers full features on all supported GPUs > > 2.) the major version release driver (Y.0); should be enough for kepler > support for 11.x (11.0 driver is r450). This setup has possible issues with > PTX and/or some features mentioned above [0] > > 3.) forward compatibility; needs installation of "compat package" (basically > uspace cuda driver from later release) [1]. Should provide full features for > select GPUs. > > > I'm not sure if 3. is needed, since it only works for a handful of > datacenter SKUs. If needed, it can be handled by having a new "cuda-compat" > package to override uspace libraries installed by "nvidia-drivers". > > > Having 2. guarded by a useflag (e.g. "cuda-minor-compat") would be helpful. > The expected behaviour: > > * nvidia-cuda-toolkit[+cuda-minor-compat]: basically the current behaviour > with a post-install message warning about potential issues with a link to [0] > * nvidia-cuda-toolkit[-cuda-minor-compat]: strict depend on release driver > version (e.g. 520 for 11.8). > > Even if the useflag is 'default-on', it would allow users to decide whether > to risk PTX incompatibility, which is fine if the workloads use cubins/SASS > compiled kernels. Or they need PTX JIT to work and therefore need newer > driver. > > > [0] > https://docs.nvidia.com/deploy/cuda-compatibility/index.html#application- > considerations > [1] > https://docs.nvidia.com/deploy/cuda-compatibility/index.html#forward- > compatibility-title I've created a pull request that implements cuda-minor-compat https://github.com/gentoo/gentoo/pull/28838
With CUDA 12 in the tree now, I think this bug is obsolete.
The specific instance for 11.7 and r510 might be obsolete, but the generic problem remains; Depending on the internal versioning of PTX asm, CUDA applications that need ptx runtime compiler will fail on a driver older than the release driver of a cuda toolkit release.
(In reply to David Seifert from comment #10) > With CUDA 12 in the tree now, I think this bug is obsolete. I just hit this bug as VMD was failing to detect CUDA with a combination of x11-drivers/nvidia-drivers-535.161.07 and dev-util/nvidia-cuda-toolkit-12.3.2: CUDA error: the provided PTX was compiled with an unsupported toolchain., CUDAClearDevice.cu line 54