663234 – >=x11-drivers/nvidia-drivers-396 should check for CONFIG_NUMA to enable CUDA library usage

Bug 663234 - >=x11-drivers/nvidia-drivers-396 should check for CONFIG_NUMA to enable CUDA library usage

Summary: >=x11-drivers/nvidia-drivers-396 should check for CONFIG_NUMA to enable CUDA ...

Status:	RESOLVED OBSOLETE

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	Current packages (show other bugs)
Hardware:	All Linux

Importance:	Normal normal
Assignee:	David Seifert

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2018-08-09 14:45 UTC by Timo Rothenpieler
Modified:	2021-03-06 08:01 UTC (History)
CC List:	9 users (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Timo Rothenpieler 2018-08-09 14:45:01 UTC

Discovered this today after quite some digging.
The CUDA library tries to access stuff in /sys/devices/system/node which only exists if the Kernel is built with CONFIG_NUMA, and maybe even some other options like CONFIG_ACPI_NUMA, not sure about the exact requirements.

If /sys/devices/system/node does not exist, it just bails out early during cuInit, rendering all CUDA applications broken.

Not sure how to enforce this, as there is no useflag for CUDA.
Maybe it's time to introduce one, and move the cuda/cuvid libraries to it? Right now they are bound to the X useflag, which is technically incorrect as CUDA does not need X at all to work.

Comment 1 Risto A. Paju 2018-08-11 16:21:07 UTC

I confirm the need for CONFIG_NUMA to enable CUDA in nvidia-drivers-396.51, using Linux 4.17.14. But in addition to NUMA, I also found out it needs some CGROUPS config, for example these work:

CONFIG_CGROUPS=y
CONFIG_CGROUP_SCHED=y
CONFIG_CGROUP_DEVICE=y
CONFIG_CGROUP_CPUACCT=y

As with CONFIG_NUMA, I'm not sure if all of these are necessary, but CUDA would not work with CONFIG_CGROUPS=n. (This set of CGROUPS config comes from Gentoo livedvd-amd64-multilib-20160704.)

The same CUDA issue already appeared with nvidia-drivers-396.45, but I have not tested these options with that driver, as the latest one now works.

nvidia-drivers-396.24-r1 was the last one that worked for me without this extra kernel config.

Comment 2 Chiitoo gentoo-dev

2018-08-12 23:32:37 UTC

I wonder if 'nvenc' being broken is related to this.

Starting with 'x11-drivers/nvidia-drivers-396.45', I'm getting this:

[h264_nvenc @ 0x55ce5cafe4c0] Cannot init CUDA
warning: [NVENC encoder: 'simple_h264_stream'] Failed to open NVENC codec: Unknown error occurred

I tried

CONFIG_NUMA=y
CONFIG_ACPI_NUMA=y
CONFIG_CGROUPS=y
CONFIG_CGROUP_SCHED=y
CONFIG_CGROUP_DEVICE=y
CONFIG_CGROUP_CPUACCT=y

but that did not help in my case.

I noticed that 'nvidia_uvm' isn't being loaded automatically like it does with 396.24-r1, but loading it manually didn't change anything either as far as I can tell.

Also found this: http://www.ffmpeg-archive.org/Nvenc-Fails-with-Cannot-Init-CUDA-td4683921.html

Comment 3 Timo Rothenpieler 2018-08-12 23:39:06 UTC

That's not an nvenc issue. It's the same cuInit failure this bug is about, as nvenc runs on top of CUDA.
It's working fine for me on 396.51 after I enabled NUMA in my kernel.

What I did to figure out what the issue was was compiling a minimal C program:

int main() { printf("%d\n", cuInit(0)); return 0; }

If it doesn't print 0, it failed. Then I used strace to see what it's doing, and I saw it was trying to access NUMA stuff in /sys which wasn't there.

Comment 4 LP 2018-08-26 23:29:22 UTC

(In reply to Timo Rothenpieler from comment #3)
> That's not an nvenc issue. It's the same cuInit failure this bug is about,
> as nvenc runs on top of CUDA.
> It's working fine for me on 396.51 after I enabled NUMA in my kernel.
> 

Nvidia seems to be aware of the problem, but how and when they will fix it is uncertain.

Here is the relevant link for the Nvidia forum:

https://devtalk.nvidia.com/default/topic/1038207/please-lift-numa-dependency-of-cuda-or-provide-a-test-for-it-in-the-installer-and-kernel-module

Comment 5 Adam Jones 2018-08-31 07:56:48 UTC

I've enabled the various NUMA, ZONE_DEVICE, HMM and CGROUPS options in my kernel config, and Timo's cuInit() test program still returns error 999 on my system.

Looking at the strace logs, the only obvious errors seem to be:

openat(AT_FDCWD, "/dev/shm/cuda_injection_path_shm", O_RDWR|O_NOFOLLOW|O_CLOEXEC) = -1 ENOENT (No such file or directory)

connect(3, {sa_family=AF_UNIX, sun_path="/tmp/nvidia-mps/control"}, 26) = -1 ENOENT (No such file or directory)

and, later:

ioctl(-1, _IOC(0, 0, 0x2, 0x3000), 0)   = -1 EBADF (Bad file descriptor)

(which rather suggests it's trying an ioctl on a file descriptor that it hasn't checked for errors...)

I suspect the MPS file is ignorable, as it's just checking to see if that daemon is running.

Comment 6 Timo Rothenpieler 2018-08-31 08:42:12 UTC

Both these errors are also in my strace log of a successful init.
Did you rebuild the nvidia driver after re-building your kernel with those changes? I suspect that might be required for some reason.

Comment 7 Adam Jones 2018-08-31 08:47:11 UTC

Pretty sure I've rebuilt the driver, yes, as I've gone up a few point releases since changing the config.

I do notice that modprobing the nvidia-uvm driver doesn't create any device nodes (I have to use the nvidia-modprobe script to do that), and I just get one line of output from it:

nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 243

Not sure if it's normally meant to report more?

My dmesg at the point of loading the nvidia modules looks like:

[   15.444207] nvidia: loading out-of-tree module taints kernel.
[   15.445293] nvidia: module license 'NVIDIA' taints kernel.
[   15.446298] Disabling lock debugging due to kernel taint
[   15.458265] nvidia-nvlink: Nvlink Core is being initialized, major device number 244
[   15.459617] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: oldecodes=io+m
em,decodes=none:owns=io+mem
[   15.460741] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  396.54  Tue Aug 14 19:02:34 PDT 2018 (using threaded interrupts)
[   15.483862] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 243
[   15.491466] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  396.54  Tue Aug 14 23:08:44 PDT 2018
[   15.494982] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[   15.494984] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 0

Comment 8 Xepher 2018-09-08 04:01:10 UTC

I just spent about 6 hours tracking this down, manually bisecting kernel configs between working and non. It appears that, at the very least, it does now require CONFIG_NUMA and CONFIG_CPUSETS to get CUDA working with >396.24

CONFIG_ACPI_NUMA is NOT required for my (intel) system, neither is the AMD_NUMA option. It is possible that other CGROUP options are required as well, but those weren't in my initial bisection.


/dev/shm/cuda_injection_path_shm is a red herring. It "always" fails, even on working systems, as does the /tmp/nvidia-mps/control socket open.


Currently working, kernel 4.18.6 with 396.54 driver. Using ffmpeg to do NVENC and CUVID.

Comment 9 Thomas Albers 2018-09-11 14:35:11 UTC

Responding to comment # 1:
> there is no useflag for CUDA

In a way there is, for CUDA to work the driver must be built with UVM useflag.

Comment 10 Adam Jones 2018-09-12 21:19:09 UTC

As per https://devtalk.nvidia.com/default/topic/1037521/linux/cuda-broken-in-396-24-02-and-396-24-10-vulkan-beta-drivers-on-linux/3 and Xepher's suggestion above, enabling CONFIG_CPUSETS seems to have fixed things for me.

Comment 11 Ionen Wolkens gentoo-dev

2021-03-06 08:01:23 UTC

I think(?) this is obsolete, without NUMA nor CPUSETS, NVENC seems to work fine and nvidia mentioned this dependency was not intentional.

I imagine they changed this on their end.