Bug 667362

Summary:	x11-drivers/nvidia-drivers-410.57 - timeouts when loading modules
Product:	Gentoo Linux	Reporter:	Tomasz Golinski <tomaszg>
Component:	Current packages	Assignee:	David Seifert <soap>
Status:	RESOLVED DUPLICATE
Severity:	normal	CC:	b4b1, boris.bigott, brainkiller_01, eXt, gaboroszkar, gentoo-bugzilla, gentoo-bugzilla, info, ionen, jasmin+gentoo, jazzvoid, josef64, kevin, kuba.iluvatar, limanski, luke, lweinberger42, mateubruno, netbox253, particleflux, regboxemg, rzubaly, simon.haegler, tsebrenko, viper, zbox, zephyrus.271
Priority:	Normal
Version:	unspecified
Hardware:	All
OS:	Linux
See Also:	https://bugs.gentoo.org/show_bug.cgi?id=670340
Whiteboard:
Package list:		Runtime testing required:	---
Attachments:	Patch to workaround problems with udev. Xorg log from venus server. dmesg \|grep nvidia

Description Tomasz Golinski 2018-09-30 12:03:31 UTC

After updating to x11-drivers/nvidia-drivers-410.57 my system won't boot correctly anymore. Kernel modules won't load properly and all I get is black screen. I can't modprobe or rmmod any modules after that and system won't even reboot/powerdown cleanly. 

Here's what I see in dmesg (two versions):

-------------------------
Sep 21 13:21:03 wafel kernel: udevd[563]: worker [631] /module/nvidia is taking a long time
Sep 21 13:21:03 wafel kernel: udevd[563]: worker [657] /devices/pci0000:00/0000:00:02.0/0000:01:00.0 is taking a long time

[...]

Sep 21 13:23:01 wafel kernel: udevd[631]: timeout 'nvidia-udev.sh add'
Sep 21 13:23:01 wafel kernel: udevd[631]: slow: 'nvidia-udev.sh add' [825]
Sep 21 13:23:02 wafel kernel: udevd[631]: timeout: killing 'nvidia-udev.sh add' [825]
Sep 21 13:23:02 wafel kernel: udevd[631]: slow: 'nvidia-udev.sh add' [825]
Sep 21 13:23:02 wafel kernel: udevd[631]: 'nvidia-udev.sh add' [825] terminated by signal 9 (Killed)
Sep 21 13:23:05 wafel kernel: udevd[563]: worker [657] /devices/pci0000:00/0000:00:02.0/0000:01:00.0 timeout; kill it
Sep 21 13:23:05 wafel kernel: udevd[563]: seq 975 '/devices/pci0000:00/0000:00:02.0/0000:01:00.0' killed
-------------------------
Sep 30 12:13:01 wafel kernel: udevd[572]: worker [666] /devices/pci0000:00/0000:00:02.0/0000:01:00.0 is taking a long time
Sep 30 12:13:03 wafel /etc/init.d/local[2088]: local: timed out waiting for netmount
Sep 30 12:13:04 wafel kernel: udevd[572]: worker [656] /module/nvidia is taking a long time
Sep 30 12:15:01 wafel kernel: udevd[572]: worker [656] /module/nvidia timeout; kill it
Sep 30 12:15:01 wafel kernel: udevd[572]: seq 1268 '/module/nvidia' killed
Sep 30 12:15:01 wafel kernel: udevd[572]: worker [666] /devices/pci0000:00/0000:00:02.0/0000:01:00.0 timeout; kill it
Sep 30 12:15:01 wafel kernel: udevd[572]: seq 959 '/devices/pci0000:00/0000:00:02.0/0000:01:00.0' killed
Sep 30 12:15:01 wafel kernel: udevd[572]: worker [656] terminated by signal 9 (Killed)
Sep 30 12:15:01 wafel kernel: udevd[572]: worker [656] failed while handling '/module/nvidia'
-------------------------

I was able to workaround this problem by manually blacklisting nvidia module in /etc/modprobe.d/blacklist.conf. Somehow modules still got loaded and system works as expected. Thus I suspect it may be a bug in eudev.

Here's list of versions of udev related packages:

equery l *udev*
 * Searching for *udev* ...
[IP-] [  ] dev-libs/libgudev-232:0/0
[IP-] [  ] sys-fs/eudev-3.2.5:0
[IP-] [  ] sys-fs/udev-init-scripts-32:0
[IP-] [  ] virtual/libgudev-232:0/0
[IP-] [  ] virtual/libudev-232:0/1
[IP-] [  ] virtual/udev-217:0

Comment 1 Martijn Schmidt 2018-10-07 14:06:55 UTC

Similar issue for me here. Apart from the nvidia-udev.sh messages, I also got the below:

[  912.061874] udevd[617]: specified group 'render' unknown
[  915.068419] udevd[617]: specified group 'render' unknown

Downgrading from nvidia-drivers-410.57-r1 to nvidia-drivers-396.54 allowed me to boot normally again. Haven't tried blacklisting the module as of yet.

Comment 2 Tomasz Golinski 2018-10-07 15:45:26 UTC

I also get similar messages about groups:

udevd[547]: specified group 'colord' unknown

However I had them before installing nvidia-drivers as well, so I don't think they are related.

Comment 3 Jeroen Roovers (RETIRED) gentoo-dev

2018-10-07 16:28:08 UTC

Thu Nov 23 19:45:07 2017 >>> sys-fs/eudev-3.2.5
 (Currently runing =x11-drivers/nvidia-drivers-410.57-r1, kernel 4.18.9)


If the eudev version mattered I would have noticed a long time ago.

Comment 4 Jeroen Roovers (RETIRED) gentoo-dev

2018-10-07 16:29:40 UTC

(In reply to Jeroen Roovers from comment #3)
> Thu Nov 23 19:45:07 2017 >>> sys-fs/eudev-3.2.5
>  (Currently runing =x11-drivers/nvidia-drivers-410.57-r1, kernel 4.18.9)

*running

Also, I have been running sys-fs/eudev-3.2.6 since before 410 came out:

Tue Sep 18 17:50:07 2018 >>> sys-fs/eudev-3.2.6

Comment 5 Boris Bigott 2018-10-07 20:10:54 UTC

I have similar problems, with the nvidia drivers using sys-fs/udev. Actually, hiccups already started with the previous nvidia drivers and kernel 4.17. With kernel 4.18 it got worse.

My current workaround is to start from a dracut generated initramfs and a small sleep before calling nvidia-smi in /lib/udev/nvidia-udev.sh. Somehow this makes it work.

Comment 6 Christoph Böhmwalder 2018-10-16 09:06:12 UTC

I have the same problems, with the workaround from Boris only working partially. I have to reboot 3 to 5 times before the driver works correctly.

Comment 7 Tomasz Golinski 2018-10-16 09:31:23 UTC

Did you try my workaround? Seems to work each time. On the other hand, I'm on kernel 4.14. 

I don't understand why eudev reference was related from the title, as it is definitely tied to eudev. Without it, modules load fine. Problem comes from eudev   workers taking too much time.

Comment 8 Boris Bigott 2018-10-16 19:14:15 UTC

Created attachment 551586 [details]
Patch to workaround problems with udev.

@Christoph: Did you add a long enough sleep? See the patch for what I am using. If the sleep is too short, nvidia-smi very likely goes into a 100% busy state. (And if I boot without an initramfs, udev will go into a 100% busy state.) With the workarounds, X starts reliably. The only annoying thing is that I have to move the mouse a little or type something on the keyboard for the login screen to appear faster. If anyone has a nicer workaround, I would be delighted to hear about it.

Comment 9 Jeroen Roovers (RETIRED) gentoo-dev

2018-10-16 22:09:53 UTC

(In reply to Tomasz Golinski from comment #7)
> Did you try my workaround? Seems to work each time. On the other hand, I'm
> on kernel 4.14. 
> 
> I don't understand why eudev reference was related from the title, as it is
> definitely tied to eudev. Without it, modules load fine. Problem comes from
> eudev   workers taking too much time.

https://bugs.gentoo.org/667362#c4

Because I've been running several eudev versions before and after, and that can obviously not be the problem if it doesn't manifest for me but does manifest for you. So look elsewhere.

Comment 10 Boris Bigott 2018-10-17 08:09:30 UTC

I experimented a bit more. Using Tomasz Golinski workaround to just blacklist the nvidia module works for me too. Apparently, the nvidia module still will be loaded when xdm starts, which does not cause hangs at this (later?) point.

Comment 11 Stefan Linke 2018-10-23 18:17:06 UTC

Same problem with:

  [IP-] [  ] sys-fs/eudev-3.2.5:0
  [IP-] [  ] sys-kernel/gentoo-sources-4.18.14:4.18.14
  [IP-] [  ] x11-drivers/nvidia-drivers-410.66:0/410

System boots *sometimes*, but mostly the screen goes black and completely unresponsive (SysRq works though). The one time I got into a tty it had those 100% CPU on nvidia-udev.sh

Tried to workaround with that sleep fix, but it does not change much. It seems to boot a little bit more often than without sleep.

Blacklisting the nvidia module seems to work as a workaround though.

Comment 12 Emanuele A. Bagnaschi (Zephyrus) 2018-10-27 09:33:41 UTC

I can confirm seeing problematic behaviours on my laptops (Quadro K2000M and Quadro M620 Mobile).  Interesting enough, I have two slightly different issues whether I am using the 39x or 41x driver series.

With 39x (and I think even older drivers, I have been having this problem since last March/April - see also the forums), I see the issue only on the K200M-equipped laptop and it is nvidia-smi that is hanging with 100% CPU usage. I can work around the problem by adding a sleep statement in nvidia-udev.sh, before nvidia-smi is executed (as it is in the patch from Boris Bigott).

With 41x I have the issue on both laptops and it is not nvidia-smi but rather udevd hanging with 100% CPU usage, as reported by Tomasz Golinski. Adding the sleep command to nvidia-udev.sh does not have any effect, while blacklisting the nvidia* modules and letting Xorg load them seems to work.

Cheers

Comment 13 Henny Coenen 2018-10-29 00:41:06 UTC

Problematic behaviour on lenovo thinkpad p51 as well:

I notice some different sympthoms though.

With 39x and kernel <4.19.0 all is well, i experience no problems.

With 41x i get black screen, loud fan spin (100% utilisation?) and laptop is totally unresponsive. This happens on all >4.18 kernels

Blacklisting nvidia modules: black screen, loud fan spin, unresponsive laptop
nvidia-udev.sh patch: black screen, loud fan spin, unresponsive laptop.

dmesg shows no errors or warnings, X log shows no abnormalities.

However..... IF the laptop is booted into single user mode, and right after the populating /dev status the system asks for the root password or CTRL+D to resume normal boot. When i press CTRL+D after two seconds, the system works like a charm every single time.

If i press CTRL+D immedately after i see the message: black screen, loud fans and unresponsive laptop.

Comment 14 daichan2017 2018-11-07 11:52:19 UTC

> However..... IF the laptop is booted into single user mode, and right after
> the populating /dev status the system asks for the root password or CTRL+D
> to resume normal boot. When i press CTRL+D after two seconds, the system
> works like a charm every single time.

Having the same issue on a GTX1070 desktop after updating to 410.xx driver (black screen, 100% system load, no response to input), and the single user trick makes it work for me. The patch however does not.

Comment 15 Ivan 2018-11-09 18:19:22 UTC

Please check my workaround here: https://bugs.gentoo.org/670340#c8
It's simple and stupid since my knowledge about udev and gentoo is quite limited, but that worked for me.

Comment 16 Ivan 2018-11-09 18:21:31 UTC

^ Basically it's the same idea as Tomasz (thanks, Tomasz) proposed, but it is working for me only if add 'modprobe nvidia_drm' into local script.

Comment 17 Tomasz Golinski 2018-11-27 11:48:21 UTC

I had another problem along these lines. I upgraded the kernel to 4.19.4. Since I intend to migrate to AMD GPU in the coming days, I enabled a bunch of related kernel options. The result was that my trick of blacklisting nvidia ceased to work. Namely, nvidia module did load but X wouldn't start complaining about missing modules. Indeed other nvidia_* modules didn't load and modprobing them would fail (stall for a time, then be killed by udev).

My dirty solution to the problem was to delete all /lib/modules/***/drivers/gpu directory. After reboot (not clean one regretfully) the system started normally. Apparently I removed too much since nvidia_drm module didn't load since it picked up the dependency on drm kernel module (which it didn't have in my old 4.14 kernel as the drm module was not built). System seems to work fine without it and I can even reinstall missing drm modules and modprobe nvidia_drm it after X starts. It pulls in a bunch of other modules:

nvidia_drm             40960  0
drm_kms_helper        159744  1 nvidia_drm
syscopyarea            16384  1 drm_kms_helper
sysfillrect            16384  1 drm_kms_helper
sysimgblt              16384  1 drm_kms_helper
fb_sys_fops            16384  1 drm_kms_helper
drm                   352256  3 drm_kms_helper,nvidia_drm
drm_panel_orientation_quirks    16384  1 drm
nvidia_modeset        995328  12 nvidia_drm
nvidia              16691200  530 nvidia_modeset

Comment 18 Ronny Perinke 2018-12-04 22:03:26 UTC

Similar issue here (high cpu usage, module won't unload, no shutdown etc.) with nvidia-drivers 410.78 and 415.18 and gentoo-sources-4.18.x and sys-fs/udev-239. I had to go back to nvidia-drivers 396.54.

Comment 19 Marko Steinberger 2018-12-15 12:13:28 UTC

Confirming the issue for a recently setup Gentoo system.

nvidia-drivers-396.54 or lower versions work. Higher ones give me a black screen. SysReq and remote login are working, but I cannot switch locally to any TTY.

Hardware is brand new. Graphics card is GTX-1050Ti. I think those higher driver version should get masked again immediately.

Dmesg output:

[ 1000.893204] udevd[3245]: timeout 'nvidia-udev.sh add'
[ 1000.893214] udevd[3245]: slow: 'nvidia-udev.sh add' [3422]
[ 1001.758534] udevd[3207]: worker [3245] /module/nvidia timeout; kill it
[ 1001.758545] udevd[3207]: seq 2321 '/module/nvidia' killed
[ 1001.758549] udevd[3207]: worker [3271] /devices/pci0000:00/0000:00:03.2/0000:1d:00.0 timeout; kill it
[ 1001.758553] udevd[3207]: seq 1934 '/devices/pci0000:00/0000:00:03.2/0000:1d:00.0' killed
[ 1001.758737] udevd[3207]: worker [3245] terminated by signal 9 (Killed)
[ 1001.758739] udevd[3207]: worker [3245] failed while handling '/module/nvidia'

Comment 20 Henny Coenen 2018-12-16 23:20:03 UTC

Taken from Bug 670340:

I was able to run my X server with nvidia-drivers-415.18 just comenting one last string "#options nvidia NVreg_DeviceFileMode=432 NVreg_DeviceFileUID=0 NVreg_DeviceFileGID=27 NVreg_ModifyDeviceFiles=1" in /etc/modprobe.d/nvidia.conf

This at least boots my system into X instead of a black screen and i don't need to use a crippled single user mode boot workaround.


That said... maybe it's time to get this bug the confirmed status, get some packages masked or get upstream involved?

Comment 21 Jeroen Roovers (RETIRED) gentoo-dev

2018-12-17 00:31:32 UTC


*** This bug has been marked as a duplicate of bug 670340 ***

Comment 22 Jeroen Roovers (RETIRED) gentoo-dev

2018-12-17 00:32:08 UTC

*** Bug 670340 has been marked as a duplicate of this bug. ***

Comment 23 Jeroen Roovers (RETIRED) gentoo-dev

2018-12-17 00:34:48 UTC

(In reply to Henny Coenen from comment #20)
> Taken from Bug 670340:
> 
> I was able to run my X server with nvidia-drivers-415.18 just comenting one
> last string "#options nvidia NVreg_DeviceFileMode=432 NVreg_DeviceFileUID=0
> NVreg_DeviceFileGID=27 NVreg_ModifyDeviceFiles=1" in
> /etc/modprobe.d/nvidia.conf
> 
> This at least boots my system into X instead of a black screen and i don't
> need to use a crippled single user mode boot workaround.

Do the device nodes get proper permissions set when you do that? And then could we find a way to better generalise this for distribution?

Comment 24 Fredrik Lingvall 2018-12-19 06:59:17 UTC

Commenting the line:

options nvidia NVreg_DeviceFileMode=432 NVreg_DeviceFileUID=0 NVreg_DeviceFileGID=27 NVreg_ModifyDeviceFiles=1

in /etc/modprobe.d/nvidia.conf did not work on two of my boxes. I still get:

$ dmesg

-snip-

[  184.810576] udevd[2196]: timeout: killing 'nvidia-udev.sh add' [2281]
[  184.810586] udevd[2196]: slow: 'nvidia-udev.sh add' [2281]
[  184.810700] udevd[2196]: 'nvidia-udev.sh add' [2281] terminated by signal 9 (Killed)

-snip-

I had to switch back to the 4.18.16-gentoo kernel, and now also unmask the nvidia-drivers-396.54 driver, to get X running. The machines also do not reboot cleanly.

Comment 25 Fredrik Lingvall 2018-12-19 07:01:46 UTC

This was using:

x11-drivers/nvidia-drivers-415.23
linux-4.19.10-gentoo

/F

Comment 26 Mike Limansky 2018-12-30 14:32:33 UTC

Same issue here. Reproducible with kernel -- 4.14.83, and both drivers 410 and 415. Was able to start X using workaround from bug 670340, but since local service want to be started last I have to restart xdm after each boot manually.

Comment 27 Alexander Polozov 2019-01-09 05:51:44 UTC

`Curiouser and curiouser!' cried Alice (c)
I was updated my nvidia-drivers to 415.25, and after "emerge" + "reboot" I again see no X but console prompt to login.
I was logined,looked Xorg.log (nothing about nvidia module), change no any config but repeat "emerge" and "reboot" and voilà I see X(KDE).
May be first "emerge" crookedly placed kernel module?

Comment 28 Marko Steinberger 2019-01-10 19:11:30 UTC

Gave it a try after Alexander's comment - no luck. Downgraded again.

Comment 29 Alexander Polozov 2019-01-11 17:56:06 UTC

RRepent. I was wrong. After several reboots I found out that my Xorg starts from time to time.
When it started I see in kernel log: 
[drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 0
[drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  415.25  Wed Dec 12 10:02:42 CST 2018
nvidia: module license 'NVIDIA' taints kernel.
NVRM: loading NVIDIA UNIX x86_64 Kernel Module  415.25  Wed Dec 12 10:22:08 CST 2018 (using threaded interrupts)

When not started I see in kernel log: 
nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
nvidia: module license 'NVIDIA' taints kernel.
nvidia-nvlink: Nvlink Core is being initialized, major device number 247
J

Comment 30 Gottfried Munda 2019-01-16 17:20:38 UTC

I can confirm the nvidia-udev.sh hang/timeout for nvidia-drivers-410 and 415. As others have reported, udev is unable to load the nvidia module and leaves the system in a broken (no clean reboot possible) state.
Kernel: 4.14.83

Fyi: problem occurs with GTX660 as well as with GTX1080 Ti. 

390.87 and 396.54 work fine.

Blacklisting nvidia modules as per Bug 670340 resolves the issue.

I think this needs some attention: Using CUDA with gcc-7 requires at least cuda-9.2, which in turn requires >nvidia-drivers-396.24. However, all ebuilds satisfying this are either masked for removal (396.54) or do not work because of the udev issue (410, 415).
This bug basically prevents people from using CUDA, at least without reverting to gcc-6.

Comment 31 Marko Steinberger 2019-05-10 19:05:16 UTC

Problem persists in 418.56 and 430.09.

Comment 32 Valeriy Malov 2019-06-10 05:57:35 UTC

Seems like I'm able to boot without the modprobe blacklist workaround on 430.14
In fact, using the workaround seems to break systemd 241+ detection of CanGraphical flag for the seat now

Comment 33 H. Peter Pfeufer 2019-06-10 09:23:29 UTC

(In reply to Valeriy Malov from comment #32)
> Seems like I'm able to boot without the modprobe blacklist workaround on
> 430.14
> In fact, using the workaround seems to break systemd 241+ detection of
> CanGraphical flag for the seat now

Does not work for me, udev still hangs it self out of the window ...
Black listing is still needed.

Would be really nice if this bug will be takled soon.

Comment 34 Sander 2019-07-01 20:35:38 UTC

I tried switching from OpenRC to systemd and that solved the issue for me. Now x11-drivers/nvidia-drivers-430.26 works fine for me.

Comment 35 josef.95 2019-07-05 04:16:45 UTC

Have you tried it with sys-fs/udev instead of sys-fs/eudev ?

Comment 36 Rafal Kupiec 2019-07-23 17:56:55 UTC

wow... almost year-old bug and still not fixed.
I wasted a day for this.

@Jeroen Roovers: any progress on fixing this?

Comment 37 josef.95 2019-07-23 18:13:42 UTC

(In reply to Rafal Kupiec from comment #36)
> wow... almost year-old bug and still not fixed.
> I wasted a day for this.

With which eudev version?

Comment 38 H. Peter Pfeufer 2019-07-23 18:15:05 UTC

(In reply to josef.95 from comment #37)
> (In reply to Rafal Kupiec from comment #36)
> > wow... almost year-old bug and still not fixed.
> > I wasted a day for this.
> 
> With which eudev version?

3.2.5

Comment 39 josef.95 2019-07-23 18:25:56 UTC

(In reply to H.-Peter Pfeufer from comment #38)
 > 3.2.5

Can you please try it with >=sys-fs/eudev-3.2.8
or with latest stable sys-fs/udev ?

Comment 40 H. Peter Pfeufer 2019-07-23 18:27:57 UTC

(In reply to josef.95 from comment #39)
> (In reply to H.-Peter Pfeufer from comment #38)
>  > 3.2.5
> 
> Can you please try it with >=sys-fs/eudev-3.2.8
> or with latest stable sys-fs/udev ?

It's the same with the latest stable sys-fs/udev (which I have on my other machine)

Comment 41 josef.95 2019-07-23 18:37:35 UTC

Sorry no idea then :-/
(I can not reproduce this error on two different machines)

Comment 42 Rafal Kupiec 2019-07-23 18:40:06 UTC

└──> ~ # equery l eudev
 * Searching for eudev ...
[IP-] [  ] sys-fs/eudev-3.2.8:0

Comment 43 Rafal Kupiec 2019-07-23 20:05:06 UTC

Same with sys-fs/udev-242

Comment 44 Rafal Kupiec 2019-07-23 20:20:37 UTC

Brilliant!

Now it is not working for me even with this patch applied...

Comment 45 Vladimir 2019-07-23 22:01:01 UTC

just to confirm that the bug is still here.
100% reproducible, with 2 different video cards:

NVIDIA GPU GeForce 8600 GTS (G84) at PCI:1:0:0 (GPU-0)
NVIDIA GPU GeForce GT 1030 (GP108-A) at PCI:1:0:0 (GPU-0)

Initially I thought that this are issues with newer kernels and nvidia drivers,
so I had ">x11-drivers/nvidia-drivers-396.54" in package.mask and  was
using 4.9 kernel. This combination works.

With blacklisting trick I've got system running fine with 5.2.1 kernel
and nvidia-drivers-430.26

I'm running openrc-0.41.2 and udev from sys-apps/systemd-241-r4

Comment 46 Rafal Kupiec 2019-08-04 18:45:20 UTC

(In reply to Vladimir from comment #45)
> just to confirm that the bug is still here.
> 100% reproducible, with 2 different video cards:
> 
> NVIDIA GPU GeForce 8600 GTS (G84) at PCI:1:0:0 (GPU-0)
> NVIDIA GPU GeForce GT 1030 (GP108-A) at PCI:1:0:0 (GPU-0)
> 
> Initially I thought that this are issues with newer kernels and nvidia
> drivers,
> so I had ">x11-drivers/nvidia-drivers-396.54" in package.mask and  was
> using 4.9 kernel. This combination works.
> 
> With blacklisting trick I've got system running fine with 5.2.1 kernel
> and nvidia-drivers-430.26
> 
> I'm running openrc-0.41.2 and udev from sys-apps/systemd-241-r4

Doesn't blacklisting and loading modules from local prevent xdm init script to start X, do it?

Comment 47 Zentoo 2019-10-31 10:34:21 UTC

I got the same problem on a friend box yesterday while migrating from initial nouveau driver setup to nvidia driver setup and that have drive me crazy.

Before I find your posts on this bugzilla I've investigated the problem on the box and I found that:
- udev (eudev) launch "nvidia-udev.sh add" and wait for ever on it waiting for nvidia-smi 
- nvidia module is loaded
- nvidia_modeset and nvidia_drm is NOT LOADED
- /dev/nvidiactl exists
- /dev/nvidia0 and /dev/nvidia-modeset DON'T EXIST
- no nvidia in /proc/interrupt !!!
- X was lauched properly (no pb at all in Xorg.0.log !!!)
- screen was black with fixed "_" VT cursor in the upper right corner
- ssh on box was OK
- no trace of framebuffer at all in dmesg (checking for simple framebuffer)

So I decide to unload nvidia module to try load all nvidia modules manually after 
- impossibe to unload nvidia module (modprove tell that it is in use)
- lsof | grep nvidia shows that a process udevd was using nvidia.ko

So I decide to stop udev:
- /etc/init.usdev stop hang
- impossible to kill manually the faulty udevd process that was using nvidia.ko

Finally I interpret problem seems to be:
- a udev process hang for ever when loading nvidia.ko so device /dev/nvidia0 don't exist
- udev launch 'nvidia-udev.sh add' that launch nvidia-smi that hang because it find /dev/nvidiactl but not /dev/nvidia0

So there is a problem when udev create nvidia device.

Since I don't have the problem on another box that use a simple framebuffer on 4 cores intel CPU and PCIE 3 while the faulty box use a 12 cores AMD CPU and PCIE 4. I wonder is there is a race condition on boot with framebuffer and loading of nvidia-kms that should be loaded early in boot process.

Is your framebuffer loaded at boot when problem is occuring ?
Which one do you use ?
check: dmsg | grep -Ei "fb|frame"

Comment 48 Zentoo 2019-10-31 10:40:07 UTC

On a working box at boot you have:

$ dmesg | grep -Ei 'nvidia|frame|fb'
[    0.650063] simple-framebuffer simple-framebuffer.0: framebuffer at 0xc0000000, 0x1680000 bytes, mapped to 0x00000000c5f617d5
[    0.650064] simple-framebuffer simple-framebuffer.0: format=a8r8g8b8, mode=2560x1440x32, linelength=16384
[    0.650077] fbcon: Deferring console take-over
[    0.650078] simple-framebuffer simple-framebuffer.0: fb0: simplefb registered!
[    3.486577] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card0/input20
[    3.486795] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card0/input25
[    3.486912] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card0/input26
[    3.487018] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card0/input27
[    3.639028]   #0: HDA NVidia at 0xdf080000 irq 17
[    6.526341] fbcon: Taking over console
[    6.526370] Console: switching to colour frame buffer device 320x90
[    7.373957] nvidia: module license 'NVIDIA' taints kernel.
[    7.380844] nvidia-nvlink: Nvlink Core is being initialized, major device number 238
[    7.381093] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[    7.481589] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  440.26  Sun Oct 13 18:00:57 UTC 2019
[    7.483988] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  440.26  Sun Oct 13 17:39:54 UTC 2019
[    7.484670] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[    7.484671] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 0

Comment 49 simon 2019-11-25 19:57:11 UTC

I can also confirm this bug being still there with
openrc-0.41.2
nvidia-drivers-440.31 (i think also  435.21 which i replaced with latest during investigation)
kernel-5.3.9 and 5.4.0 (gentoo-sources)

I just switched the system from an old Core2Quad 9550 to a Ryzen 3700x while keeping installation (system with generic kernel and support for both) and keeping the same graphic card (GTX660).
As it never ever happened on the old system i take it as strong indicator to a timing issue.

If there is some logs or other information to supply/test let me know so i can try to help resolving that.

Comment 50 Simone Scanzoni 2019-12-15 04:10:12 UTC

I have the same symptoms here. With a GTX 750 everything worked up to version 396.x, with anything newer I always got the black screen with "_" and no way to get out of that from the keyboard, but it seemed SysRq worked because the filesystems are clean after rebooting with the reset button. IIRC ssh worked too. I have the same problem with a GTX 770 but I just tested version 390.x and 440.x. I didn't investigate before today and I found the same things in dmseg as the opening post.
I worked around the problem with the GTX 770 (I haven't the 750 at the moment) blacklisting nvidia module (I didn't try other solutions). Thanks Tomasz!

Comment 51 Simone Scanzoni 2019-12-15 04:17:44 UTC

(In reply to Simone Scanzoni from comment #50)

I meant that 390.x worked on both GTX 750 and GTX 770, to confirm that this problem arises after series 396.x here too

Comment 52 Marko Steinberger 2020-01-04 18:40:02 UTC

Just tried out the latest nvidia-drivers without any luck. 

I have also given nouveau driver another chance after a year. Works reliable with XRender setting in KDE Plasma. 3D effects are limited however.

Removing myself from CC, as I replaced my graphics board with an AMD Vega.

Good luck all!

Comment 53 Rafal Kupiec 2020-01-04 18:43:21 UTC

Whats the progress here? Guys?! When are you going to resolve this issue?!

Comment 54 simon 2020-01-08 17:40:08 UTC

i did the dmesg with blacklist and without according as mentioned by Zentoo (dmesg | grep -Ei "nvidia|fb|frame" )

Failing without blacklisting nvidia modules:
[    0.000000] BIOS-e820: [mem 0x00000000f8000000-0x00000000fbffffff] reserved
[    0.000000] PM: Registered nosave memory: [mem 0xf8000000-0xfbffffff]
[    0.000001] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x33cb4391fb6, max_idle_ns: 440795213593 ns
[    0.002016] LSM: Security Framework initializing
[    0.123340] PCI: MMCONFIG for domain 0000 [bus 00-3f] at [mem 0xf8000000-0xfbffffff] (base 0xf8000000)
[    0.123340] PCI: MMCONFIG at [mem 0xf8000000-0xfbffffff] reserved in E820
[    0.356440] pci 0000:09:00.0: BAR 3: assigned to efifb
[    0.453275] system 00:00: [mem 0xf8000000-0xfbffffff] has been reserved
[    0.787009] efifb: probing for efifb
[    0.787018] efifb: framebuffer at 0xf1000000, using 9024k, total 9024k
[    0.787020] efifb: mode is 1920x1200x32, linelength=7680, pages=1
[    0.787021] efifb: scrolling: redraw
[    0.787022] efifb: Truecolor: size=8:8:8:8, shift=24:16:8:0
[    0.790195] Console: switching to colour frame buffer device 240x75
[    0.793295] fb0: EFI VGA frame buffer device
[    0.797358] ahci 0000:07:00.0: flags: 64bit ncq sntf ilck pm led clo only pmp fbs pio slum part sxs 
[    0.797965] ahci 0000:08:00.0: flags: 64bit ncq sntf ilck pm led clo only pmp fbs pio slum part sxs 
[    0.798687] ahci 0000:0c:00.0: flags: 64bit ncq sntf ilck pm led clo only pmp fbs pio slum part 
[    0.800051] ahci 0000:0d:00.0: flags: 64bit ncq sntf ilck pm led clo only pmp fbs pio slum part 
[    2.124399] nvidia: loading out-of-tree module taints kernel.
[    2.124407] nvidia: module license 'NVIDIA' taints kernel.
[    2.132877] nvidia-nvlink: Nvlink Core is being initialized, major device number 245
[    2.137218] nvidia 0000:09:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[    2.387289] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:03.1/0000:09:00.1/sound/card0/input20
[    2.416122] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:03.1/0000:09:00.1/sound/card0/input21
[    2.416194] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:03.1/0000:09:00.1/sound/card0/input22
[    2.416262] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:03.1/0000:09:00.1/sound/card0/input23


When using the blacklisting workaround and successful startup:

[    0.123346] PCI: MMCONFIG for domain 0000 [bus 00-3f] at [mem 0xf8000000-0xfbffffff] (base 0xf8000000)
[    0.123346] PCI: MMCONFIG at [mem 0xf8000000-0xfbffffff] reserved in E820
[    0.356687] pci 0000:09:00.0: BAR 3: assigned to efifb
[    0.454214] system 00:00: [mem 0xf8000000-0xfbffffff] has been reserved
[    0.788525] efifb: probing for efifb
[    0.788533] efifb: framebuffer at 0xf1000000, using 9024k, total 9024k
[    0.788535] efifb: mode is 1920x1200x32, linelength=7680, pages=1
[    0.788536] efifb: scrolling: redraw
[    0.788537] efifb: Truecolor: size=8:8:8:8, shift=24:16:8:0
[    0.791706] Console: switching to colour frame buffer device 240x75
[    0.794802] fb0: EFI VGA frame buffer device
[    0.798895] ahci 0000:07:00.0: flags: 64bit ncq sntf ilck pm led clo only pmp fbs pio slum part sxs 
[    0.799511] ahci 0000:08:00.0: flags: 64bit ncq sntf ilck pm led clo only pmp fbs pio slum part sxs 
[    0.800249] ahci 0000:0c:00.0: flags: 64bit ncq sntf ilck pm led clo only pmp fbs pio slum part 
[    0.801614] ahci 0000:0d:00.0: flags: 64bit ncq sntf ilck pm led clo only pmp fbs pio slum part 
[    2.400370] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:03.1/0000:09:00.1/sound/card0/input15
[    2.449119] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:03.1/0000:09:00.1/sound/card0/input16
[    2.449178] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:03.1/0000:09:00.1/sound/card0/input17
[    2.449235] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:03.1/0000:09:00.1/sound/card0/input19
[   11.319173] nvidia: loading out-of-tree module taints kernel.
[   11.319181] nvidia: module license 'NVIDIA' taints kernel.
[   11.327534] nvidia-nvlink: Nvlink Core is being initialized, major device number 245
[   11.327766] nvidia 0000:09:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[   11.533066] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  440.44  Sun Dec  8 03:38:56 UTC 2019
[   11.759132] caller _nv000908rm+0x1bf/0x1f0 [nvidia] mapping multiple BARs
[   12.334309] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  440.44  Sun Dec  8 03:29:48 UTC 2019

I also tried to deactivate EFIFB but that changed only that i had no console output during boot but it got stuck as well.

Comment 55 G Msx 2020-01-11 20:54:39 UTC

I was able to get it to work on 440.44-r1 by disabling rc_parallel in /etc/rc.conf

No other workarounds (such as module blacklisting and manually loading them) were applied.

Comment 56 Zentoo 2020-01-18 12:55:18 UTC

(In reply to G Msx from comment #55)
> I was able to get it to work on 440.44-r1 by disabling rc_parallel in
> /etc/rc.conf
> 
> No other workarounds (such as module blacklisting and manually loading them)
> were applied.

That confirms it's a timing issue or race condition at boot.
Disabling rc_parallel is common solution for this kind of problem.

So may be it could be possible to fix it by playing with daemon init order, runlevels, rc_need, ...
I think the purpose is to start nvidia-smi once nvidia devices have been correctly populated.

Can't expriment anymore on my side since this problem is not on my system but on a friend one.

Comment 57 Fab 2020-01-18 14:33:40 UTC

(In reply to Zentoo from comment #56)
> (In reply to G Msx from comment #55)
> > I was able to get it to work on 440.44-r1 by disabling rc_parallel in
> > /etc/rc.conf
> > 
> > No other workarounds (such as module blacklisting and manually loading them)
> > were applied.
> 
> That confirms it's a timing issue or race condition at boot.
> Disabling rc_parallel is common solution for this kind of problem.
> 

I'm sorry but, what do you mean when you are saying «disabling rc_parallel».
Here, on two different systems with different drivers versions, rc_parallel is and has always been commented :
> #rc_parallel="NO"

The only working workaround on both systems is blacklisting nvidia drivers :
> $ cat /etc/modprobe.d/blacklist.conf 
> blacklist nvidia
> blacklist nvidia_drm
> blacklist nvidia_modeset

Comment 58 I am 2020-02-26 07:51:53 UTC

I have the same problem since month!

neither RC_PARALLEL=NO nor blacklisting the nvidia modules help.


I have to restart xdm after I am logged in on the console. Otherwise I have a Blank Screen with blinking cursor.

Comment 59 gletonai 2020-03-03 21:27:59 UTC

Anyone contacted NVIDIA with this?

Comment 60 gentoo-bugzilla 2020-04-04 20:03:36 UTC

I also had this problem for a while and worked around it by keeping nvidia drivers 390 series. But this was not longer an option, so I also messed around with this problem a bit.

It's definitely a race condition. Loading the driver after boot works perfectly.

The problem was solved for me by adding nvidia and nvidia-drm to

/etc/conf.d/modules
modules="atlantic nvidia nvidia-drm"

(atlantic is my network card driver -> unrelated).

... which changed the order of execution.

As always with race conditions, this reduced the race probability for my system sufficiently. But this may be different for other systems.

Comment 61 I am 2020-04-19 11:22:03 UTC

For now, I've fixed this for me:

nvidia-drivers-440.82
gentoo-sources-5.6.5
mesa-20.0.4-1 (with libglvnd)

removed everything from "modprobe.d/blacklist.conf"

rc-parallel = no 



Now the module gets inserted at boot and xdm is able to start X11 without a Segmentation fault

Comment 62 Ladislav Zitka 2020-06-01 15:26:42 UTC

Created attachment 643042 [details]
Xorg log from venus server.

This is Xorg.0.log from my server with following versions of components, I tried first stable tree, which failed as well, now tried with ~amd64, same result, I applied blacklist and rc.conf is not needed as rc_parallel is default NO. 

Here are the versions:
media-libs/mesa=19.3.5
x11-base/xorg-server-1.20.7
x11-drivers/nvidia-drivers-440.82-r3

Comment 63 Ladislav Zitka 2020-06-01 15:29:49 UTC

Created attachment 643044 [details]
dmesg |grep nvidia

This is dmesg |grep nvidia log which might be also useful to recognize the issue.

Comment 64 Constantin Runge 2020-11-15 13:04:08 UTC

I have a similar problem with nvidia-smi and I'm not entirely sure, if it's the same one.

My system has
- Two graphics cards: an nvidia gt 730 and an amd radeon rx 580 (or rx 590)
- The display is connected to the amd one, the nvidia one just provides CUDA
- x11-driver/nvidia-drivers-455.38
- sys-fs/udev-246-r1

I observed, that
- the nvidia-smi command (executed by /etc/X11/xinit/xinitrc.d/95-nvidia-settings) hanged and prevented X to start properly
- When commenting out the nvidia-smi call from the script, X was able to start, but I still saw one nvidia-smi process running in htop (process state D)
- I also saw one thread of /lib/systemd/systemd-udevd which
  - used 100 % of one cpu (process state R)
  - was not killable using SIGTERM or SIGKILL
  - was not attachable using strace -p (strace just said, it will attach to the process and then hang up. strace didn't respond to Ctrl+C anymore and had to be killed with kill -9 <PID>)
  - was not killable by killing its parent systemd-udevd (the parent got killed, all other children got killed, the specific child clogging the cpu lived on)
- After some time, I noticed that the cpu intensive systemd-udevd process was gone. At roughly the same time, also the nvidia-smi process exited. Also at roughly the same time the message 'NVRM: loading NVIDIA UNIX x86_64 Kernel Module  455.38  Thu Oct 22 06:06:59 UTC 2020' appeared on dmesg

I am wondering, if this behavior is caused by the same problem or if I should create a new bug for this.
I'd be glad for any hints.

Comment 65 Rahil Bhimjiani 2020-11-24 14:48:45 UTC

(In reply to Constantin Runge from comment #64)
> I have a similar problem with nvidia-smi and I'm not entirely sure, if it's
> the same one.
> 
> My system has
> - Two graphics cards: an nvidia gt 730 and an amd radeon rx 580 (or rx 590)
> - The display is connected to the amd one, the nvidia one just provides CUDA
> - x11-driver/nvidia-drivers-455.38
> - sys-fs/udev-246-r1
> 
> I observed, that
> - the nvidia-smi command (executed by
> /etc/X11/xinit/xinitrc.d/95-nvidia-settings) hanged and prevented X to start
> properly
> - When commenting out the nvidia-smi call from the script, X was able to
> start, but I still saw one nvidia-smi process running in htop (process state
> D)
> - I also saw one thread of /lib/systemd/systemd-udevd which
>   - used 100 % of one cpu (process state R)
>   - was not killable using SIGTERM or SIGKILL
>   - was not attachable using strace -p (strace just said, it will attach to
> the process and then hang up. strace didn't respond to Ctrl+C anymore and
> had to be killed with kill -9 <PID>)
>   - was not killable by killing its parent systemd-udevd (the parent got
> killed, all other children got killed, the specific child clogging the cpu
> lived on)
> - After some time, I noticed that the cpu intensive systemd-udevd process
> was gone. At roughly the same time, also the nvidia-smi process exited. Also
> at roughly the same time the message 'NVRM: loading NVIDIA UNIX x86_64
> Kernel Module  455.38  Thu Oct 22 06:06:59 UTC 2020' appeared on dmesg
> 
> I am wondering, if this behavior is caused by the same problem or if I
> should create a new bug for this.
> I'd be glad for any hints.

 Can confirm it's happening to me too. I've noticed when this occurs it loads only nvidia module. Does not load nvidia-drm nvidia-modesettings and nvidia-smi doesn't show any output. I thought I was the only one since I started messing with kernel config.

Comment 66 kartebi 2021-02-20 03:48:43 UTC

I think its -fomit-frame-pointer
deleted /lib/modules, rebuilding kernel and nvidia-drivers without it everything back to normal...

Comment 67 kartebi 2021-02-24 12:39:58 UTC

(In reply to kartebi from comment #66)
> I think its -fomit-frame-pointer
> deleted /lib/modules, rebuilding kernel and nvidia-drivers without it
> everything back to normal...

Update to the situation, i think i found the problem
look here
https://bugs.gentoo.org/454740#c31

Comment 68 Ionen Wolkens gentoo-dev

2021-03-04 21:27:26 UTC

The likely removal removal of nvidia-udev.sh will hopefully solve those.

*** This bug has been marked as a duplicate of bug 454740 ***