After updating to x11-drivers/nvidia-drivers-410.57 my system won't boot correctly anymore. Kernel modules won't load properly and all I get is black screen. I can't modprobe or rmmod any modules after that and system won't even reboot/powerdown cleanly. Here's what I see in dmesg (two versions): ------------------------- Sep 21 13:21:03 wafel kernel: udevd[563]: worker [631] /module/nvidia is taking a long time Sep 21 13:21:03 wafel kernel: udevd[563]: worker [657] /devices/pci0000:00/0000:00:02.0/0000:01:00.0 is taking a long time [...] Sep 21 13:23:01 wafel kernel: udevd[631]: timeout 'nvidia-udev.sh add' Sep 21 13:23:01 wafel kernel: udevd[631]: slow: 'nvidia-udev.sh add' [825] Sep 21 13:23:02 wafel kernel: udevd[631]: timeout: killing 'nvidia-udev.sh add' [825] Sep 21 13:23:02 wafel kernel: udevd[631]: slow: 'nvidia-udev.sh add' [825] Sep 21 13:23:02 wafel kernel: udevd[631]: 'nvidia-udev.sh add' [825] terminated by signal 9 (Killed) Sep 21 13:23:05 wafel kernel: udevd[563]: worker [657] /devices/pci0000:00/0000:00:02.0/0000:01:00.0 timeout; kill it Sep 21 13:23:05 wafel kernel: udevd[563]: seq 975 '/devices/pci0000:00/0000:00:02.0/0000:01:00.0' killed ------------------------- Sep 30 12:13:01 wafel kernel: udevd[572]: worker [666] /devices/pci0000:00/0000:00:02.0/0000:01:00.0 is taking a long time Sep 30 12:13:03 wafel /etc/init.d/local[2088]: local: timed out waiting for netmount Sep 30 12:13:04 wafel kernel: udevd[572]: worker [656] /module/nvidia is taking a long time Sep 30 12:15:01 wafel kernel: udevd[572]: worker [656] /module/nvidia timeout; kill it Sep 30 12:15:01 wafel kernel: udevd[572]: seq 1268 '/module/nvidia' killed Sep 30 12:15:01 wafel kernel: udevd[572]: worker [666] /devices/pci0000:00/0000:00:02.0/0000:01:00.0 timeout; kill it Sep 30 12:15:01 wafel kernel: udevd[572]: seq 959 '/devices/pci0000:00/0000:00:02.0/0000:01:00.0' killed Sep 30 12:15:01 wafel kernel: udevd[572]: worker [656] terminated by signal 9 (Killed) Sep 30 12:15:01 wafel kernel: udevd[572]: worker [656] failed while handling '/module/nvidia' ------------------------- I was able to workaround this problem by manually blacklisting nvidia module in /etc/modprobe.d/blacklist.conf. Somehow modules still got loaded and system works as expected. Thus I suspect it may be a bug in eudev. Here's list of versions of udev related packages: equery l *udev* * Searching for *udev* ... [IP-] [ ] dev-libs/libgudev-232:0/0 [IP-] [ ] sys-fs/eudev-3.2.5:0 [IP-] [ ] sys-fs/udev-init-scripts-32:0 [IP-] [ ] virtual/libgudev-232:0/0 [IP-] [ ] virtual/libudev-232:0/1 [IP-] [ ] virtual/udev-217:0
Similar issue for me here. Apart from the nvidia-udev.sh messages, I also got the below: [ 912.061874] udevd[617]: specified group 'render' unknown [ 915.068419] udevd[617]: specified group 'render' unknown Downgrading from nvidia-drivers-410.57-r1 to nvidia-drivers-396.54 allowed me to boot normally again. Haven't tried blacklisting the module as of yet.
I also get similar messages about groups: udevd[547]: specified group 'colord' unknown However I had them before installing nvidia-drivers as well, so I don't think they are related.
Thu Nov 23 19:45:07 2017 >>> sys-fs/eudev-3.2.5 (Currently runing =x11-drivers/nvidia-drivers-410.57-r1, kernel 4.18.9) If the eudev version mattered I would have noticed a long time ago.
(In reply to Jeroen Roovers from comment #3) > Thu Nov 23 19:45:07 2017 >>> sys-fs/eudev-3.2.5 > (Currently runing =x11-drivers/nvidia-drivers-410.57-r1, kernel 4.18.9) *running Also, I have been running sys-fs/eudev-3.2.6 since before 410 came out: Tue Sep 18 17:50:07 2018 >>> sys-fs/eudev-3.2.6
I have similar problems, with the nvidia drivers using sys-fs/udev. Actually, hiccups already started with the previous nvidia drivers and kernel 4.17. With kernel 4.18 it got worse. My current workaround is to start from a dracut generated initramfs and a small sleep before calling nvidia-smi in /lib/udev/nvidia-udev.sh. Somehow this makes it work.
I have the same problems, with the workaround from Boris only working partially. I have to reboot 3 to 5 times before the driver works correctly.
Did you try my workaround? Seems to work each time. On the other hand, I'm on kernel 4.14. I don't understand why eudev reference was related from the title, as it is definitely tied to eudev. Without it, modules load fine. Problem comes from eudev workers taking too much time.
Created attachment 551586 [details] Patch to workaround problems with udev. @Christoph: Did you add a long enough sleep? See the patch for what I am using. If the sleep is too short, nvidia-smi very likely goes into a 100% busy state. (And if I boot without an initramfs, udev will go into a 100% busy state.) With the workarounds, X starts reliably. The only annoying thing is that I have to move the mouse a little or type something on the keyboard for the login screen to appear faster. If anyone has a nicer workaround, I would be delighted to hear about it.
(In reply to Tomasz Golinski from comment #7) > Did you try my workaround? Seems to work each time. On the other hand, I'm > on kernel 4.14. > > I don't understand why eudev reference was related from the title, as it is > definitely tied to eudev. Without it, modules load fine. Problem comes from > eudev workers taking too much time. https://bugs.gentoo.org/667362#c4 Because I've been running several eudev versions before and after, and that can obviously not be the problem if it doesn't manifest for me but does manifest for you. So look elsewhere.
I experimented a bit more. Using Tomasz Golinski workaround to just blacklist the nvidia module works for me too. Apparently, the nvidia module still will be loaded when xdm starts, which does not cause hangs at this (later?) point.
Same problem with: [IP-] [ ] sys-fs/eudev-3.2.5:0 [IP-] [ ] sys-kernel/gentoo-sources-4.18.14:4.18.14 [IP-] [ ] x11-drivers/nvidia-drivers-410.66:0/410 System boots *sometimes*, but mostly the screen goes black and completely unresponsive (SysRq works though). The one time I got into a tty it had those 100% CPU on nvidia-udev.sh Tried to workaround with that sleep fix, but it does not change much. It seems to boot a little bit more often than without sleep. Blacklisting the nvidia module seems to work as a workaround though.
I can confirm seeing problematic behaviours on my laptops (Quadro K2000M and Quadro M620 Mobile). Interesting enough, I have two slightly different issues whether I am using the 39x or 41x driver series. With 39x (and I think even older drivers, I have been having this problem since last March/April - see also the forums), I see the issue only on the K200M-equipped laptop and it is nvidia-smi that is hanging with 100% CPU usage. I can work around the problem by adding a sleep statement in nvidia-udev.sh, before nvidia-smi is executed (as it is in the patch from Boris Bigott). With 41x I have the issue on both laptops and it is not nvidia-smi but rather udevd hanging with 100% CPU usage, as reported by Tomasz Golinski. Adding the sleep command to nvidia-udev.sh does not have any effect, while blacklisting the nvidia* modules and letting Xorg load them seems to work. Cheers
Problematic behaviour on lenovo thinkpad p51 as well: I notice some different sympthoms though. With 39x and kernel <4.19.0 all is well, i experience no problems. With 41x i get black screen, loud fan spin (100% utilisation?) and laptop is totally unresponsive. This happens on all >4.18 kernels Blacklisting nvidia modules: black screen, loud fan spin, unresponsive laptop nvidia-udev.sh patch: black screen, loud fan spin, unresponsive laptop. dmesg shows no errors or warnings, X log shows no abnormalities. However..... IF the laptop is booted into single user mode, and right after the populating /dev status the system asks for the root password or CTRL+D to resume normal boot. When i press CTRL+D after two seconds, the system works like a charm every single time. If i press CTRL+D immedately after i see the message: black screen, loud fans and unresponsive laptop.
> However..... IF the laptop is booted into single user mode, and right after > the populating /dev status the system asks for the root password or CTRL+D > to resume normal boot. When i press CTRL+D after two seconds, the system > works like a charm every single time. Having the same issue on a GTX1070 desktop after updating to 410.xx driver (black screen, 100% system load, no response to input), and the single user trick makes it work for me. The patch however does not.
Please check my workaround here: https://bugs.gentoo.org/670340#c8 It's simple and stupid since my knowledge about udev and gentoo is quite limited, but that worked for me.
^ Basically it's the same idea as Tomasz (thanks, Tomasz) proposed, but it is working for me only if add 'modprobe nvidia_drm' into local script.
I had another problem along these lines. I upgraded the kernel to 4.19.4. Since I intend to migrate to AMD GPU in the coming days, I enabled a bunch of related kernel options. The result was that my trick of blacklisting nvidia ceased to work. Namely, nvidia module did load but X wouldn't start complaining about missing modules. Indeed other nvidia_* modules didn't load and modprobing them would fail (stall for a time, then be killed by udev). My dirty solution to the problem was to delete all /lib/modules/***/drivers/gpu directory. After reboot (not clean one regretfully) the system started normally. Apparently I removed too much since nvidia_drm module didn't load since it picked up the dependency on drm kernel module (which it didn't have in my old 4.14 kernel as the drm module was not built). System seems to work fine without it and I can even reinstall missing drm modules and modprobe nvidia_drm it after X starts. It pulls in a bunch of other modules: nvidia_drm 40960 0 drm_kms_helper 159744 1 nvidia_drm syscopyarea 16384 1 drm_kms_helper sysfillrect 16384 1 drm_kms_helper sysimgblt 16384 1 drm_kms_helper fb_sys_fops 16384 1 drm_kms_helper drm 352256 3 drm_kms_helper,nvidia_drm drm_panel_orientation_quirks 16384 1 drm nvidia_modeset 995328 12 nvidia_drm nvidia 16691200 530 nvidia_modeset
Similar issue here (high cpu usage, module won't unload, no shutdown etc.) with nvidia-drivers 410.78 and 415.18 and gentoo-sources-4.18.x and sys-fs/udev-239. I had to go back to nvidia-drivers 396.54.
Confirming the issue for a recently setup Gentoo system. nvidia-drivers-396.54 or lower versions work. Higher ones give me a black screen. SysReq and remote login are working, but I cannot switch locally to any TTY. Hardware is brand new. Graphics card is GTX-1050Ti. I think those higher driver version should get masked again immediately. Dmesg output: [ 1000.893204] udevd[3245]: timeout 'nvidia-udev.sh add' [ 1000.893214] udevd[3245]: slow: 'nvidia-udev.sh add' [3422] [ 1001.758534] udevd[3207]: worker [3245] /module/nvidia timeout; kill it [ 1001.758545] udevd[3207]: seq 2321 '/module/nvidia' killed [ 1001.758549] udevd[3207]: worker [3271] /devices/pci0000:00/0000:00:03.2/0000:1d:00.0 timeout; kill it [ 1001.758553] udevd[3207]: seq 1934 '/devices/pci0000:00/0000:00:03.2/0000:1d:00.0' killed [ 1001.758737] udevd[3207]: worker [3245] terminated by signal 9 (Killed) [ 1001.758739] udevd[3207]: worker [3245] failed while handling '/module/nvidia'
Taken from Bug 670340: I was able to run my X server with nvidia-drivers-415.18 just comenting one last string "#options nvidia NVreg_DeviceFileMode=432 NVreg_DeviceFileUID=0 NVreg_DeviceFileGID=27 NVreg_ModifyDeviceFiles=1" in /etc/modprobe.d/nvidia.conf This at least boots my system into X instead of a black screen and i don't need to use a crippled single user mode boot workaround. That said... maybe it's time to get this bug the confirmed status, get some packages masked or get upstream involved?
*** This bug has been marked as a duplicate of bug 670340 ***
*** Bug 670340 has been marked as a duplicate of this bug. ***
(In reply to Henny Coenen from comment #20) > Taken from Bug 670340: > > I was able to run my X server with nvidia-drivers-415.18 just comenting one > last string "#options nvidia NVreg_DeviceFileMode=432 NVreg_DeviceFileUID=0 > NVreg_DeviceFileGID=27 NVreg_ModifyDeviceFiles=1" in > /etc/modprobe.d/nvidia.conf > > This at least boots my system into X instead of a black screen and i don't > need to use a crippled single user mode boot workaround. Do the device nodes get proper permissions set when you do that? And then could we find a way to better generalise this for distribution?
Commenting the line: options nvidia NVreg_DeviceFileMode=432 NVreg_DeviceFileUID=0 NVreg_DeviceFileGID=27 NVreg_ModifyDeviceFiles=1 in /etc/modprobe.d/nvidia.conf did not work on two of my boxes. I still get: $ dmesg -snip- [ 184.810576] udevd[2196]: timeout: killing 'nvidia-udev.sh add' [2281] [ 184.810586] udevd[2196]: slow: 'nvidia-udev.sh add' [2281] [ 184.810700] udevd[2196]: 'nvidia-udev.sh add' [2281] terminated by signal 9 (Killed) -snip- I had to switch back to the 4.18.16-gentoo kernel, and now also unmask the nvidia-drivers-396.54 driver, to get X running. The machines also do not reboot cleanly.
This was using: x11-drivers/nvidia-drivers-415.23 linux-4.19.10-gentoo /F
Same issue here. Reproducible with kernel -- 4.14.83, and both drivers 410 and 415. Was able to start X using workaround from bug 670340, but since local service want to be started last I have to restart xdm after each boot manually.
`Curiouser and curiouser!' cried Alice (c) I was updated my nvidia-drivers to 415.25, and after "emerge" + "reboot" I again see no X but console prompt to login. I was logined,looked Xorg.log (nothing about nvidia module), change no any config but repeat "emerge" and "reboot" and voilà I see X(KDE). May be first "emerge" crookedly placed kernel module?
Gave it a try after Alexander's comment - no luck. Downgraded again.
RRepent. I was wrong. After several reboots I found out that my Xorg starts from time to time. When it started I see in kernel log: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 0 [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 415.25 Wed Dec 12 10:02:42 CST 2018 nvidia: module license 'NVIDIA' taints kernel. NVRM: loading NVIDIA UNIX x86_64 Kernel Module 415.25 Wed Dec 12 10:22:08 CST 2018 (using threaded interrupts) When not started I see in kernel log: nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem nvidia: module license 'NVIDIA' taints kernel. nvidia-nvlink: Nvlink Core is being initialized, major device number 247 J
I can confirm the nvidia-udev.sh hang/timeout for nvidia-drivers-410 and 415. As others have reported, udev is unable to load the nvidia module and leaves the system in a broken (no clean reboot possible) state. Kernel: 4.14.83 Fyi: problem occurs with GTX660 as well as with GTX1080 Ti. 390.87 and 396.54 work fine. Blacklisting nvidia modules as per Bug 670340 resolves the issue. I think this needs some attention: Using CUDA with gcc-7 requires at least cuda-9.2, which in turn requires >nvidia-drivers-396.24. However, all ebuilds satisfying this are either masked for removal (396.54) or do not work because of the udev issue (410, 415). This bug basically prevents people from using CUDA, at least without reverting to gcc-6.
Problem persists in 418.56 and 430.09.
Seems like I'm able to boot without the modprobe blacklist workaround on 430.14 In fact, using the workaround seems to break systemd 241+ detection of CanGraphical flag for the seat now
(In reply to Valeriy Malov from comment #32) > Seems like I'm able to boot without the modprobe blacklist workaround on > 430.14 > In fact, using the workaround seems to break systemd 241+ detection of > CanGraphical flag for the seat now Does not work for me, udev still hangs it self out of the window ... Black listing is still needed. Would be really nice if this bug will be takled soon.
I tried switching from OpenRC to systemd and that solved the issue for me. Now x11-drivers/nvidia-drivers-430.26 works fine for me.
Have you tried it with sys-fs/udev instead of sys-fs/eudev ?
wow... almost year-old bug and still not fixed. I wasted a day for this. @Jeroen Roovers: any progress on fixing this?
(In reply to Rafal Kupiec from comment #36) > wow... almost year-old bug and still not fixed. > I wasted a day for this. With which eudev version?
(In reply to josef.95 from comment #37) > (In reply to Rafal Kupiec from comment #36) > > wow... almost year-old bug and still not fixed. > > I wasted a day for this. > > With which eudev version? 3.2.5
(In reply to H.-Peter Pfeufer from comment #38) > 3.2.5 Can you please try it with >=sys-fs/eudev-3.2.8 or with latest stable sys-fs/udev ?
(In reply to josef.95 from comment #39) > (In reply to H.-Peter Pfeufer from comment #38) > > 3.2.5 > > Can you please try it with >=sys-fs/eudev-3.2.8 > or with latest stable sys-fs/udev ? It's the same with the latest stable sys-fs/udev (which I have on my other machine)
Sorry no idea then :-/ (I can not reproduce this error on two different machines)
└──> ~ # equery l eudev * Searching for eudev ... [IP-] [ ] sys-fs/eudev-3.2.8:0
Same with sys-fs/udev-242
Brilliant! Now it is not working for me even with this patch applied...
just to confirm that the bug is still here. 100% reproducible, with 2 different video cards: NVIDIA GPU GeForce 8600 GTS (G84) at PCI:1:0:0 (GPU-0) NVIDIA GPU GeForce GT 1030 (GP108-A) at PCI:1:0:0 (GPU-0) Initially I thought that this are issues with newer kernels and nvidia drivers, so I had ">x11-drivers/nvidia-drivers-396.54" in package.mask and was using 4.9 kernel. This combination works. With blacklisting trick I've got system running fine with 5.2.1 kernel and nvidia-drivers-430.26 I'm running openrc-0.41.2 and udev from sys-apps/systemd-241-r4
(In reply to Vladimir from comment #45) > just to confirm that the bug is still here. > 100% reproducible, with 2 different video cards: > > NVIDIA GPU GeForce 8600 GTS (G84) at PCI:1:0:0 (GPU-0) > NVIDIA GPU GeForce GT 1030 (GP108-A) at PCI:1:0:0 (GPU-0) > > Initially I thought that this are issues with newer kernels and nvidia > drivers, > so I had ">x11-drivers/nvidia-drivers-396.54" in package.mask and was > using 4.9 kernel. This combination works. > > With blacklisting trick I've got system running fine with 5.2.1 kernel > and nvidia-drivers-430.26 > > I'm running openrc-0.41.2 and udev from sys-apps/systemd-241-r4 Doesn't blacklisting and loading modules from local prevent xdm init script to start X, do it?
I got the same problem on a friend box yesterday while migrating from initial nouveau driver setup to nvidia driver setup and that have drive me crazy. Before I find your posts on this bugzilla I've investigated the problem on the box and I found that: - udev (eudev) launch "nvidia-udev.sh add" and wait for ever on it waiting for nvidia-smi - nvidia module is loaded - nvidia_modeset and nvidia_drm is NOT LOADED - /dev/nvidiactl exists - /dev/nvidia0 and /dev/nvidia-modeset DON'T EXIST - no nvidia in /proc/interrupt !!! - X was lauched properly (no pb at all in Xorg.0.log !!!) - screen was black with fixed "_" VT cursor in the upper right corner - ssh on box was OK - no trace of framebuffer at all in dmesg (checking for simple framebuffer) So I decide to unload nvidia module to try load all nvidia modules manually after - impossibe to unload nvidia module (modprove tell that it is in use) - lsof | grep nvidia shows that a process udevd was using nvidia.ko So I decide to stop udev: - /etc/init.usdev stop hang - impossible to kill manually the faulty udevd process that was using nvidia.ko Finally I interpret problem seems to be: - a udev process hang for ever when loading nvidia.ko so device /dev/nvidia0 don't exist - udev launch 'nvidia-udev.sh add' that launch nvidia-smi that hang because it find /dev/nvidiactl but not /dev/nvidia0 So there is a problem when udev create nvidia device. Since I don't have the problem on another box that use a simple framebuffer on 4 cores intel CPU and PCIE 3 while the faulty box use a 12 cores AMD CPU and PCIE 4. I wonder is there is a race condition on boot with framebuffer and loading of nvidia-kms that should be loaded early in boot process. Is your framebuffer loaded at boot when problem is occuring ? Which one do you use ? check: dmsg | grep -Ei "fb|frame"
On a working box at boot you have: $ dmesg | grep -Ei 'nvidia|frame|fb' [ 0.650063] simple-framebuffer simple-framebuffer.0: framebuffer at 0xc0000000, 0x1680000 bytes, mapped to 0x00000000c5f617d5 [ 0.650064] simple-framebuffer simple-framebuffer.0: format=a8r8g8b8, mode=2560x1440x32, linelength=16384 [ 0.650077] fbcon: Deferring console take-over [ 0.650078] simple-framebuffer simple-framebuffer.0: fb0: simplefb registered! [ 3.486577] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card0/input20 [ 3.486795] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card0/input25 [ 3.486912] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card0/input26 [ 3.487018] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card0/input27 [ 3.639028] #0: HDA NVidia at 0xdf080000 irq 17 [ 6.526341] fbcon: Taking over console [ 6.526370] Console: switching to colour frame buffer device 320x90 [ 7.373957] nvidia: module license 'NVIDIA' taints kernel. [ 7.380844] nvidia-nvlink: Nvlink Core is being initialized, major device number 238 [ 7.381093] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem [ 7.481589] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 440.26 Sun Oct 13 18:00:57 UTC 2019 [ 7.483988] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 440.26 Sun Oct 13 17:39:54 UTC 2019 [ 7.484670] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver [ 7.484671] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 0
I can also confirm this bug being still there with openrc-0.41.2 nvidia-drivers-440.31 (i think also 435.21 which i replaced with latest during investigation) kernel-5.3.9 and 5.4.0 (gentoo-sources) I just switched the system from an old Core2Quad 9550 to a Ryzen 3700x while keeping installation (system with generic kernel and support for both) and keeping the same graphic card (GTX660). As it never ever happened on the old system i take it as strong indicator to a timing issue. If there is some logs or other information to supply/test let me know so i can try to help resolving that.
I have the same symptoms here. With a GTX 750 everything worked up to version 396.x, with anything newer I always got the black screen with "_" and no way to get out of that from the keyboard, but it seemed SysRq worked because the filesystems are clean after rebooting with the reset button. IIRC ssh worked too. I have the same problem with a GTX 770 but I just tested version 390.x and 440.x. I didn't investigate before today and I found the same things in dmseg as the opening post. I worked around the problem with the GTX 770 (I haven't the 750 at the moment) blacklisting nvidia module (I didn't try other solutions). Thanks Tomasz!
(In reply to Simone Scanzoni from comment #50) I meant that 390.x worked on both GTX 750 and GTX 770, to confirm that this problem arises after series 396.x here too
Just tried out the latest nvidia-drivers without any luck. I have also given nouveau driver another chance after a year. Works reliable with XRender setting in KDE Plasma. 3D effects are limited however. Removing myself from CC, as I replaced my graphics board with an AMD Vega. Good luck all!
Whats the progress here? Guys?! When are you going to resolve this issue?!
i did the dmesg with blacklist and without according as mentioned by Zentoo (dmesg | grep -Ei "nvidia|fb|frame" ) Failing without blacklisting nvidia modules: [ 0.000000] BIOS-e820: [mem 0x00000000f8000000-0x00000000fbffffff] reserved [ 0.000000] PM: Registered nosave memory: [mem 0xf8000000-0xfbffffff] [ 0.000001] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x33cb4391fb6, max_idle_ns: 440795213593 ns [ 0.002016] LSM: Security Framework initializing [ 0.123340] PCI: MMCONFIG for domain 0000 [bus 00-3f] at [mem 0xf8000000-0xfbffffff] (base 0xf8000000) [ 0.123340] PCI: MMCONFIG at [mem 0xf8000000-0xfbffffff] reserved in E820 [ 0.356440] pci 0000:09:00.0: BAR 3: assigned to efifb [ 0.453275] system 00:00: [mem 0xf8000000-0xfbffffff] has been reserved [ 0.787009] efifb: probing for efifb [ 0.787018] efifb: framebuffer at 0xf1000000, using 9024k, total 9024k [ 0.787020] efifb: mode is 1920x1200x32, linelength=7680, pages=1 [ 0.787021] efifb: scrolling: redraw [ 0.787022] efifb: Truecolor: size=8:8:8:8, shift=24:16:8:0 [ 0.790195] Console: switching to colour frame buffer device 240x75 [ 0.793295] fb0: EFI VGA frame buffer device [ 0.797358] ahci 0000:07:00.0: flags: 64bit ncq sntf ilck pm led clo only pmp fbs pio slum part sxs [ 0.797965] ahci 0000:08:00.0: flags: 64bit ncq sntf ilck pm led clo only pmp fbs pio slum part sxs [ 0.798687] ahci 0000:0c:00.0: flags: 64bit ncq sntf ilck pm led clo only pmp fbs pio slum part [ 0.800051] ahci 0000:0d:00.0: flags: 64bit ncq sntf ilck pm led clo only pmp fbs pio slum part [ 2.124399] nvidia: loading out-of-tree module taints kernel. [ 2.124407] nvidia: module license 'NVIDIA' taints kernel. [ 2.132877] nvidia-nvlink: Nvlink Core is being initialized, major device number 245 [ 2.137218] nvidia 0000:09:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem [ 2.387289] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:03.1/0000:09:00.1/sound/card0/input20 [ 2.416122] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:03.1/0000:09:00.1/sound/card0/input21 [ 2.416194] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:03.1/0000:09:00.1/sound/card0/input22 [ 2.416262] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:03.1/0000:09:00.1/sound/card0/input23 When using the blacklisting workaround and successful startup: [ 0.123346] PCI: MMCONFIG for domain 0000 [bus 00-3f] at [mem 0xf8000000-0xfbffffff] (base 0xf8000000) [ 0.123346] PCI: MMCONFIG at [mem 0xf8000000-0xfbffffff] reserved in E820 [ 0.356687] pci 0000:09:00.0: BAR 3: assigned to efifb [ 0.454214] system 00:00: [mem 0xf8000000-0xfbffffff] has been reserved [ 0.788525] efifb: probing for efifb [ 0.788533] efifb: framebuffer at 0xf1000000, using 9024k, total 9024k [ 0.788535] efifb: mode is 1920x1200x32, linelength=7680, pages=1 [ 0.788536] efifb: scrolling: redraw [ 0.788537] efifb: Truecolor: size=8:8:8:8, shift=24:16:8:0 [ 0.791706] Console: switching to colour frame buffer device 240x75 [ 0.794802] fb0: EFI VGA frame buffer device [ 0.798895] ahci 0000:07:00.0: flags: 64bit ncq sntf ilck pm led clo only pmp fbs pio slum part sxs [ 0.799511] ahci 0000:08:00.0: flags: 64bit ncq sntf ilck pm led clo only pmp fbs pio slum part sxs [ 0.800249] ahci 0000:0c:00.0: flags: 64bit ncq sntf ilck pm led clo only pmp fbs pio slum part [ 0.801614] ahci 0000:0d:00.0: flags: 64bit ncq sntf ilck pm led clo only pmp fbs pio slum part [ 2.400370] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:03.1/0000:09:00.1/sound/card0/input15 [ 2.449119] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:03.1/0000:09:00.1/sound/card0/input16 [ 2.449178] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:03.1/0000:09:00.1/sound/card0/input17 [ 2.449235] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:03.1/0000:09:00.1/sound/card0/input19 [ 11.319173] nvidia: loading out-of-tree module taints kernel. [ 11.319181] nvidia: module license 'NVIDIA' taints kernel. [ 11.327534] nvidia-nvlink: Nvlink Core is being initialized, major device number 245 [ 11.327766] nvidia 0000:09:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem [ 11.533066] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 440.44 Sun Dec 8 03:38:56 UTC 2019 [ 11.759132] caller _nv000908rm+0x1bf/0x1f0 [nvidia] mapping multiple BARs [ 12.334309] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 440.44 Sun Dec 8 03:29:48 UTC 2019 I also tried to deactivate EFIFB but that changed only that i had no console output during boot but it got stuck as well.
I was able to get it to work on 440.44-r1 by disabling rc_parallel in /etc/rc.conf No other workarounds (such as module blacklisting and manually loading them) were applied.
(In reply to G Msx from comment #55) > I was able to get it to work on 440.44-r1 by disabling rc_parallel in > /etc/rc.conf > > No other workarounds (such as module blacklisting and manually loading them) > were applied. That confirms it's a timing issue or race condition at boot. Disabling rc_parallel is common solution for this kind of problem. So may be it could be possible to fix it by playing with daemon init order, runlevels, rc_need, ... I think the purpose is to start nvidia-smi once nvidia devices have been correctly populated. Can't expriment anymore on my side since this problem is not on my system but on a friend one.
(In reply to Zentoo from comment #56) > (In reply to G Msx from comment #55) > > I was able to get it to work on 440.44-r1 by disabling rc_parallel in > > /etc/rc.conf > > > > No other workarounds (such as module blacklisting and manually loading them) > > were applied. > > That confirms it's a timing issue or race condition at boot. > Disabling rc_parallel is common solution for this kind of problem. > I'm sorry but, what do you mean when you are saying «disabling rc_parallel». Here, on two different systems with different drivers versions, rc_parallel is and has always been commented : > #rc_parallel="NO" The only working workaround on both systems is blacklisting nvidia drivers : > $ cat /etc/modprobe.d/blacklist.conf > blacklist nvidia > blacklist nvidia_drm > blacklist nvidia_modeset
I have the same problem since month! neither RC_PARALLEL=NO nor blacklisting the nvidia modules help. I have to restart xdm after I am logged in on the console. Otherwise I have a Blank Screen with blinking cursor.
Anyone contacted NVIDIA with this?
I also had this problem for a while and worked around it by keeping nvidia drivers 390 series. But this was not longer an option, so I also messed around with this problem a bit. It's definitely a race condition. Loading the driver after boot works perfectly. The problem was solved for me by adding nvidia and nvidia-drm to /etc/conf.d/modules modules="atlantic nvidia nvidia-drm" (atlantic is my network card driver -> unrelated). ... which changed the order of execution. As always with race conditions, this reduced the race probability for my system sufficiently. But this may be different for other systems.
For now, I've fixed this for me: nvidia-drivers-440.82 gentoo-sources-5.6.5 mesa-20.0.4-1 (with libglvnd) removed everything from "modprobe.d/blacklist.conf" rc-parallel = no Now the module gets inserted at boot and xdm is able to start X11 without a Segmentation fault
Created attachment 643042 [details] Xorg log from venus server. This is Xorg.0.log from my server with following versions of components, I tried first stable tree, which failed as well, now tried with ~amd64, same result, I applied blacklist and rc.conf is not needed as rc_parallel is default NO. Here are the versions: media-libs/mesa=19.3.5 x11-base/xorg-server-1.20.7 x11-drivers/nvidia-drivers-440.82-r3
Created attachment 643044 [details] dmesg |grep nvidia This is dmesg |grep nvidia log which might be also useful to recognize the issue.
I have a similar problem with nvidia-smi and I'm not entirely sure, if it's the same one. My system has - Two graphics cards: an nvidia gt 730 and an amd radeon rx 580 (or rx 590) - The display is connected to the amd one, the nvidia one just provides CUDA - x11-driver/nvidia-drivers-455.38 - sys-fs/udev-246-r1 I observed, that - the nvidia-smi command (executed by /etc/X11/xinit/xinitrc.d/95-nvidia-settings) hanged and prevented X to start properly - When commenting out the nvidia-smi call from the script, X was able to start, but I still saw one nvidia-smi process running in htop (process state D) - I also saw one thread of /lib/systemd/systemd-udevd which - used 100 % of one cpu (process state R) - was not killable using SIGTERM or SIGKILL - was not attachable using strace -p (strace just said, it will attach to the process and then hang up. strace didn't respond to Ctrl+C anymore and had to be killed with kill -9 <PID>) - was not killable by killing its parent systemd-udevd (the parent got killed, all other children got killed, the specific child clogging the cpu lived on) - After some time, I noticed that the cpu intensive systemd-udevd process was gone. At roughly the same time, also the nvidia-smi process exited. Also at roughly the same time the message 'NVRM: loading NVIDIA UNIX x86_64 Kernel Module 455.38 Thu Oct 22 06:06:59 UTC 2020' appeared on dmesg I am wondering, if this behavior is caused by the same problem or if I should create a new bug for this. I'd be glad for any hints.
(In reply to Constantin Runge from comment #64) > I have a similar problem with nvidia-smi and I'm not entirely sure, if it's > the same one. > > My system has > - Two graphics cards: an nvidia gt 730 and an amd radeon rx 580 (or rx 590) > - The display is connected to the amd one, the nvidia one just provides CUDA > - x11-driver/nvidia-drivers-455.38 > - sys-fs/udev-246-r1 > > I observed, that > - the nvidia-smi command (executed by > /etc/X11/xinit/xinitrc.d/95-nvidia-settings) hanged and prevented X to start > properly > - When commenting out the nvidia-smi call from the script, X was able to > start, but I still saw one nvidia-smi process running in htop (process state > D) > - I also saw one thread of /lib/systemd/systemd-udevd which > - used 100 % of one cpu (process state R) > - was not killable using SIGTERM or SIGKILL > - was not attachable using strace -p (strace just said, it will attach to > the process and then hang up. strace didn't respond to Ctrl+C anymore and > had to be killed with kill -9 <PID>) > - was not killable by killing its parent systemd-udevd (the parent got > killed, all other children got killed, the specific child clogging the cpu > lived on) > - After some time, I noticed that the cpu intensive systemd-udevd process > was gone. At roughly the same time, also the nvidia-smi process exited. Also > at roughly the same time the message 'NVRM: loading NVIDIA UNIX x86_64 > Kernel Module 455.38 Thu Oct 22 06:06:59 UTC 2020' appeared on dmesg > > I am wondering, if this behavior is caused by the same problem or if I > should create a new bug for this. > I'd be glad for any hints. Can confirm it's happening to me too. I've noticed when this occurs it loads only nvidia module. Does not load nvidia-drm nvidia-modesettings and nvidia-smi doesn't show any output. I thought I was the only one since I started messing with kernel config.
I think its -fomit-frame-pointer deleted /lib/modules, rebuilding kernel and nvidia-drivers without it everything back to normal...
(In reply to kartebi from comment #66) > I think its -fomit-frame-pointer > deleted /lib/modules, rebuilding kernel and nvidia-drivers without it > everything back to normal... Update to the situation, i think i found the problem look here https://bugs.gentoo.org/454740#c31
The likely removal removal of nvidia-udev.sh will hopefully solve those. *** This bug has been marked as a duplicate of bug 454740 ***