We have use package sets to run the same package selection on several computers. Some have nvidia cards, some have very old nvidia cards, and some have no nvidia cards at all. The nvidia-drivers package installes /lib/udev/rules.d/99-nvidia.rules which unconditionally loads nvidia.ko and does not react to failure, that is modprobe exits with 1 when no card is found or the module doesn't support the (old) card any more. Udev just keeps trying and trying to load the module so that you constantly are running modprobe even though the module will never load. pstree looks like this: | `-udevd---nvidia-udev.sh---nvidia-smi---modprobe Reproducible: Always Steps to Reproduce: 1. Install nvidia-drivers on a machine without an nvidia card Actual Results: `pstree | grep nvidia` will show the output above. Expected Results: I would expect the udev rule to only act on supported PCI ids.
jer: ping? It is 346.35, but the problem still persists and affects Laptops (I'd say "optimus ones", but I've issue with it even on nv-only one) very hard! Can you just get rid of calling the blob (nvidia-smi) in the script, called by udev rule, but just mknod there? Really, that shitty crap just eats 100% CPU and falls X-server into D state, time to time (and blocking nvidia card and kernel module too). And ignores any trying to kill it with any signal.
the udev rule is this: ACTION=="add", DEVPATH=="/module/nvidia", SUBSYSTEM=="module", RUN+="nvidia-udev.sh $env{ACTION}" this means, that nothing should happen unless udev (or you) load the nvidia kernel module. I'm open to a fix for this, however, as far as I know, this should do nothing if the nvidia kernel module isn't loaded, so I would need to know what is loading the kernel module and how to teach udev not to run this when the modprobe fails (as far as I know that should be default but you seem to suggest that it is running no matter what)
(In reply to Rick Farina (Zero_Chaos) from comment #2) > the udev rule is this: > > ACTION=="add", DEVPATH=="/module/nvidia", SUBSYSTEM=="module", > RUN+="nvidia-udev.sh $env{ACTION}" Yes, Rick. Udev rule is this. But script itself runs `/opt/bin/nvidia-smi` on "add" event, which, in turn, can notice that module already unloaded at the moment of it's run. And what do it doing? It loads nvidia module once again. And it again unloads before second copy of blob runs. Once again. And again. And so on for infinite cycle. So, I've got about 20k duplicate nvidia records in sysfs once ;) > this means, that nothing should happen unless udev (or you) load the nvidia > kernel module. Almost. Somewhy, I experienced such bug working with imagemagick's convert and postscript-related files. Somewhy it calls that blob (and is ok if blob just exit with 1, so I dunno why the hell it calls it at all), blob loads module. Module triggers udev rule and unloads. Udev rule calls blob, blob loads module, infinite loop. > I'm open to a fix for this Drop that blob, maybe? I removing it by hand every nvidia-drivers rebuild and all works pretty fine. I don't know why the hell it can be need. > as far as I know, this should do nothing if the nvidia kernel module isn't loaded Yeah, except loading module (and failing to find temporary disabled card, so producing infinite load-unload loop with heavy syslog spamming) ;) > I would need to know what is loading the kernel module It is suid blob called /opt/bin/nvidia-smi > and how to teach udev not to run this when the modprobe fails How about removing the blob and fixing /lib/udev/nvidia-udev.sh to not run it on "add" event? :) > (as far as I know that should be default but you seem to suggest that it is running no matter what) Not exactly: modprobe doesn't fail. It loads nvidia module pretty fine. Nvidia module triggers udev rule, but then (either it self-unloading, or parent nvidia-smi blob unloads it) because it could not detect any working compatible nvidia card in the system (on laptops, it is mostly because of temporary disabled nvidia card, on desktop I experienced that after updating drivers to version, which is "too new for this card"). But at the time module cried in syslog (or even dmesg) and unloaded, there is appears new copy of nvidia-smi blob (triggered by script, which is triggered by udev rule). New copy of blob sees, that module is unloaded and... {infinite loop}
(In reply to Vadim A. Misbakh-Soloviov (mva) from comment #3) Okay, so the problem appears to be "something" causes nvidia driver to load, but hardware isn't supported so it unloads. This happens before the udev script is run, so it detects that the module was loaded and runs the script which will again trigger the load and on in an infinite loop. I'll find a sane way to check if the module is loaded before running nvidia-smi and just not run it if the module isn't loaded to avoid the loop. I'll try to have this done today.
Please modify /lib/udev/nvidia-udev.sh like this and tell me if it resolves the issue: if lsmod | grep -iq nvidia; then /opt/bin/nvidia-smi > /dev/null fi
Yeah, seems like that modification fixes infinite loop issue (at least, for me)
committed with revbump, hopefully that squashes the issue
You "fixed" it for one of 5 branches?
And went straight to stable on the revbump?
And BEFORE you do anything else, Rick, just DON'T.
Fixed for all applicable branches.
(In reply to Jeroen Roovers from comment #10) > And BEFORE you do anything else, Rick, just DON'T. If you feel the need to flame me, save your typing. If you have a grievance go through proper channels.
please actually stable things before claiming that it was fixed. For a script change like this there is no need to load the arch teams, and as I've tested this fix on both amd64 and x86 I stabled it. To avoid a childish revert war here I'll let you stable it yourself, but seriously, this is a minor script change, stablize the package and stop hurting the users.
Hi community. I ran into this issue in the last weeks. Sometimes X is unable to go past the login manager (which works fine), because /opt/bin/nvidia-smi is blocking. My "workaround" is commenting out the respective line in /lib/udev/nvidia-udev.sh. My machine has a hybrid graphics card. Normally I am running the Intel card. I'm using optirun/bumblebee. $ sudo lspci | grep -i nvidia 01:00.0 VGA compatible controller: NVIDIA Corporation GF108M [GeForce GT 540M] (rev a1) This hanging (or looping?) happens once in a few weeks. I haven't tested, but it might have occurred always during the boot, after I have used the Nvidia graphics card via optirun. Please let me know if you need more system details. Happy to help and test.
frustrating bug, same problem. after starting lightdm only show black screen and not blinking cursor. nvidia-smi generates cpu 100% load removing line "/opt/bin/nvidia-smi > /dev/null" helps me. x11-drivers/nvidia-drivers-390.67 (also old versions) sys-kernel/gentoo-sources-4.17.5 (also 4.16.x)
Same problem on a desktop machine... No workaround mentioned in topics about this issue help. I've tried both the sleep 1 or sleep 5 workaround. I've tried to comment the line calling nvidia-smi. None of these help. The only thing that helps is if I rename the nvidia modules and reboot. Then after a reboot I login and rename the modules back to original names and now I can start X. I guess the thing that helps is manually loading the nvidia modules. Mighty annoying. Please fix. I'm running > Linux xlad 4.19.4-gentoo #1 SMP PREEMPT Tue Nov 27 22:11:43 EET 2018 x86_64 AMD Ryzen 7 1700 Eight-Core Processor AuthenticAMD GNU/Linux > nvidia gtx 1050 I've tried drivers 410.78, 415.18. This problem started happening recently. Before that I've not seen it.
This problem is still happening. And it is quite annoying. Can someone try to resolve it? I've upgraded to kernel 5.0.3-gentoo and nvidia-drivers-418.56, but it doesn't help. The only thing that helps is making sure that the nvidia.ko cannot be loaded. As I've said it is so annoying, because I have to restart 2 times to get a working X. First time is the infinite loop one, then I have to rename the nvidia.ko to something. Restart, press the hard-reset button, because of course udev is stuck and cannot finish, wait for a start, login, rename the nvidia.ko, so it can be loaded, start X. So, annoying...
Same problem here since some weeks. The cause of the problem seems to be in the shutdown process. From a working Gentoo with nvidia-drivers: reboot -> not working, 100% udevd shutdown and restart computer -> not working, 100% udevd pressing reset key -> no problems reboot into Fedora (nouveau-drivers), then reboot into Gentoo -> not working, 100% udevd reboot into Antergos (nvidia-drivers), then reboot into Gentoo -> no problems Tried several distros with nouveau- and nvidia-drivers; the result is ever the same, Gentoo is not working correctly after using a distro with nouveau-drivers but ever works after using a distro with nvidia-drivers. Using sys-fs/udev or sys-fs/eudev does not change anything. BTW, x11-drivers/nvidia-drivers-418.56, kernel 5.0.7.
this ticket has been reported 7 years ago and assigned to David Seifert. David has never answered any message here. there is something wrong in the process. could anybody from gentoo team have a look at it, please? thank you.
(In reply to John Blbec from comment #19) > this ticket has been reported 7 years ago and assigned to David Seifert. > David has never answered any message here. there is something wrong in the > process. could anybody from gentoo team have a look at it, please? thank you. Maybe because I have a ton of stuff on my plate and 90% of nvidia stuff is beyond my control?
@David, I've already answered your response in Gentoo's forum. I'm really glad you're alive. I understand you may have a tons of other staff but no one could know that because there is no reaction from you you here. I can provide you any information you'll need to investigate the issue because I'm able to reproduce it and I think I'm not alone. If the issue is beyond your control, please write it and let everybody here know about it because the issue is annoying and I'm not alone who would like to see it solved finally.
About the nvidia module and 100% udev stuff (have commented on other similar bugs with no response so far) i think i found the problem: -fomit-frame-pointer Removed it from flags, deleted /lib/modules, rebuild kernel and nvidia-drivers without it, and got a working system.
@kartebi, i'll have to test is as soon as i'm home but i think i don't use -fomit-frame-pointer directly. i guess i use CFLAGS="-march=native -O2 -pipe"
it looks like -Ox enables it by default
Hmm interesting Looks like -fomit-frame-pointer is enabled by default in -O1 ( https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html ) I have a gentoo default build in another drive as a base, with defaults "-o2 -pipe", everything works fine there. I experiment in another drive with stuff like o3 , ltoize etc until recently i came across this nvidia/udev bug. I had to rebuild several times and regress my flags up until i had "-o2 -pipe -fomit-frame-pointer" and still got this nvidia bug. Removing the -fomit-frame-pointer from my make.conf and nvidia bug gone....
@kartebi, it seems it is not a solution.
@John Blbec Sad to hear. For me the issue is no more (unable to remove nvidia module, udev 100% cpu on 1 core, X not starting, unable to reboot) Just a little detail that might help, i rebuild both kernel and nvidia-drivers with "-o2 -pipe" only, while i was booted into a kernel with no nvidia support (no nvidia module into /lib/modules)
Created attachment 688239 [details] working ck config
Created attachment 688242 [details] working gentoo config
Created attachment 688245 [details] problematic gentoo config
Plz disregard my above comments, i was stupid Long story short, If you build nvidia-drivers against a kernel that is configured to use 1000hz timer frequency, shit hit the fan :) 100hz and 300hz works fine Attached working ck config and gentoo config, and a buggy gentoo config Only difference in the gentoo configs is the timer frequency Longer story, I was using ck-sources and 100hz timer because con kolivas says it works best with his scheduler. Thought i tried out pf-sources and the problem reappeared. I brute forced my way into finding this problem, rebuild kernels and nvidia-drivers about 30 times and hard-reset my system about as many times. Reminder , you must rebuild nvidia-drivers every time you change timer frequency. I tried using nvidia-drivers build against a 1000hz kernel, while i booted into a 100hz kernel, problem remained. Hope the devs figure this out
Sorry cant find any edit button here, thought i should add the use flags i have sys-kernel/gentoo-sources-5.11.1:5.11.1::gentoo USE="experimental -build -symlink" x11-drivers/nvidia-drivers-460.39-r1:0/460::gentoo USE="X driver kms multilib static-libs tools uvm -compat -dist-kernel -wayland" ABI_X86="32 (64) (-x32)"
$ less /proc/config.gz |grep HZ_1000 CONFIG_HZ_1000=y $ uname -a Linux 5.8.14-gentoo #1 SMP PREEMPT Sun Oct 11 01:04:34 EEST 2020 x86_64 AMD Ryzen 7 1700 Eight-Core Processor AuthenticAMD GNU/Linux I'll update kernel soon and I'll see if anything changes.
Building drivers against a CONFIG_HZ_1000=y kernel doesn't seem to change anything for me. But even if never been able to reproduce, I'll still review the relevance of the udev rule and how the script handle things when I get to this (don't expect this all that soon). Any more hints regarding this would be helpful as it's hard to fix something I can't reproduce.
(In reply to Ionen Wolkens from comment #34) > Building drivers against a CONFIG_HZ_1000=y kernel doesn't seem to change > anything for me. > > But even if never been able to reproduce, I'll still review the relevance of > the udev rule and how the script handle things when I get to this (don't > expect this all that soon). > > Any more hints regarding this would be helpful as it's hard to fix something > I can't reproduce. In my system i can reproduce it 100% if i change it to 1000hz without changing anything else in the kernel config. (tried it again with newer kernel and nvidia-drivers, still the same) I have a fairly minimal config, maybe something i disabled or enabled in combination with the 1000hz produces the bug. I have added my configs as attachments
Another thing that maybe is not obvious (in an earlier post where i had my use flags), because i have installed steam from steam-overlay, i have several packages build with x32, one of them is nvidia-drivers app-arch/zstd abi_x86_32 dev-libs/expat abi_x86_32 dev-libs/wayland abi_x86_32 dev-libs/libgcrypt abi_x86_32 dev-libs/libgpg-error abi_x86_32 dev-util/wayland-scanner abi_x86_32 media-libs/mesa abi_x86_32 media-libs/libpng-compat abi_x86_32 sys-apps/lm-sensors abi_x86_32 sys-devel/llvm abi_x86_32 virtual/opengl abi_x86_32 virtual/libintl abi_x86_32 x11-libs/libXfixes abi_x86_32 x11-libs/libXrandr abi_x86_32 x11-libs/libXrender abi_x86_32 x11-libs/libXxf86vm abi_x86_32 x11-libs/libdrm abi_x86_32 x11-libs/libxshmfence abi_x86_32 x11-drivers/nvidia-drivers abi_x86_32 media-libs/libglvnd abi_x86_32 x11-libs/libvdpau abi_x86_32 x11-libs/libX11 abi_x86_32 x11-libs/libXext abi_x86_32 sys-libs/zlib abi_x86_32 x11-libs/libxcb abi_x86_32 x11-libs/libXau abi_x86_32 x11-libs/libXdmcp abi_x86_32 x11-base/xcb-proto abi_x86_32 dev-libs/libffi abi_x86_32 sys-libs/gpm abi_x86_32 sys-libs/ncurses abi_x86_32 dev-libs/libxml2 abi_x86_32 dev-libs/icu abi_x86_32 app-arch/xz-utils abi_x86_32
Well, figuring this out may not be necessary. Had a look as to why nvidia-smi is being run (bug #376527, to create devices for no-X cuda users), and the irony is that we also have: 1. bug #505092 allows nvidia libraries to auto-run nvidia-modprobe which creates any missing devices on top of loading modules 2. there's another udev rule that creates /dev/nvidia-uvm without nvidia-smi The devices were already created as video group, so simply removing nvidia-smi call shouldn't disrupt most users as should already be in video group to call nvidia-modprobe. So plan is now to just remove nvidia-udev.sh (won't be right away from me, but expect it sooner or later). If possible I'll see if I can remove udev rules entirely, although I still need to check some misbehavior of nvidia-modprobe.
thank you @ionen I appreciate your help.
*** Bug 667362 has been marked as a duplicate of this bug. ***
*** Bug 504326 has been marked as a duplicate of this bug. ***
*** Bug 752018 has been marked as a duplicate of this bug. ***
Ran into an annoying issue, while the devices (this was intended to fix) aren't an issue, if nvidia-drm is not loaded early Xorg will only work if have a custom config with nvidia in it, e.g. can't auto-detect anymore. But nvidia-modprobe -m may be a better solution than nvidia-smi, hopefully it won't hang for anyone.
(In reply to Ionen Wolkens from comment #42) > Ran into an annoying issue, [...] Actually figured why this was happening, I shouldn't speak too soon on bugzilla. Sorry for the noise :) Still no need for udev.
The bug has been closed via the following commit(s): https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=26146d1510fd678538b7d02400c1eb8e66e20212 commit 26146d1510fd678538b7d02400c1eb8e66e20212 Author: Ionen Wolkens <sudinave@gmail.com> AuthorDate: 2021-03-21 15:52:10 +0000 Commit: David Seifert <soap@gentoo.org> CommitDate: 2021-03-21 15:52:10 +0000 x11-drivers/nvidia-drivers: bump to 460.67 with refactored ebuild ebuild carries a lot of history and, rather than cleanups, it needed something closer to a rewrite. Bugfixes: - Removed all udev rules to solve long standing issues (bug #454740) - Install libraries with no X11 dependencies with USE=-X, notably for headless OpenCL/CUDA (bug #561706) - Install systemd unit for persistenced + nvpd user (bug #591638) - Add custom error message for DRM_KMS_HELPER and ensure driver doesn't attempt building DRM support without it (bug #603818) - Warn about AMD SME if enabled by default (bug #652408) - Distribute extra sources to lift RESTRICT="bindist mirror", the nvidia-driver.eclass is no longer used (bug #732702) - Build modprobe and persistenced from source (bug #747145) - Use system locations for vulkan icd/layers (bug #749600) Others: - Dropped IUSE=compat/multilib/kms/uvm/wayland > compat: was for non-GLVND variants and currently a no-op > multilib: obsolete, abi_x86_32 does all that's needed > kms/uvm: modules are loaded by nvidia-modprobe as-needed and there's not much sense in skipping installation. Will also save OpenCL/CUDA packages from having to depend on [uvm] > wayland: library is provided by gui-libs/egl-wayland instead which now also provides pkgconfig files and can be a newer version. optfeature warning was added for awareness. - Dropped REQUIRED_USE, all USE can now be used independently, e.g. now possible to get libXNVCtrl.a (static-libs) without the deps-heavy USE=tools - Dropped locale patch, the offending code it was meant to fix is gone. - Dropped linker patch, uses right linker even with -native-symlinks. - Added modprobe.d .conf to blacklist nouveau by default. - Patched nvidia-modprobe to respect nvidia.conf's permissions when creating uvm devices, was previously created as world read-write. - No longer installing libOpenCL.so loader (not needed to use OpenCL, was used by the no longer available eselect-opencl). - nvidia-persistenced init script simplified and updated for nvpd user. - nvidia-smi init script removed (all it did was query cards every 300 seconds), mentioned behavior is no longer observable (fan scales normally without X) and it wasn't intended for this purpose. - Removed I2C_NVIDIA_GPU check as it caused unnecessary noise for gentoo-kernel-bin users (built as module), and being a bad thing even if loaded is questionable. - Attempt to reduce message noise. The only fatal CONFIG_CHECK is fairly rare so there's little reason to check twice with pkg_pretend. - ... but added new conditional messages to explain important things often seen as common sense but that a new user likely won't know. - Replaced the nvidia-driver.eclass legacy test with a compact version that reads supported-gpus.json (usable on >450). - More strict deps, some may sound strange but nvidia-settings only use headers for some of these (dbus/Xrandr/Xv/vdpau). > X? libs kept separate as it's the only one needing multilib deps. > pax-utils now unconditional for scanelf as libraries are always installed. Alternatively could've generated those, but prefer to leave it easier to maintain for future generations. > virtual/opencl removed, no sense in the drivers depending on this and it's instead applications using opencl that should. > Added MODULES_OPTIONAL_USE="driver" to handle linux-mod deps - Added MIT license for persistenced - Added ZLIB license for supported-gpus.json - NV_KERNEL_MAX (previously NV_KV_MAX_PLUS) set to be <=5.11 form rather than <5.12 given that often confused users thinking it meant 5.12 support from quick looks. - arm64 support "should" work but runtime untested - And a long list of cleanups that "hopefully" won't cause new issues. Closes: https://bugs.gentoo.org/454740 Closes: https://bugs.gentoo.org/561706 Closes: https://bugs.gentoo.org/591638 Closes: https://bugs.gentoo.org/603818 Closes: https://bugs.gentoo.org/652408 Closes: https://bugs.gentoo.org/732702 Closes: https://bugs.gentoo.org/747145 Closes: https://bugs.gentoo.org/749600 Signed-off-by: Ionen Wolkens <sudinave@gmail.com> Signed-off-by: David Seifert <soap@gentoo.org> x11-drivers/nvidia-drivers/Manifest | 7 + .../files/nvidia-blacklist-nouveau.conf | 3 + .../files/nvidia-modprobe-390.141-uvm-perms.patch | 12 + .../nvidia-drivers/files/nvidia-persistenced.confd | 7 + .../nvidia-drivers/files/nvidia-persistenced.initd | 12 + .../nvidia-drivers/nvidia-drivers-460.67.ebuild | 391 +++++++++++++++++++++ 6 files changed, 432 insertions(+)
wau :o looking forward to test it. thank you.
Hope it's all good and doesn't introduce different issues. At the very least it shouldn't be possible for udev to do anything given nvidia-drivers don't use udev at all anymore (no rules, and udev/kernel don't really know about nvidia devices either).