Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!

Bug 454740

Summary: x11-drivers/nvidia-drivers should not include udev rule to unconditionally load nvidia.ko (even if the system doesn't support it)
Product: Gentoo Linux Reporter: David Mohr <bugs>
Component: Current packagesAssignee: Jeroen Roovers <jer>
Status: UNCONFIRMED ---    
Severity: normal CC: fuscated, help, john.blbec, mva, viper, xarthisius, zerochaos, zsojka
Priority: Normal    
Version: unspecified   
Hardware: All   
OS: Linux   
Whiteboard:
Package list:
Runtime testing required: ---
Bug Depends on:    
Bug Blocks: 504326    

Description David Mohr 2013-01-30 21:06:04 UTC
We have use package sets to run the same package selection on several computers. Some have nvidia cards, some have very old nvidia cards, and some have no nvidia cards at all.

The nvidia-drivers package installes /lib/udev/rules.d/99-nvidia.rules which unconditionally loads nvidia.ko and does not react to failure, that is modprobe exits with 1 when no card is found or the module doesn't support the (old) card any more. Udev just keeps trying and trying to load the module so that you constantly are running modprobe even though the module will never load.

pstree looks like this:
     |       `-udevd---nvidia-udev.sh---nvidia-smi---modprobe

Reproducible: Always

Steps to Reproduce:
1. Install nvidia-drivers on a machine without an nvidia card
Actual Results:  
`pstree | grep nvidia` will show the output above.

Expected Results:  
I would expect the udev rule to only act on supported PCI ids.
Comment 1 Vadim A. Misbakh-Soloviov (mva) gentoo-dev 2015-03-05 19:12:21 UTC
jer: ping?

It is 346.35, but the problem still persists and affects Laptops (I'd say "optimus ones", but I've issue with it even on nv-only one) very hard!

Can you just get rid of calling the blob (nvidia-smi) in the script, called by udev rule, but just mknod there?

Really, that shitty crap just eats 100% CPU and falls X-server into D state, time to time (and blocking nvidia card and kernel module too). And ignores any trying to kill it with any signal.
Comment 2 Rick Farina (Zero_Chaos) gentoo-dev 2015-09-18 18:01:40 UTC
the udev rule is this:

ACTION=="add", DEVPATH=="/module/nvidia", SUBSYSTEM=="module", RUN+="nvidia-udev.sh $env{ACTION}"

this means, that nothing should happen unless udev (or you) load the nvidia kernel module.  I'm open to a fix for this, however, as far as I know, this should do nothing if the nvidia kernel module isn't loaded, so I would need to know what is loading the kernel module and how to teach udev not to run this when the modprobe fails (as far as I know that should be default but you seem to suggest that it is running no matter what)
Comment 3 Vadim A. Misbakh-Soloviov (mva) gentoo-dev 2015-09-18 18:43:35 UTC
(In reply to Rick Farina (Zero_Chaos) from comment #2)
> the udev rule is this:
> 
> ACTION=="add", DEVPATH=="/module/nvidia", SUBSYSTEM=="module",
> RUN+="nvidia-udev.sh $env{ACTION}"

Yes, Rick. Udev rule is this. But script itself runs `/opt/bin/nvidia-smi` on "add" event, which, in turn, can notice that module already unloaded at the moment of it's run. And what do it doing? It loads nvidia module once again. And it again unloads before second copy of blob runs. Once again. And again. And so on for infinite cycle. So, I've got about 20k duplicate nvidia records in sysfs once ;)

> this means, that nothing should happen unless udev (or you) load the nvidia
> kernel module.

Almost. Somewhy, I experienced such bug working with imagemagick's convert and postscript-related files. Somewhy it calls that blob (and is ok if blob just exit with 1, so I dunno why the hell it calls it at all), blob loads module. Module triggers udev rule and unloads. Udev rule calls blob, blob loads module, infinite loop. 

> I'm open to a fix for this

Drop that blob, maybe? I removing it by hand every nvidia-drivers rebuild and all works pretty fine. I don't know why the hell it can be need.

> as far as I know, this should do nothing if the nvidia kernel module isn't loaded

Yeah, except loading module (and failing to find temporary disabled card, so producing infinite load-unload loop with heavy syslog spamming) ;)

> I would need to know what is loading the kernel module

It is suid blob called /opt/bin/nvidia-smi

> and how to teach udev not to run this when the modprobe fails

How about removing the blob and fixing /lib/udev/nvidia-udev.sh to not run it on "add" event? :)

> (as far as I know that should be default but you seem to suggest that it is running no matter what)

Not exactly: modprobe doesn't fail. It loads nvidia module pretty fine. Nvidia module triggers udev rule, but then (either it self-unloading, or parent nvidia-smi blob unloads it) because it could not detect any working compatible nvidia card in the system (on laptops, it is mostly because of temporary disabled nvidia card, on desktop I experienced that after updating drivers to version, which is "too new for this card"). But at the time module cried in syslog (or even dmesg) and unloaded, there is appears new copy of nvidia-smi blob (triggered by script, which is triggered by udev rule). New copy of blob sees, that module is unloaded and... {infinite loop}
Comment 4 Rick Farina (Zero_Chaos) gentoo-dev 2015-09-19 14:36:31 UTC
(In reply to Vadim A. Misbakh-Soloviov (mva) from comment #3)
Okay, so the problem appears to be "something" causes nvidia driver to load, but hardware isn't supported so it unloads.  This happens before the udev script is run, so it detects that the module was loaded and runs the script which will again trigger the load and on in an infinite loop.

I'll find a sane way to check if the module is loaded before running nvidia-smi and just not run it if the module isn't loaded to avoid the loop.

I'll try to have this done today.
Comment 5 Rick Farina (Zero_Chaos) gentoo-dev 2015-09-19 22:29:15 UTC
Please modify /lib/udev/nvidia-udev.sh like this and tell me if it resolves the issue:

if lsmod | grep -iq nvidia; then
  /opt/bin/nvidia-smi > /dev/null
fi
Comment 6 Vadim A. Misbakh-Soloviov (mva) gentoo-dev 2015-09-20 21:19:39 UTC
Yeah, seems like that modification fixes infinite loop issue (at least, for me)
Comment 7 Rick Farina (Zero_Chaos) gentoo-dev 2015-09-20 21:30:21 UTC
committed with revbump, hopefully that squashes the issue
Comment 8 Jeroen Roovers gentoo-dev 2015-09-21 05:04:50 UTC
You "fixed" it for one of 5 branches?
Comment 9 Jeroen Roovers gentoo-dev 2015-09-21 05:05:04 UTC
And went straight to stable on the revbump?
Comment 10 Jeroen Roovers gentoo-dev 2015-09-21 05:23:47 UTC
And BEFORE you do anything else, Rick, just DON'T.
Comment 11 Jeroen Roovers gentoo-dev 2015-09-23 05:50:02 UTC
Fixed for all applicable branches.
Comment 12 Rick Farina (Zero_Chaos) gentoo-dev 2015-09-23 19:21:33 UTC
(In reply to Jeroen Roovers from comment #10)
> And BEFORE you do anything else, Rick, just DON'T.

If you feel the need to flame me, save your typing.  If you have a grievance go through proper channels.
Comment 13 Rick Farina (Zero_Chaos) gentoo-dev 2015-09-23 19:29:13 UTC
please actually stable things before claiming that it was fixed.

For a script change like this there is no need to load the arch teams, and as I've tested this fix on both amd64 and x86 I stabled it.

To avoid a childish revert war here I'll let you stable it yourself, but seriously, this is a minor script change, stablize the package and stop hurting the users.
Comment 14 oberwipf 2018-04-18 09:18:39 UTC
Hi community. I ran into this issue in the last weeks. Sometimes X is unable to go past the login manager (which works fine), because /opt/bin/nvidia-smi is blocking. My "workaround" is commenting out the respective line in /lib/udev/nvidia-udev.sh.

My machine has a hybrid graphics card. Normally I am running the Intel card. I'm using optirun/bumblebee.

$ sudo lspci | grep -i nvidia
01:00.0 VGA compatible controller: NVIDIA Corporation GF108M [GeForce GT 540M] (rev a1)

This hanging (or looping?) happens once in a few weeks. I haven't tested, but it might have occurred always during the boot, after I have used the Nvidia graphics card via optirun.

Please let me know if you need more system details. Happy to help and test.
Comment 15 MZ 2018-07-13 21:56:49 UTC
frustrating bug, same problem.

after starting lightdm only show black screen and not blinking cursor.

nvidia-smi generates cpu 100% load


removing line "/opt/bin/nvidia-smi > /dev/null" helps me.



x11-drivers/nvidia-drivers-390.67 (also old versions)
sys-kernel/gentoo-sources-4.17.5 (also 4.16.x)
Comment 16 Teodor Petrov 2018-11-27 23:26:03 UTC
Same problem on a desktop machine...
No workaround mentioned in topics about this issue help.

I've tried both the sleep 1 or sleep 5 workaround.
I've tried to comment the line calling nvidia-smi.
None of these help.

The only thing that helps is if I rename the nvidia modules and reboot. Then after a reboot I login and rename the modules back to original names and now I can start X. I guess the thing that helps is manually loading the nvidia modules.

Mighty annoying. Please fix.

I'm running
> Linux xlad 4.19.4-gentoo #1 SMP PREEMPT Tue Nov 27 22:11:43 EET 2018 x86_64 AMD Ryzen 7 1700 Eight-Core Processor AuthenticAMD GNU/Linux
> nvidia gtx 1050

I've tried drivers 410.78, 415.18. This problem started happening recently. Before that I've not seen it.
Comment 17 Teodor Petrov 2019-03-23 17:03:17 UTC
This problem is still happening. And it is quite annoying. Can someone try to resolve it?

I've upgraded to kernel 5.0.3-gentoo and nvidia-drivers-418.56, but it doesn't help. The only thing that helps is making sure that the nvidia.ko cannot be loaded.

As I've said it is so annoying, because I have to restart 2 times to get a working X. First time is the infinite loop one, then I have to rename the nvidia.ko to something. Restart, press the hard-reset button, because of course udev is stuck and cannot finish, wait for a start, login, rename the nvidia.ko, so it can be loaded, start X.

So, annoying...
Comment 18 Ingo Kemper 2019-04-10 00:54:17 UTC
Same problem here since some weeks. The cause of the problem seems to be in the shutdown process.

From a working Gentoo with nvidia-drivers:

reboot -> not working, 100% udevd
shutdown and restart computer -> not working, 100% udevd
pressing reset key -> no problems
reboot into Fedora (nouveau-drivers), then reboot into Gentoo -> not working, 100% udevd
reboot into Antergos (nvidia-drivers), then reboot into Gentoo -> no problems

Tried several distros with nouveau- and nvidia-drivers; the result is ever the same, Gentoo is not working correctly after using a distro with nouveau-drivers but ever works after using a distro with nvidia-drivers. Using sys-fs/udev or sys-fs/eudev does not change anything.

BTW, x11-drivers/nvidia-drivers-418.56, kernel 5.0.7.