Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 454740 - x11-drivers/nvidia-drivers should not include udev rule to unconditionally load nvidia.ko (even if the system doesn't support it)
Summary: x11-drivers/nvidia-drivers should not include udev rule to unconditionally lo...
Status: UNCONFIRMED
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: Current packages (show other bugs)
Hardware: All Linux
: Normal normal (vote)
Assignee: David Seifert
URL:
Whiteboard:
Keywords:
: 504326 667362 752018 (view as bug list)
Depends on:
Blocks:
 
Reported: 2013-01-30 21:06 UTC by David Mohr
Modified: 2021-03-06 18:00 UTC (History)
13 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments
working ck config (config-5.11.0-ck,119.50 KB, text/plain)
2021-02-24 12:28 UTC, kartebi
Details
working gentoo config (config-5.11.1-gentoo,119.44 KB, text/plain)
2021-02-24 12:28 UTC, kartebi
Details
problematic gentoo config (config-5.11.1-gentoo.old(buggy),119.43 KB, text/plain)
2021-02-24 12:29 UTC, kartebi
Details

Note You need to log in before you can comment on or make changes to this bug.
Description David Mohr 2013-01-30 21:06:04 UTC
We have use package sets to run the same package selection on several computers. Some have nvidia cards, some have very old nvidia cards, and some have no nvidia cards at all.

The nvidia-drivers package installes /lib/udev/rules.d/99-nvidia.rules which unconditionally loads nvidia.ko and does not react to failure, that is modprobe exits with 1 when no card is found or the module doesn't support the (old) card any more. Udev just keeps trying and trying to load the module so that you constantly are running modprobe even though the module will never load.

pstree looks like this:
     |       `-udevd---nvidia-udev.sh---nvidia-smi---modprobe

Reproducible: Always

Steps to Reproduce:
1. Install nvidia-drivers on a machine without an nvidia card
Actual Results:  
`pstree | grep nvidia` will show the output above.

Expected Results:  
I would expect the udev rule to only act on supported PCI ids.
Comment 1 Vadim A. Misbakh-Soloviov (mva) gentoo-dev 2015-03-05 19:12:21 UTC
jer: ping?

It is 346.35, but the problem still persists and affects Laptops (I'd say "optimus ones", but I've issue with it even on nv-only one) very hard!

Can you just get rid of calling the blob (nvidia-smi) in the script, called by udev rule, but just mknod there?

Really, that shitty crap just eats 100% CPU and falls X-server into D state, time to time (and blocking nvidia card and kernel module too). And ignores any trying to kill it with any signal.
Comment 2 Rick Farina (Zero_Chaos) gentoo-dev 2015-09-18 18:01:40 UTC
the udev rule is this:

ACTION=="add", DEVPATH=="/module/nvidia", SUBSYSTEM=="module", RUN+="nvidia-udev.sh $env{ACTION}"

this means, that nothing should happen unless udev (or you) load the nvidia kernel module.  I'm open to a fix for this, however, as far as I know, this should do nothing if the nvidia kernel module isn't loaded, so I would need to know what is loading the kernel module and how to teach udev not to run this when the modprobe fails (as far as I know that should be default but you seem to suggest that it is running no matter what)
Comment 3 Vadim A. Misbakh-Soloviov (mva) gentoo-dev 2015-09-18 18:43:35 UTC
(In reply to Rick Farina (Zero_Chaos) from comment #2)
> the udev rule is this:
> 
> ACTION=="add", DEVPATH=="/module/nvidia", SUBSYSTEM=="module",
> RUN+="nvidia-udev.sh $env{ACTION}"

Yes, Rick. Udev rule is this. But script itself runs `/opt/bin/nvidia-smi` on "add" event, which, in turn, can notice that module already unloaded at the moment of it's run. And what do it doing? It loads nvidia module once again. And it again unloads before second copy of blob runs. Once again. And again. And so on for infinite cycle. So, I've got about 20k duplicate nvidia records in sysfs once ;)

> this means, that nothing should happen unless udev (or you) load the nvidia
> kernel module.

Almost. Somewhy, I experienced such bug working with imagemagick's convert and postscript-related files. Somewhy it calls that blob (and is ok if blob just exit with 1, so I dunno why the hell it calls it at all), blob loads module. Module triggers udev rule and unloads. Udev rule calls blob, blob loads module, infinite loop. 

> I'm open to a fix for this

Drop that blob, maybe? I removing it by hand every nvidia-drivers rebuild and all works pretty fine. I don't know why the hell it can be need.

> as far as I know, this should do nothing if the nvidia kernel module isn't loaded

Yeah, except loading module (and failing to find temporary disabled card, so producing infinite load-unload loop with heavy syslog spamming) ;)

> I would need to know what is loading the kernel module

It is suid blob called /opt/bin/nvidia-smi

> and how to teach udev not to run this when the modprobe fails

How about removing the blob and fixing /lib/udev/nvidia-udev.sh to not run it on "add" event? :)

> (as far as I know that should be default but you seem to suggest that it is running no matter what)

Not exactly: modprobe doesn't fail. It loads nvidia module pretty fine. Nvidia module triggers udev rule, but then (either it self-unloading, or parent nvidia-smi blob unloads it) because it could not detect any working compatible nvidia card in the system (on laptops, it is mostly because of temporary disabled nvidia card, on desktop I experienced that after updating drivers to version, which is "too new for this card"). But at the time module cried in syslog (or even dmesg) and unloaded, there is appears new copy of nvidia-smi blob (triggered by script, which is triggered by udev rule). New copy of blob sees, that module is unloaded and... {infinite loop}
Comment 4 Rick Farina (Zero_Chaos) gentoo-dev 2015-09-19 14:36:31 UTC
(In reply to Vadim A. Misbakh-Soloviov (mva) from comment #3)
Okay, so the problem appears to be "something" causes nvidia driver to load, but hardware isn't supported so it unloads.  This happens before the udev script is run, so it detects that the module was loaded and runs the script which will again trigger the load and on in an infinite loop.

I'll find a sane way to check if the module is loaded before running nvidia-smi and just not run it if the module isn't loaded to avoid the loop.

I'll try to have this done today.
Comment 5 Rick Farina (Zero_Chaos) gentoo-dev 2015-09-19 22:29:15 UTC
Please modify /lib/udev/nvidia-udev.sh like this and tell me if it resolves the issue:

if lsmod | grep -iq nvidia; then
  /opt/bin/nvidia-smi > /dev/null
fi
Comment 6 Vadim A. Misbakh-Soloviov (mva) gentoo-dev 2015-09-20 21:19:39 UTC
Yeah, seems like that modification fixes infinite loop issue (at least, for me)
Comment 7 Rick Farina (Zero_Chaos) gentoo-dev 2015-09-20 21:30:21 UTC
committed with revbump, hopefully that squashes the issue
Comment 8 Jeroen Roovers (RETIRED) gentoo-dev 2015-09-21 05:04:50 UTC
You "fixed" it for one of 5 branches?
Comment 9 Jeroen Roovers (RETIRED) gentoo-dev 2015-09-21 05:05:04 UTC
And went straight to stable on the revbump?
Comment 10 Jeroen Roovers (RETIRED) gentoo-dev 2015-09-21 05:23:47 UTC
And BEFORE you do anything else, Rick, just DON'T.
Comment 11 Jeroen Roovers (RETIRED) gentoo-dev 2015-09-23 05:50:02 UTC
Fixed for all applicable branches.
Comment 12 Rick Farina (Zero_Chaos) gentoo-dev 2015-09-23 19:21:33 UTC
(In reply to Jeroen Roovers from comment #10)
> And BEFORE you do anything else, Rick, just DON'T.

If you feel the need to flame me, save your typing.  If you have a grievance go through proper channels.
Comment 13 Rick Farina (Zero_Chaos) gentoo-dev 2015-09-23 19:29:13 UTC
please actually stable things before claiming that it was fixed.

For a script change like this there is no need to load the arch teams, and as I've tested this fix on both amd64 and x86 I stabled it.

To avoid a childish revert war here I'll let you stable it yourself, but seriously, this is a minor script change, stablize the package and stop hurting the users.
Comment 14 oberwipf 2018-04-18 09:18:39 UTC
Hi community. I ran into this issue in the last weeks. Sometimes X is unable to go past the login manager (which works fine), because /opt/bin/nvidia-smi is blocking. My "workaround" is commenting out the respective line in /lib/udev/nvidia-udev.sh.

My machine has a hybrid graphics card. Normally I am running the Intel card. I'm using optirun/bumblebee.

$ sudo lspci | grep -i nvidia
01:00.0 VGA compatible controller: NVIDIA Corporation GF108M [GeForce GT 540M] (rev a1)

This hanging (or looping?) happens once in a few weeks. I haven't tested, but it might have occurred always during the boot, after I have used the Nvidia graphics card via optirun.

Please let me know if you need more system details. Happy to help and test.
Comment 15 MZ 2018-07-13 21:56:49 UTC
frustrating bug, same problem.

after starting lightdm only show black screen and not blinking cursor.

nvidia-smi generates cpu 100% load


removing line "/opt/bin/nvidia-smi > /dev/null" helps me.



x11-drivers/nvidia-drivers-390.67 (also old versions)
sys-kernel/gentoo-sources-4.17.5 (also 4.16.x)
Comment 16 Teodor Petrov 2018-11-27 23:26:03 UTC
Same problem on a desktop machine...
No workaround mentioned in topics about this issue help.

I've tried both the sleep 1 or sleep 5 workaround.
I've tried to comment the line calling nvidia-smi.
None of these help.

The only thing that helps is if I rename the nvidia modules and reboot. Then after a reboot I login and rename the modules back to original names and now I can start X. I guess the thing that helps is manually loading the nvidia modules.

Mighty annoying. Please fix.

I'm running
> Linux xlad 4.19.4-gentoo #1 SMP PREEMPT Tue Nov 27 22:11:43 EET 2018 x86_64 AMD Ryzen 7 1700 Eight-Core Processor AuthenticAMD GNU/Linux
> nvidia gtx 1050

I've tried drivers 410.78, 415.18. This problem started happening recently. Before that I've not seen it.
Comment 17 Teodor Petrov 2019-03-23 17:03:17 UTC
This problem is still happening. And it is quite annoying. Can someone try to resolve it?

I've upgraded to kernel 5.0.3-gentoo and nvidia-drivers-418.56, but it doesn't help. The only thing that helps is making sure that the nvidia.ko cannot be loaded.

As I've said it is so annoying, because I have to restart 2 times to get a working X. First time is the infinite loop one, then I have to rename the nvidia.ko to something. Restart, press the hard-reset button, because of course udev is stuck and cannot finish, wait for a start, login, rename the nvidia.ko, so it can be loaded, start X.

So, annoying...
Comment 18 Ingo Kemper 2019-04-10 00:54:17 UTC
Same problem here since some weeks. The cause of the problem seems to be in the shutdown process.

From a working Gentoo with nvidia-drivers:

reboot -> not working, 100% udevd
shutdown and restart computer -> not working, 100% udevd
pressing reset key -> no problems
reboot into Fedora (nouveau-drivers), then reboot into Gentoo -> not working, 100% udevd
reboot into Antergos (nvidia-drivers), then reboot into Gentoo -> no problems

Tried several distros with nouveau- and nvidia-drivers; the result is ever the same, Gentoo is not working correctly after using a distro with nouveau-drivers but ever works after using a distro with nvidia-drivers. Using sys-fs/udev or sys-fs/eudev does not change anything.

BTW, x11-drivers/nvidia-drivers-418.56, kernel 5.0.7.
Comment 19 John Blbec 2021-02-20 17:54:01 UTC
this ticket has been reported 7 years ago and assigned to David Seifert. David has never answered any message here. there is something wrong in the process. could anybody from gentoo team have a look at it, please? thank you.
Comment 20 David Seifert gentoo-dev 2021-02-20 19:13:54 UTC
(In reply to John Blbec from comment #19)
> this ticket has been reported 7 years ago and assigned to David Seifert.
> David has never answered any message here. there is something wrong in the
> process. could anybody from gentoo team have a look at it, please? thank you.

Maybe because I have a ton of stuff on my plate and 90% of nvidia stuff is beyond my control?
Comment 21 John Blbec 2021-02-22 17:09:29 UTC
@David,

I've already answered your response in Gentoo's forum. I'm really glad you're alive. I understand you may have a tons of other staff but no one could know that because there is no reaction from you you here.

I can provide you any information you'll need to investigate the issue because I'm able to reproduce it and I think I'm not alone. If the issue is beyond your control, please write it and let everybody here know about it because the issue is annoying and I'm not alone who would like to see it solved finally.
Comment 22 kartebi 2021-02-23 06:51:45 UTC
About the nvidia module and 100% udev stuff (have commented on other similar bugs with no response so far) i think i found the problem: -fomit-frame-pointer
Removed it from flags, deleted /lib/modules, rebuild kernel and nvidia-drivers without it, and got a working system.
Comment 23 John Blbec 2021-02-23 08:12:04 UTC
@kartebi,

i'll have to test is as soon as i'm home but i think i don't use -fomit-frame-pointer directly. i guess i use CFLAGS="-march=native -O2 -pipe"
Comment 24 John Blbec 2021-02-23 08:15:43 UTC
it looks like -Ox enables it by default
Comment 25 kartebi 2021-02-23 10:30:46 UTC
Hmm interesting
Looks like -fomit-frame-pointer is enabled by default in -O1
( https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html )

I have a gentoo default build in another drive as a base, with defaults "-o2 -pipe", everything works fine there.

I experiment in another drive with stuff like o3 , ltoize etc until recently i came across this nvidia/udev bug. I had to rebuild several times and regress my flags up until i had "-o2 -pipe -fomit-frame-pointer" and still got this nvidia bug. Removing the -fomit-frame-pointer from my make.conf and nvidia bug gone....
Comment 26 John Blbec 2021-02-23 11:22:32 UTC
@kartebi, it seems it is not a solution.
Comment 27 kartebi 2021-02-23 12:02:51 UTC
@John Blbec

Sad to hear. For me the issue is no more (unable to remove nvidia module, udev 100% cpu on 1 core, X not starting, unable to reboot)

Just a little detail that might help, i rebuild both kernel and nvidia-drivers with "-o2 -pipe" only, while i was booted into a kernel with no nvidia support (no nvidia module into /lib/modules)
Comment 28 kartebi 2021-02-24 12:28:02 UTC
Created attachment 688239 [details]
working ck config
Comment 29 kartebi 2021-02-24 12:28:50 UTC
Created attachment 688242 [details]
working gentoo config
Comment 30 kartebi 2021-02-24 12:29:26 UTC
Created attachment 688245 [details]
problematic gentoo config
Comment 31 kartebi 2021-02-24 12:37:13 UTC
Plz disregard my above comments, i was stupid

Long story short,

If you build nvidia-drivers against a kernel that is configured to use 1000hz timer frequency, shit hit the fan :)
100hz and 300hz works fine

Attached working ck config and gentoo config, and a buggy gentoo config
Only difference in the gentoo configs is the timer frequency

Longer story,

I was using ck-sources and 100hz timer because con kolivas says it works best with his scheduler.
Thought i tried out pf-sources and the problem reappeared.
I brute forced my way into finding this problem, rebuild kernels and nvidia-drivers about 30 times and hard-reset my system about as many times.

Reminder , you must rebuild nvidia-drivers every time you change timer frequency.
I tried using nvidia-drivers build against a 1000hz kernel, while i booted into a 100hz kernel, problem remained.

Hope the devs figure this out
Comment 32 kartebi 2021-02-24 12:46:01 UTC
Sorry cant find any edit button here, thought i should add the use flags i have

sys-kernel/gentoo-sources-5.11.1:5.11.1::gentoo  USE="experimental -build -symlink"
x11-drivers/nvidia-drivers-460.39-r1:0/460::gentoo  USE="X driver kms multilib static-libs tools uvm -compat -dist-kernel -wayland" ABI_X86="32 (64) (-x32)"
Comment 33 Teodor Petrov 2021-02-24 13:59:27 UTC
$ less /proc/config.gz |grep HZ_1000
CONFIG_HZ_1000=y

$ uname -a
Linux 5.8.14-gentoo #1 SMP PREEMPT Sun Oct 11 01:04:34 EEST 2020 x86_64 AMD Ryzen 7 1700 Eight-Core Processor AuthenticAMD GNU/Linux

I'll update kernel soon and I'll see if anything changes.
Comment 34 Ionen Wolkens 2021-03-02 21:14:00 UTC
Building drivers against a CONFIG_HZ_1000=y kernel doesn't seem to change anything for me.

But even if never been able to reproduce, I'll still review the relevance of the udev rule and how the script handle things when I get to this (don't expect this all that soon).

Any more hints regarding this would be helpful as it's hard to fix something I can't reproduce.
Comment 35 kartebi 2021-03-03 10:11:07 UTC
(In reply to Ionen Wolkens from comment #34)
> Building drivers against a CONFIG_HZ_1000=y kernel doesn't seem to change
> anything for me.
> 
> But even if never been able to reproduce, I'll still review the relevance of
> the udev rule and how the script handle things when I get to this (don't
> expect this all that soon).
> 
> Any more hints regarding this would be helpful as it's hard to fix something
> I can't reproduce.

In my system i can reproduce it 100% if i change it to 1000hz without changing anything else in the kernel config. (tried it again with newer kernel and nvidia-drivers, still the same)
I have a fairly minimal config, maybe something i disabled or enabled in combination with the 1000hz produces the bug.
I have added my configs as attachments
Comment 36 kartebi 2021-03-03 12:17:42 UTC
Another thing that maybe is not obvious (in an earlier post where i had my use flags), because i have installed steam from steam-overlay, i have several packages build with x32, one of them is nvidia-drivers

app-arch/zstd abi_x86_32
dev-libs/expat abi_x86_32
dev-libs/wayland abi_x86_32
dev-libs/libgcrypt abi_x86_32
dev-libs/libgpg-error abi_x86_32
dev-util/wayland-scanner abi_x86_32
media-libs/mesa abi_x86_32
media-libs/libpng-compat abi_x86_32
sys-apps/lm-sensors abi_x86_32
sys-devel/llvm abi_x86_32
virtual/opengl abi_x86_32
virtual/libintl abi_x86_32
x11-libs/libXfixes abi_x86_32
x11-libs/libXrandr abi_x86_32
x11-libs/libXrender abi_x86_32
x11-libs/libXxf86vm abi_x86_32
x11-libs/libdrm abi_x86_32
x11-libs/libxshmfence abi_x86_32
x11-drivers/nvidia-drivers abi_x86_32
media-libs/libglvnd abi_x86_32
x11-libs/libvdpau abi_x86_32
x11-libs/libX11 abi_x86_32
x11-libs/libXext abi_x86_32
sys-libs/zlib abi_x86_32
x11-libs/libxcb abi_x86_32
x11-libs/libXau abi_x86_32
x11-libs/libXdmcp abi_x86_32
x11-base/xcb-proto abi_x86_32
dev-libs/libffi abi_x86_32
sys-libs/gpm abi_x86_32
sys-libs/ncurses abi_x86_32
dev-libs/libxml2 abi_x86_32
dev-libs/icu abi_x86_32
app-arch/xz-utils abi_x86_32
Comment 37 Ionen Wolkens 2021-03-04 08:23:57 UTC
Well, figuring this out may not be necessary.

Had a look as to why nvidia-smi is being run (bug #376527, to create devices for no-X cuda users), and the irony is that we also have:
1. bug #505092 allows nvidia libraries to auto-run nvidia-modprobe which creates any missing devices on top of loading modules
2. there's another udev rule that creates /dev/nvidia-uvm without nvidia-smi

The devices were already created as video group, so simply removing nvidia-smi call shouldn't disrupt most users as should already be in video group to call nvidia-modprobe.

So plan is now to just remove nvidia-udev.sh (won't be right away from me, but expect it sooner or later). If possible I'll see if I can remove udev rules entirely, although I still need to check some misbehavior of nvidia-modprobe.
Comment 38 John Blbec 2021-03-04 08:46:53 UTC
thank you @ionen I appreciate your help.
Comment 39 Ionen Wolkens 2021-03-04 21:27:26 UTC
*** Bug 667362 has been marked as a duplicate of this bug. ***
Comment 40 Ionen Wolkens 2021-03-04 21:33:06 UTC
*** Bug 504326 has been marked as a duplicate of this bug. ***
Comment 41 Ionen Wolkens 2021-03-06 08:16:24 UTC
*** Bug 752018 has been marked as a duplicate of this bug. ***
Comment 42 Ionen Wolkens 2021-03-06 16:11:36 UTC
Ran into an annoying issue, while the devices (this was intended to fix) aren't an issue, if nvidia-drm is not loaded early Xorg will only work if have a custom config with nvidia in it, e.g. can't auto-detect anymore.

But nvidia-modprobe -m may be a better solution than nvidia-smi, hopefully it won't hang for anyone.
Comment 43 Ionen Wolkens 2021-03-06 18:00:17 UTC
(In reply to Ionen Wolkens from comment #42)
> Ran into an annoying issue, [...]
Actually figured why this was happening, I shouldn't speak too soon on bugzilla. Sorry for the noise :) Still no need for udev.