670340 – x11-drivers/nvidia-drivers-410.xx do not work [found workaround]

Bug 670340 - x11-drivers/nvidia-drivers-410.xx do not work [found workaround]

Summary: x11-drivers/nvidia-drivers-410.xx do not work [found workaround]

Status:	RESOLVED DUPLICATE of bug 667362

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	Current packages (show other bugs)
Hardware:	AMD64 Linux

Importance:	Normal normal
Assignee:	Jeroen Roovers (RETIRED)

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2018-11-05 02:17 UTC by Ivan
Modified:	2021-02-20 03:45 UTC (History)
CC List:	10 users (show)

See Also:	667362
Package list:
Runtime testing required:	---

Attachments
emerge --info (emerge--info,6.85 KB, text/plain) 2018-11-05 02:17 UTC, Ivan	Details
dmesg (dmesg.log,68.03 KB, text/x-log) 2018-11-05 02:18 UTC, Ivan	Details
/var/log/messages (messages,6.84 KB, text/plain) 2018-11-05 02:26 UTC, Ivan	Details
Xorg.0.log (Xorg.0.log,30.20 KB, text/plain) 2018-11-05 02:30 UTC, Ivan	Details
.config - 4.18.16-gentoo Kernel Configuration (kernel_config.txt,108.65 KB, text/plain) 2018-11-05 02:32 UTC, Ivan	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Ivan 2018-11-05 02:17:05 UTC

Created attachment 554108 [details]
emerge --info

* What I did:

1. Emerged latest nvidia-drivers.
2. Rebooted after that.
3. Noticed that OS itself booted fine.
4. Found out that X server failed to start and I got tty1 promt.

* What I expected:

I expected to see X, sddm and KDE loading successfully.

* What I tried to do:

1. I tried to use every available from portage 410.xx drivers with 4.18.xx kernels and also tried 410.73 with 4.18.xx and 4.19.0. No luck.
2. I tried to make nvidia-xconfig with new driver (from tty1 after first boot with new driver) and reboot later. Didn't work out.
3. I also tried to blacklist nvidia modules. Didn't help.
4. I tried to build nvidia-drivers-410.73 with the patch from here: https://devtalk.nvidia.com/default/topic/1043346/nvidia-driver-v410-73-fails-to-build-functional-modules/ (with modified paths according to Chris Torske at https://bugs.gentoo.org/669902#c1). Builds fine, but doesn't solve my problem.


* What I noticed:

1. Usually after successful launch of X, DE etc, I get the following output:
$ lsmod | grep nv
nvidia_drm             40960  7
nvidia_modeset       1060864  19 nvidia_drm
nvidia              13549568  943 nvidia_modeset

And when I fail to load with new driver, I get only "nvidia" module. 

According to dmesg with 410.73:
ivan@pc ~ $ cat dmesg.log | grep nvid
[    1.692049] nvidia: loading out-of-tree module taints kernel.
[    1.692054] nvidia: module license 'NVIDIA' taints kernel.
[    1.703574] nvidia-nvlink: Nvlink Core is being initialized, major device number 244
[    1.703858] nvidia 0000:03:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem


And successful load with 396.54:
ivan@pc ~ $ dmesg | grep nvid
[    2.045808] nvidia: loading out-of-tree module taints kernel.
[    2.045813] nvidia: module license 'NVIDIA' taints kernel.
[    2.055768] nvidia-nvlink: Nvlink Core is being initialized, major device number 244
[    2.055979] nvidia 0000:03:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[    2.063402] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  396.54  Tue Aug 14 23:08:44 PDT 2018
[    2.065804] [drm] [nvidia-drm] [GPU ID 0x00000300] Loading driver
[    2.065806] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:03:00.0 on minor 0
[    2.493934] caller _nv001112rm+0xe3/0x1d0 [nvidia] mapping multiple BARs
[    8.716686] caller _nv001112rm+0xe3/0x1d0 [nvidia] mapping multiple BARs


2. Sometimes I got messages:
timeout 'nvidia-udev.sh add'
slow: 'nvidia-udev.sh add'
timeout: killing 'nvidia-udev.sh add'
slow: 'nvidia-udev.sh add'

Which is similar to https://bugs.gentoo.org/667362#c0

3. I also receive sometimes (again, sometimes) following messages in /var/log/messages:
NVRM: API mismatch: the client has the version 410.73, but\x0aNVRM: this kernel module has the version 396.54.  Please\x0aNVRM: make sure that this kernel module and all NVIDIA driver\x0aNVRM: components have the same version.

However, I made sure that I built 410.73 against the kernel that I load. Checked multiple times. 

I also think that this might be kind of hardware problem, because I received kernel panic when I did 'emerge @module-rebuild' to re-emerge nvidia-driver-410.73 after I tried to boot with 410.73. But I am not sure.

Comment 1 Ivan 2018-11-05 02:18:05 UTC

Created attachment 554110 [details]
dmesg

Comment 2 Ivan 2018-11-05 02:25:21 UTC

Found out that 

> NVRM: API mismatch: the client has the version 410.73, but\x0aNVRM: this kernel module has the version 396.54.  Please\x0aNVRM: make sure that this kernel module and all NVIDIA driver\x0aNVRM: components have the same version.

took place BEFORE I actually reboot, so that is not relevant.

Comment 3 Ivan 2018-11-05 02:26:41 UTC

Created attachment 554114 [details]
/var/log/messages

Comment 4 Ivan 2018-11-05 02:30:52 UTC

Created attachment 554116 [details]
Xorg.0.log

Comment 5 Ivan 2018-11-05 02:32:48 UTC

Created attachment 554118 [details]
.config - 4.18.16-gentoo Kernel Configuration

Comment 6 Jeroen Roovers (RETIRED) gentoo-dev

2018-11-05 09:13:10 UTC

Comment on attachment 554116 [details]
Xorg.0.log

>[     7.968] (--) Log file renamed from "/var/log/Xorg.pid-3852.log" to "/var/log/Xorg.0.log"

...

>[    10.477] (II) NVIDIA(0): Setting mode "DVI-D-0: nvidia-auto-select @1920x1080 +0+0 {ViewPortIn=1920x1080, ViewPortOut=1920x1080+0+0, ForceCompositionPipeline=On, ForceFullCompositionPipeline=On}"
>[   891.841] (II) config/udev: Adding input device Plantronics Plantronics GameCom 780 (/dev/input/event7)

...

>[   891.860] (II) event7  - Plantronics Plantronics GameCom 780: device is a keyboard
>[  6250.366] (II) config/udev: removing device Plantronics Plantronics GameCom 780

...

>[  6282.424] (II) NVIDIA(GPU-0): Deleting GPU-0
>[  6282.426] (II) Server terminated successfully (0). Closing log file.

Looks like it worked just fine.

Comment 7 Ivan 2018-11-06 01:56:28 UTC

Tried with new kernel 4.19.1. As expected, doesn't work too.

Dmesg says:
[  182.952613] udevd[2153]: timeout 'nvidia-udev.sh add'
[  182.952626] udevd[2153]: slow: 'nvidia-udev.sh add' [2305]
[  183.953608] udevd[2153]: timeout: killing 'nvidia-udev.sh add' [2305]
[  183.953622] udevd[2153]: slow: 'nvidia-udev.sh add' [2305]
[  183.953717] udevd[2153]: 'nvidia-udev.sh add' [2305] terminated by signal 9 (Killed)

Comment 8 Ivan 2018-11-09 18:16:01 UTC

So it appears that I was able to circumwent that issue by rethinking all comments about blacklisting modules coming from wise people.

At last I noticed that IF I blacklist all modules to prevent them from loading (by udev, from what I know), I can actually load modules manually via modprobe and somehow that works perfectly. NOTE: I couldn't load or remove nvidia modules if I haven't blacklisted them. 

After that the solution was simple. Probably it's not the best way, maybe it's plain dumb way, but it works for me.

So here's what I did:

1. Added the in /etc/modprobe.d/blacklist.conf following lines:
blacklist nvidia
blacklist nvidia_drm
blacklist nvidia_modeset

(basically, I just blacklisted all nvidia modules that usually are loaded, which you can see by typing 'lsmod | grep -i nvidia' when your DE works)

2. Created file /etc/local.d/nvidia-udev-workaround.start
Added the following lines in it:
#!/bin/sh

echo "NVIDIA WORKAROUND IN PROGRESS";
modprobe nvidia_drm;

3. Made that script executable by:
chmod +x /etc/local.d/nvidia-udev-workaround.start

4. Made sure that local appears in default runlevel:
rc-update show default

If there's no "local", in order to try that workaround, you should add it by rc-update add local default

Then reboot.

Works for me with 410.73 and 415.13 nvidia-drivers, with 4.18.17 and 4.19.1 kernels.

Comment 9 alpir 2018-12-01 06:51:53 UTC

Confirm this bug with kernel 4.19.3 and nvidia-drivers-415.18.

Comment 10 Valeriy Malov 2018-12-15 14:09:52 UTC

Related to bug #667362?

I can reproduce it with GTX 660.
Maybe it's worth changing keywords on 410.x from stable to unstable.

Comment 11 Alexander Polozov 2018-12-16 11:05:32 UTC

(In reply to alpir from comment #9)
> Confirm this bug with kernel 4.19.3 and nvidia-drivers-415.18.

I was able to run my X server with nvidia-drivers-415.18 just comenting one last string "#options nvidia NVreg_DeviceFileMode=432 NVreg_DeviceFileUID=0 NVreg_DeviceFileGID=27 NVreg_ModifyDeviceFiles=1" in /etc/modprobe.d/nvidia.conf

Comment 12 Jeroen Roovers (RETIRED) gentoo-dev

2018-12-17 00:31:33 UTC

*** Bug 667362 has been marked as a duplicate of this bug. ***

Comment 13 Jeroen Roovers (RETIRED) gentoo-dev

2018-12-17 00:32:08 UTC


*** This bug has been marked as a duplicate of bug 667362 ***

Comment 14 David Bařina 2019-11-07 13:27:55 UTC

(In reply to Ivan from comment #8)
> So it appears that I was able to circumwent that issue by rethinking all
> comments about blacklisting modules coming from wise people.
> 
> At last I noticed that IF I blacklist all modules to prevent them from
> loading (by udev, from what I know), I can actually load modules manually
> via modprobe and somehow that works perfectly. NOTE: I couldn't load or
> remove nvidia modules if I haven't blacklisted them. 
> 
> After that the solution was simple. Probably it's not the best way, maybe
> it's plain dumb way, but it works for me.
> 
> So here's what I did:
> 
> 1. Added the in /etc/modprobe.d/blacklist.conf following lines:
> blacklist nvidia
> blacklist nvidia_drm
> blacklist nvidia_modeset
> 
> (basically, I just blacklisted all nvidia modules that usually are loaded,
> which you can see by typing 'lsmod | grep -i nvidia' when your DE works)
> 
> 2. Created file /etc/local.d/nvidia-udev-workaround.start
> Added the following lines in it:
> #!/bin/sh
> 
> echo "NVIDIA WORKAROUND IN PROGRESS";
> modprobe nvidia_drm;
> 
> 3. Made that script executable by:
> chmod +x /etc/local.d/nvidia-udev-workaround.start
> 
> 4. Made sure that local appears in default runlevel:
> rc-update show default
> 
> If there's no "local", in order to try that workaround, you should add it by
> rc-update add local default
> 
> Then reboot.
> 
> Works for me with 410.73 and 415.13 nvidia-drivers, with 4.18.17 and 4.19.1
> kernels.

Same problem here. The module blacklisting helped me (preventing eudev to load the nvidia module). However, the /etc/local.d/nvidia-udev-workaround.start trick is really not necessary, the /etc/modules-load.d/ is a better place to do this.

Comment 15 gletonai 2020-07-02 19:26:14 UTC

(In reply to Ivan from comment #8)
(In reply to David Bařina from comment #14)
This worked.

Comment 16 kartebi 2021-02-20 03:45:55 UTC

I think its -fomit-frame-pointer
deleted /lib/modules, rebuilding kernel and nvidia-drivers without it everything back to normal...