Bug 513380

Summary:	x11-drivers/nvidia-drivers-337.25 - at boot, nvidia.ko silently fails to initialise
Product:	Gentoo Linux	Reporter:	Eric Siskonen <esiskonen>
Component:	Current packages	Assignee:	David Seifert <soap>
Status:	RESOLVED WONTFIX
Severity:	normal	CC:	esiskonen, kernel, kjackie, s7mon
Priority:	Highest
Version:	unspecified
Hardware:	AMD64
OS:	Linux
Whiteboard:
Package list:		Runtime testing required:	---
Attachments:	emerge --info dmesg Dmesg after boot with nvidia-drivers-337.25 nvidia-bug-report_error.log.gz nvidia-kerneloops-33725.tar.gz dmesg with boot error nvidia bug report (340.24 and kernel 3.12.21-r1) comment 39: dmesg comment 39: emerge --info

Description Eric Siskonen 2014-06-15 21:02:11 UTC

Created attachment 378968 [details]
emerge --info

Today a new ebuild was made stable (x11-drivers/nvidia-drivers-337.25) which I attempted to upgrade on 2 separate systems. Both machines now throw null pointer dereference errors on boot once it reaches the "waiting for uevents to be processed" portion of the init process. I attempted to downgrade to the previous nvidia-drivers-334.21-r3 ebuild only to notice that it has now been magically removed. I then attempted to unmask the latest version available in portage (nvidia-drivers-340.17) and the issue persists. This is crippling 2 machines that I desperately need to actually do work with. I have attached an emerge --info and also my dmesg file. PLEASE HELP!!!

Comment 1 Eric Siskonen 2014-06-15 21:02:37 UTC

Created attachment 378970 [details]
dmesg

Comment 2 Eric Siskonen 2014-06-15 23:03:55 UTC

I just reverted to the next oldest driver and it is now working properly again. Currently using nvidia-drivers-331.79.ebuild. I'm still not clear why the 334.21-r3 ebuild was removed from portage considering it was the stable to revert to in the event that a new ebuild had issues.

Comment 3 Jeroen Roovers (RETIRED) gentoo-dev

2014-06-15 23:54:36 UTC

(In reply to Eric Siskonen from comment #2)
> I just reverted to the next oldest driver and it is now working properly
> again. Currently using nvidia-drivers-331.79.ebuild.

Nice.

> I'm still not clear why
> the 334.21-r3 ebuild was removed from portage considering it was the stable
> to revert to in the event that a new ebuild had issues.

No, stable is what this[1] says. 337.25 is the stable successor to the 334 branch.

[1] http://www.nvidia.com/object/unix.html

Comment 4 Eric Siskonen 2014-06-15 23:56:56 UTC

Well in Gentoo as of yesterday stable was 334. Today 337 is broken for me on several different machines with no way to revert to the 334 driver which was working. I'm now fighting to get 331 working because of libEGL.

Comment 5 Eric Siskonen 2014-06-16 00:03:41 UTC

After reverting to the old drivers there were issues with libEGL. I had to eselect opengl set xorg-x11 then delete everything in /usr/lib32/opengl/nvidia/lib and /usr/lib64/opengl/nvidia/lib. After doing that I had to emerge =x11-drivers/nvidia-drivers-331.79 and eselect opengl set nvidia. Then I had to emerge -1 mesa after everything was completed. Now I am able to run everything exactly like it was running on 334 prior to upgrading to 337 or 340. Any idea why 337 and 340 are throwing null pointer dereference? It happens at boot when udev runs /opt/bin/nvidia-smi I believe.

Comment 6 Jeroen Roovers (RETIRED) gentoo-dev

2014-06-16 00:05:40 UTC

If that's your complete dmesg output, then at no point is the nvidia kernel module appears to be loaded, or at least it doesn't show any output, which isn't normal.

The PCI subsystem output shows that a 10de:0e0a ("GK104 HDMI Audio Controller") and 10de:119f ("GK104M [GeForce GTX 780M]") are present, but the canonical dmesg output from the nvidia module ("NVRM: loading NVIDIA UNIX ...") is missing.

Comment 7 Eric Siskonen 2014-06-16 00:07:11 UTC

It manages to boot all the way into the OS when it has the problem. It will just not run xorg. I ran lsmod after it booted and indeed the nvidia module is loaded in the kernel.

Comment 8 Jeroen Roovers (RETIRED) gentoo-dev

2014-06-16 00:08:08 UTC

You might be hitting bug #508196 here. That means your BIOS should set up the device properly or you should upgrade the BIOS.

Comment 9 Eric Siskonen 2014-06-16 00:20:26 UTC

After reading that post I'm not sure we are having the same problem. My system does not have optimus and I am not running bumblebee. This machine has 2 Nvidia GTX 780m cards in SLI so optimus would be unsupported even if they made it optional. I'm running nvidia-drivers as my the only driver for xorg. I will investigate updating the BIOS but unfortunately I'm not optimistic this is the solution.

Comment 10 Jeroen Roovers (RETIRED) gentoo-dev

2014-06-16 00:26:01 UTC

(In reply to Eric Siskonen from comment #9)
> After reading that post I'm not sure we are having the same problem. My
> system does not have optimus and I am not running bumblebee.

Right.

> I will investigate updating the BIOS but unfortunately I'm not
> optimistic this is the solution.

I'm willing to put back 334 for now but I would urge you to contact both Alienware (because you payed them a lot of money) and Nvidia (because indirectly, same) with a full bug report (run nvidia-bug-report.sh).

Comment 11 Eric Siskonen 2014-06-16 00:37:09 UTC

Thank you. I will update this ticket if Nvidia gives me any further information. Alienware does not directly support Linux what-so-ever so trying to get support for it from them will be pointless. If the problem were to surface in Windows I'm sure they would.

Comment 12 Jeroen Roovers (RETIRED) gentoo-dev

2014-06-16 00:40:10 UTC

You might want to try disabling PNP support in the kernel.

Comment 13 Eric Siskonen 2014-06-16 02:58:42 UTC

Updating to the latest BIOS which was released only a month ago has no effect on this bug. I have reverted to nvidia-drivers-334.21-r3 and everything is functioning as normal. I will run nvidia-bug-report.sh and follow up with Nvidia. Strangely this bug also seems to be affecting a completely different Dell machine with a single GF108GLM NVS 5200M card in the same fashion.

Comment 14 Paul Bredbury 2014-06-16 05:49:26 UTC

> null pointer dereference errors

See https://devtalk.nvidia.com/default/topic/685307/linux/340-17-337-334-kernel-bug-when-closing-vdpau-applications/

The crummy workaround is to add in bootloader linux cmdline:

intel_iommu=off

Comment 15 Eric Siskonen 2014-06-16 13:14:39 UTC

I actually did try this yesterday and unfortunately it had no effect on the issue.

Comment 16 simon 2014-06-16 18:17:25 UTC

Created attachment 379054 [details]
Dmesg after boot with nvidia-drivers-337.25

Same thing on my machine.
Afer i uninstalled the nvidia-drivers and started into a system without nvidia-drivers everything comes up fine.
If i install it with 337.25 modprobe the module and restart X this also works ok.
After reboot, startup fails again with the dmesg attached.

I reverted to 331.79 and have no issues now.
No such issues with 334.21-r3 before (i'll sync and try this again).

Comment 17 Jeroen Roovers (RETIRED) gentoo-dev

2014-06-16 18:48:10 UTC

(In reply to Paul Bredbury from comment #14)
> https://devtalk.nvidia.com/default/topic/685307/linux/340-17-337-334-kernel-
> bug-when-closing-vdpau-applications/
> 
> The crummy workaround is to add in bootloader linux cmdline:
> 
> intel_iommu=off

That's not even remotely related.

Comment 18 Jeroen Roovers (RETIRED) gentoo-dev

2014-06-16 18:51:14 UTC

What seems to be happening is that nvidia.ko loads fine, but hangs during initialisation. It doesn't produce any of the normal output, so it's probably waiting for something else to happen.

Comment 19 simon 2014-06-16 19:17:38 UTC

Created attachment 379058 [details]
nvidia-bug-report_error.log.gz

FYI - in case the info helps the bug report file with the 337.25 driver loaded.
Most probably this should go to nvidia, correct?

Comment 20 Jeroen Roovers (RETIRED) gentoo-dev

2014-06-16 19:26:26 UTC

(In reply to simon from comment #19)
> Most probably this should go to nvidia, correct?

Yes, I have little use for it.

Comment 21 Stephan Karacson 2014-06-16 20:38:04 UTC

I need the 334.21-r3 too as the 337.25 and the 340.17 give me a kernel bug and a complete boot-stop when loading (not always but mostly on reboot) at the time loading the nvidia-module. (so my might be not so silent)
I have an Alienware Gf 580M GTX.

Comment 22 Jeroen Roovers (RETIRED) gentoo-dev

2014-06-16 21:05:45 UTC

(In reply to Stephan Karacson from comment #21)
> I need the 334.21-r3 too as the 337.25 and the 340.17 give me a kernel bug

334.21-r3 returned to the tree as promised. Synch your portage tree and try again.

> and a complete boot-stop

What does "a complete boot-stop" mean? Please don't invent technically sounding phrases, just use the ones we already have.

> when loading (not always but mostly on reboot) at
> the time loading the nvidia-module. (so my might be not so silent)

If it isn't silent during initialisation, then you could show some output and we could figure out if you're actually seeing the same problem or a different problem.

> I have an Alienware Gf 580M GTX.

That might be a simple coincidence. We haven't as yet established you're seeing the same issue.

Comment 23 Stephan Karacson 2014-06-19 17:01:00 UTC

I Installed the nvidia-drivers 337.25 on a second machine with is not alienware or dell at all (local vendor, AMD 5200+ cpu,  Geforce GTS 450 GF106 gpu, M2N-MX ASUSTeK Computer INC nforce2 motherboard).
The day after the update the startup-stop appeared again, leaving no message in the logs I can find.

So I made some screenshots with my camera...

Its a "NULL pointer dereference at        (null)" kernel Oops

only reproducible if I installed nvidia-drivers 337.25 or 340.17.
Sometimes startup goes well, mostly  not.
Used kernel is both 3.12.21-gentoo-r1.

nvidia-drivers 334.21-r3 work troublefree.

Is it useful to upload the jpegs to this bug?

Comment 24 Stephan Karacson 2014-06-19 17:23:57 UTC

Created attachment 379266 [details]
nvidia-kerneloops-33725.tar.gz

Kernel oops screenshots. I did made them small, feel free to remove them if not suitable for bugzilla.
Kernel: 3.12.21-gentoo-r1
nvidia-drivers: 337.25
pc: AMD5200+ ASUSTEC Gf-GTS 450

Comment 25 Jeroen Roovers (RETIRED) gentoo-dev

2014-06-19 19:43:29 UTC

(In reply to Stephan Karacson from comment #24)
> Created attachment 379266 [details]

Please send your nvidia-bug-report.sh output to Nvidia.

Comment 26 Stephan Karacson 2014-06-21 22:08:28 UTC

Sorry. I'm not a developer nor have an IT-job so I don't know all the words we already have.
I was also late at testing the newest gentoo-sources 3.15.1 where the kernel-opps does not occur (tested 10-20 booting each machine with nvidia-driver 337.25, booting old kernel 3.12.21-r1 gave me the kernel-oops again in first try).

Anyway there is a problem with the current stable 3.12.21 so I have studied Sysrq and was able to make a nvidia-bug-report in the failed boot and send it to nvidia asking for a hint which patch might clear the bug for stable kernel 3.12.

Comment 27 Eric Siskonen 2014-06-23 03:09:49 UTC

I can confirm that after unmasking the latest kernel in portage (sys-kernel/gentoo-sources-3.15.1:3.15.1) this bug goes away. I am currently using the latest unmasked nvidia-drivers in portage (x11-drivers/nvidia-drivers-340.17) without issue. The only answer I was given from Nvidia was to update my kernel.

Comment 28 Jeroen Roovers (RETIRED) gentoo-dev

2014-06-23 12:59:20 UTC

(In reply to Eric Siskonen from comment #27)
> I can confirm that after unmasking the latest kernel in portage
> (sys-kernel/gentoo-sources-3.15.1:3.15.1) this bug goes away. I am currently
> using the latest unmasked nvidia-drivers in portage
> (x11-drivers/nvidia-drivers-340.17) without issue. The only answer I was
> given from Nvidia was to update my kernel.

That sounds like this:

https://devtalk.nvidia.com/default/topic/751903/linux/kernel-3-15-and-nv-drivers-337-340-failed-to-initialize-the-nvidia-kernel-module-gtx-550-ti-/

but since you're experiencing the problem with a much earlier kernel, I now wonder for how long that kernel bug has existed.

Comment 29 Stephan Karacson 2014-06-23 17:07:58 UTC

I stepped over this bug too.
But there are some points that doesn't fit:
blablo complains about a failing xorg-start, my boot doesn't even get to this point.
He gets a WARNING of a cpufreq_update_policy, I get a BUG and a null pointer kernel oops.
His patch is for 3.15.1 with works fine for me without the patch.

Comment 30 Jeroen Roovers (RETIRED) gentoo-dev

2014-06-23 20:19:41 UTC

(In reply to Stephan Karacson from comment #29)
> But there are some points that doesn't fit:
> blablo complains about a failing xorg-start, my boot doesn't even get to
> this point.

Exactly, so that isn't the same issue.

Comment 31 Jeroen Roovers (RETIRED) gentoo-dev

2014-07-09 14:07:03 UTC

Nvidia says 340.24 is the next stable, replacing all of the 331*, 334* and 337* branches. How does it fare?

Comment 32 Waldemar Szostak 2014-07-12 11:53:39 UTC

I also experienced booting problems with nvidia-drivers-337.25. 
After downgrading to nvidia-drivers-334.21-r3 works fine. 

gentoo-sources: 3.12.21-r1, 3.10.41-r1
graphics: NVIDIA Corporation GT215 [GeForce GT 240]

Comment 33 Jeroen Roovers (RETIRED) gentoo-dev

2014-07-12 11:59:34 UTC

(In reply to Waldemar Szostak from comment #32)
> I also experienced booting problems with nvidia-drivers-337.25. 
> After downgrading to nvidia-drivers-334.21-r3 works fine. 

Yes, but in comment #31 I asked about 340.24 which is the next stable.

Comment 34 Marco 2014-07-14 07:37:25 UTC

Same problem using x11-drivers/nvidia-drivers-340.24 with sys-kernel/gentoo-sources-3.12.21-r1

Comment 35 Marco 2014-07-14 07:40:03 UTC

Created attachment 380696 [details]
dmesg with boot error

Comment 36 Jeroen Roovers (RETIRED) gentoo-dev

2014-07-14 17:19:10 UTC

That's odd. Did anyone here actually send a bug report to Nvidia?

Comment 37 simon 2014-07-16 17:07:10 UTC

created one today (after testing 340.24) 

https://devtalk.nvidia.com/default/topic/761854/linux/340-24-kernel-3-12-21-gentoo-kernel-null-pointer-dereference-with-xorg-not-starting-nvidia-ko-sil/

Comment 38 simon 2014-07-16 17:42:13 UTC

Created attachment 380834 [details]
nvidia bug report (340.24 and kernel 3.12.21-r1)

my issue is gone with gentoo-sources-3.15.5 and 340.24

FYI the bugreport with 340.24 and kernel 3.12.21-r1

Comment 39 Kai Wüstermann 2014-09-01 14:31:01 UTC

I can confirm that bug for x11-drivers/nvidia-drivers-340.32 with kernel 3.12.21-r1. For me it is still present with kernel 3.14.14.

I will attach dmesg and emerge --info.

Comment 40 Kai Wüstermann 2014-09-01 14:34:20 UTC

Created attachment 384056 [details]
comment 39: dmesg

Comment 41 Kai Wüstermann 2014-09-01 14:34:59 UTC

Created attachment 384058 [details]
comment 39: emerge --info

Comment 42 Jeroen Roovers (RETIRED) gentoo-dev

2014-09-20 08:40:46 UTC

(In reply to simon from comment #37)
> created one today (after testing 340.24) 
> 
> https://devtalk.nvidia.com/default/topic/761854/linux/340-24-kernel-3-12-21-gentoo-kernel-null-pointer-dereference-with-xorg-not-starting-nvidia-ko-sil/

"[edit] Issue can not be observed with newer kernel: 3.15.5"

Never saw that edit before.


How are we doing here?