Summary: | x11-drivers/nvidia-drivers-337.25 - at boot, nvidia.ko silently fails to initialise | ||
---|---|---|---|
Product: | Gentoo Linux | Reporter: | Eric Siskonen <esiskonen> |
Component: | Current packages | Assignee: | David Seifert <soap> |
Status: | RESOLVED WONTFIX | ||
Severity: | normal | CC: | esiskonen, kernel, kjackie, s7mon |
Priority: | Highest | ||
Version: | unspecified | ||
Hardware: | AMD64 | ||
OS: | Linux | ||
Whiteboard: | |||
Package list: | Runtime testing required: | --- | |
Attachments: |
emerge --info
dmesg Dmesg after boot with nvidia-drivers-337.25 nvidia-bug-report_error.log.gz nvidia-kerneloops-33725.tar.gz dmesg with boot error nvidia bug report (340.24 and kernel 3.12.21-r1) comment 39: dmesg comment 39: emerge --info |
Created attachment 378970 [details]
dmesg
I just reverted to the next oldest driver and it is now working properly again. Currently using nvidia-drivers-331.79.ebuild. I'm still not clear why the 334.21-r3 ebuild was removed from portage considering it was the stable to revert to in the event that a new ebuild had issues. (In reply to Eric Siskonen from comment #2) > I just reverted to the next oldest driver and it is now working properly > again. Currently using nvidia-drivers-331.79.ebuild. Nice. > I'm still not clear why > the 334.21-r3 ebuild was removed from portage considering it was the stable > to revert to in the event that a new ebuild had issues. No, stable is what this[1] says. 337.25 is the stable successor to the 334 branch. [1] http://www.nvidia.com/object/unix.html Well in Gentoo as of yesterday stable was 334. Today 337 is broken for me on several different machines with no way to revert to the 334 driver which was working. I'm now fighting to get 331 working because of libEGL. After reverting to the old drivers there were issues with libEGL. I had to eselect opengl set xorg-x11 then delete everything in /usr/lib32/opengl/nvidia/lib and /usr/lib64/opengl/nvidia/lib. After doing that I had to emerge =x11-drivers/nvidia-drivers-331.79 and eselect opengl set nvidia. Then I had to emerge -1 mesa after everything was completed. Now I am able to run everything exactly like it was running on 334 prior to upgrading to 337 or 340. Any idea why 337 and 340 are throwing null pointer dereference? It happens at boot when udev runs /opt/bin/nvidia-smi I believe. If that's your complete dmesg output, then at no point is the nvidia kernel module appears to be loaded, or at least it doesn't show any output, which isn't normal. The PCI subsystem output shows that a 10de:0e0a ("GK104 HDMI Audio Controller") and 10de:119f ("GK104M [GeForce GTX 780M]") are present, but the canonical dmesg output from the nvidia module ("NVRM: loading NVIDIA UNIX ...") is missing. It manages to boot all the way into the OS when it has the problem. It will just not run xorg. I ran lsmod after it booted and indeed the nvidia module is loaded in the kernel. You might be hitting bug #508196 here. That means your BIOS should set up the device properly or you should upgrade the BIOS. After reading that post I'm not sure we are having the same problem. My system does not have optimus and I am not running bumblebee. This machine has 2 Nvidia GTX 780m cards in SLI so optimus would be unsupported even if they made it optional. I'm running nvidia-drivers as my the only driver for xorg. I will investigate updating the BIOS but unfortunately I'm not optimistic this is the solution. (In reply to Eric Siskonen from comment #9) > After reading that post I'm not sure we are having the same problem. My > system does not have optimus and I am not running bumblebee. Right. > I will investigate updating the BIOS but unfortunately I'm not > optimistic this is the solution. I'm willing to put back 334 for now but I would urge you to contact both Alienware (because you payed them a lot of money) and Nvidia (because indirectly, same) with a full bug report (run nvidia-bug-report.sh). Thank you. I will update this ticket if Nvidia gives me any further information. Alienware does not directly support Linux what-so-ever so trying to get support for it from them will be pointless. If the problem were to surface in Windows I'm sure they would. You might want to try disabling PNP support in the kernel. Updating to the latest BIOS which was released only a month ago has no effect on this bug. I have reverted to nvidia-drivers-334.21-r3 and everything is functioning as normal. I will run nvidia-bug-report.sh and follow up with Nvidia. Strangely this bug also seems to be affecting a completely different Dell machine with a single GF108GLM NVS 5200M card in the same fashion. > null pointer dereference errors See https://devtalk.nvidia.com/default/topic/685307/linux/340-17-337-334-kernel-bug-when-closing-vdpau-applications/ The crummy workaround is to add in bootloader linux cmdline: intel_iommu=off I actually did try this yesterday and unfortunately it had no effect on the issue. Created attachment 379054 [details]
Dmesg after boot with nvidia-drivers-337.25
Same thing on my machine.
Afer i uninstalled the nvidia-drivers and started into a system without nvidia-drivers everything comes up fine.
If i install it with 337.25 modprobe the module and restart X this also works ok.
After reboot, startup fails again with the dmesg attached.
I reverted to 331.79 and have no issues now.
No such issues with 334.21-r3 before (i'll sync and try this again).
(In reply to Paul Bredbury from comment #14) > https://devtalk.nvidia.com/default/topic/685307/linux/340-17-337-334-kernel- > bug-when-closing-vdpau-applications/ > > The crummy workaround is to add in bootloader linux cmdline: > > intel_iommu=off That's not even remotely related. What seems to be happening is that nvidia.ko loads fine, but hangs during initialisation. It doesn't produce any of the normal output, so it's probably waiting for something else to happen. Created attachment 379058 [details]
nvidia-bug-report_error.log.gz
FYI - in case the info helps the bug report file with the 337.25 driver loaded.
Most probably this should go to nvidia, correct?
(In reply to simon from comment #19) > Most probably this should go to nvidia, correct? Yes, I have little use for it. I need the 334.21-r3 too as the 337.25 and the 340.17 give me a kernel bug and a complete boot-stop when loading (not always but mostly on reboot) at the time loading the nvidia-module. (so my might be not so silent) I have an Alienware Gf 580M GTX. (In reply to Stephan Karacson from comment #21) > I need the 334.21-r3 too as the 337.25 and the 340.17 give me a kernel bug 334.21-r3 returned to the tree as promised. Synch your portage tree and try again. > and a complete boot-stop What does "a complete boot-stop" mean? Please don't invent technically sounding phrases, just use the ones we already have. > when loading (not always but mostly on reboot) at > the time loading the nvidia-module. (so my might be not so silent) If it isn't silent during initialisation, then you could show some output and we could figure out if you're actually seeing the same problem or a different problem. > I have an Alienware Gf 580M GTX. That might be a simple coincidence. We haven't as yet established you're seeing the same issue. I Installed the nvidia-drivers 337.25 on a second machine with is not alienware or dell at all (local vendor, AMD 5200+ cpu, Geforce GTS 450 GF106 gpu, M2N-MX ASUSTeK Computer INC nforce2 motherboard). The day after the update the startup-stop appeared again, leaving no message in the logs I can find. So I made some screenshots with my camera... Its a "NULL pointer dereference at (null)" kernel Oops only reproducible if I installed nvidia-drivers 337.25 or 340.17. Sometimes startup goes well, mostly not. Used kernel is both 3.12.21-gentoo-r1. nvidia-drivers 334.21-r3 work troublefree. Is it useful to upload the jpegs to this bug? Created attachment 379266 [details]
nvidia-kerneloops-33725.tar.gz
Kernel oops screenshots. I did made them small, feel free to remove them if not suitable for bugzilla.
Kernel: 3.12.21-gentoo-r1
nvidia-drivers: 337.25
pc: AMD5200+ ASUSTEC Gf-GTS 450
(In reply to Stephan Karacson from comment #24) > Created attachment 379266 [details] Please send your nvidia-bug-report.sh output to Nvidia. Sorry. I'm not a developer nor have an IT-job so I don't know all the words we already have. I was also late at testing the newest gentoo-sources 3.15.1 where the kernel-opps does not occur (tested 10-20 booting each machine with nvidia-driver 337.25, booting old kernel 3.12.21-r1 gave me the kernel-oops again in first try). Anyway there is a problem with the current stable 3.12.21 so I have studied Sysrq and was able to make a nvidia-bug-report in the failed boot and send it to nvidia asking for a hint which patch might clear the bug for stable kernel 3.12. I can confirm that after unmasking the latest kernel in portage (sys-kernel/gentoo-sources-3.15.1:3.15.1) this bug goes away. I am currently using the latest unmasked nvidia-drivers in portage (x11-drivers/nvidia-drivers-340.17) without issue. The only answer I was given from Nvidia was to update my kernel. (In reply to Eric Siskonen from comment #27) > I can confirm that after unmasking the latest kernel in portage > (sys-kernel/gentoo-sources-3.15.1:3.15.1) this bug goes away. I am currently > using the latest unmasked nvidia-drivers in portage > (x11-drivers/nvidia-drivers-340.17) without issue. The only answer I was > given from Nvidia was to update my kernel. That sounds like this: https://devtalk.nvidia.com/default/topic/751903/linux/kernel-3-15-and-nv-drivers-337-340-failed-to-initialize-the-nvidia-kernel-module-gtx-550-ti-/ but since you're experiencing the problem with a much earlier kernel, I now wonder for how long that kernel bug has existed. I stepped over this bug too. But there are some points that doesn't fit: blablo complains about a failing xorg-start, my boot doesn't even get to this point. He gets a WARNING of a cpufreq_update_policy, I get a BUG and a null pointer kernel oops. His patch is for 3.15.1 with works fine for me without the patch. (In reply to Stephan Karacson from comment #29) > But there are some points that doesn't fit: > blablo complains about a failing xorg-start, my boot doesn't even get to > this point. Exactly, so that isn't the same issue. Nvidia says 340.24 is the next stable, replacing all of the 331*, 334* and 337* branches. How does it fare? I also experienced booting problems with nvidia-drivers-337.25. After downgrading to nvidia-drivers-334.21-r3 works fine. gentoo-sources: 3.12.21-r1, 3.10.41-r1 graphics: NVIDIA Corporation GT215 [GeForce GT 240] (In reply to Waldemar Szostak from comment #32) > I also experienced booting problems with nvidia-drivers-337.25. > After downgrading to nvidia-drivers-334.21-r3 works fine. Yes, but in comment #31 I asked about 340.24 which is the next stable. Same problem using x11-drivers/nvidia-drivers-340.24 with sys-kernel/gentoo-sources-3.12.21-r1 Created attachment 380696 [details]
dmesg with boot error
That's odd. Did anyone here actually send a bug report to Nvidia? created one today (after testing 340.24) https://devtalk.nvidia.com/default/topic/761854/linux/340-24-kernel-3-12-21-gentoo-kernel-null-pointer-dereference-with-xorg-not-starting-nvidia-ko-sil/ Created attachment 380834 [details]
nvidia bug report (340.24 and kernel 3.12.21-r1)
my issue is gone with gentoo-sources-3.15.5 and 340.24
FYI the bugreport with 340.24 and kernel 3.12.21-r1
I can confirm that bug for x11-drivers/nvidia-drivers-340.32 with kernel 3.12.21-r1. For me it is still present with kernel 3.14.14. I will attach dmesg and emerge --info. Created attachment 384056 [details] comment 39: dmesg Created attachment 384058 [details] comment 39: emerge --info (In reply to simon from comment #37) > created one today (after testing 340.24) > > https://devtalk.nvidia.com/default/topic/761854/linux/340-24-kernel-3-12-21-gentoo-kernel-null-pointer-dereference-with-xorg-not-starting-nvidia-ko-sil/ "[edit] Issue can not be observed with newer kernel: 3.15.5" Never saw that edit before. How are we doing here? |
Created attachment 378968 [details] emerge --info Today a new ebuild was made stable (x11-drivers/nvidia-drivers-337.25) which I attempted to upgrade on 2 separate systems. Both machines now throw null pointer dereference errors on boot once it reaches the "waiting for uevents to be processed" portion of the init process. I attempted to downgrade to the previous nvidia-drivers-334.21-r3 ebuild only to notice that it has now been magically removed. I then attempted to unmask the latest version available in portage (nvidia-drivers-340.17) and the issue persists. This is crippling 2 machines that I desperately need to actually do work with. I have attached an emerge --info and also my dmesg file. PLEASE HELP!!!