Summary: | x11-drivers/nvidia-drivers-450.66 hangs system | ||
---|---|---|---|
Product: | Gentoo Linux | Reporter: | Alex Efros <powerman-asdf> |
Component: | Current packages | Assignee: | David Seifert <soap> |
Status: | RESOLVED OBSOLETE | ||
Severity: | normal | CC: | harrisl, ionen, stig |
Priority: | Normal | ||
Version: | unspecified | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Package list: | Runtime testing required: | --- | |
Attachments: |
nvidia-drivers-450.66.build.log.bz2
emerge nvidia-drivers-455.38 build log against gentoo-sources 5.4.72 |
Description
Alex Efros
2020-08-29 11:40:17 UTC
Observing that the entire system apparently stops responding, it should come as no surprise that the output to the Xorg log stops as well. Maybe the kernel panicked? Anything useful in dmesg? (In reply to Jeroen Roovers from comment #1) > Observing that the entire system apparently stops responding, it should come > as no surprise that the output to the Xorg log stops as well. Maybe the > kernel panicked? Anything useful in dmesg? Nope. Kernel log is just interrupted without any error/panic at the end, both versions of nvidia-drivers write nearly the same in the kernel log. Please attach the build log for x11-drivers/nvidia-drivers-450.66. Created attachment 657458 [details]
nvidia-drivers-450.66.build.log.bz2
(In reply to Alex Efros from comment #4) > Created attachment 657458 [details] > nvidia-drivers-450.66.build.log.bz2 Thanks. Similar results for me, keyboard and mouse is dead, network interface is dead (can't ssh into the box), both on nvidia-drivers 450.66 and 440.100-r2, only on gentoo-sources 5.4.60. At 5.4.48 with the exact same config file both driver versions run okay, good performance in games etc. However, the kernel is not panicking, I can see my logon screen with the clock still incresing the seconds counter. I managed to get some dmesg output through syslog: Sep 7 00:10:40 box kernel: [ 26.071627] nvidia: module license 'NVIDIA' taints kernel. Sep 7 00:10:40 box kernel: [ 26.071629] Disabling lock debugging due to kernel taint Sep 7 00:10:40 box kernel: [ 26.086162] nvidia-nvlink: Nvlink Core is being initialized, major device number 246 Sep 7 00:10:40 box kernel: [ 26.086611] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem Sep 7 00:10:40 box kernel: [ 26.285961] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 440.100 Fri May 29 08:45:51 UTC 2020 Sep 7 00:10:40 box kernel: [ 26.366579] EXT4-fs (dm-0): re-mounted. Opts: (null) Sep 7 00:10:40 box kernel: [ 26.502723] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000d0000-0x000d3fff window] Sep 7 00:10:40 box kernel: [ 26.502819] caller _nv000908rm+0x1bf/0x1f0 [nvidia] mapping multiple BARs Sep 7 00:10:40 box kernel: [ 26.980927] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 440.100 Fri May 29 08:14:04 UTC 2020 Sep 7 00:10:40 box kernel: [ 27.006523] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver Sep 7 00:10:40 box kernel: [ 27.006525] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 0 Sep 7 00:10:40 box kernel: [ 27.174497] nvidia-smi (4147) used greatest stack depth: 12560 bytes left Sep 7 00:10:40 box kernel: [ 31.362410] ip (5358) used greatest stack depth: 12192 bytes left ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ Now, if I'm not mistaken, the current ebuild only supports gentoo-sources <5.4 so we should probably just wait for new proprietary driver or stay with 4.19.141. Still, it's interesting to know what is wrong. same here but ssh still working. kernel-5.8.10. X showing 100% cpu and stack trace of nvidia-driver crash in logs. Only recovery is cold boot. system starts gnome and runs for a while but then locks up with no mouse, no keyboard and will not reboot with systemctl reboot from ssh. (In reply to Harris Landgarten from comment #7) > same here but ssh still working. Perhaps you can access the accurate log files then? I'd say dmesg and Xorg. I'm afraid that my system goes down before it can write these on disk. I have a serial port too but I'm not as desperate to use it for this investigation. Also, please vote on this bug to get attention. I am having this issue with 455.23.04 This is the kernel oops it causes: NVRM: GPU at PCI:0000:04:00: GPU-11502392-bb1d-6042-b964-805668887312 Sep 20 10:19:38 harrisl-desktop.landgarten.local kernel: NVRM: Xid (PCI:0000:04:00): 31, pid=8208, Ch 00000068, intr 10000000. MMU Fault: ENGINE MSPDEC HUBCLIENT_MSPDEC faulted > Sep 20 10:19:38 harrisl-desktop.landgarten.local kernel: BUG: kernel NULL pointer dereference, address: 00000000000003a8 Sep 20 10:19:38 harrisl-desktop.landgarten.local kernel: #PF: supervisor read access in kernel mode Sep 20 10:19:39 harrisl-desktop.landgarten.local kernel: #PF: error_code(0x0000) - not-present page Sep 20 10:19:39 harrisl-desktop.landgarten.local kernel: PGD 0 P4D 0 Sep 20 10:19:39 harrisl-desktop.landgarten.local kernel: Oops: 0000 [#1] SMP PTI Sep 20 10:19:39 harrisl-desktop.landgarten.local kernel: CPU: 7 PID: 631 Comm: irq/50-nvidia Tainted: P IO T 5.8.10-gentoo #1 Sep 20 10:19:39 harrisl-desktop.landgarten.local kernel: Hardware name: /DX58SO2, BIOS SOX5820J.86A.0920.2013.0729.0042 07/29/2013 Sep 20 10:19:39 harrisl-desktop.landgarten.local kernel: RIP: 0010:_nv018304rm+0x0/0x20 [nvidia] Sep 20 10:19:39 harrisl-desktop.landgarten.local kernel: Code: 48 89 ca 44 8b 44 24 10 48 8d 4d 0c 48 8b 87 f8 03 00 00 e8 e2 b6 46 e1 48 83 c4 08 48 83 c5 10 c3 66 0f 1f 84 00 > Sep 20 10:19:39 harrisl-desktop.landgarten.local kernel: RSP: 0018:ffffc90000b43bd0 EFLAGS: 00010246 Sep 20 10:19:39 harrisl-desktop.landgarten.local kernel: RAX: ffffffffa09955f0 RBX: ffff8885f36c8008 RCX: 0000000000000000 Sep 20 10:19:39 harrisl-desktop.landgarten.local kernel: RDX: ffff8885f3a02bb8 RSI: ffff8885fcb08008 RDI: ffff8885f36c8008 Sep 20 10:19:39 harrisl-desktop.landgarten.local kernel: RBP: ffff8885f3a02b50 R08: 0000000000000000 R09: 00000000718b3d00 Sep 20 10:19:39 harrisl-desktop.landgarten.local kernel: R10: 0000000000000001 R11: ffffffffffffffff R12: ffff8885fcb08008 Sep 20 10:19:39 harrisl-desktop.landgarten.local kernel: R13: 0000000000000000 R14: 00000000718b3d00 R15: ffff8885f39a0808 Sep 20 10:19:39 harrisl-desktop.landgarten.local kernel: FS: 0000000000000000(0000) GS:ffff888617bc0000(0000) knlGS:0000000000000000 Sep 20 10:19:39 harrisl-desktop.landgarten.local kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Sep 20 10:19:39 harrisl-desktop.landgarten.local kernel: CR2: 00000000000003a8 CR3: 00000004316fa001 CR4: 00000000000206e0 Sep 20 10:19:39 harrisl-desktop.landgarten.local kernel: Call Trace: Sep 20 10:19:39 harrisl-desktop.landgarten.local kernel: ? _nv029707rm+0x207/0x860 [nvidia] Sep 20 10:19:39 harrisl-desktop.landgarten.local kernel: ? _nv035039rm+0x296/0x530 [nvidia] Sep 20 10:19:39 harrisl-desktop.landgarten.local kernel: ? _nv034992rm+0x6ea/0xf20 [nvidia] Sep 20 10:19:39 harrisl-desktop.landgarten.local kernel: ? _nv034993rm+0xd52/0xd90 [nvidia] Sep 20 10:19:39 harrisl-desktop.landgarten.local kernel: ? _nv018254rm+0x219/0x3e0 [nvidia] Sep 20 10:19:39 harrisl-desktop.landgarten.local kernel: ? _nv018315rm+0x46a/0x6b0 [nvidia] Sep 20 10:19:39 harrisl-desktop.landgarten.local kernel: ? _nv018072rm+0x1a2/0x1d0 [nvidia] Sep 20 10:19:39 harrisl-desktop.landgarten.local kernel: ? _nv026016rm+0x10/0x10 [nvidia] Sep 20 10:19:39 harrisl-desktop.landgarten.local kernel: ? _nv018321rm+0x1f2/0x2d0 [nvidia] Sep 20 10:19:39 harrisl-desktop.landgarten.local kernel: ? _nv026016rm+0x10/0x10 [nvidia] Sep 20 10:19:39 harrisl-desktop.landgarten.local kernel: ? _nv018354rm+0xac/0xe0 [nvidia] Sep 20 10:19:39 harrisl-desktop.landgarten.local kernel: ? _nv027674rm+0x820/0xdc0 [nvidia] Sep 20 10:19:39 harrisl-desktop.landgarten.local kernel: ? _nv007560rm+0x155/0x270 [nvidia] Sep 20 10:19:39 harrisl-desktop.landgarten.local kernel: ? _nv027682rm+0x8d/0x180 [nvidia] Sep 20 10:19:39 harrisl-desktop.landgarten.local kernel: ? _nv000711rm+0xa9/0x200 [nvidia] Sep 20 10:19:39 harrisl-desktop.landgarten.local kernel: ? disable_irq_nosync+0x10/0x10 Sep 20 10:19:39 harrisl-desktop.landgarten.local kernel: ? rm_isr_bh+0x1c/0x60 [nvidia] Sep 20 10:19:39 harrisl-desktop.landgarten.local kernel: ? nvidia_isr_kthread_bh+0x1b/0x40 [nvidia] Sep 20 10:19:39 harrisl-desktop.landgarten.local kernel: ? irq_thread_fn+0x20/0x60 Sep 20 10:19:39 harrisl-desktop.landgarten.local kernel: ? irq_thread+0xdb/0x180 Sep 20 10:19:39 harrisl-desktop.landgarten.local kernel: ? irq_thread_check_affinity+0x80/0x80 Sep 20 10:19:39 harrisl-desktop.landgarten.local kernel: ? irq_forced_thread_fn+0x80/0x80 Sep 20 10:19:39 harrisl-desktop.landgarten.local kernel: ? kthread+0x11b/0x140 Sep 20 10:19:39 harrisl-desktop.landgarten.local kernel: ? kthread_create_worker_on_cpu+0x70/0x70 Sep 20 10:19:39 harrisl-desktop.landgarten.local kernel: ? ret_from_fork+0x22/0x30 Similar results for kernels 5.4.66 and 5.4.72. Meanwhile I'm using nvidia 455.38 with 5.4.48 just fine. Some random info: I don't use systemd, I use multiple monitors, I use genkernel to build the image. Attaching build log for nvidia-drivers 455.38 against gentoo-sources 5.4.72, maybe some of the warnings helps. ‘__builtin_strncpy’ specified bound depends on the length of the source argument [-Wstringop-overflow=] ‘GTimeVal’ is deprecated ‘GTypeDebugFlags’ is deprecated [-Wdeprecated-declarations] this statement may fall through [-Wimplicit-fallthrough=] #warning "Update libvdpau to version x.x" [-Wcpp] I also have a virtually identical hardware on another box and plan the updates above 5.4.38 there, will report if the results are different. I'll also try downgrading libvdpau to 1.2 (currently using 1.3) as the last warning seems curious. Created attachment 670307 [details]
emerge nvidia-drivers-455.38 build log against gentoo-sources 5.4.72
libvdpau-1.2 doesn't change a thing. I ran X with default config, managed to reboot with SysRq this way and save the logs. Not much interesting stuff there: kernel.log: Nov 7 12:08:21 hal2 kernel: [ 168.187150] resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000d0000-0x000d3fff window] Nov 7 12:08:21 hal2 kernel: [ 168.187267] caller _nv000709rm+0x1af/0x200 [nvidia] mapping multiple BARs Xorg.0.log: [ 167.545] (WW) Warning, couldn't open module glxservernvidia [ 167.545] (EE) Failed to load module "glxservernvidia" (module does not exist, 0) [ 168.590] (II) NVIDIA(0): ACPI: failed to connect to the ACPI event daemon; the daemon [ 168.590] (II) NVIDIA(0): may not be running or the "AcpidSocketPath" X [ 168.590] (II) NVIDIA(0): configuration option may not be set correctly. When the [ 168.590] (II) NVIDIA(0): ACPI event daemon is available, the NVIDIA X driver will [ 168.590] (II) NVIDIA(0): try to use it to receive ACPI event notifications. For [ 168.590] (II) NVIDIA(0): details, please see the "ConnectToAcpid" and [ 168.590] (II) NVIDIA(0): "AcpidSocketPath" X configuration options in Appendix B: X [ 168.590] (II) NVIDIA(0): Config Options in the README. Same problem here . Tested drivers : 450.57-450.66 Tested Kernels 5.4.51 - 5.7.10 Frozen as soon as X Started No Kernel Panic No Error on Xorg logs. The same problem reported in Ubuntu too. https://bugs.launchpad.net/ubuntu/+source/nvidia-graphics-drivers-450/+bug/1894454 it is confirmed there. My issue is now resolved, 5.4.72 can work smoothly with nvidia-drivers-455.38. Since nobody helped me since September, I'm not sharing the solution, you can throw away the whole ticket for all I care. @Pawel I am happy you solved your problem. I am making a paper about open source and the entitlement people can feel to get their problems solved even if they havent paid or otherwise contributed to the effort other people have to make to solve this particular problem. Will you allow that I use you as a case? (In reply to Paweł Metelski from comment #15) > My issue is now resolved, 5.4.72 can work smoothly with > nvidia-drivers-455.38. Since nobody helped me since September, I'm not > sharing the solution, you can throw away the whole ticket for all I care. Wow , Just wow. I don't understand what purpose these passive-aggressive comments serve. I am a professional software developer and my time costs money. As there is serious lack of at least bug reviewers, leave alone responsible package maintainers or subject matter experts assigned, I feel entitled to complain on the bugzilla service, after all tens of sponsors already paid for it. I considered a donation to Gentoo project after resolving this issue for me but now I guess I will have to spend it on myself as a consolation for having to look for another, more actively maintained distro after 15 years with Gentoo. This bug has not even been confirmed after 2 months, this SLA is simply unacceptable for core system components maintenance. Kernel 5.4.48 is already out of stable portage channel, what would I do if I had to reinstall it? This said, I guess I can share a hint that the crash only occurs after a few seconds of lxdm/lightdm/gdm running, so there is just enough time to press Alt+SysRq+R and then Ctrl+Alt+F1 to access current Xorg and dmesg logs. Also I suppose I can share that running rc-config show --all may help the impacted users with finding out the solution but I'm done with further log collecting, describing my observations etc. @Pawel, let me just understand this. Your time is precious and costs money. You haven't paid anything to this project. Still, you expect anyone else then you to solve this bug, for free, in their own spare time? You do know how open source projects work right? If I were you, I would write something along: Dear all, since September I have worked with this bug (feels like I did it alone) and I finally found the problem. The problem was ... The solution is ... Regards Pawel. I guarantee you that will give you a much better response next time you need help and you get that superior feeling of having solved something others didn't know how to. Anyway, I wish you all the best. Regards Stig Nielsen I feel multiple different issues been reported in this bug. Some may be related to the null pointer deref issue (supposedly fixed in 460.56, perhaps in .39 too), or page alloc issues (fixed since gentoo added a patch to 455.23.04, later fixed by nvidia too), making it hard to tell. The linked ubuntu one is different though, is that still relevant? (I don't have a quadro card to tell). Not that we typically can do much about these kind of issues here until nvidia does something about it beside keep older drivers for a bit longer, albeit security vulnerabilities unfortunately prompted a cleanup. I don't remember when I had such hang last time, probably 2-3 months ago. Right now I'm using 460.39-r1, so far so good. (In reply to Alex Efros from comment #21) > I don't remember when I had such hang last time, probably 2-3 months ago. > Right now I'm using 460.39-r1, so far so good. Thanks for reporting, and good to hear (hope working out for everyone else too). If it happens again, feel free to re-open. |