Bug 747028

Summary:	x11-drivers/nvidia-drivers-450.66 - X: page allocation failure: order:5, mode:0x40cc0(GFP_KERNEL\|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0
Product:	Gentoo Linux	Reporter:	fatalerrors <fatalerrors>
Component:	Current packages	Assignee:	David Seifert <soap>
Status:	RESOLVED OBSOLETE
Severity:	normal	CC:	axiator, ionen, kajanos, marek.duranik
Priority:	Normal
Version:	unspecified
Hardware:	AMD64
OS:	Linux
See Also:	https://bugs.gentoo.org/show_bug.cgi?id=753629
Whiteboard:
Package list:		Runtime testing required:	---
Attachments:	kernel_59_nvidia_uvm.patch dmesg log from today's crash

Description fatalerrors@geoffray-levasseur.org 2020-10-07 09:36:58 UTC

Video card has stopped upgrading display. The bug occurs at whatever mode. If the display is in energy saving, it will stay in this mode indefinitely. The computer works normally via ssh. Because of that bug I have to reboot about once a week. Kernel messages show this :

[400053.686418] X: page allocation failure: order:5, mode:0x40cc0(GFP_KERNEL|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0
[400053.686430] CPU: 13 PID: 8889 Comm: X Tainted: P           OE     5.4.66-gentoo-x86_64 #1
[400053.686432] Hardware name: Micro-Star International Co., Ltd. MS-7B79/X470 GAMING PLUS (MS-7B79), BIOS A.E0 09/06/2019
[400053.686433] Call Trace:
[400053.686445]  dump_stack+0x66/0x90
[400053.686450]  warn_alloc.cold+0x7b/0xdf
[400053.686455]  ? _cond_resched+0x15/0x30
[400053.686459]  ? __alloc_pages_direct_compact+0x168/0x170
[400053.686463]  __alloc_pages_slowpath+0xddd/0xe10
[400053.686470]  ? pollwake+0x74/0x90
[400053.686473]  ? prep_new_page+0xc4/0xf0
[400053.686477]  __alloc_pages_nodemask+0x2f2/0x340
[400053.686482]  kmalloc_order+0x1b/0x80
[400053.686485]  kmalloc_order_trace+0x1d/0xa0
[400053.686506]  nvkms_alloc+0x20/0xe0 [nvidia_modeset]
[400053.686530]  _nv002653kms+0x16/0x30 [nvidia_modeset]
[400053.686551]  ? _nv002759kms+0x66/0x1470 [nvidia_modeset]
[400053.686569]  ? nv_kthread_q_stop+0x17e0/0x2970 [nvidia_modeset]
[400053.686577]  ? __alloc_pages_nodemask+0x18e/0x340
[400053.686592]  ? nv_kthread_q_stop+0x1cf1/0x2970 [nvidia_modeset]
[400053.686596]  ? _copy_from_user+0x37/0x60
[400053.686611]  ? nv_kthread_q_stop+0x17e0/0x2970 [nvidia_modeset]
[400053.686626]  ? nvKmsIoctl+0x96/0x1d0 [nvidia_modeset]
[400053.686644]  ? nvkms_ioctl_common+0x3b/0x180 [nvidia_modeset]
[400053.686660]  ? nvkms_ioctl_common+0x143/0x180 [nvidia_modeset]
[400053.686840]  ? nvidia_frontend_unlocked_ioctl+0x37/0x50 [nvidia]
[400053.686843]  ? do_vfs_ioctl+0x40e/0x670
[400053.686846]  ? ksys_ioctl+0x5e/0x90
[400053.686849]  ? __x64_sys_ioctl+0x16/0x20
[400053.686853]  ? do_syscall_64+0x5f/0x200
[400053.686857]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
[400053.686859] Mem-Info:
[400053.686869] active_anon:5578662 inactive_anon:485222 isolated_anon:0
                 active_file:225804 inactive_file:1511286 isolated_file:0
                 unevictable:2149 dirty:5457 writeback:0 unstable:0
                 slab_reclaimable:52912 slab_unreclaimable:219213
                 mapped:530293 shmem:409910 pagetables:38599 bounce:0
                 free:943623 free_pcp:62 free_cma:0
[400053.686875] Node 0 active_anon:22314648kB inactive_anon:1940888kB active_file:903216kB inactive_file:6045144kB unevictable:8596kB isolated(anon):0kB isolated(file):0kB mapped:2121172kB dirty:21828kB writeback:0kB shmem:1639640kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[400053.686877] Node 0 DMA free:15900kB min:20kB low:32kB high:44kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15984kB managed:15900kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[400053.686883] lowmem_reserve[]: 0 3333 47983 47983 47983
[400053.686887] Node 0 DMA32 free:211908kB min:4692kB low:8104kB high:11516kB active_anon:331700kB inactive_anon:32720kB active_file:44kB inactive_file:0kB unevictable:0kB writepending:0kB present:3613728kB managed:3613728kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[400053.686893] lowmem_reserve[]: 0 0 44650 44650 44650
[400053.686897] Node 0 Normal free:3546684kB min:62868kB low:108588kB high:154308kB active_anon:21982948kB inactive_anon:1908168kB active_file:903172kB inactive_file:6044888kB unevictable:8596kB writepending:21828kB present:46648832kB managed:45730628kB mlocked:8596kB kernel_stack:37136kB pagetables:154396kB bounce:0kB free_pcp:248kB local_pcp:0kB free_cma:0kB
[400053.686903] lowmem_reserve[]: 0 0 0 0 0
[400053.686906] Node 0 DMA: 1*4kB (U) 1*8kB (U) 1*16kB (U) 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15900kB
[400053.686920] Node 0 DMA32: 8185*4kB (UM) 6012*8kB (UM) 4634*16kB (UM) 1011*32kB (UM) 384*64kB (U) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 211908kB
[400053.686931] Node 0 Normal: 59521*4kB (UME) 235755*8kB (UME) 74416*16kB (UME) 5285*32kB (UME) 1030*64kB (UME) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3549820kB
[400053.686946] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[400053.686947] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[400053.686949] 2148493 total pagecache pages
[400053.686952] 0 pages in swap cache
[400053.686954] Swap cache stats: add 98, delete 98, find 0/0
[400053.686955] Free swap  = 8384508kB
[400053.686955] Total swap = 8388604kB
[400053.686957] 12569636 pages RAM
[400053.686957] 0 pages HighMem/MovableOnly
[400053.686958] 229572 pages reserved
[400053.686959] 0 pages cma reserved
[400053.686960] 0 pages hwpoisoned
[400053.686970] BUG: unable to handle page fault for address: 0000000000007980
[400053.686975] #PF: supervisor read access in kernel mode
[400053.686977] #PF: error_code(0x0000) - not-present page
[400053.686980] PGD 0 P4D 0
[400053.686984] Oops: 0000 [#1] SMP NOPTI
[400053.686989] CPU: 13 PID: 8889 Comm: X Tainted: P           OE     5.4.66-gentoo-x86_64 #1
[400053.686991] Hardware name: Micro-Star International Co., Ltd. MS-7B79/X470 GAMING PLUS (MS-7B79), BIOS A.E0 09/06/2019
[400053.687020] RIP: 0010:_nv002606kms+0x60/0x100 [nvidia_modeset]
[400053.687025] Code: eb 40 0f 1f 84 00 00 00 00 00 48 c7 03 00 00 00 00 c6 43 08 00 41 8b 86 d0 00 00 00 83 c5 01 48 81 c3 28 04 00 00 39 e8 76 18 <48> 8b 3b 48 85 ff 74 ea 80 7b 08 00 75 d2 e8 dd d2 ff ff eb cb 0f
[400053.687028] RSP: 0018:ffffabe603edbc60 EFLAGS: 00010202
[400053.687031] RAX: 0000000000000004 RBX: 0000000000007980 RCX: 0000000000000004
[400053.687034] RDX: ffff9f09d07a6348 RSI: 0000000000007980 RDI: ffff9f09d07a2008
[400053.687036] RBP: 0000000000000000 R08: 0000000000000800 R09: 0000000000000000
[400053.687039] R10: 0000000000000004 R11: 0000000000000004 R12: 0000000000007980
[400053.687041] R13: 0000000000007980 R14: ffff9f09d07a2008 R15: 0000000000000001
[400053.687045] FS:  00007f18923f08c0(0000) GS:ffff9f09eeb40000(0000) knlGS:0000000000000000
[400053.687048] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[400053.687050] CR2: 0000000000007980 CR3: 0000000bcf594000 CR4: 00000000003406e0
[400053.687053] Call Trace:
[400053.687082]  ? _nv002759kms+0x3ca/0x1470 [nvidia_modeset]
[400053.687101]  ? nv_kthread_q_stop+0x1780/0x2970 [nvidia_modeset]
[400053.687106]  ? __alloc_pages_nodemask+0x18e/0x340
[400053.687124]  ? nv_kthread_q_stop+0x1cf1/0x2970 [nvidia_modeset]
[400053.687129]  ? _copy_from_user+0x37/0x60
[400053.687146]  ? nv_kthread_q_stop+0x17e0/0x2970 [nvidia_modeset]
[400053.687164]  ? nvKmsIoctl+0x96/0x1d0 [nvidia_modeset]
[400053.687181]  ? nvkms_ioctl_common+0x3b/0x180 [nvidia_modeset]
[400053.687197]  ? nvkms_ioctl_common+0x143/0x180 [nvidia_modeset]
[400053.687394]  ? nvidia_frontend_unlocked_ioctl+0x37/0x50 [nvidia]
[400053.687398]  ? do_vfs_ioctl+0x40e/0x670
[400053.687402]  ? ksys_ioctl+0x5e/0x90
[400053.687405]  ? __x64_sys_ioctl+0x16/0x20
[400053.687409]  ? do_syscall_64+0x5f/0x200
[400053.687413]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
[400053.687417] Modules linked in: fuse tcp_diag inet_diag ceph libceph libcrc32c fscache uinput cfg80211 rfkill 8021q garp mrp stp llc ipv6 crc_ccitt dm_mod snd_hda_codec_hdmi wmi_bmof ppdev nvidia_drm(POE) nvidia_modeset(POE) snd_hda_codec_realtek snd_hda_codec_generic nvidia(POE) uvcvideo ledtrig_audio edac_mce_amd snd_usb_audio kvm_amd videobuf2_vmalloc snd_hda_intel videobuf2_memops videobuf2_v4l2 snd_intel_nhlt snd_usbmidi_lib videobuf2_common snd_hda_codec snd_rawmidi videodev snd_seq_device kvm drm_kms_helper mc joydev snd_hda_core irqbypass snd_hwdep drm snd_pcm crct10dif_pclmul snd_timer sp5100_tco i2c_piix4 snd ghash_clmulni_intel soundcore pcspkr k10temp i2c_core ccp vboxnetadp(OE) vboxnetflt(OE) wmi parport_pc gpio_amdpt parport gpio_generic pinctrl_amd acpi_cpufreq mac_hid vboxdrv(OE) nct6775 hwmon_vid zfs(POE) zunicode(POE) zlua(POE) zavl(POE) icp(POE) zcommon(POE) znvpair(POE) spl(OE) sd_mod uas usb_storage crc32_pclmul crc32c_intel aesni_intel crypto_simd r8169 xhci_pci cryptd
[400053.687478]  glue_helper realtek e1000e xhci_hcd ahci libphy libahci
[400053.687488] CR2: 0000000000007980
[400053.687492] ---[ end trace 8db0bddd53f5641e ]---
[400053.687520] RIP: 0010:_nv002606kms+0x60/0x100 [nvidia_modeset]
[400053.687523] Code: eb 40 0f 1f 84 00 00 00 00 00 48 c7 03 00 00 00 00 c6 43 08 00 41 8b 86 d0 00 00 00 83 c5 01 48 81 c3 28 04 00 00 39 e8 76 18 <48> 8b 3b 48 85 ff 74 ea 80 7b 08 00 75 d2 e8 dd d2 ff ff eb cb 0f
[400053.687526] RSP: 0018:ffffabe603edbc60 EFLAGS: 00010202
[400053.687529] RAX: 0000000000000004 RBX: 0000000000007980 RCX: 0000000000000004
[400053.687531] RDX: ffff9f09d07a6348 RSI: 0000000000007980 RDI: ffff9f09d07a2008
[400053.687533] RBP: 0000000000000000 R08: 0000000000000800 R09: 0000000000000000
[400053.687535] R10: 0000000000000004 R11: 0000000000000004 R12: 0000000000007980
[400053.687537] R13: 0000000000007980 R14: ffff9f09d07a2008 R15: 0000000000000001
[400053.687540] FS:  00007f18923f08c0(0000) GS:ffff9f09eeb40000(0000) knlGS:0000000000000000
[400053.687543] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[400053.687545] CR2: 0000000000007980 CR3: 0000000bcf594000 CR4: 00000000003406e0

X server logs show nothing.

Reproducible: Always

Steps to Reproduce:
1.Use nVidia proprietary drivers and KWin with OpenGL enabled and at least 2 screens (might be reproductible with any display manager using OpenGL rendering)
2.After few days (from 2 to 10 days)
3.The crash occurs when going on or back to energy saving mode
4.I can connect via ssh to reboot the clean way
Actual Results:  
Display crash, no display update possible or switch to text mode. Machine works via ssh.

Expected Results:  
Display crash

I'm using X11 mode. No idea if Wayland have the same issue. My video card is a NVIDIA Corporation TU117 [GeForce GTX 1650] (rev a1) from Gigabyte but my former MSI GT-730 had the same problem.

Comment 1 Ionen Wolkens gentoo-dev

2020-10-07 10:47:16 UTC

Which driver version? If it's 455.23.04 then it's something I've also run into (seen three other users do as well).

Using 450.80.02 instead works fine.

I've found it's easiest to trigger while doing heavy/rapid usage of tmpfs with nearly full ram usage, but other things can randomly trigger it as well.

455.xx is currently only needed for RTX 30xx cards, nvidia driver page also (currently) point to use 450.80.02 if don't request 30xx.

From this end I'd suggest making 450.80.02 the next stable and leave 455.xx in ~testing for a while. Not sure how widespread this issue is to know if it's worth masking current 455.xx (but 30xx users could unmask it as needed).

Similar issue on nvidia forums:
https://forums.developer.nvidia.com/t/455-23-04-page-allocation-failure-in-kernel-module-at-random-points/155250

Comment 2 fatalerrors@geoffray-levasseur.org 2020-10-07 11:48:19 UTC

I'm using last stable which is at time 450.66. I can try to unmask 450.80.02 to see if that happens again.

Comment 3 Ionen Wolkens gentoo-dev

2020-10-07 12:21:09 UTC

(In reply to fatalerrors@geoffray-levasseur.org from comment #2)
> I'm using last stable which is at time 450.66. I can try to unmask 450.80.02
> to see if that happens again.
I see, in that case I'm surprised. Pretty sure 450.66 was fine, although there's another user that I "think" are using 450.66 and getting that error but haven't gotten confirmation for driver version.

I don't think 450.66 and 450.80.06 are very different given the changelog but there could be more non-mentioned changes that help. Do report if it helped.

Not sure gentoo can do much to figure this out though, probably better taken to nvidia.

Comment 4 Ionen Wolkens gentoo-dev

2020-10-07 19:38:40 UTC

(In reply to Ionen Wolkens from comment #1)
> If it's 455.23.04 then it's something I've also run into.
Since 455.28 just came out (thanks for fast version bump as usual), thought I'd give it a stress test to see if fixed. Unfortunately got page allocation failure after ~20m of abuse, no issues with 450.80.02 still (ah well, I'll stick with that until nvidia fix this).

Comment 5 Ionen Wolkens gentoo-dev

2020-10-12 09:15:14 UTC

(In reply to Ionen Wolkens from comment #1)
> From this end I'd suggest making 450.80.02 the next stable and leave 455.xx
> in ~testing for a while.
On a related note, haven't tested runtime but 450.80.02 and 455.28 seem to build fine with kernel 5.9 as-is (or at least with my configuration), while stable 450.66 is failing. I still suggest not to stabilize 0/455 branch yet considering there's also bug #747319 that's concerning.

Not sure if 450.80.02 help with this page failure issue over 450.66 (given I had the issue with 455.xx), but still haven't been able to trigger it on this version.

Comment 6 Risto A. Paju 2020-10-14 10:20:36 UTC

(In reply to Ionen Wolkens from comment #5)

> On a related note, haven't tested runtime but 450.80.02 and 455.28 seem to
> build fine with kernel 5.9 as-is (or at least with my configuration)

For me, 455.28 with kernel 5.8.14 seems to have fixed the issue (knocks wood with crossed fingers). But with kernel 5.9, CUDA and OpenCL don't work (though OpenGL does). This seems related to a change in 5.9 in handling non-free modules.

Comment 7 Michal Jakubowski 2020-10-15 06:23:29 UTC

The patch (kernel59.patch) is for kernel/module.c
You can use that for sure on your private machine.
On this patch nvidia_uvm works just fine.

Comment 8 Michal Jakubowski 2020-10-15 06:24:08 UTC

Created attachment 665762 [details, diff]
kernel_59_nvidia_uvm.patch

Comment 9 Opportunist 2020-10-21 12:15:00 UTC

same here with 455.28

Comment 10 Marek Duranik 2020-10-22 16:23:01 UTC

I tried  "kernel_59_nvidia_uvm.patch" patch for kernel 5.9.1, but the problem with "X: page allocation failure...." persists.

kernel 5.9.1-gentoo-x86_64
VGA: GeForce GTX 660 
Driver: nvidia-drivers-455.28

I think, that problem could  be related with screen saving, because problem appears after waking screen from sleep. 
It is possible, that issue is related to nvidia driver aswell, because there was an update at Oct 17 from version 455.23.04-r1 to version 455.28. I have never had mentioned issue with freezing of graphical environment before.

Comment 11 Ionen Wolkens gentoo-dev

2020-10-22 23:53:38 UTC

The nvidia forums link I posted in comment #1 been seeing a lot of activity. An nvidia rep notably said:
> We’ve made a change that should avoid this problem in the future. It’ll
> be available in a future release.
> It should apply to all memory allocation failures that happen during mode
> setting operations. I’m not 100% sure it applies to the one in that other
> thread, but I think so.
I don't know if it applies to this bug as well but there's hoping the next version will fix this for everyone.

(the 5.9 patch has nothing to do with this bug, and is also pointless with the default USE=-uvm)

Comment 12 Ionen Wolkens gentoo-dev

2020-10-29 23:29:24 UTC

(In reply to Ionen Wolkens from comment #11)
> I don't know if it applies to this bug as well but there's hoping the next
> version will fix this for everyone.
Or another version down the line anyway, according to a nvidia rep the fix wasn't included in 455.38.

So, for now, if having issues stick to whichever version works best for your setups. In my case that's 450.80.02, but if using stable 5.4.x kernel then the still-in-tree 440.100 is probably the safest fallback given 450/455 introduced a lot of changes+issues.

Comment 13 reTok 2020-12-14 14:33:52 UTC

I can confirm that this issue happens also with Ubuntu, GTX 960, and 3 pcs Dell display port screens. For me it fails fastest (in a week) by using VMWare win10 virtual machine running with 3D accelerations enabled. Without that running, uptimes may be 1-2 months.

Can also confirm that this issue has been there at lesat 1.5 years with multiple different kernel and NVidia driver versions. Any version combo has not been any different.

Comment 14 reTok 2020-12-14 14:42:29 UTC

Created attachment 678286 [details]
dmesg log from today's crash

Just if someone needs, here is dmesg log from my today's crash. I were able to ssh to machine, but reboot command didn't work (just killed ssh etc). I had to uses magic sysrq key combo to force boot. This is usual case.

Comment 15 Ionen Wolkens gentoo-dev

2020-12-29 13:49:47 UTC

Is anyone still having problems with either 455.45.01 or 460.27.04+?

I believe there may have been two different page allocation issues (one which is related to DPMS that I didn't reproduce, and apparently also happened with 450.xx), but not sure if they're both fixed.

The former carries a patch, and the latter an official fix relating to page alloc failures.

Comment 16 Ionen Wolkens gentoo-dev

2021-03-02 22:35:53 UTC

At least one of the page alloc failure is fixed, but I haven't heard of the other one in a while (supposedly DPMS related). I'm led to believe this issue is obsolete.

Please open a new bug if still run into these.

Comment 17 Albert Zeyer 2021-11-05 13:36:52 UTC

Just to add as additional information: I'm seeing also hangs with similar stack traces (involving nvidia_frontend_unlocked_ioctl and other nvidia ioctl related things). This is on Ubuntu 20.04 with nvidia 470, and Linux 5.4.0.

Some more details here: https://askubuntu.com/questions/1236721/desktop-hung-up-freeze-gpuwatchdog-segfault-nvidia-frontend-close

I don't really have any solution. This also doesn't really happen too often.