Summary: | x11-drivers/nvidia-drivers-450.66 - X: page allocation failure: order:5, mode:0x40cc0(GFP_KERNEL|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0 | ||
---|---|---|---|
Product: | Gentoo Linux | Reporter: | fatalerrors <fatalerrors> |
Component: | Current packages | Assignee: | David Seifert <soap> |
Status: | RESOLVED OBSOLETE | ||
Severity: | normal | CC: | axiator, ionen, kajanos, marek.duranik |
Priority: | Normal | ||
Version: | unspecified | ||
Hardware: | AMD64 | ||
OS: | Linux | ||
See Also: | https://bugs.gentoo.org/show_bug.cgi?id=753629 | ||
Whiteboard: | |||
Package list: | Runtime testing required: | --- | |
Attachments: |
kernel_59_nvidia_uvm.patch
dmesg log from today's crash |
Description
fatalerrors@geoffray-levasseur.org
2020-10-07 09:36:58 UTC
Which driver version? If it's 455.23.04 then it's something I've also run into (seen three other users do as well). Using 450.80.02 instead works fine. I've found it's easiest to trigger while doing heavy/rapid usage of tmpfs with nearly full ram usage, but other things can randomly trigger it as well. 455.xx is currently only needed for RTX 30xx cards, nvidia driver page also (currently) point to use 450.80.02 if don't request 30xx. From this end I'd suggest making 450.80.02 the next stable and leave 455.xx in ~testing for a while. Not sure how widespread this issue is to know if it's worth masking current 455.xx (but 30xx users could unmask it as needed). Similar issue on nvidia forums: https://forums.developer.nvidia.com/t/455-23-04-page-allocation-failure-in-kernel-module-at-random-points/155250 I'm using last stable which is at time 450.66. I can try to unmask 450.80.02 to see if that happens again. (In reply to fatalerrors@geoffray-levasseur.org from comment #2) > I'm using last stable which is at time 450.66. I can try to unmask 450.80.02 > to see if that happens again. I see, in that case I'm surprised. Pretty sure 450.66 was fine, although there's another user that I "think" are using 450.66 and getting that error but haven't gotten confirmation for driver version. I don't think 450.66 and 450.80.06 are very different given the changelog but there could be more non-mentioned changes that help. Do report if it helped. Not sure gentoo can do much to figure this out though, probably better taken to nvidia. (In reply to Ionen Wolkens from comment #1) > If it's 455.23.04 then it's something I've also run into. Since 455.28 just came out (thanks for fast version bump as usual), thought I'd give it a stress test to see if fixed. Unfortunately got page allocation failure after ~20m of abuse, no issues with 450.80.02 still (ah well, I'll stick with that until nvidia fix this). (In reply to Ionen Wolkens from comment #1) > From this end I'd suggest making 450.80.02 the next stable and leave 455.xx > in ~testing for a while. On a related note, haven't tested runtime but 450.80.02 and 455.28 seem to build fine with kernel 5.9 as-is (or at least with my configuration), while stable 450.66 is failing. I still suggest not to stabilize 0/455 branch yet considering there's also bug #747319 that's concerning. Not sure if 450.80.02 help with this page failure issue over 450.66 (given I had the issue with 455.xx), but still haven't been able to trigger it on this version. (In reply to Ionen Wolkens from comment #5) > On a related note, haven't tested runtime but 450.80.02 and 455.28 seem to > build fine with kernel 5.9 as-is (or at least with my configuration) For me, 455.28 with kernel 5.8.14 seems to have fixed the issue (knocks wood with crossed fingers). But with kernel 5.9, CUDA and OpenCL don't work (though OpenGL does). This seems related to a change in 5.9 in handling non-free modules. The patch (kernel59.patch) is for kernel/module.c You can use that for sure on your private machine. On this patch nvidia_uvm works just fine. Created attachment 665762 [details, diff]
kernel_59_nvidia_uvm.patch
same here with 455.28 I tried "kernel_59_nvidia_uvm.patch" patch for kernel 5.9.1, but the problem with "X: page allocation failure...." persists. kernel 5.9.1-gentoo-x86_64 VGA: GeForce GTX 660 Driver: nvidia-drivers-455.28 I think, that problem could be related with screen saving, because problem appears after waking screen from sleep. It is possible, that issue is related to nvidia driver aswell, because there was an update at Oct 17 from version 455.23.04-r1 to version 455.28. I have never had mentioned issue with freezing of graphical environment before. The nvidia forums link I posted in comment #1 been seeing a lot of activity. An nvidia rep notably said: > We’ve made a change that should avoid this problem in the future. It’ll > be available in a future release. > It should apply to all memory allocation failures that happen during mode > setting operations. I’m not 100% sure it applies to the one in that other > thread, but I think so. I don't know if it applies to this bug as well but there's hoping the next version will fix this for everyone. (the 5.9 patch has nothing to do with this bug, and is also pointless with the default USE=-uvm) (In reply to Ionen Wolkens from comment #11) > I don't know if it applies to this bug as well but there's hoping the next > version will fix this for everyone. Or another version down the line anyway, according to a nvidia rep the fix wasn't included in 455.38. So, for now, if having issues stick to whichever version works best for your setups. In my case that's 450.80.02, but if using stable 5.4.x kernel then the still-in-tree 440.100 is probably the safest fallback given 450/455 introduced a lot of changes+issues. I can confirm that this issue happens also with Ubuntu, GTX 960, and 3 pcs Dell display port screens. For me it fails fastest (in a week) by using VMWare win10 virtual machine running with 3D accelerations enabled. Without that running, uptimes may be 1-2 months. Can also confirm that this issue has been there at lesat 1.5 years with multiple different kernel and NVidia driver versions. Any version combo has not been any different. Created attachment 678286 [details]
dmesg log from today's crash
Just if someone needs, here is dmesg log from my today's crash. I were able to ssh to machine, but reboot command didn't work (just killed ssh etc). I had to uses magic sysrq key combo to force boot. This is usual case.
Is anyone still having problems with either 455.45.01 or 460.27.04+? I believe there may have been two different page allocation issues (one which is related to DPMS that I didn't reproduce, and apparently also happened with 450.xx), but not sure if they're both fixed. The former carries a patch, and the latter an official fix relating to page alloc failures. At least one of the page alloc failure is fixed, but I haven't heard of the other one in a while (supposedly DPMS related). I'm led to believe this issue is obsolete. Please open a new bug if still run into these. Just to add as additional information: I'm seeing also hangs with similar stack traces (involving nvidia_frontend_unlocked_ioctl and other nvidia ioctl related things). This is on Ubuntu 20.04 with nvidia 470, and Linux 5.4.0. Some more details here: https://askubuntu.com/questions/1236721/desktop-hung-up-freeze-gpuwatchdog-segfault-nvidia-frontend-close I don't really have any solution. This also doesn't really happen too often. |