In all of the 455 series up through the current release, if nvidia's modesetting is enabled the kernel will throw page allocation failures and crash various apps. This prevents me from keeping the system online for a day. See upstream for tracking here: https://forums.developer.nvidia.com/t/455-23-04-page-allocation-failure-in-kernel-module-at-random-points/155250 Repeating here since this is difficult to track down to being nvidia's fault, and because the only known fix requires and out-of-tree patch posted by an nvidia engineer that apparently won't be included until a future release series. Available here: https://people.freedesktop.org/~aplattner/reduce-kmalloc-limit-455.38.patch Reproducible: Always
See also bug #747028 with comments about the same thing. But hard to say if a duplicate given that bug was about 450.66 _also_ having page allocation issues (there's a few of those on nvidia's forums too). Not that I've run into this with 450.xx myself, while I could easily reproduce in a few minutes with any 455.xx by stressing memory usage. These bugs been a worry for people here and there on gentoo, given they think their hardware is failing when it's not. Personally still using 450.80.02 right now. I've seen the patch before but felt there was no rush for me to use 455.xx, but believe 455.xx does fix a few things for KDE users, and is the reason why 455.xx was stabilized (see bug #749393). Last version which shouldn't have any of aforementioned issues without patches right now is 440.100, but needs 5.4.x kernel.
(In reply to Ionen Wolkens from comment #1) > See also bug #747028 with comments about the same thing. > > But hard to say if a duplicate given that bug was about 450.66 _also_ having > page allocation issues (there's a few of those on nvidia's forums too). Not > that I've run into this with 450.xx myself, while I could easily reproduce > in a few minutes with any 455.xx by stressing memory usage. > > These bugs been a worry for people here and there on gentoo, given they > think their hardware is failing when it's not. > > Personally still using 450.80.02 right now. I've seen the patch before but > felt there was no rush for me to use 455.xx, but believe 455.xx does fix a > few things for KDE users, and is the reason why 455.xx was stabilized (see > bug #749393). > > Last version which shouldn't have any of aforementioned issues without > patches right now is 440.100, but needs 5.4.x kernel. Ditto I'm on 450.80.02, but with 5.9.x having new intel security fixes after 5.8 went EOL I was hoping to move up. There was a comment about HardDPMS possibly being related to the 450 crashes? I haven't been able to reproduce on 450 yet, but the crashing in 455 is extreme enough it probably can't hurt to throw in the patch for the new release that added 5.9 nvidia uvm compatibility since the change is only in the modesetting code anyway.
(In reply to Gregory Beauregard from comment #0) > In all of the 455 series up through the current release, if nvidia's > modesetting is enabled the kernel will throw page allocation failures and > crash various apps. This prevents me from keeping the system online for a > day. > > See upstream for tracking here: > https://forums.developer.nvidia.com/t/455-23-04-page-allocation-failure-in- > kernel-module-at-random-points/155250 > > Repeating here since this is difficult to track down to being nvidia's > fault, and because the only known fix requires and out-of-tree patch posted > by an nvidia engineer that apparently won't be included until a future > release series. > > Available here: > https://people.freedesktop.org/~aplattner/reduce-kmalloc-limit-455.38.patch > > Reproducible: Always I'm very uneasy adding this patch in Gentoo, given how brittle the nvidia driver is already by itself.
(In reply to David Seifert from comment #3) > (In reply to Gregory Beauregard from comment #0) > > In all of the 455 series up through the current release, if nvidia's > > modesetting is enabled the kernel will throw page allocation failures and > > crash various apps. This prevents me from keeping the system online for a > > day. > > > > See upstream for tracking here: > > https://forums.developer.nvidia.com/t/455-23-04-page-allocation-failure-in- > > kernel-module-at-random-points/155250 > > > > Repeating here since this is difficult to track down to being nvidia's > > fault, and because the only known fix requires and out-of-tree patch posted > > by an nvidia engineer that apparently won't be included until a future > > release series. > > > > Available here: > > https://people.freedesktop.org/~aplattner/reduce-kmalloc-limit-455.38.patch > > > > Reproducible: Always > > I'm very uneasy adding this patch in Gentoo, given how brittle the nvidia > driver is already by itself. So, I don't think it's a big deal since I (think) we can do it ourselves with EPATCH_USER. But, the arguments for adding it are that as far as I know, modesetting is completely broken in this version of the driver, and the patch only affects the modesetting part of the driver that isn't a loaded module if you aren't using modesetting
i.e., this code isn't loaded if you aren't using a version of the driver that's completely broken otherwise
Created attachment 677137 [details, diff] patch I've attached the patch with the directory format needed to use with EPATCH_USER in gentoo
The bug has been referenced in the following commit(s): https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=477b1935411fdc4646c5ef49a1414faeda70058d commit 477b1935411fdc4646c5ef49a1414faeda70058d Author: David Seifert <soap@gentoo.org> AuthorDate: 2020-12-07 12:57:41 +0000 Commit: David Seifert <soap@gentoo.org> CommitDate: 2020-12-07 12:57:41 +0000 x11-drivers/nvidia-drivers: Add patch for modesetting allocation failures Bug: https://bugs.gentoo.org/755497 Package-Manager: Portage-3.0.12, Repoman-3.0.2 Suggested-by: Gregory Beauregard <gentoobugs@gably.net> Signed-off-by: David Seifert <soap@gentoo.org> ...nvidia-drivers-455.45.01-reduce-kmalloc-limit.patch | 18 ++++++++++++++++++ ...45.01.ebuild => nvidia-drivers-455.45.01-r1.ebuild} | 1 + 2 files changed, 19 insertions(+)
(In reply to David Seifert from comment #3) > I'm very uneasy adding this patch in Gentoo, given how brittle the nvidia > driver is already by itself. Given was able to reproduce this issue mostly on-demand, tried with new driver+patch and I couldn't make it happen so far (works fine). Thanks for looking at this. I do share your uneasiness, but think this issue was major enough. Hopefully nvidia's thread will give insight on when this patch can be removed. When stable I'd argue 450.80.02-r1->455.38-r1 should all be removed, 440.100 still has worth for its overall stability (450.xx branch introduced a lot of problems, and 455.xx yet more).
reports that new beta driver 460.27.04 is supposed to fix this issue
The bug has been referenced in the following commit(s): https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=27c4c118947e314725b7bb6246dec1d00c55826d commit 27c4c118947e314725b7bb6246dec1d00c55826d Author: David Seifert <soap@gentoo.org> AuthorDate: 2020-12-27 10:48:57 +0000 Commit: David Seifert <soap@gentoo.org> CommitDate: 2020-12-27 10:48:57 +0000 x11-drivers/nvidia-drivers: Stable 455.45.01-r1 Bug: https://bugs.gentoo.org/755497 Package-Manager: Portage-3.0.12, Repoman-3.0.2 Signed-off-by: David Seifert <soap@gentoo.org> x11-drivers/nvidia-drivers/nvidia-drivers-455.45.01-r1.ebuild | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
With nvidia having also fixed this on their end and old drivers being gone, I'd say we're done here.