Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 755497 - x11-drivers/nvidia-drivers-455.38-r1: page allocation failure crashes in X11, compositors, etc if modesetting is on
Summary: x11-drivers/nvidia-drivers-455.38-r1: page allocation failure crashes in X11,...
Status: RESOLVED FIXED
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: Current packages (show other bugs)
Hardware: All Linux
: Normal normal (vote)
Assignee: David Seifert
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-11-19 10:05 UTC by Gregory Beauregard
Modified: 2021-03-02 22:16 UTC (History)
5 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments
patch (nvidia-modeset.patch,703 bytes, patch)
2020-12-07 11:56 UTC, Gregory Beauregard
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Gregory Beauregard 2020-11-19 10:05:04 UTC
In all of the 455 series up through the current release, if nvidia's modesetting is enabled the kernel will throw page allocation failures and crash various apps. This prevents me from keeping the system online for a day.

See upstream for tracking here: https://forums.developer.nvidia.com/t/455-23-04-page-allocation-failure-in-kernel-module-at-random-points/155250

Repeating here since this is difficult to track down to being nvidia's fault, and because the only known fix requires and out-of-tree patch posted by an nvidia engineer that apparently won't be included until a future release series. 

Available here: https://people.freedesktop.org/~aplattner/reduce-kmalloc-limit-455.38.patch

Reproducible: Always
Comment 1 Ionen Wolkens gentoo-dev 2020-11-19 17:26:54 UTC
See also bug #747028 with comments about the same thing.

But hard to say if a duplicate given that bug was about 450.66 _also_ having page allocation issues (there's a few of those on nvidia's forums too). Not that I've run into this with 450.xx myself, while I could easily reproduce in a few minutes with any 455.xx by stressing memory usage.

These bugs been a worry for people here and there on gentoo, given they think their hardware is failing when it's not.

Personally still using 450.80.02 right now. I've seen the patch before but felt there was no rush for me to use 455.xx, but believe 455.xx does fix a few things for KDE users, and is the reason why 455.xx was stabilized (see bug #749393).

Last version which shouldn't have any of aforementioned issues without patches right now is 440.100, but needs 5.4.x kernel.
Comment 2 Gregory Beauregard 2020-11-19 17:39:42 UTC
(In reply to Ionen Wolkens from comment #1)
> See also bug #747028 with comments about the same thing.
> 
> But hard to say if a duplicate given that bug was about 450.66 _also_ having
> page allocation issues (there's a few of those on nvidia's forums too). Not
> that I've run into this with 450.xx myself, while I could easily reproduce
> in a few minutes with any 455.xx by stressing memory usage.
> 
> These bugs been a worry for people here and there on gentoo, given they
> think their hardware is failing when it's not.
> 
> Personally still using 450.80.02 right now. I've seen the patch before but
> felt there was no rush for me to use 455.xx, but believe 455.xx does fix a
> few things for KDE users, and is the reason why 455.xx was stabilized (see
> bug #749393).
> 
> Last version which shouldn't have any of aforementioned issues without
> patches right now is 440.100, but needs 5.4.x kernel.

Ditto I'm on 450.80.02, but with 5.9.x having new intel security fixes after 5.8 went EOL I was hoping to move up.

There was a comment about HardDPMS possibly being related to the 450 crashes? I haven't been able to reproduce on 450 yet, but the crashing in 455 is extreme enough it probably can't hurt to throw in the patch for the new release that added 5.9 nvidia uvm compatibility since the change is only in the modesetting code anyway.
Comment 3 David Seifert gentoo-dev 2020-12-07 10:38:04 UTC
(In reply to Gregory Beauregard from comment #0)
> In all of the 455 series up through the current release, if nvidia's
> modesetting is enabled the kernel will throw page allocation failures and
> crash various apps. This prevents me from keeping the system online for a
> day.
> 
> See upstream for tracking here:
> https://forums.developer.nvidia.com/t/455-23-04-page-allocation-failure-in-
> kernel-module-at-random-points/155250
> 
> Repeating here since this is difficult to track down to being nvidia's
> fault, and because the only known fix requires and out-of-tree patch posted
> by an nvidia engineer that apparently won't be included until a future
> release series. 
> 
> Available here:
> https://people.freedesktop.org/~aplattner/reduce-kmalloc-limit-455.38.patch
> 
> Reproducible: Always

I'm very uneasy adding this patch in Gentoo, given how brittle the nvidia driver is already by itself.
Comment 4 Gregory Beauregard 2020-12-07 10:40:50 UTC
(In reply to David Seifert from comment #3)
> (In reply to Gregory Beauregard from comment #0)
> > In all of the 455 series up through the current release, if nvidia's
> > modesetting is enabled the kernel will throw page allocation failures and
> > crash various apps. This prevents me from keeping the system online for a
> > day.
> > 
> > See upstream for tracking here:
> > https://forums.developer.nvidia.com/t/455-23-04-page-allocation-failure-in-
> > kernel-module-at-random-points/155250
> > 
> > Repeating here since this is difficult to track down to being nvidia's
> > fault, and because the only known fix requires and out-of-tree patch posted
> > by an nvidia engineer that apparently won't be included until a future
> > release series. 
> > 
> > Available here:
> > https://people.freedesktop.org/~aplattner/reduce-kmalloc-limit-455.38.patch
> > 
> > Reproducible: Always
> 
> I'm very uneasy adding this patch in Gentoo, given how brittle the nvidia
> driver is already by itself.

So, I don't think it's a big deal since I (think) we can do it ourselves with EPATCH_USER. But, the arguments for adding it are that as far as I know, modesetting is completely broken in this version of the driver, and the patch only affects the modesetting part of the driver that isn't a loaded module if you aren't using modesetting
Comment 5 Gregory Beauregard 2020-12-07 10:45:03 UTC
i.e., this code isn't loaded if you aren't using a version of the driver that's completely broken otherwise
Comment 6 Gregory Beauregard 2020-12-07 11:56:01 UTC
Created attachment 677137 [details, diff]
patch

I've attached the patch with the directory format needed to use with EPATCH_USER in gentoo
Comment 7 Larry the Git Cow gentoo-dev 2020-12-07 12:57:56 UTC
The bug has been referenced in the following commit(s):

https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=477b1935411fdc4646c5ef49a1414faeda70058d

commit 477b1935411fdc4646c5ef49a1414faeda70058d
Author:     David Seifert <soap@gentoo.org>
AuthorDate: 2020-12-07 12:57:41 +0000
Commit:     David Seifert <soap@gentoo.org>
CommitDate: 2020-12-07 12:57:41 +0000

    x11-drivers/nvidia-drivers: Add patch for modesetting allocation failures
    
    Bug: https://bugs.gentoo.org/755497
    Package-Manager: Portage-3.0.12, Repoman-3.0.2
    Suggested-by: Gregory Beauregard <gentoobugs@gably.net>
    Signed-off-by: David Seifert <soap@gentoo.org>

 ...nvidia-drivers-455.45.01-reduce-kmalloc-limit.patch | 18 ++++++++++++++++++
 ...45.01.ebuild => nvidia-drivers-455.45.01-r1.ebuild} |  1 +
 2 files changed, 19 insertions(+)
Comment 8 Ionen Wolkens gentoo-dev 2020-12-07 15:02:43 UTC
(In reply to David Seifert from comment #3)
> I'm very uneasy adding this patch in Gentoo, given how brittle the nvidia
> driver is already by itself.
Given was able to reproduce this issue mostly on-demand, tried with new driver+patch and I couldn't make it happen so far (works fine). Thanks for looking at this.

I do share your uneasiness, but think this issue was major enough. Hopefully nvidia's thread will give insight on when this patch can be removed.

When stable I'd argue 450.80.02-r1->455.38-r1 should all be removed, 440.100 still has worth for its overall stability (450.xx branch introduced a lot of problems, and 455.xx yet more).
Comment 9 Harris Landgarten 2020-12-16 19:32:00 UTC
reports that new beta driver 460.27.04 is supposed to fix this issue
Comment 10 Larry the Git Cow gentoo-dev 2020-12-27 10:49:05 UTC
The bug has been referenced in the following commit(s):

https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=27c4c118947e314725b7bb6246dec1d00c55826d

commit 27c4c118947e314725b7bb6246dec1d00c55826d
Author:     David Seifert <soap@gentoo.org>
AuthorDate: 2020-12-27 10:48:57 +0000
Commit:     David Seifert <soap@gentoo.org>
CommitDate: 2020-12-27 10:48:57 +0000

    x11-drivers/nvidia-drivers: Stable 455.45.01-r1
    
    Bug: https://bugs.gentoo.org/755497
    Package-Manager: Portage-3.0.12, Repoman-3.0.2
    Signed-off-by: David Seifert <soap@gentoo.org>

 x11-drivers/nvidia-drivers/nvidia-drivers-455.45.01-r1.ebuild | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
Comment 11 Ionen Wolkens gentoo-dev 2021-03-02 22:16:55 UTC
With nvidia having also fixed this on their end and old drivers being gone, I'd say we're done here.