Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 796329 - >=x11-drivers/nvidia-drivers-465.31 kernel/system freezes when terminating chromium-based applications (when using kernel's slub_debug=P)
Summary: >=x11-drivers/nvidia-drivers-465.31 kernel/system freezes when terminating ch...
Status: RESOLVED UPSTREAM
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: Current packages (show other bugs)
Hardware: AMD64 Linux
: Normal major (vote)
Assignee: Ionen Wolkens
URL: https://forums.developer.nvidia.com/t...
Whiteboard:
Keywords:
: 808859 (view as bug list)
Depends on:
Blocks:
 
Reported: 2021-06-16 09:47 UTC by gertoe
Modified: 2021-09-02 20:11 UTC (History)
3 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description gertoe 2021-06-16 09:47:27 UTC
Under certain conditions, the 465 series nvidia driver causes the system to freeze/crash reproducibly. The freeze/crash occurs after terminating (exiting/closing) a chromium-based application.

Reproducible: Always

Steps to Reproduce:
1. Install/update to =x11-drivers/nvidia-drivers-465.31
2. Open any installed chromium-based application, e.g. www-client/chromium, www-client/vivaldi or net-im/signal-desktop-bin
3. Terminate the application (close/exit)

Actual Results:  
Case 1: The complete system freezes/crashes: No interaction is possible at all.

Case 2: Partial system freeze/crash: Xorg freezes/crashes partially. However, moving the mouse is still and only possible.

In both cases, switching to tty[1-12] and terminating Xorg with a manually terminate key combination is not possible.

Within the crashed/frozen state, the system will neither react to any ACPI poweroff events nor Magic SysRq either. A reboot can only be performed by a hard power-off/reset.

No logs/kernel dumps/stack traces are written at any time.

Expected Results:  
The application should be closed and the system should behave normally.

With a running x11-misc/picom compositor, the crash/freeze occurs instantly. Without the compositor, the system freezes delayed (approx. 5–10 s) after the chromium-based application was terminated.

Similar issues with the nvidia 465 branch were reported at the NVIDIA forums https://forums.developer.nvidia.com/t/bug-report-455-23-04-kernel-panic-due-to-null-pointer-dereference/155506/ and https://forums.developer.nvidia.com/t/465-24-02-page-fault/175782 etc.

Rolling back to the latest 460 series driver (=x11-driverss/nvidia-drivers-460.84) resolves the issue. However, others report a rollback to at most <=460.73.01 was required to solve their issue.

Apparently, the issue seems to be located within the driver itself. So, an update from NVIDIA will be required in order to solve the bug.
Comment 1 gertoe 2021-06-16 09:49:01 UTC
Portage 3.0.18 (python 3.9.5-final-0, default/linux/amd64/17.1/hardened, gcc-10.3.0, glibc-2.33, 5.10.27-gentoo x86_64)
=================================================================
System uname: Linux-5.10.27-gentoo-x86_64-Intel-R-_Core-TM-_i7-6700_CPU_@_3.40GHz-with-glibc2.33
KiB Mem:    65677688 total,  55966436 free
KiB Swap:    4194300 total,   4194300 free
Timestamp of repository gentoo: Wed, 16 Jun 2021 08:00:01 +0000
Head commit of repository gentoo: c09f2e2eb0e996be2497a17e1f8c6dafb9b70297

Head commit of repository gertoe: 79be7c4f79986705e4d2a29e233c73b366e00387

Timestamp of repository poly-c: Tue, 15 Jun 2021 12:19:55 +0000
Head commit of repository poly-c: 336b5adc59e9b8f3740878fe1758a8778910b6d5

Head commit of repository science: 5ddaec4db70483bff1a83cef04af84cee7e5e1c1

sh bash 5.1_p8
ld GNU ld (Gentoo 2.35.2 p1) 2.35.2
ccache version 4.3 [enabled]
app-shells/bash:          5.1_p8::gentoo
dev-java/java-config:     2.3.1::gentoo
dev-lang/perl:            5.32.1::gentoo
dev-lang/python:          2.7.18_p10::gentoo, 3.7.10_p3::gentoo, 3.8.10_p2::gentoo, 3.9.5_p2::gentoo
dev-lang/rust:            1.51.0-r2::gentoo
dev-util/ccache:          4.3::gentoo
dev-util/cmake:           3.18.5::gentoo
dev-util/pkgconfig:       0.29.2::gentoo
sys-apps/baselayout:      2.7::gentoo
sys-apps/openrc:          0.42.1-r1::gentoo
sys-apps/sandbox:         2.24::gentoo
sys-devel/autoconf:       2.13-r1::gentoo, 2.69-r5::gentoo
sys-devel/automake:       1.16.3-r1::gentoo
sys-devel/binutils:       2.35.2::gentoo
sys-devel/gcc:            7.5.0-r1::gentoo, 8.4.0-r2::gentoo, 9.3.0-r2::gentoo, 10.3.0::gentoo
sys-devel/gcc-config:     2.4::gentoo
sys-devel/libtool:        2.4.6-r6::gentoo
sys-devel/make:           4.3::gentoo
sys-kernel/linux-headers: 5.10::gentoo (virtual/os-headers)
sys-libs/glibc:           2.33::gentoo
Repositories:

gentoo
    location: /usr/portage
    sync-type: rsync
    sync-uri: rsync://rsync.gentoo.org/gentoo-portage
    priority: -1000
    sync-rsync-verify-metamanifest: yes
    sync-rsync-extra-opts: 
    sync-rsync-verify-max-age: 24
    sync-rsync-verify-jobs: 1

gertoe
    location: /var/db/repos/gertoe
    sync-type: git
    sync-uri: git@gitlab.com:gertoe/portage-overlay.git
    masters: gentoo

localrepo
    location: /usr/local/portage
    masters: gentoo

poly-c
    location: /var/db/repos/poly-c
    sync-type: git
    sync-uri: https://github.com/gentoo-mirror/poly-c.git
    masters: gentoo

science
    location: /var/db/repos/science
    sync-type: git
    sync-uri: https://anongit.gentoo.org/git/proj/sci.git
    masters: gentoo

ACCEPT_KEYWORDS="amd64"
ACCEPT_LICENSE="@FREE"
CBUILD="x86_64-pc-linux-gnu"
CFLAGS="-march=native -O2 -pipe"
CHOST="x86_64-pc-linux-gnu"
CONFIG_PROTECT="/etc /etc/X11/xinit/xserverrc /etc/init.d/snort /etc/modprobe.d/nvidia.conf /usr/bin/startx /usr/share/config /usr/share/gnupg/qualified.txt"
CONFIG_PROTECT_MASK="/etc/ca-certificates.conf /etc/dconf /etc/env.d /etc/fonts/fonts.conf /etc/gconf /etc/gentoo-release /etc/revdep-rebuild /etc/sandbox.d /etc/terminfo /etc/texmf/language.dat.d /etc/texmf/language.def.d /etc/texmf/updmap.d /etc/texmf/web2c"
CXXFLAGS="-march=native -O2 -pipe"
DISTDIR="/usr/portage/distfiles"
EMERGE_DEFAULT_OPTS="--jobs=8 --load-average=8 --quiet-build=n"
ENV_UNSET="CARGO_HOME DBUS_SESSION_BUS_ADDRESS DISPLAY GOBIN GOPATH PERL5LIB PERL5OPT PERLPREFIX PERL_CORE PERL_MB_OPT PERL_MM_OPT XAUTHORITY XDG_CACHE_HOME XDG_CONFIG_HOME XDG_DATA_HOME XDG_RUNTIME_DIR"
FCFLAGS="-O2 -pipe"
FEATURES="assume-digests binpkg-docompress binpkg-dostrip binpkg-logs ccache config-protect-if-modified distlocks ebuild-locks fixlafiles ipc-sandbox merge-sync multilib-strict network-sandbox news parallel-fetch pid-sandbox preserve-libs protect-owned qa-unresolved-soname-deps sandbox sfperms strict unknown-features-warn unmerge-logs unmerge-orphans userfetch userpriv usersandbox usersync xattr"
FFLAGS="-march=native -O2 -pipe"
GENTOO_MIRRORS="http://distfiles.gentoo.org"
LANG="de_DE.utf8"
LDFLAGS="-Wl,-O1 -Wl,--as-needed"
MAKEOPTS="-j12 -l8"
PKGDIR="/usr/portage/packages"
PORTAGE_CONFIGROOT="/"
PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --omit-dir-times --compress --force --whole-file --delete --stats --human-readable --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages --exclude=/.git"
PORTAGE_TMPDIR="/var/tmp"
USE="X a52 acl acpi alsa amd64 bluetooth bluray bzip2 cdda cdr clang colord crypt cuda cups dbus dri dts dvd egl elogind eselect-ldso ffmpeg fuse gmp graphite gtk gtk3 gvfs hardened hwaccel iconv icu ipv6 jemalloc jit jumbo-build lcms libglvnd libtirpc lto luajit multilib ncurses nls nptl nvidia opencl opengl openmp pam pcre pic pie policykit printsupport pulseaudio qt5 raw readline seccomp split-usr ssl ssp tcmalloc threads udev udisks unicode v4l vaapi vdpau vkd3d vulkan wayland xattr xinerama xtpax zlib" ABI_X86="64" ADA_TARGET="gnat_2018" ALSA_CARDS="ali5451 als4000 atiixp atiixp-modem bt87x ca0106 cmipci emu10k1x ens1370 ens1371 es1938 es1968 fm801 hda-intel intel8x0 intel8x0m maestro3 trident usb-audio via82xx via82xx-modem ymfpci" APACHE2_MODULES="authn_core authz_core socache_shmcb unixd actions alias auth_basic authn_alias authn_anon authn_dbm authn_default authn_file authz_dbm authz_default authz_groupfile authz_host authz_owner authz_user autoindex cache cgi cgid dav dav_fs dav_lock deflate dir disk_cache env expires ext_filter file_cache filter headers include info log_config logio mem_cache mime mime_magic negotiation rewrite setenvif speling status unique_id userdir usertrack vhost_alias" CALLIGRA_FEATURES="karbon sheets words" CAMERAS="canon fuji p2p pentax" COLLECTD_PLUGINS="df interface irq load memory rrdtool swap syslog" CPU_FLAGS_X86="aes avx avx2 f16c fma3 mmx mmxext pclmul popcnt sse sse2 sse3 sse4_1 sse4_2 ssse3" ELIBC="glibc" GPSD_PROTOCOLS="ashtech aivdm earthmate evermore fv18 garmin garmintxt gpsclock greis isync itrax mtk3301 nmea ntrip navcom oceanserver oldstyle oncore rtcm104v2 rtcm104v3 sirf skytraq superstar2 timing tsip tripmate tnt ublox ubx" INPUT_DEVICES="evdev libinput wacom" KERNEL="linux" L10N="de en" LCD_DEVICES="bayrad cfontz cfontz633 glk hd44780 lb216 lcdm001 mtxorb ncurses text" LIBREOFFICE_EXTENSIONS="presenter-console presenter-minimizer" LLVM_TARGETS="X86 BPF NVPTX" LUA_SINGLE_TARGET="lua5-1" LUA_TARGETS="lua5-1" OFFICE_IMPLEMENTATION="libreoffice" PHP_TARGETS="php7-3 php7-4" POSTGRES_TARGETS="postgres10 postgres11" PYTHON_SINGLE_TARGET="python3_9" PYTHON_TARGETS="python3_9" RUBY_TARGETS="ruby26" SANE_BACKENDS="net pixma" USERLAND="GNU" VIDEO_CARDS="nvidia intel i965" XTABLES_ADDONS="quota2 psd pknock lscan length2 ipv4options ipset ipp2p iface geoip fuzzy condition tee tarpit sysrq proto steal rawnat logmark ipmark dhcpmac delude chaos account"
Unset:  CC, CPPFLAGS, CTARGET, CXX, INSTALL_MASK, LC_ALL, LINGUAS, PORTAGE_BINHOST, PORTAGE_BUNZIP2_COMMAND, PORTAGE_COMPRESS, PORTAGE_COMPRESS_FLAGS, PORTAGE_RSYNC_EXTRA_OPTS, RUSTFLAGS
Comment 2 Ionen Wolkens gentoo-dev 2021-06-16 19:31:52 UTC
Haven't be able to reproduce (tried with picom/vivaldi/chromium and various settings), but that may be because I don't have the right hardware to or some other specifics. I'd assume it's not all that widespread.

As you mention, unlikely there is anything we can do until nvidia sends a fix. Please keep using a branch that works for you meanwhile.

For what it's worth, was already never planning to mark the 465.xx branch stable and waiting for the next Production branch.
Comment 3 Ionen Wolkens gentoo-dev 2021-06-22 19:24:28 UTC
There is reports that NVIDIA has fixed this issue internally but unsure if the fix landed in 470.42.01 (may have been too soon, it's not in the changelog either).
Comment 4 gertoe 2021-06-28 17:06:27 UTC
I have just tried the latest nvidia-drivers-470.42.01 which apparently also has the issue – at least on my machine. Thus the earlier references may purely coincide with my observed issue. My guess is that NVIDIA might have changed some internals after the 460 series that interfere with my hardened kernel configuration. In the meantime, I will stay with the 460 series driver until I have some more time to investigate further.
Comment 5 gertoe 2021-07-20 09:36:04 UTC
With the latest "stable"/"recommended" nvidia driver x11-drivers/nvidia-drivers-470.57.02 released on 2021-07-19, the freezing still occurs on my machine.

Also, enabling the kernel lock detectors does still not reveal any stack traces at all if a freeze occurs.

I will investigate further.
Comment 6 gertoe 2021-07-20 15:43:51 UTC
I have replaced my graphics card with two different cards to preclude any possible failing hardware: Another identical GTX 970 used in another machine and a Quadro K600. With both cards, the freezing issue was observed in the same manner, thus the first card should be working properly.

Removing any hardening option from my kernel configuration, especially the stack protection did not have any effect on the occurring issue either.

Finally, both sys-kernel/gentoo-kernel{,-bin} (5.10.47) kernels with the pre-configured default options crash with the new 470-series driver on my machine, too.

For the meantime, I will continue to use the 460-series driver. In the long run, this will not be a very satisfying solution, though.
Comment 7 Ionen Wolkens gentoo-dev 2021-07-20 17:48:12 UTC
Thanks for looking into it. I guess this may be a more isolated case, unsure if there's much I can do to help.

I do intend to keep 460 branch in the tree for a very long time (either way) assuming no new major problems with it.
Comment 8 gertoe 2021-07-21 10:36:17 UTC
Finally, I have found a reference to the issue I am facing with the recent nvidia-drivers describing exactly the same observation:

According to https://forums.developer.nvidia.com/t/gpf-when-closing-chrome-with-slub-debug-p-enabled-on-465-19-01-and-470-42-01/182054, the issue seems to coincide with a SLUB hardening (poisoning) option by enabling CONFIG_SLUB_DEBUG and setting the slub_debug=P boot argument.

I was not aware of its presence in my GRUB_CMDLINE; I have configured it a long time ago and never touched it ever since.

However, another hardening option with respect to page poisoning, i.e., CONFIG_PAGE_POISONING=y and page_poison=1, can be set without inducing any observable freezes so far, respectively.

Thus, I assume that NVIDIA must have changed the driver behaviour within the 465-series somehow which is now conflicting with the slub hardening.

Consequently, the slub_debug poisoning as a hardening option must supposedly be avoided when using the latest nvidia-drivers at the moment.
Comment 9 gertoe 2021-07-21 10:42:30 UTC
I have just realized that marking the issue as "RESOLVED" removes the issue from the issue tracker listing. Thus I have re-set the status to "UNCONFIRMED", instead to allow others to browse the issue as a reference.
Comment 10 Ionen Wolkens gentoo-dev 2021-07-21 11:08:30 UTC
Considering this CONFIG option is set even on generic kernels, I guess checking for it in the ebuild wouldn't mean much.

At most could check if it's set on the (current) kernel command line and emit a warning which could be useful before this version is made stable.

Unfortunately rather little control over this.
Comment 11 Ionen Wolkens gentoo-dev 2021-07-21 18:01:44 UTC
Have been able to reproduce now, successfully froze my passthrough VM.
Comment 12 Larry the Git Cow gentoo-dev 2021-07-21 19:44:43 UTC
The bug has been referenced in the following commit(s):

https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=69ea157c9d2851d3aa369484fd13fcb1f69f4477

commit 69ea157c9d2851d3aa369484fd13fcb1f69f4477
Author:     Ionen Wolkens <ionen@gentoo.org>
AuthorDate: 2021-07-21 17:29:55 +0000
Commit:     Ionen Wolkens <ionen@gentoo.org>
CommitDate: 2021-07-21 19:42:51 +0000

    x11-drivers/nvidia-drivers: warn about slub_debug issues
    
    May not affect many users but it is hard to diagnose
    without a hint.
    
    Bug: https://bugs.gentoo.org/796329
    Signed-off-by: Ionen Wolkens <ionen@gentoo.org>

 x11-drivers/nvidia-drivers/nvidia-drivers-470.57.02.ebuild | 7 +++++++
 1 file changed, 7 insertions(+)
Comment 13 Ionen Wolkens gentoo-dev 2021-07-21 19:45:59 UTC
Hopefully warning will be enough to inform users, so this bug shouldn't be needed open.

I'll keep an eye out to remove the warning if nvidia does something about it.
Comment 14 Alex Efros 2021-08-19 19:17:44 UTC
(In reply to Ionen Wolkens from comment #13)
> Hopefully warning will be enough to inform users, so this bug shouldn't be
> needed open.
> 
> I'll keep an eye out to remove the warning if nvidia does something about it.

Thanks, but please also add a comment here, otherwise removed warning is even harder to notice - we need some "ping" to let us know we can re-add this kernel option.
Comment 15 Ionen Wolkens gentoo-dev 2021-08-19 19:25:22 UTC
*** Bug 808859 has been marked as a duplicate of this bug. ***