Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 487558 - >=x11-drivers/nvidia-drivers-{331.17,319.49} causes processes to wait due to memory corruption
Summary: >=x11-drivers/nvidia-drivers-{331.17,319.49} causes processes to wait due to ...
Status: RESOLVED FIXED
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: Current packages (show other bugs)
Hardware: All Linux
: Normal normal with 5 votes (vote)
Assignee: Jeroen Roovers (RETIRED)
URL:
Whiteboard:
Keywords:
: 487700 490256 490496 490718 492984 493040 494212 494618 (view as bug list)
Depends on:
Blocks:
 
Reported: 2013-10-10 19:10 UTC by Serge Gavrilov
Modified: 2019-03-19 14:19 UTC (History)
48 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments
.config for kernel 3.12.2 (3.12.2.config,105.71 KB, text/plain)
2013-12-02 13:20 UTC, Alexandre Rostovtsev (RETIRED)
Details
working .config for kernel 3.12.2 (.config,86.44 KB, text/plain)
2013-12-02 16:28 UTC, Tom Wijsman (TomWij) (RETIRED)
Details
kernel-3.12.1-config-broken (.config,67.21 KB, text/plain)
2013-12-03 00:35 UTC, Martin Samek
Details
Non-working .config for 3.5.7 with CONFIG_BSD_PROCESS_ACCT=n (.config,97.04 KB, text/x-mpsub)
2013-12-15 08:55 UTC, Serge Gavrilov
Details
non-working .config with CONFIG_BSD_PROCESS_ACCT disabled (.config,91.59 KB, text/plain)
2014-01-04 17:37 UTC, Victor Orozco
Details
Working config for 3.12.6 (.config,77.05 KB, text/plain)
2014-01-04 17:48 UTC, Markus Strobl
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Serge Gavrilov 2013-10-10 19:10:17 UTC
Hi!

After upgrade my system up to gnome-3.8 (and changing openrc to systemd) I have the following problem. 

If I run evolution from gnome-shell than the 1st bogofilter child process becomes a zombie. Evolution waits them infinitely and stops to check the mail for SPAM. Evolution in this case can be stopped only via kill -9. 

If I run evolution from terminal then the bug disappear. The problem is also workarounded if I create /usr/local/bin/evolution, which call /usr/bin/evolution and resend its STDERR to /dev/null. 

I have gdm compiled with +systemd USE flag, so that generally STDERR is sent to journald if evolution is called from gnome-shell. So perhaps this problem is due to some magic interaction of bogofilter with journald.

Many thanks for your work!
Comment 1 Pacho Ramos gentoo-dev 2013-10-10 19:20:39 UTC
I cannot reproduce this, and I use evolution+bogofilter+systemd always

I have:
$ emerge -Opv bogofilter evolution systemd

These are the packages that would be merged, in order:

[ebuild   R    ] mail-filter/bogofilter-1.2.3  USE="berkdb -sqlite -tokyocabinet" 0 kB
[ebuild   R   ~] mail-client/evolution-3.8.5:2.0  USE="bogofilter crypt gnome-online-accounts gstreamer ldap ssl weather -highlight -kerberos -map -spamassassin" 0 kB
[ebuild   R   ~] sys-apps/systemd-208-r2:0/1  USE="acl filecaps firmware-loader gudev http introspection kmod pam policykit tcpd {test} xattr -audit -cryptsetup -doc -gcrypt -lzma -openrc -python -qrcode (-selinux) -vanilla" ABI_X86="(64) -32 (-x32)" PYTHON_SINGLE_TARGET="python2_7" PYTHON_TARGETS="python2_7" 0 kB
Comment 2 Serge Gavrilov 2013-10-10 19:26:07 UTC
Very similar (I have tried different versions of bogofilter and all of them demonstrate this behavior)

[ebuild     U  ] mail-filter/bogofilter-1.2.4::x-portage [1.2.2::gentoo] USE="berkdb sqlite -tokyocabinet" 0 kB
[ebuild   R    ] mail-client/evolution-3.8.5:2.0  USE="bogofilter crypt gnome-online-accounts gstreamer ssl weather -highlight -kerberos -ldap -map -spamassassin" 0 kB
[ebuild     U  ] sys-apps/systemd-208-r2:0/1 [208-r1:0/1] USE="acl audit doc filecaps firmware-loader gudev introspection kmod lzma openrc pam policykit python tcpd xattr -cryptsetup -gcrypt -http -qrcode (-selinux) {-test} -vanilla" ABI_X86="(64) -32 (-x32)" PYTHON_SINGLE_TARGET="python2_7" PYTHON_TARGETS="python2_7" 8 kB
Comment 3 Pacho Ramos gentoo-dev 2013-10-10 19:31:18 UTC
Your bogofilter looks to come from an external overlay (and also different version)
Comment 4 Serge Gavrilov 2013-10-10 19:34:34 UTC
(In reply to Pacho Ramos from comment #3)
> Your bogofilter looks to come from an external overlay (and also different
> version)

I have tried 1.2.2 and 1.2.3 from portage and 1.2.4 (which still is not in portage) and all of them demonstrate this behavior.
Comment 5 Serge Gavrilov 2013-10-12 10:26:30 UTC
This is not only evolution problem (I have discovered many zombie processes in my system) and seems to be caused by upgrade of nvidia-drivers up to 319.60. 

Downgrade to 319.49 seems to fix the bug (though, I must investigate this a little bit more).
Comment 6 Pacho Ramos gentoo-dev 2013-10-19 07:42:55 UTC
Any news on this?
Comment 7 Serge Gavrilov 2013-10-23 03:06:13 UTC
Fixed by downgrade of nvidia-drivers to 319.49. 

It seems to be mysterious.
Comment 8 Pacho Ramos gentoo-dev 2013-10-26 15:37:18 UTC
This reminds me to an old bug on nvidia-drivers that causes similar problems (lots of zombie process), but it was solved by nvidia on a newer version
Comment 9 Jeroen Roovers (RETIRED) gentoo-dev 2013-10-26 16:23:02 UTC
*** Bug 487700 has been marked as a duplicate of this bug. ***
Comment 10 Ron 2013-10-28 19:27:07 UTC
(In reply to Pacho Ramos from comment #8)
> This reminds me to an old bug on nvidia-drivers that causes similar problems
> (lots of zombie process), but it was solved by nvidia on a newer version

This issue continues with x11-drivers/nvidia-drivers-331.17 in case it hasn't been mentioned I rolled them back to 325.15 and everything builds fine again. Is this something that needs to be mentioned upstream?
Comment 11 Jeroen Roovers (RETIRED) gentoo-dev 2013-10-28 19:30:59 UTC
(In reply to Ron from comment #10)
> Is this something that needs to be mentioned upstream?

Yes, of course. It's closed source software so we cannot investigate very deeply ourselves. Run `nvidia-bug-report.sh' and send the output to Nvidia as instructed.
Comment 12 Tanktalus 2013-11-02 00:50:14 UTC
FYI, what seems to be happening, if it's the same as on my system, is that the signal mask being propagated by the driver is simply out of whack.  Now SIGCHLD signals are being masked, so zombies never get reaped by processes that expect to reap children manually (as opposed to ignored).

$ ps -eda -o pid,ppid,blocked | grep -v 00000
  PID  PPID          BLOCKED
 1376     1 fffffffe7ffbfeff
 2721  2715 fffffffe3ffba207
 2722  2715 fffffffe3ffba207
 3097     1 fffffffe7ffbfeff
 3099  3097 fffffffe7ffbfeff
 3991  3827 00007ffe7330cc90
 3994  3991 00007ffe7330c810
 3996     1 00007f4c53e28688
 4007  3994 00007ffe7330c810
 4014     1 00007f4c53e28688
 4077     1 00007f3a0b825688
 4154  3827 00007f2666110418
 4222  3991 00007ffe7330c810
 4223  3991 00007ffe7330c810
 4463     1 00007f4c53e28620

Those first ones, with leading f's, are okay - those are daemons that are purposefully setting their signal mask to make them harder to kill (e.g., DB2).  The ones starting with 00007 are going to be ones with problems.  Normal processes will have all zeros, or close thereto.

I'll be downgrading my nvidia shortly to see if that resolves the issue here.
Comment 13 Tanktalus 2013-11-02 04:28:59 UTC
After downgrade, the bad signal masks shown earlier have gone away. This is apparently an issue with the nvidia driver.  (I'm back to 325.15 - 331.17 is definitely bad.)
Comment 14 Tanktalus 2013-11-02 04:31:04 UTC
Oh, I should also add: I'm running KDE 4.11.2.  And akonadi has problems due to this as well.  But what has problems should be of passing interest only.  The real cause is the signal mask that nvidia gives to the parent process and gets passed on to everything else.  Arguably, everything else could reset their own signal masks. But they shouldn't have to.
Comment 15 Tom Wijsman (TomWij) (RETIRED) gentoo-dev 2013-11-02 15:10:37 UTC
Could you please post the output of `emerge --info`?
Comment 16 Tanktalus 2013-11-02 15:12:05 UTC
(In reply to Tom Wijsman (TomWij) from comment #15)
> Could you please post the output of `emerge --info`?

Who are you asking this of? Everyone?
Comment 17 Jeroen Roovers (RETIRED) gentoo-dev 2013-11-02 18:12:51 UTC
(In reply to Tanktalus from comment #16)
> (In reply to Tom Wijsman (TomWij) from comment #15)
> > Could you please post the output of `emerge --info`?
> 
> Who are you asking this of? Everyone?

Yes, why not.

Also, send the output of nvidia-bug-report.sh to Nvidia, so they can fix their proprietary software and we can then write an ebuild for the fixed proprietary software version. There isn't really anything else we can do except urge you to send reports upstream.
Comment 18 Tanktalus 2013-11-02 18:21:25 UTC
(In reply to Jeroen Roovers from comment #17)
> (In reply to Tanktalus from comment #16)
> > (In reply to Tom Wijsman (TomWij) from comment #15)
> > > Could you please post the output of `emerge --info`?
> > 
> > Who are you asking this of? Everyone?
> 
> Yes, why not.
> 
> Also, send the output of nvidia-bug-report.sh to Nvidia, so they can fix
> their proprietary software and we can then write an ebuild for the fixed
> proprietary software version. There isn't really anything else we can do
> except urge you to send reports upstream.

nvidia: my plan is, once I have sufficient time available, to re-upgrade nvidia, reproduce the problem, submit the nvidia bug upstream with their tool, and then downgrade again.  I just haven't had time yet :)

Info:

Portage 2.2.7 (default/linux/amd64/13.0, gcc-4.7.3, glibc-2.15-r3, 3.11.6-gentoo x86_64)
=================================================================
System uname: Linux-3.11.6-gentoo-x86_64-Intel-R-_Core-TM-_i7_CPU_930_@_2.80GHz-with-gentoo-2.2
KiB Mem:    12296636 total,    788496 free
KiB Swap:   25165820 total,  25090636 free
Timestamp of tree: Sat, 02 Nov 2013 07:45:01 +0000
ld GNU ld (GNU Binutils) 2.23.1
distcc 3.1 x86_64-pc-linux-gnu [enabled]
app-shells/bash:          4.2_p45
dev-java/java-config:     2.1.12-r1
dev-lang/python:          2.7.5-r3, 3.2.5-r3
dev-util/cmake:           2.8.11.2
dev-util/pkgconfig:       0.28
sys-apps/baselayout:      2.2
sys-apps/openrc:          0.11.8
sys-apps/sandbox:         2.6-r1
sys-devel/autoconf:       2.13, 2.69
sys-devel/automake:       1.4_p6-r1, 1.11.6, 1.12.6, 1.13.4
sys-devel/binutils:       2.23.1
sys-devel/gcc:            4.6.3, 4.7.3-r1
sys-devel/gcc-config:     1.7.3
sys-devel/libtool:        2.4.2
sys-devel/make:           3.82-r4
sys-kernel/linux-headers: 3.11 (virtual/os-headers)
sys-libs/glibc:           2.15-r3
Repositories: gentoo private x11 kde
ACCEPT_KEYWORDS="amd64"
ACCEPT_LICENSE="*"
CBUILD="x86_64-pc-linux-gnu"
CFLAGS="-O3 -pipe -march=core2 -ggdb"
CHOST="x86_64-pc-linux-gnu"
CONFIG_PROTECT="/etc /usr/share/config /usr/share/gnupg/qualified.txt /usr/share/polkit-1/actions /var/bind"
CONFIG_PROTECT_MASK="/etc/ca-certificates.conf /etc/dconf /etc/env.d /etc/fonts/fonts.conf /etc/gconf /etc/gentoo-release /etc/init.d /etc/php/apache2-php5.4/ext-active/ /etc/php/apache2-php5.5/ext-active/ /etc/php/cgi-php5.4/ext-active/ /etc/php/cgi-php5.5/ext-active/ /etc/php/cli-php5.4/ext-active/ /etc/php/cli-php5.5/ext-active/ /etc/revdep-rebuild /etc/sandbox.d /etc/terminfo"
CXXFLAGS="-O3 -pipe -march=core2 -ggdb"
DISTDIR="/usr/portage/distfiles"
FCFLAGS="-O2 -pipe"
FEATURES="assume-digests binpkg-logs collision-protect config-protect-if-modified distcc distlocks ebuild-locks fixlafiles merge-sync multilib-strict news parallel-fetch preserve-libs protect-owned sandbox sfperms splitdebug strict unknown-features-warn unmerge-logs unmerge-orphans userfetch userpriv usersandbox usersync"
FFLAGS="-O2 -pipe"
GENTOO_MIRRORS="http://gentoo.arcticnetwork.ca/ ftp://gentoo.mirrors.tds.net/gentoo http://mirror.datapipe.net/gentoo ftp://mirror.datapipe.net/gentoo ftp://gentoo.arcticnetwork.ca/pub/gentoo/ http://gentoo.llarian.net/ ftp://gentoo.llarian.net/pub/gentoo"
LANG="en_US.utf8"
LC_ALL="en_US.utf8"
LDFLAGS="-Wl,-O1 -Wl,--as-needed"
MAKEOPTS="-j13 -l25"
PKGDIR="/usr/portage/packages"
PORTAGE_CONFIGROOT="/"
PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --omit-dir-times --compress --force --whole-file --delete --stats --human-readable --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages"
PORTAGE_TMPDIR="/var/tmp"
PORTDIR="/usr/portage"
PORTDIR_OVERLAY="/home/portdir-mine /usr/portage/local/layman/x11 /usr/portage/local/layman/kde"
SYNC="rsync://rsync.ca.gentoo.org/gentoo-portage"
USE="X a52 aac acl acpi alsa amd64 apache2 audiofile avahi avi bash-completion berkdb branding bzip2 cairo cdda cddb cdparanoia cdr cli consolekit cracklib crypt css cups cxx dbus dri dvd dvdr dvdread enca encode exif expat ffmpeg fftw firefox fontconfig fortran gd gdbm gif gimp gmp gnutls gs handbook htmlhandbook iconv imagemagick ipv6 java jbig jpeg jpeg2k kde kipi lcms libnotify lzma lzo mad mjpeg mmx mng modules mp3 mpeg mudflap multilib ncurses nls nptl nsplugin ogg opengl openmp pam pcre perl png policykit python qt4 rdesktop readline scanner sdl semantic-desktop session smp sse sse2 ssl subversion svg tcpd threads tiff truetype udev unicode vaapi vcd vde vorbis win32codecs wmf x264 xcb xcomposite xinerama xml xulrunner xvid zlib" ABI_X86="32 64" ALSA_CARDS="ali5451 als4000 atiixp atiixp-modem bt87x ca0106 cmipci emu10k1x ens1370 ens1371 es1938 es1968 fm801 hda-intel intel8x0 intel8x0m maestro3 trident usb-audio via82xx via82xx-modem ymfpci" APACHE2_MODULES="proxy proxy_http actions alias auth_basic authn_alias authn_anon authn_dbm authn_default authn_file authz_dbm authz_default authz_groupfile authz_host authz_owner authz_user autoindex cache cgi cgid dav dav_fs dav_lock deflate dir disk_cache env expires ext_filter file_cache filter headers include info log_config logio mem_cache mime mime_magic negotiation rewrite setenvif speling status unique_id userdir usertrack vhost_alias" CALLIGRA_FEATURES="kexi words flow plan sheets stage tables krita karbon braindump author" CAMERAS="ptp2" COLLECTD_PLUGINS="df interface irq load memory rrdtool swap syslog" ELIBC="glibc" GPSD_PROTOCOLS="ashtech aivdm earthmate evermore fv18 garmin garmintxt gpsclock itrax mtk3301 nmea ntrip navcom oceanserver oldstyle oncore rtcm104v2 rtcm104v3 sirf superstar2 timing tsip tripmate tnt ubx" INPUT_DEVICES="keyboard mouse evdev" KERNEL="linux" LCD_DEVICES="bayrad cfontz cfontz633 glk hd44780 lb216 lcdm001 mtxorb ncurses text" LIBREOFFICE_EXTENSIONS="presenter-console presenter-minimizer" LINGUAS="en" OFFICE_IMPLEMENTATION="libreoffice" PHP_TARGETS="php5-5" PYTHON_SINGLE_TARGET="python2_7" PYTHON_TARGETS="python2_7 python3_2" QEMU_SOFTMMU_TARGETS="i386 x86_64" QEMU_USER_TARGETS="i386 x86_64" RUBY_TARGETS="ruby19 ruby18" USERLAND="GNU" VIDEO_CARDS="nvidia" XTABLES_ADDONS="quota2 psd pknock lscan length2 ipv4options ipset ipp2p iface geoip fuzzy condition tee tarpit sysrq steal rawnat logmark ipmark dhcpmac delude chaos account"
Unset:  CPPFLAGS, CTARGET, EMERGE_DEFAULT_OPTS, INSTALL_MASK, PORTAGE_BUNZIP2_COMMAND, PORTAGE_COMPRESS, PORTAGE_COMPRESS_FLAGS, PORTAGE_RSYNC_EXTRA_OPTS, USE_PYTHON
Comment 19 Jeroen Roovers (RETIRED) gentoo-dev 2013-11-02 18:39:30 UTC
(In reply to Tanktalus from comment #18)
> Portage 2.2.7 (default/linux/amd64/13.0, gcc-4.7.3, glibc-2.15-r3,
> 3.11.6-gentoo x86_64)

Great. An unsupported kernel. "Patched" that yourself? Break it and you get to keep the pieces.

Does anyone here experience this very same problem with an actually supported kernel?
Comment 20 Eric F. GARIOUD 2013-11-07 07:23:19 UTC
(In reply to Jeroen Roovers from comment #19)
> Does anyone here experience this very same problem with an actually
> supported kernel?

I cannot tell if it's the very same problem, however, I can observe things similar to what reported in comment 12 with the following apparent symptoms :
- dolphin, kmail2 (from kde-4.10.5) take ages to start
- libreoffice-4.1.2.3 takes ages to show the file open / save dialog box

Running nvidia-drivers-319.60 on whatever (ck/g/rt)-sources 2.6.38 - recent 3.4
and 3.8

No problems with nvidia-drivers <= 319.49
Comment 21 Jeroen Roovers (RETIRED) gentoo-dev 2013-11-08 15:23:52 UTC
*** Bug 490718 has been marked as a duplicate of this bug. ***
Comment 22 Christian Loosli 2013-11-08 15:42:05 UTC
(In reply to Jeroen Roovers from comment #19)

> Does anyone here experience this very same problem with an actually
> supported kernel?

Hi, yes. 

Since the newest drivers (331.20), the 3.12 and 3.11 kernels are officially supported. 

Hence the info in 
https://bugs.gentoo.org/show_bug.cgi?id=490718
shows the problem still persists by using a supported combination of kernel and drivers. 

I see that I can update again, create and send a bug report to nvidia on Sunday. I won't have a chance to access my nvidia system do it before. 

Kind regards, 

Christian
Comment 23 Christian Loosli 2013-11-09 11:20:04 UTC
Upstream bug report linking to this one: 
https://devtalk.nvidia.com/default/topic/633706/linux/recent-drivers-cause-applications-to-hang-not-start-at-all-or-compilation-failures/

As mentioned in 487548, the issue seems to happen less often with the most recent driver, but unfortunately it still does happen.
Comment 24 Ville Aakko 2013-11-09 21:20:27 UTC
Hi!

Reporting here as requested in this thread:

http://forums.gentoo.org/viewtopic-t-975106.html?sid=8b4f5670553424affe500aad0b28b764

I had problems with CTRL+C in konsole (but, curiously, not in xterm, uxterm, VT, or in screen sessions started in those - but, vice versa, screen sessions started in Konsole experience the problem when re-attatched in a VT). See the discussion for details.

Oh, this is on the dreaded 3.12 Kernel, but I tried nvidia-drivers-331.20, which should be officially supported, and it experiences this problems still. The versions between 319.49 and 331.20 (that I have tried) all have this problem, but were patched to run with the kernel I was currently running on.


My info (running on 319.49 - will re-upgrade, run the nvidia patch script and downgrade again later when I have time ):
# emerge --info
Portage 2.2.7 (default/linux/amd64/13.0/desktop/kde, gcc-4.6.3, glibc-2.15-r3, 3.12.0-gentooVillenMyytti-2 x86_64)
=================================================================
System uname: Linux-3.12.0-gentooVillenMyytti-2-x86_64-AMD_Athlon-tm-_Dual_Core_Processor_4850e-with-gentoo-2.2
KiB Mem:     6101848 total,   1219984 free
KiB Swap:   19433468 total,  19432248 free
Timestamp of tree: Sat, 09 Nov 2013 15:30:01 +0000
ld GNU ld-versio (GNU Binutils) 2.23.1
app-shells/bash:          4.2_p45
dev-java/java-config:     2.1.12-r1
dev-lang/python:          2.7.5-r3, 3.2.5-r3
dev-util/cmake:           2.8.11.2
dev-util/pkgconfig:       0.28
sys-apps/baselayout:      2.2
sys-apps/openrc:          0.11.8
sys-apps/sandbox:         2.6-r1
sys-devel/autoconf:       2.13, 2.69
sys-devel/automake:       1.11.6, 1.12.6, 1.13.4
sys-devel/binutils:       2.23.1
sys-devel/gcc:            4.6.3, 4.7.3-r1
sys-devel/gcc-config:     1.7.3
sys-devel/libtool:        2.4.2
sys-devel/make:           3.82-r4
sys-kernel/linux-headers: 3.9 (virtual/os-headers)
sys-libs/glibc:           2.15-r3
Repositories: gentoo gamerlay mythtv x-portage
ACCEPT_KEYWORDS="amd64"
ACCEPT_LICENSE="*"
CBUILD="x86_64-pc-linux-gnu"
CFLAGS="-march=athlon64-sse3 -O2 -pipe -fomit-frame-pointer"
CHOST="x86_64-pc-linux-gnu"
CONFIG_PROTECT="/etc /usr/share/config /usr/share/gnupg/qualified.txt"
CONFIG_PROTECT_MASK="/etc/ca-certificates.conf /etc/env.d /etc/fonts/fonts.conf /etc/gconf /etc/gentoo-release /etc/php/apache2-php5.3/ext-active/ /etc/php/apache2-php5.5/ext-active/ /etc/php/cgi-php5.3/ext-active/ /etc/php/cgi-php5.5/ext-active/ /etc/php/cli-php5.3/ext-active/ /etc/php/cli-php5.5/ext-active/ /etc/revdep-rebuild /etc/sandbox.d /etc/terminfo /etc/texmf/language.dat.d /etc/texmf/language.def.d /etc/texmf/updmap.d /etc/texmf/web2c"                                                                                                                                                             
CXXFLAGS="-march=athlon64-sse3 -O2 -pipe -fomit-frame-pointer"                                                                                                                                            
DISTDIR="/usr/portage/distfiles"                                                                                                                                                                          
EMERGE_DEFAULT_OPTS="--keep-going -j 3 --load-average 1.90"                                                                                                                                               
FCFLAGS="-O2 -pipe"                                                                                                                                                                                       
FEATURES="assume-digests binpkg-logs config-protect-if-modified distlocks ebuild-locks fixlafiles merge-sync metadata-transfer news parallel-fetch preserve-libs protect-owned sandbox sfperms strict unknown-features-warn unmerge-logs unmerge-orphans userfetch userpriv usersandbox usersync"                                                                                                                   
FFLAGS="-O2 -pipe"                                                                                                                                                                                        
GENTOO_MIRRORS="http://trumpetti.atm.tut.fi/gentoo/ ftp://trumpetti.atm.tut.fi/gentoo/"                                                                                                                   
LANG="fi_FI.UTF-8"                                                                                                                                                                                        
LDFLAGS="-Wl,-O1 -Wl,--as-needed"                                                                                                                                                                         
PKGDIR="/usr/portage/packages"                                                                                                                                                                            
PORTAGE_CONFIGROOT="/"
PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --omit-dir-times --compress --force --whole-file --delete --stats --human-readable --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages"
PORTAGE_TMPDIR="/var/tmp"
PORTDIR="/usr/portage"
PORTDIR_OVERLAY="/var/lib/layman/gamerlay /usr/local/mythtv_portage/Gentoo /usr/local/portage"
USE="32bit 3dnow 3dnowext S3TC X a52 aac aacplus aacs aalib ace acl acpi alsa amd64 apache2 bash-completion berkdb bittorrent bluetooth bluray branding bzip2 cairo cdda cddb cdio cdr cdrom chardet cli consolekit cracklib crypt css cups cxx dbus declarative dri dts dvb dvd dvdnav dvdr embedded emboss enca encode examples exif fam ffmpeg fftw fi firefox flac floppy fluidsynth fontconfig fortran ftp g3dvl gdbm gif git google-gadgets goom gpm gtk hddtemp iconv icu id3 id3tag ieee1394 imlib ipod ipv6 jack java javascript joystick jpeg jpeg2k kde kdecards kipi latin1 lcd lcms ldap libnotify lirc lm_sensors logrotate mad maildir matroska mbox md5sum midi mikmod mixer mjpeg mmx mmxext mng mod modplug modules mp3 mp4 mpeg mplayer mtp mudflap multilib multiuser musepack music mysql mythtv ncurses nls nodrm nptl nsplugin ntfs ntfsprogs nvcontrol nvidia offensive ogg ogg123 openal opencl opengl openmp pam pango pcre pdf phonon php plasma png policykit ppds projectm projectx pvr qt3support qt4 quicktime rar readline real rpc rtc s3tc scanner sdl semantic-desktop sensord session sftp sid sndfile spell sse sse2 sse3 ssl startup-notification stk stream subtitles svg systray taglib tcpd test-programs tga theora threads tiff timidity tk transcode truetype udev udisks unicode unzip upower usb vaapi vcd vdpau vhosts vim vim-syntax vim-with-x vorbis vst wallpapers wavpack win32 wma wxwidgets x264 xcb xcomposite xinerama xml xrandr xscreensaver xterm xv xvid zip zlib" ABI_X86="64" ALSA_CARDS="hda-intel virmidi" APACHE2_MODULES="authn_core authz_core socache_shmcb unixd actions alias auth_basic authn_alias authn_anon authn_dbm authn_default authn_file authz_dbm authz_default authz_groupfile authz_host authz_owner authz_user autoindex cache cgi cgid dav dav_fs dav_lock deflate dir disk_cache env expires ext_filter file_cache filter headers include info log_config logio mem_cache mime mime_magic negotiation rewrite setenvif speling status unique_id userdir usertrack vhost_alias" CALLIGRA_FEATURES="kexi words flow plan sheets stage tables krita karbon braindump author" CAMERAS="ptp2" COLLECTD_PLUGINS="df interface irq load memory rrdtool swap syslog" ELIBC="glibc" GPSD_PROTOCOLS="ashtech aivdm earthmate evermore fv18 garmin garmintxt gpsclock itrax mtk3301 nmea ntrip navcom oceanserver oldstyle oncore rtcm104v2 rtcm104v3 sirf superstar2 timing tsip tripmate tnt ubx" GRUB_PLATFORMS="pc" INPUT_DEVICES="keyboard mouse joystick synaptics evdev" KERNEL="linux" LCD_DEVICES="imon" LIBREOFFICE_EXTENSIONS="presenter-console presenter-minimizer" LINGUAS="fi en_GB en" LIRC_DEVICES="userspace" OFFICE_IMPLEMENTATION="libreoffice" PHP_TARGETS="php5-4 php5-3" PYTHON_SINGLE_TARGET="python2_7" PYTHON_TARGETS="python2_7 python3_2" RUBY_TARGETS="ruby19 ruby18" USERLAND="GNU" VIDEO_CARDS="nvidia" XTABLES_ADDONS="quota2 psd pknock lscan length2 ipv4options ipset ipp2p iface geoip fuzzy condition tee tarpit sysrq steal rawnat logmark ipmark dhcpmac delude chaos account"
Unset:  CPPFLAGS, CTARGET, INSTALL_MASK, LC_ALL, MAKEOPTS, PORTAGE_BUNZIP2_COMMAND, PORTAGE_COMPRESS, PORTAGE_COMPRESS_FLAGS, PORTAGE_RSYNC_EXTRA_OPTS, SYNC, USE_PYTHON
Comment 25 Mike Gilbert gentoo-dev 2013-11-14 22:02:25 UTC
*** Bug 490496 has been marked as a duplicate of this bug. ***
Comment 26 Alexandre Rostovtsev (RETIRED) gentoo-dev 2013-11-30 22:50:16 UTC
FWIW, here is a workaround to build freezing packages like tar, grub, etc. with nvidia-drivers-331.20 and kernel 3.12.2: ssh to localhost.

It seems that nvidia's buggy wrappers don't get loaded as long as you are ssh-ed in without X forwarding, or otherwise are not in an X session.
Comment 27 Lars Wendler (Polynomial-C) (RETIRED) gentoo-dev 2013-12-01 17:43:29 UTC
*** Bug 493040 has been marked as a duplicate of this bug. ***
Comment 28 Lars Wendler (Polynomial-C) (RETIRED) gentoo-dev 2013-12-01 18:52:49 UTC
*** Bug 492984 has been marked as a duplicate of this bug. ***
Comment 29 Andreas K. Hüttel archtester gentoo-dev 2013-12-01 21:41:32 UTC
I've masked the affected nvidia-drivers versions in the KDE profiles, since this hits Akonadi and KWin. 

Global mask requires the package maintainer.
Comment 30 Jeroen Roovers (RETIRED) gentoo-dev 2013-12-02 13:00:54 UTC
(In reply to Andreas K. Hüttel from comment #29)
> I've masked the affected nvidia-drivers versions in the KDE profiles, since
> this hits Akonadi and KWin. 
> 
> Global mask requires the package maintainer.

It also affects GNOME users[1] so I am pretty sure it's an issue between the kernel and nvidia.ko.

Now, if only someone who can reproduce the bug would do some debugging or even perhaps show a kernel .config with which we might reproduce it. I can't believe it has been this long and nobody yet showed up with a proper analysis. It's probably down to a mere CONFIG_* option in the kernel which we could test for and warn against.


[1] https://devtalk.nvidia.com/default/topic/638521/linux/gnome-terminal-problems-ctrl-c-and-exit/
Comment 31 Alexandre Rostovtsev (RETIRED) gentoo-dev 2013-12-02 13:20:25 UTC
Created attachment 364448 [details]
.config for kernel 3.12.2

(In reply to Jeroen Roovers from comment #30)

My .config file. I am experiencing this problem in gnome with vanilla 3.12.2 and nvidia-drivers-331.20
Comment 32 scrimekiler 2013-12-02 14:04:52 UTC
> It also affects GNOME users[1] so I am pretty sure it's an issue between the
> kernel and nvidia.ko.

I have this problem (emerge freezing at configure step) with openbox too.

Check my report :
https://bugs.gentoo.org/show_bug.cgi?id=490496
Comment 33 Ulenrich 2013-12-02 16:07:19 UTC
Can someone tell me if I am affected!?

# grep SigBlk /proc/*/status| grep -v 0000000000000000
/proc/1/status:SigBlk:  7be3c0fe28014a03
/proc/1229/status:SigBlk:       7be3c0fe28014a03
/proc/1230/status:SigBlk:       0000000000010000
/proc/1291/status:SigBlk:       00007f4425234e90
/proc/1330/status:SigBlk:       00007ffe15e91cb0
/proc/1333/status:SigBlk:       00007ffe15e91a20
/proc/1334/status:SigBlk:       00007f201480a8c8
/proc/1350/status:SigBlk:       000000000000000a
/proc/1378/status:SigBlk:       0000000000010000
/proc/1482/status:SigBlk:       7be3c0fe28014a03
/proc/1617/status:SigBlk:       fffffffe7ffb9eff
/proc/1620/status:SigBlk:       0000000000010000
/proc/1630/status:SigBlk:       0000000000010000
/proc/2137/status:SigBlk:       0000000000010000
/proc/2283/status:SigBlk:       fffffffe7ffb9eff
/proc/2286/status:SigBlk:       0000000000010000
/proc/2306/status:SigBlk:       0000000000010000
/proc/394/status:SigBlk:        0000000000004002
/proc/67/status:SigBlk: 0000000000004a02
/proc/92/status:SigBlk: fffffffe7ffbfeff
Comment 34 Ulenrich 2013-12-02 16:20:15 UTC
This SigBlk I can kill 9 or 15 as user or as root-user:
/proc/1333/status:SigBlk:       00007ffe15e91a20
It is kde-dolphin
Comment 35 Tom Wijsman (TomWij) (RETIRED) gentoo-dev 2013-12-02 16:28:58 UTC
Created attachment 364456 [details]
working .config for kernel 3.12.2

(In reply to Alexandre Rostovtsev from comment #26)
> build freezing packages like tar, grub, etc.

(In reply to scrimekiler from comment #32)
> I have this problem (emerge freezing at configure step) with openbox too.

Cannot reproduce this on unpatched 3.12.2-gentoo with unpatched =x11-drivers/nvidia-drivers-331.20. Tried thrice for each. Also have no other processes hanging. Tried reproducing in other ways from what I read on the forums (sleep in a while loop, ^C^C^C^Z, ps). Everything works here.

Attached my working .config.

Whether or not you can reproduce; please upload your .config such that we can determine a common denominator, but make sure you have properly tested first.
Comment 36 Tom Wijsman (TomWij) (RETIRED) gentoo-dev 2013-12-02 16:37:06 UTC
On a side note, I am also running systemd; GNOME 3.10 though.

(In reply to Ulenrich from comment #33)
> Can someone tell me if I am affected!?
> 
> # grep SigBlk /proc/*/status| grep -v 0000000000000000
> /proc/1291/status:SigBlk:       00007f4425234e90
> /proc/1330/status:SigBlk:       00007ffe15e91cb0
> /proc/1333/status:SigBlk:       00007ffe15e91a20
> /proc/1334/status:SigBlk:       00007f201480a8c8

Possibly, but I guess you'll want to try to reproduce. These look like memory addresses and thus match the description of what might be an indicator; just checked mine as well (I haven't been able to reproduce yet), but I notice I have at least one like this (gnome-shell):

/proc/3925/status:SigBlk:	00007f143178e000

Not sure what it means or whether it is a false positive in my case. :/
Comment 37 Ulenrich 2013-12-02 16:56:38 UTC
for i in $(grep SigBlk /proc/*/status| grep -v 00000000000) ; do 
  [ -f ${i/status*}cmdline ] \
  && cat ${i/status*}cmdline \
  || echo -e "\n SigBlk $i \n"  
done
----
/sbin/systemd
 SigBlk 7be3c0fe28014a03 

/usr/lib/systemd/systemd--user
 SigBlk 7be3c0fe28014a03 

kwin
 SigBlk 00007f4425234e90 

kdeinit4: krunner [kdeinit]
 SigBlk 00007ffe15e91cb0 

su-
 SigBlk fffffffe7ffb9eff 

su-
 SigBlk fffffffe7ffb9eff
----
Comment 38 Tom Wijsman (TomWij) (RETIRED) gentoo-dev 2013-12-02 17:34:23 UTC
Not reproducible on 331.17 either, I have tried that one because it is supposed to happen more often there; I'll probably try to change some of the kernel options. If I find one that allows me to reproduce, I'll let you know.
Comment 39 Ulenrich 2013-12-02 21:52:33 UTC
I just installed x11-drivers/nvidia-drivers-319.72
---
see bug https://bugs.gentoo.org/show_bug.cgi?id=493160
---
but nothing changes (I was never hit by this bug), like above:

/sbin/systemd
 SigBlk 7be3c0fe28014a03 

/usr/lib/systemd/systemd--user
 SigBlk 7be3c0fe28014a03 

/usr/lib/systemd/systemd--user
 SigBlk 7be3c0fe28014a03 

kwin
 SigBlk 00007efe34c32e90 

kdeinit4: krunner [kdeinit]
 SigBlk 0000000001323e50 

kdeinit4: dolphin [kdeinit] --icon system-file-ma
 SigBlk 00007ffe61f8ca30 

/usr/bin/aqualung
 SigBlk 00007fa463d198c8 

/usr/lib/systemd/systemd-udevd
 SigBlk fffffffe7ffbfeff
Comment 40 Martin Samek 2013-12-03 00:34:29 UTC
Hi, I have same/similar issue with 331.20 drivers, an increasing number of defunct processes:

$ ps ax |grep defunct
 2723 ?        Z      0:00 [kwin_opengl_tes] <defunct>
 2749 ?        ZN     0:00 [virtuoso-t] <defunct>
 3103 ?        ZN     0:00 [virtuoso-t] <defunct>
 3105 ?        ZN     0:00 [virtuoso-t] <defunct>
 3141 ?        Z      0:00 [virtuoso-t] <defunct>
 3143 ?        Z      0:00 [virtuoso-t] <defunct>
 3147 ?        Z      0:00 [virtuoso-t] <defunct>
 3148 ?        Z      0:00 [virtuoso-t] <defunct>

My kernel config is in the attatchement.
Comment 41 Martin Samek 2013-12-03 00:35:30 UTC
Created attachment 364486 [details]
kernel-3.12.1-config-broken
Comment 42 Jeroen Roovers (RETIRED) gentoo-dev 2013-12-03 12:15:01 UTC
(In reply to Martin Samek from comment #40)
> Hi, I have same/similar issue with 331.20 drivers

We don't need more confirmation, thanks. We do need at the very least someone posting their nvidia-bug-report.sh output on the upstream forum, and ideally someone going through various kernel configuration switches to see which one trips up nvidia.ko.
Comment 43 Christian Loosli 2013-12-03 12:37:26 UTC
(In reply to Jeroen Roovers from comment #42)
> (In reply to Martin Samek from comment #40)
> > Hi, I have same/similar issue with 331.20 drivers
> 
> We don't need more confirmation, thanks. We do need at the very least
> someone posting their nvidia-bug-report.sh output on the upstream forum, 

This has been done multiple times, see 

https://devtalk.nvidia.com/default/topic/633706/linux/recent-drivers-cause-applications-to-hang-not-start-at-all-or-compilation-failures/

and 

https://devtalk.nvidia.com/default/topic/638521/linux/gnome-terminal-problems-ctrl-c-and-exit/

nVidia opened a bug in their internal tracker (bug 1414070), so they are aware of the issue (but apparently struggle with reproducing it) 

Of course more people can add their reports there :) 

As other distributions seem to be affected as well, I can't say whether it really is a kernel option or maybe a specific software version (e.g. gcc or libc) 

Kind regards, 

Christian
Comment 44 Eric F. GARIOUD 2013-12-03 12:53:30 UTC
(In reply to Jeroen Roovers from comment #42)
> ideally someone going through various kernel configuration switches to see
> which one trips up nvidia.ko.

1/ I have no logic yet capable of justifying,
2/ The troubles occur randomly => I might well not have tested enough.
Both tests made under identical hardware + ck-sources-3.4.68 + all drivers statically built + (nvidia-drivers-319.49 (troublefree) || nvidia-drivers-319.60 (misc and random problems already reported above and elsewhere))

- Building the kernel with CONFIG_BSD_PROCESS_ACCT=y and CONFIG_BSD_PROCESS_ACCT_V3=y + nvidia-drivers-319.60 => Troubles!
- Building the kernel with CONFIG_BSD_PROCESS_ACCT and CONFIG_BSD_PROCESS_ACCT_V3 unset => No problem... yet! (including no problem for akonadi registering with dbus)

For what it is worth... that is, for the now... almost nothing.
Comment 45 Ulenrich 2013-12-03 14:02:05 UTC
@Eric, am I affected using now nvidia-drivers-319.76 with both 'y'? Output at: 
https://forums.gentoo.org/viewtopic-p-7453354.html#7453354
... to get some user input in the forum.
Comment 46 Tom Wijsman (TomWij) (RETIRED) gentoo-dev 2013-12-04 15:27:41 UTC
(In reply to Eric F. GARIOUD from comment #44)
> - Building the kernel with CONFIG_BSD_PROCESS_ACCT=y and
> CONFIG_BSD_PROCESS_ACCT_V3=y + nvidia-drivers-319.60 => Troubles!
> - Building the kernel with CONFIG_BSD_PROCESS_ACCT and
> CONFIG_BSD_PROCESS_ACCT_V3 unset => No problem... yet! (including no problem
> for akonadi registering with dbus)

Thank you for sharing this discovery.

Can confirm this to break on the very first emerge that I do after rebuilding it with just those tw toggled; both app-arch/tar and sys-boot/grub now show this is their log:

  configure:26161: checking for working re_compile_pattern
  configure:26352: x86_64-pc-linux-gnu-gcc -std=gnu99 -o conftest -O2 -pipe -O2 -pipe -march=native -fomit-frame-pointer  -Wl,-O1 -Wl,--as-needed conftest.c -lacl  >&5
  configure:26352: $? = 0
  configure:26352: ./conftest
  *** Error in `./conftest': malloc(): memory corruption: 0x0000000000604fc0 ***

When checking the signal masks, I see /usr/bin/gnome-shell as before which is normal; but this time I additionally see the following as well, it might or might not be part of the problem:

  /usr/libexec/ibus-x11 --kill-daemon

The builds failing are sufficient proof though.

Updated the bug summary with this config variable, as we also know that 325.15 works and 331.17 and 331.20 fail we can update the version as well.

Only thing left to figure out is the kernel version and to be more specific the kernel commit where this behavior was introduced. Unless it applies to all kernel versions...

=================
= TEMPORARY FIX =
=================

Downgrade to ~x11-drivers/nvidia-drivers-325.25 or alternatively set CONFIG_BSD_PROCESS_ACCT=n CONFIG_BSD_PROCESS_ACCT_V3=n in the kernel .config
Comment 47 Jeroen Roovers (RETIRED) gentoo-dev 2013-12-04 15:31:36 UTC
"with kernel ? and CONFIG_BSD_PROCESS_ACCT{,_V3}=y"

I've had those enabled all along and that was never the issue for me.
Comment 48 Eric F. GARIOUD 2013-12-04 15:57:47 UTC
(In reply to Jeroen Roovers from comment #47)
> "with kernel ? and CONFIG_BSD_PROCESS_ACCT{,_V3}=y"
> 
> I've had those enabled all along and that was never the issue for me.

There might well be another CONFIG setting involved.
On my side I can make observations identical to my #44 comment under :

ck-sources-2.6.38-r3 ; 3.4.68 ; 3.8.13 ; 3.10.17
Comment 49 Tom Wijsman (TomWij) (RETIRED) gentoo-dev 2013-12-04 16:31:29 UTC
(In reply to Jeroen Roovers from comment #47)
> "with kernel ? and CONFIG_BSD_PROCESS_ACCT{,_V3}=y"
> 
> I've had those enabled all along and that was never the issue for me.

Hmm, then either this is card specific or depends on some other config variable as well; feel free to share .config if you want this investigated further, but I however think like you that this should be further investigated upstream.

Further testing reveals this can also be reproduced using an unpatched 331.17 on 3.9.11 and 3.6.11; so, it is definitely not a recent kernel regression. Will do some further testing later to see if older versions are not affected. It starts to seem like a NVIDIA drivers regression where some kernel options and/or graphics cards just serve as a condition to reveal it.

My card is:

02:00.0 VGA compatible controller [0300]: NVIDIA Corporation G92M [GeForce GTX 285M] [10de:060f] (rev a2)

I'll send more details upstream with `nvidia-bug-report.sh` when I revisit this.
Comment 50 Eric F. GARIOUD 2013-12-04 21:22:56 UTC
@Jer about the new summary.

As written in #44, <=319.49 systems are systematically OK

*319.60* is the first release causing troubles.
Comment 51 Jeroen Roovers (RETIRED) gentoo-dev 2013-12-05 02:48:15 UTC
(In reply to Tanktalus from comment #14)
> Oh, I should also add: I'm running KDE 4.11.2.  And akonadi has problems due
> to this as well.  But what has problems should be of passing interest only. 
> The real cause is the signal mask that nvidia gives to the parent process
> and gets passed on to everything else.  Arguably, everything else could
> reset their own signal masks. But they shouldn't have to.

This all sounds like an ABI change. It may not even come up in nvidia.ko vs. the kernel, but in the userland libraries that talk to nvidia.ko. Try this and see how it goes:

1) upgrade sys-kernel/linux-headers to a version approaching your current kernel version.
2) re-emerge sys-libs/glibc
3) re-emerge x11-drivers/nvidia-drivers
4) reboot
5) test
Comment 52 Matthias Dahl 2013-12-05 14:39:39 UTC
A few things I tried on a current ~amd64 system:

+ recompiled glibc & nvidia-drivers, reboot, test
+ same as above but w/ gcc 4.7.3 (as opposed to 4.8.2) and recompiled the kernel as well, reboot, test
+ BSD accounting off

None of it made any change. KDE behaved strangely sometimes and I got zombie processes (due to the "masking corruption" noted earlier by someone else).

For what it's worth: It is a GTX470 and the kernel is 3.11.7 (vanilla flavor).

Downgrading to 325.15 stopped all the madness again and everything runs smoothly.
Comment 53 scrimekiler 2013-12-05 21:55:19 UTC
331.17 in the topic name doesn't seem to exist.

331.20 does exist but doesn't seem to be affected
Comment 54 Tom Wijsman (TomWij) (RETIRED) gentoo-dev 2013-12-06 01:46:15 UTC
(In reply to scrimekiler from comment #53)
> 331.17 in the topic name doesn't seem to exist.

It is a beta version in the ebuild attic that makes it easier to reproduce the problem; please note that the summary ATOM has ">=", which means 331.17 or newer.

(In reply to Matthias Dahl from comment #52)
> + BSD accounting off
> 
> None of it made any change.

How did you turn it off? Did you change the kernel options? Can you check if you did boot the newly build kernel? If you have this problem and cannot toggle it using the kernel config variable we have found; then, it seems that there are other ways to trigger this behavior.
Comment 55 Matthias Dahl 2013-12-06 08:59:06 UTC
(In reply to Tom Wijsman (TomWij) from comment #54)

> How did you turn it off?

Changed the kernel config. Recompiled the kernel. Rebooted. :)

> Can you check if you did boot the newly build kernel?

You can be pretty sure I know what I am doing. :) In fact, if it weren't for the fact that the drivers mostly consist of the crappy blob, I would have already put gdb to get use to figure out what is going on.

> If you have this problem and cannot toggle it using the kernel config
> variable we have found; then, it seems that there are other ways to trigger 
> this behavior.

Yeah. It would have surprised me, honestly, if this had been really a reliable trigger or workaround. :( I also had a look through the configs (mine and the ones posted), but nothing really jumped out.

Maybe it is really a memory corruption happening on the nvidia side due to their recent changes with the unified memory support. Or it is indeed an ABI clash somewhere. With nvidia's history of taking their time to fix things, this could be an issue for quite some time to come, I am afraid. :((

What we know so far:

- signal mask gets corrupted (w/ all its side effects)
- the gfx chip doesn't seem to make any difference
- glibc makes no difference (happens w/ glibc 2.15 and 2.17)
- gcc makes no difference (happens w/ gcc 4.7 and 4.8)
- kernel version seems to makes no difference
- drivers > 325.15 are affected
Comment 56 Serge Gavrilov 2013-12-07 05:41:06 UTC
319.60 is definitely affected

319.49 is probably affected too. The fresh X session behaves in a normal way, but after a week I have 

$ ps ax | grep Z
 1358 ?        Z     11:25 [vlc] <defunct>
 6392 ?        Z      0:00 [su] <defunct>
 6396 ?        Z      0:00 [sh] <defunct>
 9357 ?        Z      0:53 [digikam] <defunct>
 9684 ?        Zs     0:00 [ssh-euclid] <defunct>
12906 ?        Z     12:56 [vlc] <defunct>
21672 ?        Z      1:50 [vmplayer] <defunct>
21986 ?        Z      2:19 [darktable] <defunct>
22526 ?        Z      3:24 [vmplayer] <defunct>
23966 ?        Z      0:25 [recoll] <defunct>
Comment 57 Eric F. GARIOUD 2013-12-07 10:19:02 UTC
(In reply to Serge Gavrilov from comment #56)
> 319.60 is definitely affected
> 
> 319.49 is probably affected too.

This is not my opinion. In my opinion :

- 319.49 is *definitely not* concerned by *this* (487558) precise bug.
- 319.60 is the first one being concerned.

- 319.49 is known for being concerned with another bug. The one you are experiencing and reporting about with your list of defunct processes.
- 319.60 tried to address this issue (" Fixed a bug that could cause OpenGL applications to crash during the initialization of new threads." quoted from nvidia-319.60 release highlights)

It is highly probable that the bug we are speaking about here (487558) has been introduced by nvidia as a consequence of the above mentioned bugfix, but I acknowledge I get no mean to prove that.
Comment 58 Ulenrich 2013-12-07 14:12:34 UTC
If you diff the non binary files there are only two files with not just versions differently showing up. But this two issues of:
a) drm_fasync
b) gfp_mask
isn't of any significance in our case, or is it?

--- NVIDIA-Linux-x86_64-319.49/kernel/nv-drm.c
+++ NVIDIA-Linux-x86_64-319.60/kernel/nv-drm.c
@@ -106,7 +106,6 @@
     .unlocked_ioctl = drm_ioctl,
     .mmap = drm_gem_mmap,
     .poll = drm_poll,
-    .fasync = drm_fasync,
     .read = drm_read,
     .llseek = noop_llseek,
 };
--- NVIDIA-Linux-x86_64-319.49/kernel/nv-vm.c
+++ NVIDIA-Linux-x86_64-319.60/kernel/nv-vm.c
@@ -483,6 +483,9 @@
         gfp_mask = NV_GFP_DMA32;
     }
 #endif
+#if defined(__GFP_NORETRY)
+    gfp_mask |= __GFP_NORETRY;
+#endif
 #if defined(__GFP_ZERO)
     if (at->flags & NV_ALLOC_TYPE_ZEROED)
         gfp_mask |= __GFP_ZERO;
@@ -532,7 +535,7 @@
     NV_GET_FREE_PAGES(virt_addr, at->order, (gfp_mask | __GFP_COMP));
     if (virt_addr == 0)
     {
-        nv_printf(NV_DBG_ERRORS,
+        nv_printf(NV_DBG_MEMINFO,
             "NVRM: VM: %s: failed to allocate memory\n", __FUNCTION__);
         return RM_ERR_NO_FREE_MEM;
     }
@@ -700,7 +703,7 @@
         NV_GET_FREE_PAGES(virt_addr, 0, gfp_mask);
         if (virt_addr == 0)
         {
-            nv_printf(NV_DBG_ERRORS,
+            nv_printf(NV_DBG_MEMINFO,
                 "NVRM: VM: %s: failed to allocate memory\n", __FUNCTION__);
             status = RM_ERR_NO_FREE_MEM;
             goto failed;
Comment 59 Constantin Baranov 2013-12-07 21:58:37 UTC
I observe this bug also, on any drivers newer than 325.15 on any kernel tried (3.10, 3.11, 3.12). It appears every time during Xfce session initialization (one of Xfce's root processes gets wrong sigmask and infects most of GUI), but I was unlucky finding the exact way to provoke the bug in a single isolated process.
Comment 60 Jeroen Roovers (RETIRED) gentoo-dev 2013-12-10 12:38:00 UTC
Interestingly, while re-installing a kernel with some changed .config options, I saw this:

DEPMOD  3.12.2-gentoo-JeR
depmod: WARNING: /lib/modules/3.12.2-gentoo-JeR/video/nvidia.ko needs unknown symbol kmem_cache_alloc_trace
depmod: WARNING: /lib/modules/3.12.2-gentoo-JeR/video/nvidia.ko needs unknown symbol add_preempt_count
depmod: WARNING: /lib/modules/3.12.2-gentoo-JeR/video/nvidia.ko needs unknown symbol debug_smp_processor_id
depmod: WARNING: /lib/modules/3.12.2-gentoo-JeR/video/nvidia.ko needs unknown symbol sub_preempt_count

The symbol references are present in the nvidia.ko built against the previously installed kernel, while apparently nvidia.ko (or probably more precisely nv-kernel.o) hides these symbols. After reinstalling nvidia-drivers, this is magically corrected for.

The main difference in the .config is that I played with enabling/disabling CONFIG_TRACING, but since that enables/disables some other dependent options on its own, I can't be sure which is triggering the behaviour we see. Also, the attached .configs don't agree on CONFIG_TRACING itself.
Comment 61 Tom Wijsman (TomWij) (RETIRED) gentoo-dev 2013-12-14 14:31:38 UTC
*** Bug 494212 has been marked as a duplicate of this bug. ***
Comment 62 Serge Gavrilov 2013-12-15 08:14:57 UTC
With nvidia-drivers-319.76 and quite old kernel 3.5.7 where 

# CONFIG_BSD_PROCESS_ACCT is not set

I reproduce the initial problem related to bogofilter. Thus it seems there is no working driver in portage now.
Comment 63 Serge Gavrilov 2013-12-15 08:34:11 UTC
The same problem with 331.20
Comment 64 Serge Gavrilov 2013-12-15 08:55:43 UTC
Created attachment 365380 [details]
Non-working .config for 3.5.7 with  CONFIG_BSD_PROCESS_ACCT=n
Comment 65 Christian Loosli 2013-12-15 13:54:30 UTC
Could we please get 325.15 back in tree, as this is the last driver that works? 

Removing this leaves KDE users with only non-working drivers ...

Kind regards
Comment 66 Jeroen Roovers (RETIRED) gentoo-dev 2013-12-16 14:13:43 UTC
(In reply to Christian Loosli from comment #65)
> Could we please get 325.15 back in tree, as this is the last driver that
> works? 
> 
> Removing this leaves KDE users with only non-working drivers ...

I am running KDE just fine with newer nvidia-drivers.
Comment 67 Ville Aakko 2013-12-16 15:38:32 UTC
(In reply to Jeroen Roovers from comment #66)
> (In reply to Christian Loosli from comment #65)
> > Could we please get 325.15 back in tree, as this is the last driver that
> > works? 
> > 
> > Removing this leaves KDE users with only non-working drivers ...
> 
> I am running KDE just fine with newer nvidia-drivers.

I'm not. The bug does not occur for everyone (also, it is not KDE specific).

Actually, nothing newer than 319.49, which has also been removed from the tree, is working for me. 

When the bug is fixed in a new driver, I'd presume this bug will be labeled FIXED.  I'd also like to suggest versions that are the last ones that seem to work, would not be removed from the tree until then, as it causes unecessary hassle for people hit by this bug.

Cheers!
Comment 68 Lars Wendler (Polynomial-C) (RETIRED) gentoo-dev 2013-12-18 08:08:39 UTC
*** Bug 494618 has been marked as a duplicate of this bug. ***
Comment 69 Hans Nieser 2013-12-18 09:31:20 UTC
I have this issue as well, but it appears *very* sporadically, and wether or not I get this bug is decided at boot time it seems (once I have booted and I don't see any curiously blocked signals, I'm good at least until the next time I boot). 4 out of 5 times my machine boots fine, so it's hard for me to really pin this down on any particular kernel option - I initially thought toggling CONFIG_BSD_PROCESS_ACCT did help, but numerous reboots down the road this issue seemed to hit me again
Comment 70 Marco Diletti 2013-12-18 10:44:54 UTC
The nvidia drivers I am using is version 319.49, 
if I upgrade to a newer version I suffer these problems:

- VMWare Workstation do not starts virtual machines, giving error
"Cannot find a valid peer process to connect to".
- Some wine/crossover application (f.e. DVDFab) not starts at all.
- SMPlayer/mplayer hangs on quitting application or, if I am watching
the TV, when I change channel.
- Mono applications (f.e. Keepass) do not starts at all.

A workaround to these problems is it start the applications from
the terminal (very boring).

Another issue with drivers > 319.49 is with the SLI, it does not work,
giving error "trouble accessing pci config space".

I apoligize for my english.
Comment 71 Kelly Price 2013-12-21 13:30:33 UTC
Adding myself to the bug list.  Mainly had issues with KDE freezing even after a laptop suspend-to-RAM.
Comment 72 Markus Strobl 2013-12-23 20:28:44 UTC
I had kernel 3.11.6 with CONFIG_BSD_PROCESS_ACCT=y. nvidia-drivers-319.76 worked and 331.20 had all problems already listed in this bug report.

Set CONFIG_BSD_PROCESS_ACCT=n. Still had the same problem with 331.20.

Switched to kernel 3.12.6 (CONFIG_BSD_PROCESS_ACCT=n) and nvidia-drivers-331.20. So far it seems to work. No delay when opening dolphin or the save-as dialog in KDE apps. Emerged grep and it did not hang on recompile-pattern like it did before.
Comment 73 Neil 2013-12-24 07:34:45 UTC
(In reply to Markus Strobl from comment #72)

> 
> Switched to kernel 3.12.6 (CONFIG_BSD_PROCESS_ACCT=n) and
> nvidia-drivers-331.20. So far it seems to work. No delay when opening
> dolphin or the save-as dialog in KDE apps. Emerged grep and it did not hang
> on recompile-pattern like it did before.

I wish I was as lucky...

$ uname -a
Linux kimura 3.12.6-gentoo #1 SMP Sun Dec 22 10:19:31 GMT 2013 x86_64 Intel(R) Core(TM)2 CPU E8400 @ 3.00GHz GenuineIntel GNU/Linux
$ eix -Ic nvidia-drivers
[I] x11-drivers/nvidia-drivers (331.20@12/22/13): NVIDIA X11 driver and GLX libraries
[1] "Personal overlay" /usr/portage/local
$ tail -n 30 /var/tmp/portage/sys-apps/grep-2.15-r1/temp/build.log 
checking whether mbrtowc handles a NULL pwc argument... (cached) yes
checking whether mbrtowc handles a NULL string argument... (cached) yes
checking whether mbrtowc has a correct return value... (cached) yes
checking whether mbrtowc returns 0 when parsing a NUL character... (cached) yes
checking whether mbrtowc handles incomplete characters... (cached) yes
checking whether mbrtowc works as well as mbtowc... (cached) yes
checking whether mbrtowc handles incomplete characters... (cached) yes
checking whether mbrtowc works as well as mbtowc... (cached) yes
checking whether mbsrtowcs works... yes
checking for mempcpy... (cached) yes
checking for memrchr... yes
checking whether YESEXPR works... yes
checking for obstacks... yes
checking whether open recognizes a trailing slash... yes
checking for opendir... yes
checking for perl5.005 or newer... yes
checking whether frexp works... (cached) yes
checking whether ldexp can be used without linking with libm... yes
checking whether frexpl() can be used without linking with libm... (cached) yes
checking whether frexpl works... (cached) yes
checking whether frexpl is declared... (cached) yes
checking whether ldexpl() can be used without linking with libm... yes
checking whether ldexpl works... yes
checking whether ldexpl is declared... (cached) yes
checking whether program_invocation_name is declared... yes
checking whether program_invocation_short_name is declared... yes
checking for readdir... yes
checking for stdlib.h... (cached) yes
checking for GNU libc compatible realloc... yes
checking for working re_compile_pattern... 


...and its just hung there.  grub2 hangs in the same manner.
Comment 74 scrimekiler 2013-12-24 09:06:43 UTC
(In reply to Neil from comment #73)
> I wish I was as lucky...
> ...and its just hung there.  grub2 hangs in the same manner.

You may be interested in the following (duplicate) bug : 
https://bugs.gentoo.org/show_bug.cgi?id=490496

It seems it depends of the ebuilds.I had only this problem with grub2, not with  other ebuilds

Neil, maybe you should try with version <=319.32
Comment 75 Neil 2013-12-24 10:05:56 UTC
(In reply to scrimekiler from comment #74)
> (In reply to Neil from comment #73)
> > I wish I was as lucky...
> > ...and its just hung there.  grub2 hangs in the same manner.
> 
> You may be interested in the following (duplicate) bug : 
> https://bugs.gentoo.org/show_bug.cgi?id=490496
> 
> It seems it depends of the ebuilds.I had only this problem with grub2, not
> with  other ebuilds
> 
> Neil, maybe you should try with version <=319.32

Thanks the suggestion, unfortunately I've not got a blocker from chromium on <=x11-drivers/nvidia-drivers-331.20...

# emerge -a nvidia-drivers
[ebuild     UD] x11-drivers/nvidia-drivers-319.76 [331.20] USE="X acpi (-multilib) -pax_kernel tools" 
[blocks B     ] <x11-drivers/nvidia-drivers-331.20 ("<x11-drivers/nvidia-drivers-331.20" is blocking www-client/chromium-32.0.1700.68)

 * Error: The above package list contains packages which cannot be
 * installed at the same time on the same system.

  (x11-drivers/nvidia-drivers-319.76::gentoo, ebuild scheduled for merge) pulled in by
    x11-drivers/nvidia-drivers required by @selected
    nvidia-drivers
    x11-drivers/nvidia-drivers required by (x11-base/xorg-drivers-1.15::gentoo, installed)

  (www-client/chromium-32.0.1700.68::gentoo, installed) pulled in by
    www-client/chromium required by @selected


319.76 is the current 319.* version in portage....

# eix nvidia-drivers
[D] x11-drivers/nvidia-drivers
     Available versions:  (~)96.43.09^s[1] 96.43.23^msd 173.14.39^msd 304.116^msd (~)304.117^msd 319.76^msd [m]331.20^msd {+X acpi custom-cflags gtk multilib pax_kernel (+)tools KERNEL="FreeBSD linux"}
     Installed versions:  331.20^msd(10:23:37 22/12/13)(X acpi tools -multilib -pax_kernel KERNEL="linux -FreeBSD")
     Homepage:            http://www.nvidia.com/
     Description:         NVIDIA X11 driver and GLX libraries


Its not a game stopper (yet!) and if I've understood reading this bug report correctly (which may not be the case, as I feel slightly confused) then its something that needs sorting upstream by nvidia and I am happy to wait (but also to try out solutions suggested here in the meantime).
Comment 76 Hans Nieser 2013-12-24 15:14:05 UTC
Please ignore everything I said in comment #69, my conclusions were tainted by not having the nvidia opengl libs selected (with 'eselect opengl').

It does seem however that switching to the xorg-x11 option makes the corrupt/weird signal mask problem go away. Not really a usable work-around for those on Gnome since that means no 3D accel, but still better than hanging terminals and crashing/stuck applications and systemd getting stuck when trying to reboot/shutdown.
Comment 77 Hans Nieser 2013-12-24 15:16:24 UTC
And.. I forgot to add: this problem only occurs on a machine with a GTX275, I have another, newer machine that has a GTX780, which does not have this problem at all. When I get some time I will see if swapping the GPUs changes anything and if not, see if I can spot some differences in the kernel config and installed software.
Comment 78 Bradley Broom 2013-12-26 18:29:17 UTC
Nvidia seems to have a fix soon. See post 13 in https://devtalk.nvidia.com/default/topic/638521/linux/gnome-terminal-problems-ctrl-c-and-exit/
Comment 79 David Davidson 2013-12-30 17:10:33 UTC
I confirm that I had the same problem on a new system that I just worked on.
For whatever reason, this Portage tree didn't have version 319.49 so I made an overlay for it. I also had to mask 319.76 which also seems to be affected by this problem.
I experienced the same issues with zombie processes including nepomuk, virtuoso-t, and kwin_opengl_test. I also was unable to get gettext to compile; it also got stuck on the sleep step during the configure process.
Thank you for posting this bug - I was going around in circles trying to get these figured out. I wasn't able to find the bug until I came across the gettext compile issue.
Rolling back to v319.49 resolved these. Hopefully NVidia will be able to develop a fix. If I can provide anything that would help then please let me know.
Thank you again.
Comment 80 Denis Misiurca 2014-01-02 23:12:14 UTC
^c is now working for me after rebuilding kernel without BSD_PROCESS_ACCT.

Also I had some random system freezes (reset button was the only thing that helped), but don't know if it is related to this bug, will see if they occur again.
Comment 81 Denis Misiurca 2014-01-04 02:57:24 UTC
Sorry, it wasn't really true about working ^C.

After reboot, it stopped working again, seems that it don't work when Konsole is started by KDE session manager. Restarting Konsole made it working again (but it wasn't working that way with BSD_PROCESS_ACCT, so seems that it has some influence).

I'm using sys-kernel/gentoo-sources-3.11.10 and x11-drivers/nvidia-drivers-331.20
Comment 82 Markus Strobl 2014-01-04 03:45:28 UTC
Try the latest 3.12 kernel. I had the same issue with 3.11. 3.12.6 (CONFIG_BSD_PROCESS_ACCT=n) and nvidia-drivers-331.20 has been OK for me for 2 weeks now.
Comment 83 Victor Orozco 2014-01-04 17:37:18 UTC
Created attachment 366974 [details]
non-working .config with CONFIG_BSD_PROCESS_ACCT disabled

I've tested CONFIG_BSD_PROCESS_ACCT=n with gentoo-sources-3.12.6, nvidia-drivers-331.20 over a NVIDIA GTX560m, however it does not fixes the issue
Comment 84 Markus Strobl 2014-01-04 17:48:56 UTC
Created attachment 366976 [details]
Working config for 3.12.6

So far I have not encountered any issues with this config and nvidia-drivers-331.20
Comment 85 Jeroen Roovers (RETIRED) gentoo-dev 2014-01-06 15:06:14 UTC
*** Bug 497000 has been marked as a duplicate of this bug. ***
Comment 86 Andrew Udvare 2014-01-06 21:42:42 UTC
I get the same error with 319.76 (and it does not seem possible for me to use the GeForce 650M with version 304).

$ ps -eda -o pid,ppid,blocked,comm | grep -v 00000                                                                                                                                  
  PID  PPID          BLOCKED COMMAND
  275     1 fffffffe7ffbfeff udevd
 9107     1 00007f847f9b5c00 kded4
 9162     1 00007f847f9b5800 bluedevil-monol
 9196  9104 00007ffe750a90f8 kio_trash
 9214  9206 00007f8c7868f070 kwin
 9226  9223 00007f847f9b5c00 ksysguardd
 9235  9104 00007ffe750a90f8 kio_trash
 9292  9104 00007ffe750a8410 konqueror
 9391  9104 00007ff845816440 pidgin
 9659  9104 00007ffe750a90f8 kio_trash

It is even worse if anything at the kdeinit4 level gets touched. It makes stuff like ^C stop working in Konsole properly.
Comment 87 Andreas K. Hüttel archtester gentoo-dev 2014-01-06 21:53:13 UTC
For the record I extended the mask in the kde profile to cover all 319* and 331* versions.
Comment 88 Andrew Udvare 2014-01-07 02:54:44 UTC
I tried to use 304.* with my GeForce 650M (MacBook Pro machine) and got errors starting up. It does not initialise the card. This might be a compatibility issue.

[    5.500558] NVRM: failed to copy vbios to system memory.
Comment 89 Alexey Korepanov 2014-01-12 18:01:35 UTC
I confirm the problem.

x11-drivers/nvidia-drivers-331.20 was built with the following:
USE="X acpi (multilib) tools -pax_kernel"

sys-kernel/gentoo-sources-3.10.17 was built with the following:
USE="-build -deblob -experimental -symlink"
Comment 90 Victor Mataré 2014-01-13 19:47:52 UTC
BTW, is this really a KDE issue WRT masking? Seems to have been observed on KDE systems at first, but if it's the signal mask on the X server, it really shouldn't be limited to KDE.

Anyways a heads-up: The 331.38 driver just came out:
http://www.nvidia.com/download/driverResults.aspx/72250/en-us

According to https://devtalk.nvidia.com/default/topic/638521/linux/gnome-terminal-problems-ctrl-c-and-exit/ it's supposed to be fixed there. I'm going to try it out right now. May it should be bumped in unstable so more people will test it?
Comment 91 Victor Mataré 2014-01-13 20:27:48 UTC
OK, so:

# modinfo nvidia
filename:       /lib/modules/3.12.6-gentoo_wald/video/nvidia.ko
alias:          char-major-195-*
version:        331.38
[...]

# cat /proc/`pgrep X`/status | grep Sig
SigQ:   2/61376
SigPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000001000
SigCgt: 00000001d18066cf

Is this it? Looks OK to me. Please bump.
Comment 92 Martin Samek 2014-01-13 21:47:16 UTC
Same for me:

filename:       /lib/modules/3.12.7-gentoo/video/nvidia.ko
alias:          char-major-195-*
version:        331.38
supported:      external
license:        NVIDIA

SigQ:   0/126391
SigPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000001000
SigCgt: 00000001d18066cf

A this moment without defuncts
Comment 93 Serge Gavrilov 2014-01-14 08:27:30 UTC
It seems that 331.38 fixes initial problem related with bogofilter freezes.
Comment 94 Jeroen Roovers (RETIRED) gentoo-dev 2014-01-14 13:06:57 UTC
(In reply to Serge Gavrilov from comment #93)
> It seems that 331.38 fixes initial problem related with bogofilter freezes.

Perhaps =x11-drivers/nvidia-drivers-319.82 too?
Comment 95 Andreas Hermann 2014-01-14 23:02:19 UTC
I tried almost each version of x11-drivers/nvidia-drivers but
cannot reproduce the problem, output of "cat "/proc/`pgrep X`/status | grep Sig" always shows:

SigQ:   0/31107
SigPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000001000
SigCgt: 00000001d18066cf

No crashes or application hangs. And I'm using KDE with kmail/akonadi, really strange. Maybe this is somehow hardware dependent? (using an old Thinkpad with an ancient nvidia card here).
Comment 96 Victor Orozco 2014-01-15 02:30:45 UTC
nvidia-drivers-331.38 fixes the problem for me (gentoo-sources-3.12.7, nvidia gtx560m)

# modinfo nvidia
filename:       /lib/modules/3.12.7-gentoo/video/nvidia.ko
alias:          char-major-195-*
version:        331.38
supported:      external
license:        NVIDIA

# cat /proc/`pgrep X`/status | grep Sig
SigQ:	0/128018
SigPnd:	0000000000000000
SigBlk:	0000000000000000
SigIgn:	0000000000381000
SigCgt:	00000001d18066cf


I'm now able to use ctrl+c and my system is running without any defuncts
Comment 97 Constantin Baranov 2014-01-15 08:15:06 UTC
331.38 seems to work fine for me. Note that exact garbage in sigmasks may vary, thus observed behaviour may be very different (INT and CHLD blocking is described above; my first experience with the bug was in blocked ALRM that is funny too). Better way to test current state is statistics from command: ps ax --no-headings -o sigmask | sort | uniq -c
Run it with old nvidia or non-nvidia drivers and compare with current statistics.
Comment 98 Andrew Udvare 2014-01-16 09:10:56 UTC
331.38 installed.

$ ps -eda -o pid,ppid,blocked,comm | grep -v 00000
  PID  PPID          BLOCKED COMMAND
  275     1 fffffffe7ffbfeff udevd
Comment 99 Constantin Baranov 2014-01-16 09:57:29 UTC
(In reply to Andrew Udvare from comment #98)
> 331.38 installed.
> 
> $ ps -eda -o pid,ppid,blocked,comm | grep -v 00000
>   PID  PPID          BLOCKED COMMAND
>   275     1 fffffffe7ffbfeff udevd

This is OK. At least I have the same number on intel-only laptop. I guess it blocks all signals and unblocks only few of them actually needed.
Comment 100 Tom Wijsman (TomWij) (RETIRED) gentoo-dev 2014-02-02 00:26:19 UTC
*** Bug 490256 has been marked as a duplicate of this bug. ***
Comment 101 Jeroen Roovers (RETIRED) gentoo-dev 2014-02-19 12:38:09 UTC
https://devtalk.nvidia.com/default/topic/690793

 = Linux, Solaris, and FreeBSD driver 331.49 (long-lived branch release) =

[...]
    "Fixed a bug which could sometimes corrupt a newly-created thread's signal
     mask in multi-threaded applications that load libGL."
Comment 102 Ville Aakko 2014-02-20 14:19:38 UTC
Hi!

I've been running 334.16-r5 (and some previous version including 331.38 and perhaps after it, I could check my emerge logs) seemingly without problems, though I haven't come around to check the sigmask as described above (I need my computer so I don't want to install the broken drivers).

Though, if I understand correctly, seemingly correct operation might not be enough to determine if this is indeed fixed (maybe it's just a coincidence / different circumstances that cause different kind of corruption, which might not be observed...).

But confirming in any cse that newer drivers seem to work for me, and if others can confirm too, maybe this bug can be CLOSED FIXED?

Cheers!
Comment 103 Marat Radchenko 2016-10-27 20:31:44 UTC
I believe this is obsolete long time ago.