Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 290404 - x11-drivers/nvidia-drivers libGL.so.1 causes segfaults in other software
Summary: x11-drivers/nvidia-drivers libGL.so.1 causes segfaults in other software
Status: RESOLVED FIXED
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: [OLD] Unspecified (show other bugs)
Hardware: x86 Linux
: High major (vote)
Assignee: Doug Goldstein (RETIRED)
URL: http://www.nvnews.net/vbulletin/showt...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-10-24 19:03 UTC by Marcin Marszalek
Modified: 2009-11-04 12:25 UTC (History)
3 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Marcin Marszalek 2009-10-24 19:03:43 UTC
It's the strangest bug I've ever encountered. Let me start with a list of observations:

1. If I eselect the opengl impl. to xorg-x11 there are no problems.
2. If I eselect the opengl impl. to nvidia, any program linked to /usr/lib/opengl/xorg-x11/lib/libGL.so.1 will segfault immediately.
3. This means I can't even start X with nvidia opengl - a distorted NVIDIA logo would appear and X process will crash [1].
4. The programs will segfault even without X started and this includes nvidia-settings [2] and glxgears [3].
5. All this holds for nvidia-drivers versions 185.18.31, 185.18.36, 185.18.36-r1 and 190.42-r1. 180.* version no longer compiles due to changes in the kernel (2.6.32-rc5).
6. The kernel module always loads fine. Updating the kernel did not help.
7. I re-emerged any reasonable dependencies including libXext, libXdmcp, libX11, libxcb, libXau, libvdpau and glibc itself - did not help

Basically either libGL or calloc() glibc call from libGLcore seems to fail:

[1] X[8842] general protection ip:b743a7de sp:bf8aeb1c error:0 in libc-2.9.so[b73ff000+12f000]
[2] nvidia-settings[28414]: segfault at ff0a0000 ip b6c97be6 sp bfed69d8 error 6 in libGL.so.185.18.36[b6c45000+80000]
[3] glxgears[28409] general protection ip:b766fd46 sp:bfb85d08 error:0 in libc-2.9.so[b7601000+13c000]


Reproducible: Always

Steps to Reproduce:
1. emerge nvidia-drivers
2. reboot
3. run nvidia-settings or glxgears

Actual Results:  
Sample backtrace for glxgears crash:

(gdb) bt
#0  0xb7689d46 in calloc () from /lib/libc.so.6
#1  0xb6c81e89 in ?? () from /usr/lib/opengl/nvidia/lib/libGLcore.so.1
#2  0x00000001 in ?? ()
#3  0x00000034 in ?? ()
#4  0x00000009 in ?? ()
#5  0x00000002 in ?? ()
#6  0x080640a0 in ?? ()
#7  0x0000029a in ?? ()
#8  0x00007b75 in ?? ()
#9  0xb77df452 in ?? () from /usr/lib/opengl/nvidia/lib/libGL.so.1
#10 0x0000029a in ?? ()
#11 0x00007b75 in ?? ()
#12 0xb6c81e00 in ?? () from /usr/lib/opengl/nvidia/lib/libGLcore.so.1
#13 0x01c81e2c in ?? ()
#14 0x00007b75 in ?? ()
#15 0xb6c81e00 in ?? () from /usr/lib/opengl/nvidia/lib/libGLcore.so.1
#16 0xbfb7c224 in ?? ()
#17 0xb77dfa11 in _init () from /usr/lib/opengl/nvidia/lib/libGL.so.1
#18 0x00007b75 in ?? ()
#19 0x00000001 in ?? ()
#20 0x00000000 in ?? ()

The one for nvidia-settings is pretty trivial:

(gdb) bt
#0  0xb6c92be6 in ?? () from /usr/lib/opengl/nvidia/lib/libGL.so.1
#1  0x00000000 in ?? ()

Suspecting memory corruption I run valgrind as well:

overlord ~ # valgrind --tool=memcheck glxgears
==31950== Process terminating with default action of signal 11 (SIGSEGV)
==31950==  General Protection Fault
==31950==    at 0x415DDA6: __strtol_internal (in /lib/libc-2.9.so)
==31950==    by 0x4086C14: (within /usr/lib/opengl/nvidia/lib/libGL.so.185.18.36)
==31950==
==31950== Process terminating with default action of signal 11 (SIGSEGV)
==31950==  General Protection Fault
==31950==    at 0x42396B5: (within /lib/libc-2.9.so)
==31950==    by 0x4239404: (within /lib/libc-2.9.so)
==31950==    by 0x4239C21: (within /lib/libc-2.9.so)
==31950==    by 0x401F482: _vgnU_freeres (in /usr/lib/valgrind/x86-linux/vgpreload_core.so)
==31950==    by 0x5343E07: ???
==31950==    by 0x4086C14: (within /usr/lib/opengl/nvidia/lib/libGL.so.185.18.36)
==31950==
==31950== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 16 from 1)
==31950== malloc/free: in use at exit: 90,971 bytes in 29 blocks.
==31950== malloc/free: 75 allocs, 46 frees, 95,094 bytes allocated.
==31950== For counts of detected errors, rerun with: -v
==31950== searching for pointers to 29 not-freed blocks.
==31950== checked 2,433,948 bytes.
==31950==
Segmentation fault

overlord ~ # valgrind --tool=memcheck nvidia-settings
==31989== Invalid write of size 4
==31989==    at 0x4B93BE6: (within /usr/lib/opengl/nvidia/lib/libGL.so.185.18.36)
==31989==  Address 0xff0a0000 is not stack'd, malloc'd or (recently) free'd
==31989==
==31989== Process terminating with default action of signal 11 (SIGSEGV)
==31989==  Access not within mapped region at address 0xFF0A0000
==31989==    at 0x4B93BE6: (within /usr/lib/opengl/nvidia/lib/libGL.so.185.18.36)
==31989==
==31989== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 15 from 1)
==31989== malloc/free: in use at exit: 89,989 bytes in 25 blocks.
==31989== malloc/free: 72 allocs, 47 frees, 94,152 bytes allocated.
==31989== For counts of detected errors, rerun with: -v
==31989== searching for pointers to 25 not-freed blocks.
==31989== checked 3,502,132 bytes.
==31989==
Segmentation fault


Expected Results:  
Strangely, 185.18.31 was working for me before, but not since Oct 22 when I upgraded to 185.18.36 and everything crashed.


So this raises the following questions:

1. Is this a glibc bug? I don't think so, because it would be a problem with basic memory management and it would affect other software as well. But calloc() should probably never segfault, very strange.
2. Is this a bug in nvidia-drivers? If so, why was the version 185.18.31 running flawlessly and now it crashes? Very strange.
3. Is this a problem with xorg-server? That's the only related thing I recently changed, but there is no evidence.

Any clues? I'll post some more system info in a while.
Comment 1 Marcin Marszalek 2009-10-24 19:07:09 UTC
It seems to me that bug #281848 could be related, but it's somewhat different and still no solution was suggested.

overlord ~ # emerge --info
Portage 2.1.7.1 (default/linux/x86/10.0, gcc-4.3.4, glibc-2.9_p20081201-r2, 2.6.32-rc5 i686)
=================================================================
System uname: Linux-2.6.32-rc5-i686-Intel-R-_Core-TM-2_Duo_CPU_E6750_@_2.66GHz-with-gentoo-1.12.11.1
Timestamp of tree: Sat, 24 Oct 2009 15:15:01 +0000
ccache version 2.4 [disabled]
app-shells/bash:     4.0_p28
dev-java/java-config: 2.1.9-r1
dev-lang/python:     2.6.2-r1
dev-util/ccache:     2.4-r7
dev-util/cmake:      2.6.4-r3
sys-apps/baselayout: 1.12.11.1
sys-apps/sandbox:    1.6-r2
sys-devel/autoconf:  2.13, 2.63-r1
sys-devel/automake:  1.5, 1.7.9-r1, 1.8.5-r3, 1.9.6-r2, 1.10.2
sys-devel/binutils:  2.18-r3
sys-devel/gcc-config: 1.4.1
sys-devel/libtool:   2.2.6a
virtual/os-headers:  2.6.30-r1
ACCEPT_KEYWORDS="x86"
CBUILD="i486-pc-linux-gnu"
CFLAGS="-march=nocona -O2 -pipe"
CHOST="i486-pc-linux-gnu"
CONFIG_PROTECT="/etc /usr/kde/3.5/env /usr/kde/3.5/share/config /usr/kde/3.5/shutdown /usr/share/config /var/bind /var/lib/hsqldb"
CONFIG_PROTECT_MASK="/etc/ca-certificates.conf /etc/env.d /etc/env.d/java/ /etc/fonts/fonts.conf /etc/gconf /etc/php/apache2-php5/ext-active/ /etc/php/cgi-php5/ext-active/ /etc/php/cli-php5/ext-active/ /etc/revdep-rebuild /etc/sandbox.d /etc/terminfo /etc/texmf/language.dat.d /etc/texmf/language.def.d /etc/texmf/updmap.d /etc/texmf/web2c /etc/udev/rules.d"
CXXFLAGS="-march=nocona -O2 -pipe"
DISTDIR="/usr/portage/distfiles"
FEATURES="assume-digests distlocks fixpackages news parallel-fetch protect-owned sandbox sfperms strict unmerge-logs unmerge-orphans userfetch"
GENTOO_MIRRORS="http://mirror.bytemark.co.uk/gentoo http://distfiles.gentoo.org http://www.ibiblio.org/pub/Linux/distributions/gentoo"
LDFLAGS="-Wl,-O1"
LINGUAS="en pl"
MAKEOPTS="-j4"
PKGDIR="/usr/portage/packages"
PORTAGE_CONFIGROOT="/"
PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --compress --force --whole-file --delete --stats --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages"
PORTAGE_TMPDIR="/var/tmp"
PORTDIR="/usr/portage"
PORTDIR_OVERLAY="/usr/portage/local/layman/science /usr/local/portage"
SYNC="rsync://rsync.europe.gentoo.org/gentoo-portage"
USE="X a52 aac acl acpi alsa ao apache2 arts audiofile bash-completion berkdb bindist blas bluetooth bzip2 cairo calendar cdr clamav cli cracklib crypt css ctype cups curl curlwrappers cvs cxx dga doc dri dts dv dvb dvd dvdr encode enscript exif expat fam ffmpeg fftw firefox flac fltk foomaticdb fortran ftp gd gdbm geoip gif gimp glut gmp gnuplot gnutls gphoto2 gpm gps graphviz gsl gstreamer gtk gzip handbook hddtemp hdf5 iconv imagemagick imap imlib ipv6 isdnlog jabber jpeg jpeg2k kde kolab kontact lame lapack latex lcms libsamplerate lirc lm_sensors lzo mad matroska mhash mime mmap mmx mng modules mp3 mpeg mpi mplayer mudflap musepack ncurses nls nptl nptlonly nsplugin offensive ofx ogg openal openexr opengl openmp pam pcmcia pcntl pcre pda pdf perl php plotutils png postgres ppds pppd python qt3support qt4 raw readline reflection rss scanner sdl semantic-desktop session sharedmem shorten simplexml sip slang smp sndfile snmp sockets sox spell spl sse sse2 ssl subversion svg sysfs syslog sysvipc szip taglib tcpd theora threads tidy tiff truetype unicode usb v4l v4l2 vnc vorbis wavpack wifi win32codecs wxwidgets x264 x86 xcomposite xine xinerama xml xorg xpm xsl xv xvid zlib" ALSA_CARDS="ali5451 als4000 atiixp atiixp-modem bt87x ca0106 cmipci emu10k1 emu10k1x ens1370 ens1371 es1938 es1968 fm801 hda-intel intel8x0 intel8x0m maestro3 trident usb-audio via82xx via82xx-modem ymfpci" ALSA_PCM_PLUGINS="adpcm alaw asym copy dmix dshare dsnoop empty extplug file hooks iec958 ioplug ladspa lfloat linear meter mmap_emul mulaw multi null plug rate route share shm softvol" APACHE2_MODULES="actions alias auth_basic authn_alias authn_anon authn_dbm authn_default authn_file authz_dbm authz_default authz_groupfile authz_host authz_owner authz_user autoindex cache dav dav_fs dav_lock deflate dir disk_cache env expires ext_filter file_cache filter headers include info log_config logio mem_cache mime mime_magic negotiation rewrite setenvif speling status unique_id userdir usertrack vhost_alias" ELIBC="glibc" INPUT_DEVICES="keyboard mouse evdev" KERNEL="linux" LCD_DEVICES="bayrad cfontz cfontz633 glk hd44780 lb216 lcdm001 mtxorb ncurses text" LINGUAS="en pl" USERLAND="GNU" VIDEO_CARDS="fbdev glint intel mach64 mga neomagic nv r128 radeon savage sis tdfx trident vesa vga via vmware voodoo"
Unset:  CPPFLAGS, CTARGET, EMERGE_DEFAULT_OPTS, FFLAGS, INSTALL_MASK, LANG, LC_ALL, PORTAGE_COMPRESS, PORTAGE_COMPRESS_FLAGS, PORTAGE_RSYNC_EXTRA_OPTS
Comment 2 Marcin Marszalek 2009-10-25 22:53:28 UTC
OK, I've spent all day, but it seems that I have found the source of the problem.

For unknown reason nvidia-drivers ebuild links to the non-tls version of nvidia-tls on my system. If I switch the link to tls version, all my problems disappear. It seems that all ebuilds in nvidia-drivers have the same want_tls() function. Could some explain to me the logic behind and why it fails for me? I put my comments after ##:

want_tls() {
        # For uclibc or anything non glibc, return false
        has_version sys-libs/glibc || return 1

## We should reach this point as I have sys-libs/glibc-2.9_p20081201-r2

        # Old versions of glibc were lt/no-tls only
        has_version '<sys-libs/glibc-2.3.2' && return 1

## We should reach this point as well if only 9_p20081201 > 3 (assuming yes)

        if use x86 ; then
                case ${CHOST/-*} in
                        i486|i586|i686) ;;
                        *) return 1 ;;
                esac
        fi

## My profile is x86 and CHOST in make.conf is i486-pc-linux-gnu.
## So in the above the first pattern matches and we reach this point.

        # 2.3.5 turned off tls for linuxthreads glibc on i486 and i586
        if use x86 && has_version '>=sys-libs/glibc-2.3.5' ; then
                case ${CHOST/-*} in
                        i486|i586) return 1 ;;
                esac
        fi

## And the above would match suggesting that I should not be using tls.
## But my glibc emerge first states "Configuring GLIBC for nptl"
## And then "checking for i386 TLS support... yes"
## And to add to the confusion, actually my uname -m says i686 :-)

        # These versions built linuxthreads version to support tls, too
        has_version '>=sys-libs/glibc-2.3.4.20040619-r2' && return 0

## So it seems to me that I do not really have a chance to reach the above.
## Is there a bug in this logic?

        return 1
}
Comment 3 Matthias Krull 2009-10-26 11:43:06 UTC
This is definetly a logical flaw.

Changing

# 2.3.5 turned off tls for linuxthreads glibc on i486 and i586
if use x86 && has_version '>=sys-libs/glibc-2.3.5' ; then
to 
if use x86 && has_version '=sys-libs/glibc-2.3.5' ; then

resolved this issue for me, too.

Im not sure if this is a good solution as i dont know if other versions of glibc also turned off tls.
Comment 4 sf 2009-10-27 10:23:46 UTC
Same bug here: i586-pc-linux-gnu, glibc-2.9_p20081201-r2, nvidia-drivers-96.43.13

Manually symlinking /usr/lib/opengl/nvidia/lib/libnvidia-tls.so* to ../tls/libnvidia-tls.so* works for me.
Comment 5 Doug Goldstein (RETIRED) gentoo-dev 2009-11-03 03:30:46 UTC
This should be fixed with 190.42-r2.
Comment 6 sf 2009-11-04 12:25:03 UTC
Please fix it in 96.43.13, too. TIA