Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 353224 - sys-libs/glibc-2.12.2: Broken thread local storage (TLS) initialization
Summary: sys-libs/glibc-2.12.2: Broken thread local storage (TLS) initialization
Status: RESOLVED FIXED
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: [OLD] Core system (show other bugs)
Hardware: All Linux
: High normal (vote)
Assignee: Gentoo Toolchain Maintainers
URL: http://sources.redhat.com/bugzilla/sh...
Whiteboard:
Keywords:
: 338513 (view as bug list)
Depends on:
Blocks: 315345 317557 338513 351897 366293 366507
  Show dependency tree
 
Reported: 2011-01-30 12:26 UTC by Martin von Gagern
Modified: 2011-08-09 05:41 UTC (History)
7 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments
Demo case (glibc-tls-bug.sh,1.81 KB, text/plain)
2011-01-30 14:21 UTC, Martin von Gagern
Details
Do update_slotinfo after add_to_slotinfo (gentoo353224a.patch,2.18 KB, patch)
2011-01-30 15:03 UTC, Martin von Gagern
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Martin von Gagern 2011-01-30 12:26:21 UTC
Today I've debugged https://github.com/cschwan/sage-on-gentoo/issues/issue/40 one more time, and finally understood what's going wrong there. The problem seems to lie in glibc itself, and bug #338513 is just another manifestation.

The problem manifests itself in the fact that __tls_get_addr returns NULL, causing a SIGSEGV upon access to thread local variables. Obviously this shouldn't happen.

For the code involved in the description below, please refer to dl-tls.c and dl-open.c:
http://sourceware.org/git/?p=glibc.git;a=blob;f=elf/dl-tls.c;h=824adc196d1ab63646c7e8ea4e3546107b7590bd;hb=3a33e487eeb65e2f1f633581c56bee2c60d0ca43
http://sourceware.org/git/?p=glibc.git;a=blob;f=elf/dl-open.c;h=cf8e8cc6715f9f44b3d5d88ee4cc9b709cb37226;hb=3a33e487eeb65e2f1f633581c56bee2c60d0ca43

The problem occurs when the python runtime loads the _gtk.so module, along with all its dependencies. Some of the dependencies make use of thread local storage: libpixman.so, libEGL.so, libstdc++.so, libnvidia-tls.so and libuuid.so. They are assigned module ids 2 through 6, as 1 is for libc.so. _dl_next_tls_modid is a suitable breakpoint here.

Next the modules are loaded to the global slot database using the function _dl_add_to_slotinfo, one after the other. The global generation counter, GL(dl_tls_generation), stays at value 1 the whole time, as it is only incremented after the whole dl_open call for the complete set of libraries is done (dl-open.c line 458). So all new libraries are marked as belonging to the next generation, generation 2. This information is stored in their slots.

However, some of the libraries need some special kind of tls initialization. For libEGL.so and libnvidia-tls.so, imap->l_need_tls_init in dl-open.c line 428 will evaluate true, causing an immediate call to _dl_update_slotinfo for module ids 3 and 5, intermixed with the _dl_add_to_slotinfo calls.

This is where things go wrong: when running _dl_update_slotinfo for module 3, this function finds that module 3 is in generation 2. It then updates the dtv (the thread-local vector of thread-local data blocks for the modules) with the data for all generation 2 modules that it knows about at that point, i.e. modules 2 and 3. It then marks the dtv to be up to date with generation 2.

This mark becomes incorrect when subsequent calls to _dl_add_to_slotinfo add more slots to generation 2. _dl_update_slotinfo is executed again for module 4, but in this case the check in line 571 of dl-tls.c skips the actual update, as the dtv seems to be up to date already, as judged by its generation counter.

Later on, when at runtime some code in module 4 attempts to access its thread local storage, __tls_get_addr determines in line 758 of dl-tls.c that no update to the dtv is required, as its generation still seems up to date. Therefore an uninitialized part of the dtv will be returned, wich usually will tend to be a NULL pointer. That's what's causing the application to SIGSEGV.

I have tried to reproduce this in a small demo setup, but so far I couldn't get my own code to reproduce the case where l_need_tls_init is set. The corresponding code is line 111 in dl-reloc.c:
http://sourceware.org/git/?p=glibc.git;a=blob;f=elf/dl-reloc.c;h=23cb59cbc89cf65c5c3f8c56b0459daf8949383f;hb=3a33e487eeb65e2f1f633581c56bee2c60d0ca43#l111
Comments in functions leading up to this indicate that it has something to do with discouraged practices, so I'm not sure how to willfully enter that branch.

I'm pretty convinced that this whole thing is an UPSTREAM issue, but as http://www.gnu.org/software/libc/bugs.html#what_to_report requests users to report glibc issues with their distros first, this is what I do here.
Comment 1 Rafał Mużyło 2011-01-30 13:35:46 UTC
Do you think bug 351897 could be related ?
Comment 2 Martin von Gagern 2011-01-30 14:21:28 UTC
Created attachment 261100 [details]
Demo case

OK, the attached script reproduces the problem for me. Simply execute it in an otherwise empty directory, and you should see this (after some gcc commands):

./demo
&tbaz=(nil)
glibc-tls-bug.sh: line 75:  1752 Segmentation fault      "$@"

So the address of some thread local variable is calculated to be NULL, causing the segmentation fault when its value is accessed.
Comment 3 Martin von Gagern 2011-01-30 14:29:07 UTC
(In reply to comment #1)
> Do you think bug 351897 could be related ?

Very very likely. I can reproduce that issue, and I can tell after a quick gdb that at least __tls_get_addr returns NULL there as well. Breaking in _dl_add_to_slotinfo and _dl_update_slotinfo, I find their invocations to be intermixed again, with premature updates for libEGL.so and libnvidia-tls.so.

By the way, libEGL.so is media-libs/mesa-7.10 here, while nvidia-tls.so comes from x11-drivers/nvidia-drivers-260.19.36.

As I've only seen the issue on x86_64 so far, I'll mark the bug accordingly for the moment. But please give other architectures a try with the attached script, and report whether you can reproduce it anywhere else as well.
Comment 4 Rafał Mużyło 2011-01-30 14:56:07 UTC
Well, bug 351897 affects at very least x86 too, so if that one is related, it's not arch specific.
Comment 5 Rafał Mużyło 2011-01-30 14:57:28 UTC
Also, I'm on radeon open source, so it's pure mesa here.
Comment 6 Martin von Gagern 2011-01-30 15:03:51 UTC
Created attachment 261104 [details, diff]
Do update_slotinfo after add_to_slotinfo

This patch fixes the issue for me. It should be reasonably safe to apply.
Comment 7 Rafał Mużyło 2011-01-30 15:08:06 UTC
And an interesting note: your example segfaults here if compiled as-is,
but passes, if -O0 is changed to -O/-O2.
Comment 8 Martin von Gagern 2011-01-30 15:18:42 UTC
Reported this upstream, as the evidence of the issue is pretty clear in the sources, and the fact that my patch works confirms my interpretation.
Upstream report: http://sources.redhat.com/bugzilla/show_bug.cgi?id=12453
Comment 9 Martin von Gagern 2011-01-30 15:22:48 UTC
(In reply to comment #7)
> And an interesting note: your example segfaults here if compiled as-is,
> but passes, if -O0 is changed to -O/-O2.

The differenc probably lies here:
objdump -R libbar.so | grep R_X86_64_DTPMOD64
Although I have to confess I currently fail to understand how the optimized code does its TLS stuff. Never mind, we know things fail in real world scenarios.
Comment 10 Toralf Förster gentoo-dev 2011-02-03 16:30:07 UTC
Today I did a "ebuild sys-libs/glibc-~2.12.2 test" but it exited with : 
...
make[1]: Target `check' not remade because of errors.

The appropriate portage log file contains a lot of lines like :
/var/tmp/portage/sys-libs/glibc-2.12.2/work/build-default-i686-pc-linux-gnu-nptl//nptl/tst-tls5: symbol lookup error: /var/tmp/portage/sys-libs/glibc-2.12.2/work/build-default-i686-pc-linux-gnu-nptl/nptl/tst-tls5modd.so: undefined symbol: tls_registry

What's interesting, is that within /var/log/messages I've these 2 lines :
2011-02-03T17:09:26.962+01:00 n22 kernel: Maximum lock depth 1024 reached task: ld-linux.so.2 (22161)

2011-02-03T17:18:09.603+01:00 n22 kernel: ld-linux.so.2[30117]: segfault at 0 ip   (null) sp bfa00514 error 4 in next[8048000+2000]

"17:18" is the time stamp of portage log file too. Is this related to the reported issue or should I file a separate bug report for this test failure ?
Comment 11 Martin von Gagern 2011-02-03 18:08:59 UTC
(In reply to comment #10)
> Is this related to the
> reported issue or should I file a separate bug report for this test failure ?

Doesn't look related. To be sure, you could apply my patch and try the selftest after that:
ebuild $(equery w sys-libs/glibc) clean unpack
wget -O- 'https://bugs.gentoo.org/attachment.cgi?id=261104' \
 | patch /var/tmp/portage/sys-libs/glibc-*/work/glibc-*/elf/dl-open.c
ebuild $(equery w sys-libs/glibc) test

If you get the same error from these commands, I'd say it's a different bug.

I tried to reproduce your problem here, but encountered a number of different selftest failures instead, as reported in bug #270997. Perhaps that report is related to your problem, even if tst-tls5 occurs neither in that report nor in my list. I'd say open a new report but mention that as possibly related.
Comment 12 Toralf Förster gentoo-dev 2011-02-04 12:25:28 UTC
(In reply to comment #11)
Yep, I filed a new bug #353224
Comment 13 Martin von Gagern 2011-02-04 12:35:30 UTC
(In reply to comment #12)
> Yep, I filed a new bug #353224

You mean bug #353682, as #353224 is this one here.
Comment 14 SpanKY gentoo-dev 2011-03-01 00:20:35 UTC
looking at this, it seems like the breakage isnt a regression and is actually in glibc-2.11/older too.  would you agree with that assessment ?
Comment 15 Martin von Gagern 2011-03-01 08:42:14 UTC
(In reply to comment #14)
> looking at this, it seems like the breakage isnt a regression and is actually
> in glibc-2.11/older too.  would you agree with that assessment ?

Yes, looks that way. I'd guess the problem was likely introduced in 
http://sourceware.org/git/?p=glibc.git;a=commit;h=9dcafc559763e339d4a79580c333 from 2005 which would make glibc-2.4 containing the problem.
Comment 16 David Kirkby 2011-04-03 03:56:12 UTC
I'm seeing the bug mentioned on the Sage on Gentoo site

https://github.com/cschwan/sage-on-gentoo/issues/issue/40

using OpenSolaris where the Sun C library would be used, and not the GNU C library. Hence I'm a bit suspicous that this is really the problem. See more information at 


https://github.com/cschwan/sage-on-gentoo/issues/issue/40
http://trac.sagemath.org/sage_trac/ticket/11116

Dave
Comment 17 Marc Joliet 2011-04-03 15:32:48 UTC
I have glibc-2.11.3 and the attached patch fixes bug #338513 for me. I applied it via the instructions found at https://github.com/cschwan/sage-on-gentoo/issues/#issue/40/comment/722867, after which plotting with matplotlib started working again. So far, after rebooting everything seems to be fine.

>> (0) % emerge --info glibc
Portage 2.1.9.42 (default/linux/amd64/10.0, gcc-4.4.5, glibc-2.11.3-r0, 2.6.38-gentoo-r1 x86_64)
=================================================================
                        System Settings
=================================================================
System uname: Linux-2.6.38-gentoo-r1-x86_64-AMD_Athlon-tm-_64_X2_Dual_Core_Processor_4200+-with-gentoo-2.0.2
Timestamp of tree: Sat, 02 Apr 2011 20:15:01 +0000
app-shells/bash:     4.1_p9
dev-lang/python:     2.7.1-r1, 3.1.3-r1
dev-util/cmake:      2.8.4
sys-apps/baselayout: 2.0.2
sys-apps/openrc:     0.8.0
sys-apps/sandbox:    2.4
sys-devel/autoconf:  2.13, 2.65-r1
sys-devel/automake:  1.9.6-r3, 1.10.3, 1.11.1
sys-devel/binutils:  2.20.1-r1
sys-devel/gcc:       4.4.5
sys-devel/gcc-config: 1.4.1
sys-devel/libtool:   2.2.10
sys-devel/make:      3.81-r2
virtual/os-headers:  2.6.36.1 (sys-kernel/linux-headers)
ACCEPT_KEYWORDS="amd64"
ACCEPT_LICENSE="* -@EULA"
CBUILD="x86_64-pc-linux-gnu"
CFLAGS="-O2 -march=native -pipe"
CHOST="x86_64-pc-linux-gnu"
CONFIG_PROTECT="/etc /usr/kde/3.5/share/config/kdm /usr/share/config /usr/share/gnupg/qualified.txt"
CONFIG_PROTECT_MASK="/etc/ca-certificates.conf /etc/env.d /etc/fonts/fonts.conf /etc/games/angband/edit/ /etc/gconf /etc/gentoo-release /etc/revdep-rebuild /etc/sandbox.d /etc/terminfo /etc/texmf/language.dat.d /etc/texmf/language.def.d /etc/texmf/updmap.d /etc/texmf/web2c"
CXXFLAGS="-O2 -march=native -pipe"
DISTDIR="/usr/portage/distfiles"
EMERGE_DEFAULT_OPTS="--with-bdeps=y"
FEATURES="assume-digests binpkg-logs buildpkg collision-protect distlocks fixlafiles fixpackages metadata-transfer news parallel-fetch protect-owned sandbox sfperms strict unknown-features-warn unmerge-logs unmerge-orphans userfetch userpriv usersandbox"
FFLAGS=""
GENTOO_MIRRORS="http://de-mirror.org/distro/gentoo/ ftp://ftp-stud.fht-esslingen.de/pub/Mirrors/gentoo/ ftp://mirror.muntinternet.net/pub/gentoo/"
LANG="de_DE.UTF-8"
LDFLAGS="-Wl,-O1 -Wl,--as-needed"
LINGUAS="en_US en en_GB de"
MAKEOPTS="-s -j3"
PKGDIR="/usr/portage/packages"
PORTAGE_COMPRESS="xz"
PORTAGE_COMPRESS_FLAGS="-9"
PORTAGE_CONFIGROOT="/"
PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --compress --force --whole-file --delete --stats --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages"
PORTAGE_TMPDIR="/var/tmp"
PORTDIR="/usr/portage"
PORTDIR_OVERLAY="/usr/local/portage/layman/sunrise /usr/local/portage/layman/pcsx2 /usr/local/portage/layman/science /usr/local/portage"
SYNC="rsync://rsync.europe.gentoo.org/gentoo-portage"
USE="3dnow 3dnowext X a52 aac aalib accessibility acl acpi alsa amd64 audiofile avahi bash-completion berkdb blas branding bzip2 cairo caps cdda cdinstall cdr chipcard cjk cli consolekit cracklib crypt css cups cxx dbus dga djvu dri dssi dts dvd dvdnav dvdr dvi encode exif fbcon ffmpeg fftw flac fortran ftp fuse gdbm gif gimp glitz glut gmp gnuplot gnutls gpm gtk hbci iconv idn imlib inotify ipv6 jack jpeg jpeg2k kipi ladspa lapack lash latex lcms libcaca libnotify libsamplerate lm_sensors logrotate lzma mad matroska mikmod mjpeg mmx mmxext mng modplug modules mp3 mp4 mpeg mudflap multilib musepack musicbrainz ncurses nfs nls nntp nptl nptlonly nsplugin ntp offensive ogg openal openexr opengl openmp pam pango pcre pdf plotutils png policykit ppds pppd pulseaudio python qt4 quicktime rar readline rtsp samba sasl schroedinger sdl session sid slang slp smp sndfile speex spell sse sse2 sse3 ssl startup-notification svg sysfs taglib tcpd theora threads tiff timidity truetype unicode usb vaapi vcd vim-syntax vorbis vpx webkit wma x264 xattr xcb xcomposite xface xft xml xmp xorg xpm xscreensaver xulrunner xv xvid xvmc zeroconf zlib zsh-completion" ALSA_CARDS="ice1724 hda-intel usb-audio" ALSA_PCM_PLUGINS="adpcm alaw asym copy dmix dshare dsnoop empty extplug file hooks iec958 ioplug ladspa lfloat linear meter mmap_emul mulaw multi null plug rate route share shm softvol" APACHE2_MODULES="actions alias auth_basic authn_alias authn_anon authn_dbm authn_default authn_file authz_dbm authz_default authz_groupfile authz_host authz_owner authz_user autoindex cache cgi cgid dav dav_fs dav_lock deflate dir disk_cache env expires ext_filter file_cache filter headers include info log_config logio mem_cache mime mime_magic negotiation rewrite setenvif speling status unique_id userdir usertrack vhost_alias" CAMERAS="ptp2" COLLECTD_PLUGINS="df interface irq load memory rrdtool swap syslog" ELIBC="glibc" GPSD_PROTOCOLS="ashtech aivdm earthmate evermore fv18 garmin garmintxt gpsclock itrax mtk3301 nmea ntrip navcom oceanserver oldstyle oncore rtcm104v2 rtcm104v3 sirf superstar2 timing tsip tripmate tnt ubx" INPUT_DEVICES="evdev" KERNEL="linux" LCD_DEVICES="bayrad cfontz cfontz633 glk hd44780 lb216 lcdm001 mtxorb ncurses text" LINGUAS="en_US en en_GB de" PHP_TARGETS="php5-3" RUBY_TARGETS="ruby18" USERLAND="GNU" VIDEO_CARDS="radeon vesa" XTABLES_ADDONS="quota2 psd pknock lscan length2 ipv4options ipset ipp2p iface geoip fuzzy condition tee tarpit sysrq steal rawnat logmark ipmark dhcpmac delude chaos account" 
Unset:  CPPFLAGS, CTARGET, INSTALL_MASK, LC_ALL, PORTAGE_BUNZIP2_COMMAND, PORTAGE_RSYNC_EXTRA_OPTS

=================================================================
                        Package Settings
=================================================================

sys-libs/glibc-2.11.3 was built with the following:
USE="(multilib) nls -debug -gd -glibc-omitfp (-hardened) -profile (-selinux) -vanilla"
CFLAGS="-march=native -pipe -O2 -fno-strict-aliasing"
CXXFLAGS="-march=native -pipe -O2 -fno-strict-aliasing"
Comment 18 Martin von Gagern 2011-04-11 14:10:14 UTC
Instructions to include this patch on your system: in order to build and install a patched version of your glibc, execute these commands as root:

# mkdir -p /etc/portage/patches/sys-libs/glibc
# wget -O /etc/portage/patches/sys-libs/glibc/bug353224.patch \
  'http://bugs.gentoo.org/attachment.cgi?id=261104'
# emerge -1 sys-libs/glibc

If you want to remove the patch again, either because it got included in portage or because you encounter troubles and want to know if it is to blame, you can do so by executing the following commands as root:

# rm /etc/portage/patches/sys-libs/glibc/bug353224.patch
# emerge -1 sys-libs/glibc

If there are problems introduced by this patch (i.e. they occur with the patch but not without it), please report them here. If you encounter issues fixed by this patch, mark the corresponding bug reports as depending on this one here, or if there is no dedicated report for it, leave a short comment here.
Comment 19 Steffen Schaumburg 2011-04-13 22:26:26 UTC
The patch has fixed my problem with drawing graphs with fpdb (not in portage), thanks :)
Comment 20 Martin von Gagern 2011-05-15 08:20:43 UTC
Upstream has committed a slightly modified fix:
http://sourceware.org/git/?p=glibc.git;a=commitdiff;h=d26dfc60edc8
Can we get this revbumped into portage before the next glibc release?
Comment 22 Sébastien Fabbro (RETIRED) gentoo-dev 2011-08-09 05:41:58 UTC
*** Bug 338513 has been marked as a duplicate of this bug. ***