There appears to be an issue with threading on nptl enabled systems. So far this is mainly demonstrated by Mono applications, but could potentially affect other threaded apps as well. Im posting this as a new bug, as the original bug focused (incorrectly) on Mono. Here's a fresh start. Here is a sample program to better demonstrate the issue submitted by Canal Vorfeed on bug #54603. Here's Canal: ---------------------------------------------------------------------------- Ok, I've spent few hours on this issue and found where REAL problem lies. Good news: it's NOT Boehm's GC and it's NOT mono. Bad news: it's problem with glibc itself :-( I've started with "GC is broken with nptl" sample by "Peter Johanson" and played with it for few hours. In the end I've just removed Boehm's GC completely (just plain old malloc) and... it still deadlocks somewhere. So we should stop playing with mono and try to address REAL issue: deadlocks somewhere in nptl library itself :-( Unfortunatelly I'm not glibc guru. Program: #include <pthread.h> #include <stdio.h> void *thread_function (void *args) { int j = 0; char *str; printf("starting thread!\n"); for (j; j < 24; j++) { str = (char *)malloc(240); printf("malloc in thread !\n"); } pthread_yield(); } int main (void) { int i; pthread_t thread; char *str; for (i=0;i<1000;i++) { pthread_create( &thread, NULL, thread_function, (void *)i); pthread_yield(); str = (char *)malloc(240); printf("%d threads\n", i); } sleep(10); } $ ./testpgm | grep 'malloc in thread' | wc -l 9168 Without ntpl I've got expected 24000 ... ------------------------------------------------------------------- Nice work tracking this down, let's have the toolchain team take a look at this or try and push it upstream.
*** Bug 54603 has been marked as a duplicate of this bug. ***
*** Bug 63713 has been marked as a duplicate of this bug. ***
Just a side comment: it does not look like a deadlock problem after all: it's just glibc/nptl can not create more the 384 threads (even if THREADS_MAX in glibc is 100000 and /proc/sys/kernel/threads-max is 14336). Since LinuxThreads CAN create requested 1000 threads on the same system (in fact it's not even two systems: I have Gentoo with glibc/LinuxThreads installed in chroot jail) it's not kernel issue. Not sure about mono: can it actually hit this limit in real world applications ? If the answer is no then mono bug should be reopened...
Oops. Of course even if it's not deadlock condition it still CAN affect mono (and a lot of other programs): it's not 384 simultaneous threads but rather 384 threads for lifetime of program! #include <pthread.h> #include <stdio.h> #include <time.h> void *thread_function (void *args) { pthread_exit(NULL); } int main (void) { int i; pthread_t thread; char *str; for (i=0;i<1000;i++) { if (pthread_create( &thread, NULL, thread_function, (void *)i)) { perror("testpgm"); } else { // int status; // pthread_join(thread, (void **)&status); pthread_yield(); printf("%d threads\n", i); } sleep(1); } } $ ./testpgm 1 threads 2 threads ... 380 threads 381 threads testpgm: Cannot allocate memory testpgm: Cannot allocate memory ... testpgm: Cannot allocate memory testpgm: Cannot allocate memory $ If you'll uncomment pthread_join program will work Ok but it's not Ok not not wait for thread stop with pthreads, right ? P.S. sleep is there to make sure no race conditions exist at all: thread is created and executed and then other will be started. You can remove it from sample - it does not change anything...
I've tried both these test programs on my system (which is NPTL enabled, and exhibits the referenced mono 'bug'). Surprisingly, they both work fine, easily reaching the end of the for loop. I attempted to replace the for loop with a while( 1 ), and they both sat there quite happliy chugging into the tens of thousands of threads created. Also, it should be mentioned that in all cases that I've seen this bug exhibited (in the C# thread test posted elsewhere, and in mono's xsp/mod-mono-server) the threads numbered in the 10s, and each of them was in one of the various wait functions (pthread_cond_timedwait, wait_sem, etc..). In mono's case at least, the bug can be 'resolved' by disabling the garbage collector (libgc), obviously this isn't a viable solution. On the mono bugzilla (http://bugs.ximian.com/show_bug.cgi?id=60576), I've mentioned that the freeze occurs when the garbage collector initiates a world stop (stops all threads except the current) and begins a full garbage collection. However, for some reason the thread performing the garbage collection then also waits. When running the various test programs in gdb, I found that if I put a breakpoint in the libgc code that performed the garbage collection, the breakpoint would be reached repeatedly, however there was a period of one or two seconds in between each occurance. This occured when it would normally freeze completely. I'm not one hundred percent up on my unix signal handling etc., but it seems as if the breakpoint occuring may have altered some of the processes signals, perhaps waking one of the other threads, and therefore alleviating the freeze.
Perhaps some details are different enough to see difference in behaviour. I've tried to recompile system with gcc 3.3.4-r1 but it made no difference... Anyway: looks like there are something goes on with thread synchronisation: in my case with pthread_join I can avoid problems in mono case GC trying to freeze all threads to safely do garbage collectiong. I think real problem is in nptl resources allocation: from outside it looks like there are something wrong with mutexes or something and sometimes nptl just can not do the right thing. Since "normal" programs (like MySQL or Apache) behave quite predictable in regard to threads communications (threads are created and work on separate task - sometimes with something like "manager" thread but never in "normal" programs some random thread will try to communicate with some other equally random thread!) it does not trigger bug in nptl. But GC can stop all threads to do it's work literally at ANY TIME and thus it triggers bug from time to time... For reference: Portage 2.0.50-r11 (gcc34-x86-2004.2, gcc-3.4.1, glibc-2.3.4.20040808-r0, 2.6.9-rc1-mm4) ================================================================= System uname: 2.6.9-rc1-mm4 i686 Intel(R) Pentium(R) 4 CPU 3.00GHz Gentoo Base System version 1.5.3 Autoconf: sys-devel/autoconf-2.59-r4 Automake: sys-devel/automake-1.8.5-r1 ACCEPT_KEYWORDS="x86 ~x86" AUTOCLEAN="yes" CFLAGS="-O2 -pipe -march=pentium4 -funroll-loops -ffast-math -fomit-frame-pointer -ffloat-store -fforce-addr -ftracer -mmmx -msse -msse2 -mfpmath=sse" CHOST="i686-pc-linux-gnu" COMPILER="" CONFIG_PROTECT="/etc /usr/X11R6/lib/X11/xkb /usr/kde/2/share/config /usr/kde/3.3/env /usr/kde/3.3/share/config /usr/kde/3.3/shutdown /usr/kde/3/share/config /usr/share/config /usr/share/texmf/dvipdfm/config/ /usr/share/texmf/dvips/config/ /usr/share/texmf/tex/generic/config/ /usr/share/texmf/tex/platex/config/ /usr/share/texmf/xdvi/ /var/qmail/control" CONFIG_PROTECT_MASK="/etc/afs/C /etc/afs/afsws /etc/gconf /etc/terminfo /etc/env.d" CXXFLAGS="-O2 -pipe -march=pentium4 -funroll-loops -ffast-math -fomit-frame-pointer -ffloat-store -fforce-addr -ftracer -mmmx -msse -msse2 -mfpmath=sse" DISTDIR="/usr/portage/distfiles" FEATURES="autoaddcvs ccache sandbox" GENTOO_MIRRORS="http://gentoo.osuosl.org http://distro.ibiblio.org/pub/Linux/distributions/gentoo" MAKEOPTS="-j3" PKGDIR="/usr/portage/packages" PORTAGE_TMPDIR="/var/tmp" PORTDIR="/usr/portage" PORTDIR_OVERLAY="" SYNC="rsync://rsync.gentoo.org/gentoo-portage" USE="X X509 Xaw3d aalib acl afs alsa apm arts avi berkdb bitmap-fonts bzlib calendar caps cjk crypt cups curl dga directfb djbfft doc emacs encode esd f77 fam fbcon fdftk firebird foomaticdb ftp gcj gd gd-external gdbm ggi gif gmp gnome gpm gtk gtk2 guile iconv imap imlib immqt immqt-bc ipv6 jack java javamail javascript jikes jpeg junit kde kerberos krb4 ldap libcaca libg++ libwww mad makecheck memlimit mhash mikmod mime ming mmx motif mozilla mpeg mysql nas ncurses nls nptl objc odbc oggvorbis opengl oss pam pcre pdflib perl php pic png postgres pwdb python qdbm qt quicktime readline ruby samba sasl sdl session skey slang slp snmp socks5 spell sqlite sse sse2 ssl svga tcltk tcpd tetex threads tiff truetype unicode x86 xinerama xml xml2 xmlrpc xmms xpm xprint xsl xv zlib"
This bug is probably a dup/dep of bug #45115
@solar: I fail to see how a compile time bug about g++ not liking one of the pthread.h macros is related to this runtime threading problem. Care to elaborate?
"Care to elaborate?" Sorry not really. I'm not a nptl fan. If you/others want to try to solve this bug then I'm simply suggesting to take a peek at that bug which may or may not be related to the problems your having here. Either way if your using ntpl then that bug concerns you.
can someone here please test glibc 2.3.4.20040916?
Muine (a mono app) appears to still lock up under glibc 2.3.4.20040916.
reaching 102 threads with latest glibc in portage 0918 adding as much info as i can think off mono apps seem happier, since it is 1020 i will try to see if they lock up how they used to in past, compiled mono -r2 with nptl after remerging 0918 with nptl ok, here all info first, while i wrtoe the above, muine must have reached 1020 threads as it locked up now in the end nullzone ~ # /lib/libc.so.6 GNU C Library 20040918 release version 2.3.4, by Roland McGrath et al. Copyright (C) 2004 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Compiled by GNU CC version 3.4.2 (Gentoo Linux 3.4.2-r1, ssp-3.4.1-1, pie-8.7.6.5). Compiled on a Linux 2.6.7 system on 2004-09-19. Available extensions: GNU libio by Per Bothner crypt add-on version 2.1 by Michael Glad and others Native POSIX Threads Library by Ulrich Drepper et al BIND-8.2.3-T5B NIS(YP)/NIS+ NSS modules 0.19 by Thorsten Kukuk Thread-local storage support included. For bug reporting instructions, please see: <http://www.gnu.org/software/libc/bugs.html>. nullzone ~ # nullzone ~ # emerge info Portage 2.0.51_rc1 (default-linux/x86/2004.2/gcc34/2.6, gcc-3.4.2, glibc-2.3.4.20040918-r0, 2.6.9-rc1 i686) ================================================================= System uname: 2.6.9-rc1 i686 AMD Athlon(tm) XP 3200+ Gentoo Base System version 1.5.3 ccache version 2.3 [enabled] Autoconf: sys-devel/autoconf-2.59-r4 Automake: sys-devel/automake-1.8.5-r1 Binutils: sys-devel/binutils-2.15.90.0.1.1-r3 Headers: sys-kernel/linux26-headers-2.6.7-r4 Libtools: sys-devel/libtool-1.5.2-r5 ACCEPT_KEYWORDS="x86 ~x86" AUTOCLEAN="yes" CFLAGS="-O2 -march=athlon-xp -pipe -fomit-frame-pointer" CHOST="i686-pc-linux-gnu" COMPILER="" CONFIG_PROTECT="/etc /usr/X11R6/lib/X11/xkb /usr/kde/2/share/config /usr/kde/3.2/share/config /usr/kde/3.3/share/config:/usr/kde/3.3/env:/usr/kde/3.3/shutdown /usr/kde/3/share/config /usr/lib/mozilla/defaults/pref /usr/share/config /usr/share/texmf/dvipdfm/config/ /usr/share/texmf/dvips/config/ /usr/share/texmf/tex/generic/config/ /usr/share/texmf/tex/platex/config/ /usr/share/texmf/xdvi/ /var/qmail/control" CONFIG_PROTECT_MASK="/etc/gconf /etc/terminfo /etc/env.d" CXXFLAGS="-O2 -march=athlon-xp -pipe -fomit-frame-pointer" DISTDIR="/usr/portage/distfiles" FEATURES="autoaddcvs ccache cvs sandbox sfperms" GENTOO_MIRRORS="http://mirror.tucdemonic.org/gentoo/ http://gentoo.ccccom.com http://gentoo.osuosl.org/ http://mirrors.tds.net/gentoo http://mirror.datapipe.net/gentoo" MAKEOPTS="-j2" PKGDIR="/usr/portage/packages" PORTAGE_TMPDIR="/var/tmp" PORTDIR="/usr/portage" PORTDIR_OVERLAY="/overlay" i know longer till freezing of mono apps != fix but goes in right direction just a fyi for all 1020 threads starting thread! GC_ALLOC in thread ! GC_ALLOC in thread ! GC_ALLOC in thread ! GC_ALLOC in thread ! GC_ALLOC in thread ! GC_ALLOC in thread ! GC_ALLOC in thread ! GC_ALLOC in thread ! GC_ALLOC in thread ! GC_ALLOC in thread ! GC_ALLOC in thread ! GC_ALLOC in thread ! GC_ALLOC in thread ! GC_ALLOC in thread ! GC_ALLOC in thread ! GC_ALLOC in thread ! GC_ALLOC in thread ! GC_ALLOC in thread ! GC_ALLOC in thread ! GC_ALLOC in thread ! GC_ALLOC in thread ! GC_ALLOC in thread ! GC_ALLOC in thread ! GC_ALLOC in thread ! 1021 threads 1022 threads 1023 threads 1024 threads then the usual endless list
s/102/1020/ sorry
The more I've looked into this, the more apparant it is that the issue is not that there is a bug in NPTL, but instead there are some applications (mono and libgc in particular) that are developed expecting certain behaviour in LinuxThreads that is non standard, or erroneous at best. http://lists.ximian.com/archives/public/mono-devel-list/2004-September/007824.html http://lists.ximian.com/archives/public/mono-list/2004-May/020061.html http://www.redhat.com/archives/phil-list/2003-December/msg00023.html
I've straced program given in comment #4 on my nptl system and got : $ strace ./testpgm2 ... ... ... 30236 write(1, "253 threads\n", 12) = 12 30236 mmap2(NULL, 8392704, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xbf25b000 30236 mprotect(0xbf25b000, 4096, PROT_NONE) = 0 30236 clone(child_stack=0xbfa5bb28, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID|CLONE_DETACHED, parent_tidptr=0xbfa5bbf8, {entry_number:6, base_addr:0xbfa5bbb0, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xbfa5bbf8) = 30491 30236 sched_yield( <unfinished ...> 30491 _exit(0) = ? 30236 <... sched_yield resumed> ) = 0 30236 write(1, "254 threads\n", 12) = 12 30236 mmap2(NULL, 8392704, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory) 30236 dup(2) = 3 30236 fcntl64(3, F_GETFL) = 0x2 (flags O_RDWR) 30236 fstat64(3, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 5), ...}) = 0 30236 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xbfa5c000 30236 _llseek(3, 0, 0xbffff358, SEEK_CUR) = -1 ESPIPE (Illegal seek) 30236 write(3, "nptl: Cannot allocate memory\n", 29) = 29 30236 close(3) = 0 ... ... ... So the 255-th pthread_create seems to fails because the mmap of 8M for the thread stack fails. I thus tried to lower the maximum stack size to 1M (ulimit -s 1024), and the program terminate correctly as the kernel is able to allocate all requested memory for kernel stacks (well it pretend to because I don't have 1G of memory :). Doing so little math, it appears the problem comes from the very large default maximum stack size used. With a 8M stack size, and a 2G/2G split (which seems to be how I compiled my kernel), a single process can not map more than 2G of memory, that is 256 * 8M ... The original poster seems to have a kernel compiled with a 3G/1G split which allow a single process to map 3G (~ 380 * 8M) of memory ... This is the same problem that affect the original problem. By reducing the stack size, i get the awaited 24000 answer: $ ulimit -s 8192 $ ./testpgm | grep 'malloc in thread' | wc -l 6120 $ ulimit -s 512 $ testpgm | grep 'malloc in thread' | wc -l 24000 So the failure of those sample programs is simply caused by memory exhaustion, and this is no bug in glibc (maybe there is a problem for the very large default stack size limit but that is another issue) and it is something else that is causing the problem with glibc (after reading the different links, it seems to come from difference in semantic from LinuxThreads and NPTL, but its just a guess) ...
Yes, problem is with "memory exhaustion", true. But why the hell memory is exhausted in the first place ? You are missing fundamental fact: program from comment #4 WILL NEVER CREATE MORE THEN TWO THREADS! sleep is there for a reason: it's GUARANTEE that this program will never create more then two threads on any sane system. If glibc is 100% sure every stack for every dead thread should be kept around forever then you are 100% correct. But it's just stupid: why all this dead memory is kept around forever ? I can see ABSOLUTELY NO REASON to keep stack for every thread which ever existed in program allocated forever. More: if you'll add pthread_join all stacks will be correctly deallocated. To me it looks like some strange error with nptl locking logic and NOT as normal behaviour. If you'll think about this: http://lists.ximian.com/archives/public/mono-devel-list/2004-September/007824.html you'll see it's EXACTLY the same problem: thread is gone (and there are NO associated kernel process in memory - check with /proc !) but half-dead zombie is still in memory so no cleanup...
From pthread_join(3) man page: When a joinable thread terminates, its memory resources (thread descriptor and stack) are not deallocated until another thread performs pthread_join on it. Therefore, pthread_join must be called once for each joinable thread created to avoid memory leaks. And since pthread_create create thread in joinable state by default (that is the case since the 'pthread_attr_t *' parameter is NULL) and they are not detached by call to pthread_detach(3), the behaviour is correct. The bug is in the program posted in comment #4, not the glibc ! If you want the thread memory (stack, ...) to be automatically freeed by the libc when the thread exit, you must create it in detached state or call pthred_detach(3). Otherwise, you must call pthread_join(3). I think the rationale for keeping thread memory is to be able to get the return value from the thread when calling pthread_join. But in all case, there is no bug since it is the behaviour POSIX seems to specify.
As Defresne Sylvain pointed out this is a red herring. See bug 54603 and upstream http://bugzilla.ximian.com/show_bug.cgi?id=60576 for the fix.
That fix is actually a workaround. It seems to be a gcc/glibc issue: http://gcc.gnu.org/ml/gcc/2004-01/msg01766.html Basically there is a threading/exception problem caused by either of them.
True, I guess, the underlying problem is that gcc -fexceptions is incompatible with glibc+NPTL. However mono doesn't need -fexceptions (it's pure C, -fexceptions is for mixed C/C++ code) so removing it will fix the mono build. If you want to try to fix the gcc/-fexceptions/glibc/nptl incompatibility then I think that would best be done on a separate bug, or upstream.
Okay, I've just commited a change to the package.masked mono-1.0.2-r1 which adds the -fexpections fix as posted on the ximian bug. Still requires gcc-3.4 unfortunately. I'd like to get some wide testing on that, and then un package.mask that version, so it is only ~x86. At that point I want to get the mono-1.0.2 and friends (gtk-sharp, etc). So that people with NPTL systems will only have to deal with marking a few select ebuilds ~x86 to have it functioning, and every one else can have a stable 1.0 mono. TEST! (/me gets on hands and knees and begs)
Seems to be ok for me with mono 1.0.2-r1+muine 0.6.3
works fine with muine-from-cvs, glibc-2.3.4.20041021 gcc-3.4.2-r3 mono-1.0.2-r1 gtk-sharp-1.0.2
Ok, one last *MAJOR* test request: I've just commited mono-1.0.6 to the tree, but i've added it to package.mask as well. I've done some testing with both muine, and BLAM! on this, and I'm 90% sure we're ok with NPTL *without* using gcc-3.4! The new version deps on gcc-3.3.5, which i believe may have been the piece that contained the fix. "may" being the operative word. So I need some testing from people using gcc-3.3.5, as if this is the case, one of the major things holding up a mono-1.0.x marked stable will disappear. Thanks all for persevering.
1.0.6 actually seems to have some threading issues unrelated to the NPTL issue. I've commited mono-1.0.5-r4, which includes these changes for the 1.0.5 series, and is working here and for one other user wonderfully. *PLEASE* test this, as it's the most realistic target for stabalization, given the flakiness of 1.0.6. Marking this TEST-REQUEST.
I build mono-1.0.6 with gcc-3.3.5 & glibc-2.3.4.20050125 (nptl only glibc) it works fine for me.. muine doesn't hang any more :) ------------------------------------------------------- cafri ~ # emerge -pv glibc mono =sys-devel/gcc-3.3.5-r1 These are the packages that I would merge, in order: Calculating dependencies ...done! [ebuild R ] sys-libs/glibc-2.3.4.20050125 -build -debug -erandom -hardened (-multilib) +nls -nomalloccheck +nptl +nptlonly +pic -userlocales 0 kB [ebuild R ] dev-dotnet/mono-1.0.6 -debug +nptl 0 kB [ebuild R ] sys-devel/gcc-3.3.5-r1 -bootstrap -boundschecking -build -debug -fortran -gcj -gtk -hardened (-ip28) (-multilib) -multislot (-n32) (-n64) +nls -nocxx +objc* -static (-uclibc) 0 kB Total size of downloads: 0 kB cafri ~ # gcc -v Reading specs from /usr/lib/gcc-lib/i686-pc-linux-gnu/3.3.5/specs Configured with: /var/tmp/portage/gcc-3.3.5-r1/work/gcc-3.3.5/configure --enable -version-specific-runtime-libs --prefix=/usr --bindir=/usr/i686-pc-linux-gnu/gcc -bin/3.3.5 --includedir=/usr/lib/gcc-lib/i686-pc-linux-gnu/3.3.5/include --datad ir=/usr/share/gcc-data/i686-pc-linux-gnu/3.3.5 --mandir=/usr/share/gcc-data/i686 -pc-linux-gnu/3.3.5/man --infodir=/usr/share/gcc-data/i686-pc-linux-gnu/3.3.5/in fo --with-gxx-include-dir=/usr/lib/gcc-lib/i686-pc-linux-gnu/3.3.5/include/g++-v 3 --host=i686-pc-linux-gnu --disable-altivec --enable-nls --without-included-get text --enable-__cxa_atexit --enable-clocale=gnu --with-system-zlib --disable-che cking --disable-werror --disable-libunwind-exceptions --enable-shared --enable-t hreads=posix --disable-multilib --disable-libgcj --enable-languages=c,c++ Thread model: posix gcc version 3.3.5 (Gentoo Linux 3.3.5-r1, ssp-3.3.2-3, pie-8.7.7.1) cafri ~ # gcc-config -l [1] i686-pc-linux-gnu-3.3.5 * [2] i686-pc-linux-gnu-3.3.5-hardened [3] i686-pc-linux-gnu-3.3.5-hardenednopie [4] i686-pc-linux-gnu-3.3.5-hardenednossp [5] i686-pc-linux-gnu-3.4.3 [6] i686-pc-linux-gnu-3.4.3-hardened [7] i686-pc-linux-gnu-3.4.3-hardenednopie [8] i686-pc-linux-gnu-3.4.3-hardenednossp