A world update failed for me while remerging glibc: make[2]: *** [/var/tmp/portage/sys-libs/glibc-2.10.1/work/build-default-i686-pc-linux-gnu-nptl/sunrpc/rpcsvc/rusers.stmp] Segmentation fault make[2]: *** Waiting for unfinished jobs.... ... make[2]: Leaving directory `/var/tmp/portage/sys-libs/glibc-2.10.1/work/glibc-2.10.1/sunrpc' make[1]: *** [sunrpc/others] Error 2 make[1]: Leaving directory `/var/tmp/portage/sys-libs/glibc-2.10.1/work/glibc-2.10.1' make: *** [all] Error 2 * * ERROR: sys-libs/glibc-2.10.1 failed. * Call stack: * ebuild.sh, line 49: Called src_compile * environment, line 3575: Called eblit-run 'src_compile' * environment, line 1213: Called eblit-glibc-src_compile * src_compile.eblit, line 188: Called toolchain-glibc_src_compile * src_compile.eblit, line 123: Called die * The specific snippet of code: Retrying the emerge with MAKEOPTS=-j1 succeeded, though, so it looks like a parallel build issue. Looking for "make.*rpcsvc.*Segmentation fault" in this bugzilla lists quite a number of matches: http://tinyurl.com/o7pouq Most of these seem to be related to weird CHOST settings, though. I'd consider my own system to be sane enough in that respect: Portage 2.2_rc33 (default/linux/x86/2008.0/desktop, gcc-4.3.3, glibc-2.10.1-r0, 2.6.29-gentoo-r4 i686) ================================================================= System uname: Linux-2.6.29-gentoo-r4-i686-Intel-R-_Pentium-R-_4_CPU_3.00GHz-with-gentoo-2.0.0 Timestamp of tree: Mon, 18 May 2009 06:15:03 +0000 distcc 3.1 i686-pc-linux-gnu [disabled] ccache version 2.4 [enabled] app-shells/bash: 4.0_p24 dev-java/java-config: 2.1.7 dev-lang/python: 2.5.4-r2, 2.6.2 dev-python/pycrypto: 2.0.1-r8 dev-util/ccache: 2.4-r8 dev-util/cmake: 2.6.4 sys-apps/baselayout: 2.0.0 sys-apps/openrc: 0.4.3-r2 sys-apps/sandbox: 1.9 sys-devel/autoconf: 2.13, 2.63-r1 sys-devel/automake: 1.4_p6, 1.5, 1.6.3, 1.7.9-r1, 1.8.5-r3, 1.9.6-r2, 1.10.2 sys-devel/binutils: 2.19.1-r1 sys-devel/gcc-config: 1.4.1 sys-devel/libtool: 2.2.6a virtual/os-headers: 2.6.29 ACCEPT_KEYWORDS="x86 ~x86" CBUILD="i686-pc-linux-gnu" CFLAGS="-march=prescott -O2 -ggdb" CHOST="i686-pc-linux-gnu" CXXFLAGS="-march=prescott -O2 -ggdb" DISTDIR="/usr/portage/distfiles" FEATURES="ccache collision-protect distlocks fixpackages parallel-fetch preserve-libs protect-owned sandbox sfperms splitdebug strict unmerge-orphans userfetch" Also a broken CHOST should lead to repsoducable problems, otherwise it's a bug in the build system. I have to confess that I've been running a world update with --keep-going, so some other (seemingly unrelated) update might be the reason for the first build failing and the second one succeeding, invalidating my diagnosis of a parallel build issue. Usually I'd run an emake with a presumed parallel build issue about 20 times in a loop to reproduce it, but given the time building libc requires, I don't think I can do this here. 5 times will have to suffice. Will report back whether I could reproduce the issue.
Reproduced in run 4 of my loop, so it definitely is a parallel build issue. This time I had it with a different file, but again it's the rpcgen invocation intended to generate a header file. I'm running my loop again, trying to get a core dump this time around.
OK, I got a core dump, and a backtrace. It seems to be the sandbox! Core was generated by `/bin/sh -c CPP='i686-pc-linux-gnu-gcc -E -x c-header' /var/tmp/portage/sys-lib'. (gdb) bt #0 0x4005d17c in munmap@plt () from /usr/lib/libsandbox.so #1 0x40062aef in sb_check_exec ( filename=0x8869468 "/var/tmp/portage/sys-libs/glibc-2.10.1/work/build-default-i686-pc-linux-gnu-nptl/elf/ld-linux.so.2", argv=0x8862008) at ../../sandbox-1.9/libsandbox/wrapper-funcs/__wrapper_exec.c:73 #2 0x40062df7 in execve_DEFAULT ( path=0x8869468 "/var/tmp/portage/sys-libs/glibc-2.10.1/work/build-default-i686-pc-linux-gnu-nptl/elf/ld-linux.so.2", argv=0x8862008, envp=0x8865280) at ../../sandbox-1.9/libsandbox/wrapper-funcs/__wrapper_exec.c:214 #3 0x0807106d in shell_execve ( command=0x8869468 "/var/tmp/portage/sys-libs/glibc-2.10.1/work/build-default-i686-pc-linux-gnu-nptl/elf/ld-linux.so.2", args=0x8862008, env=0x8865280) at execute_cmd.c:4676 #4 0x080761e1 in execute_simple_command (simple_command=0x8867ac0, pipe_in=<value optimized out>, pipe_out=<value optimized out>, async=0, fds_to_close=0x8867aa8) at execute_cmd.c:4477 #5 0x08072e18 in execute_command_internal (command=0x88623a0, asynchronous=0, pipe_in=-1, pipe_out=-1, fds_to_close=0x8867aa8) at execute_cmd.c:724 #6 0x080ac4d5 in parse_and_execute ( string=0x8862400 "CPP='i686-pc-linux-gnu-gcc -E -x c-header' /var/tmp/portage/sys-libs/glibc-2.10.1/work/build-default-i686-pc-linux-gnu-nptl/elf/ld-linux.so.2 --library-path /var/tmp/portage/sys-libs/glibc-2.10.1/wor"..., from_file=0x80df3ab "-c", flags=<value optimized out>) at evalstring.c:315 #7 0x0806171f in run_one_command ( command=0xbffdbdc9 "CPP='i686-pc-linux-gnu-gcc -E -x c-header' /var/tmp/portage/sys-libs/glibc-2.10.1/work/build-default-i686-pc-linux-gnu-nptl/elf/ld-linux.so.2 --library-path /var/tmp/portage/sys-libs/glibc-2.10.1/wor"...) at shell.c:1302 #8 0x080624cb in main (argc=3, argv=0xbffda154, env=0xbffda164) at shell.c:680 (gdb) frame 1 #1 0x40062aef in sb_check_exec ( filename=0x8869468 "/var/tmp/portage/sys-libs/glibc-2.10.1/work/build-default-i686-pc-linux-gnu-nptl/elf/ld-linux.so.2", argv=0x8862008) at ../../sandbox-1.9/libsandbox/wrapper-funcs/__wrapper_exec.c:73 73 munmap(elf, st.st_size); (gdb) p elf $1 = (unsigned char *) 0x405b3000 "\177ELF\001\001\001" I can see no obvious problem in that frame 1. I still don't understand what frame 0 is. PLT probably stands for Procedure Linkage Table, which I had never heard of before. (gdb) disassemble 0x4005d17c Dump of assembler code for function munmap@plt: 0x4005d17c <munmap@plt+0>: jmp *0x100(%ebx) 0x4005d182 <munmap@plt+6>: push $0x1e8 0x4005d187 <munmap@plt+11>: jmp 0x4005cd9c <_init+24> End of assembler dump. (gdb) info registers eax 0x405b3000 1079717888 ecx 0x9562f 611887 edx 0x9562f 611887 ebx 0x7b 123 esp 0xbffd9d80 0xbffd9d80 ebp 0xbffd9e38 0xbffd9e38 esi 0x3 3 edi 0x8869468 143037544 eip 0x40062aef 0x40062aef <sb_check_exec+367> eflags 0x210286 [ PF SF IF RF ID ] cs 0x73 115 ss 0x7b 123 ds 0x7b 123 es 0x7b 123 fs 0x0 0 gs 0x33 51 So yes, this indeed doesn't look too good; 0x100(%ebx) is in fact almost certainly an illegal address. A disassembly of sb_check_exec shows that it does seem to initialize ebx to some larger value based on its own address: 0x4006298c <sb_check_exec+12>: call 0x40062991 <sb_check_exec+17> 0x40062991 <sb_check_exec+17>: pop %ebx 0x40062992 <sb_check_exec+18>: add $0x6663,%ebx Other than this, it doesn't modify ebx at all. As a previous call to mmap@plt similar to the munmap@plt didn't cause a SIGSEGV, I assume ebx was sane at that time. The only function to get called after that mmap is trace_main, so either it or something it called somehow garbled ebx. Seems like a PIC/PIE issue somewhere. How all of this corresponds to parallel makes, I have not the slightest idea. But someone involved in the internals of sandbox probably should have a look, and involving upstream seems a bad idea right now.
wonder if the library path given while executing that application includes the local libdl from glibc-2.10.1
(In reply to comment #3) > wonder if the library path given while executing that application includes the > local libdl from glibc-2.10.1 "strings core" does list stuff like LD_PRELOAD and GRASS_LD_LIBRARY_PATH but no LD_LIBRARY_PATH. So I assume the environment is part of the core dump, but LD_LIBRARY_PATH is not set. Is this enough to answer your question with "no"?
the env settings wont be relevant. you need to look in the build log to see how rusers.stmp was invoked. iirc, glibc executes all of its local binaries through its freshly compiled ldso/libraries.
(In reply to comment #5) > the env settings wont be relevant. you need to look in the build log to see > how rusers.stmp was invoked. iirc, glibc executes all of its local binaries > through its freshly compiled ldso/libraries. I just pieced together the full argv[2] from the core dump. Abbreviating $WORKDIR/build-default-i686-pc-linux-gnu-nptl to $B for legibility, it reads thus: CPP='i686-pc-linux-gnu-gcc -E -x c-header' $B/elf/ld-linux.so.2 --library-path $B:$B/math:$B/elf:$B/dlfcn:$B/nss:$B/nis:$B/rt:$B/resolv:$B/crypt:$B/nptl $B/sunrpc/rpcgen -Y ../scripts -c rpcsvc/klm_prot.x -o $B/sunrpc/xklm_prot.T" This seems to be a shell run by make in order to execute the rpcgen-cmd command from the following rule: $(rpcsvc:%.x=$(objpfx)x%.c): $(objpfx)x%.c: $(objpfx)x%.stmp @: $(objpfx)x%.stmp: rpcsvc/%.x $(objpfx)rpcgen -@rm -f ${@:stmp=T} $@ $(rpcgen-cmd) -c $< -o ${@:stmp=T} $(move-if-change) $(@:stmp=T) $(@:stmp=c) touch $@ The command from the log of my first segfaulting run looks slightly different, as it was creating a header, not a C source file: CPP='i686-pc-linux-gnu-gcc -E -x c-header' $B/elf/ld-linux.so.2 --library-path $B:$B/math:$B/elf:$B/dlfcn:$B/nss:$B/nis:$B/rt:$B/resolv:$B/crypt:$B/nptl $B/sunrpc/rpcgen -Y ../scripts -h rpcsvc/rusers.x -o $B/sunrpc/rpcsvc/rusers The corresponding Makefile rules: $(rpcsvc:%.x=$(objpfx)rpcsvc/%.h): $(objpfx)rpcsvc/%.h: $(objpfx)rpcsvc/%.stmp @: $(objpfx)rpcsvc/%.stmp: rpcsvc/%.x $(objpfx)rpcgen $(make-target-directory) -@rm -f ${@:stmp=T} $@ $(rpcgen-cmd) -h $< -o ${@:stmp=T} $(move-if-change) $(@:stmp=T) $(@:stmp=h) touch $@ However, judging from the backtrace I'd say that command wasn't actually executed at all. I believe that the sandboxed shell invoked by make segfaulted before executing that command. Running the invoked ld-linux.so.2 through a sandbox within gdb, I found that sb_check_exec in libsandbox actually decides the executable is static and has to be ptraced. gdb operating on the core file tells me trace_pid == 0, so it should be the child that crashed, not the parent. Looking at the assembly code of trace_main, this is how it ends: 0x4005fe88 <trace_main+1560>: call 0x4005f452 <_do_ptrace> 0x4005fe8d <trace_main+1565>: call 0x4005ce0c <getpid@plt> 0x4005fe92 <trace_main+1570>: movl $0x13,0x4(%esp) 0x4005fe9a <trace_main+1578>: mov %eax,(%esp) 0x4005fe9d <trace_main+1581>: call 0x4005d19c <kill@plt> 0x4005fea2 <trace_main+1586>: add $0x1ac,%esp 0x4005fea8 <trace_main+1592>: pop %ebx 0x4005fea9 <trace_main+1593>: pop %esi 0x4005feaa <trace_main+1594>: pop %edi 0x4005feab <trace_main+1595>: pop %ebp 0x4005feac <trace_main+1596>: ret It then returns here: 0x40062acc <sb_check_exec+332>: call 0x4005f870 <trace_main> 0x40062ad1 <sb_check_exec+337>: mov -0x3c(%ebp),%edx 0x40062ad4 <sb_check_exec+340>: mov %edx,-0x94(%ebp) 0x40062ada <sb_check_exec+346>: mov -0x94(%ebp),%ecx 0x40062ae0 <sb_check_exec+352>: mov %ecx,0x4(%esp) 0x40062ae4 <sb_check_exec+356>: mov -0x74(%ebp),%eax 0x40062ae7 <sb_check_exec+359>: mov %eax,(%esp) 0x40062aea <sb_check_exec+362>: call 0x4005d17c <munmap@plt> The mmap@plt call in sb_check_exec didn't segfault, so ebx was correct at that point in time. After that, I see no assembly instruction in sb_check_exec that should modify ebx, so when trace_main got invoked, ebx should still be correct. trace_main pushes ebx along with several other registers. It then sets up its own ebx, does some stuff including a fork, then returns using the code above. Assuming the return works as expected, we have to assume the stack pointer is sane, so it pops ebx from the place where it pushed it before. Almost immediately after trace_main returned, sb_ckeck_exec calls the munmap@plt which causes the segfault. This seems to me like some form of stack corruption. I'd guess something or other overwrote the ebx on the stack of trace_main, causing an incorrect value to be popped. As an alternative, the wrong ebx value migh come from some signal handler interfering, some kernel bug concerning context switches, some hardware failure on my local system (though the number of CCs here may speak against that) or something else I haven't thought of.
(In reply to comment #6) > The mmap@plt call in sb_check_exec didn't segfault, so ebx was correct at that > point in time. I guess I was a bit too quick there: mmap@plt should jump to mmap, so when it returns, ebx might be broken. It seems that the GOT in my core dump is at 0x40062991 + 0x6663 = 0x40068ff4, so that's what the value of ebx should be if everything worked all right: 0x4006298c <sb_check_exec+12>: call 0x40062991 <sb_check_exec+17> 0x40062991 <sb_check_exec+17>: pop %ebx 0x40062992 <sb_check_exec+18>: add $0x6663,%ebx Looking at the GOT at offset 0x20, as indicated in mmap@plt: 0x40068ff4: 0x0000dee8 0x4001c000 0x400145d0 0x4005cdb2 0x40069004: 0x4005cdc2 0x4005cdd2 0x40086e50 0x40063930 0x40069014: 0x4013e490 0x4005ce12 0x4005ce22 0x4005ce32 ^^^^^^^^^^ (gdb) info symbol 0x4013e490 mmap in section .text (gdb) info sharedlibrary From To Syms Read Shared Object Library 0x4005d210 0x40065a8a Yes /usr/lib/libsandbox.so 0x45be82e0 0x45c0ebac Yes /lib/libncurses.so.5 0x4006ca30 0x4006d9ac Yes /lib/libdl.so.2 0x40086840 0x40181128 Yes /lib/libc.so.6 0x400008d0 0x40017b5f Yes /lib/ld-linux.so.2 0x4057ce10 0x40581c88 Yes /lib/libnss_compat.so.2 0x40588170 0x4059500c Yes /lib/libnsl.so.1 0x4059d980 0x405a3a00 Yes /lib/libnss_nis.so.2 0x405a8a00 0x405afd9c Yes /lib/libnss_files.so.2 So the mmap version called should be the one from the standard libc, nothing fancy or specific to sandbox. I'm not sure how likely that's involved in the breakage.
Created attachment 192984 [details] 1243696130-install-sys-libs_glibc-2.10.1:2.2::gentoo.out.bz2 Fails for me on sunrpc/xkey_prot.stmp make[2]: *** [/var/tmp/paludis/sys-libs-glibc-2.10.1/work/build-default-i686-pc-linux-gnu-nptl/sunrpc/xkey_prot.stmp] Error 137 make[2]: *** Waiting for unfinished jobs.... make[1]: *** [sunrpc/others] Error 2 make: *** [all] Error 2
building with -j1 succeeds for me too.
Are you sure this is a parallel build bug and not some hardware failure? If it were a parallel build bug, I'm sure we would have seen a lot more of these failures by this point. I can't reproduce this on any of my machines.
(In reply to comment #10) > Are you sure this is a parallel build bug and not some hardware failure? No, I'm not sure it's a parallel build issue. It might as well be some other kind of bug causing sporadic breakage, like e.g. a race condition. It it were a hardware failure on my system, then I doubt you'd have seen even the one reproduction by nico. I can only imagine a full disk partition or something similar as a rare bug not unique cause for such a behaviour. > I can't reproduce this on any of my machines. I try to reproduce this with sandbox-2.0, and if that fails, try to reproduce it yet again with sandbox-1.9. I'll have a good look at disk capacity this time around as well. I'll get back to you with the results. If you had been able to reproduce it on your machines, what would you have done next? Maybe I can do some of that investigation on my system.
(In reply to comment #11) > I try to reproduce this with sandbox-2.0, and if that fails, try to reproduce > it yet again with sandbox-1.9. First result: couldn't reproduce it with sandbox-2.0, tried about 20 times.
(In reply to comment #11) > and if that fails, try to reproduce it yet again with sandbox-1.9. Failed as well. I can't reproduce this thing any more. Strange!