270297 – sys-libs/glibc-2.10.1 build fails due to sunrpc/rpcsvc/rusers.stmp segmentation fault

Bug 270297 - sys-libs/glibc-2.10.1 build fails due to sunrpc/rpcsvc/rusers.stmp segmentation fault

Summary: sys-libs/glibc-2.10.1 build fails due to sunrpc/rpcsvc/rusers.stmp segmentati...

Status:	RESOLVED NEEDINFO

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	[OLD] Core system (show other bugs)
Hardware:	All Linux

Importance:	High minor (vote)
Assignee:	Gentoo Toolchain Maintainers

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2009-05-18 14:00 UTC by Martin von Gagern
Modified:	2009-08-15 15:03 UTC (History)
CC List:	3 users (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
1243696130-install-sys-libs_glibc-2.10.1:2.2::gentoo.out.bz2 (1243696130-install-sys-libs_glibc-2.10.1:2.2::gentoo.out.bz2,127.05 KB, text/plain) 2009-05-30 15:30 UTC, Nico R. Wohlgemuth	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Martin von Gagern 2009-05-18 14:00:12 UTC

A world update failed for me while remerging glibc:

make[2]: *** [/var/tmp/portage/sys-libs/glibc-2.10.1/work/build-default-i686-pc-linux-gnu-nptl/sunrpc/rpcsvc/rusers.stmp] Segmentation fault
make[2]: *** Waiting for unfinished jobs....
...
make[2]: Leaving directory `/var/tmp/portage/sys-libs/glibc-2.10.1/work/glibc-2.10.1/sunrpc'
make[1]: *** [sunrpc/others] Error 2
make[1]: Leaving directory `/var/tmp/portage/sys-libs/glibc-2.10.1/work/glibc-2.10.1'
make: *** [all] Error 2
 *
 * ERROR: sys-libs/glibc-2.10.1 failed.
 * Call stack:
 *               ebuild.sh, line   49:  Called src_compile
 *             environment, line 3575:  Called eblit-run 'src_compile'
 *             environment, line 1213:  Called eblit-glibc-src_compile
 *       src_compile.eblit, line  188:  Called toolchain-glibc_src_compile
 *       src_compile.eblit, line  123:  Called die
 * The specific snippet of code:

Retrying the emerge with MAKEOPTS=-j1 succeeded, though, so it looks like a parallel build issue.

Looking for "make.*rpcsvc.*Segmentation fault" in this bugzilla lists quite a number of matches: http://tinyurl.com/o7pouq
Most of these seem to be related to weird CHOST settings, though. I'd consider my own system to be sane enough in that respect:

Portage 2.2_rc33 (default/linux/x86/2008.0/desktop, gcc-4.3.3, glibc-2.10.1-r0, 2.6.29-gentoo-r4 i686)
=================================================================
System uname: Linux-2.6.29-gentoo-r4-i686-Intel-R-_Pentium-R-_4_CPU_3.00GHz-with-gentoo-2.0.0
Timestamp of tree: Mon, 18 May 2009 06:15:03 +0000
distcc 3.1 i686-pc-linux-gnu [disabled]
ccache version 2.4 [enabled]
app-shells/bash:     4.0_p24
dev-java/java-config: 2.1.7
dev-lang/python:     2.5.4-r2, 2.6.2
dev-python/pycrypto: 2.0.1-r8
dev-util/ccache:     2.4-r8
dev-util/cmake:      2.6.4
sys-apps/baselayout: 2.0.0
sys-apps/openrc:     0.4.3-r2
sys-apps/sandbox:    1.9
sys-devel/autoconf:  2.13, 2.63-r1
sys-devel/automake:  1.4_p6, 1.5, 1.6.3, 1.7.9-r1, 1.8.5-r3, 1.9.6-r2, 1.10.2
sys-devel/binutils:  2.19.1-r1
sys-devel/gcc-config: 1.4.1
sys-devel/libtool:   2.2.6a
virtual/os-headers:  2.6.29
ACCEPT_KEYWORDS="x86 ~x86"
CBUILD="i686-pc-linux-gnu"
CFLAGS="-march=prescott -O2 -ggdb"
CHOST="i686-pc-linux-gnu"
CXXFLAGS="-march=prescott -O2 -ggdb"
DISTDIR="/usr/portage/distfiles"
FEATURES="ccache collision-protect distlocks fixpackages parallel-fetch preserve-libs protect-owned sandbox sfperms splitdebug strict unmerge-orphans userfetch"

Also a broken CHOST should lead to repsoducable problems, otherwise it's a bug in the build system. I have to confess that I've been running a world update with --keep-going, so some other (seemingly unrelated) update might be the reason for the first build failing and the second one succeeding, invalidating my diagnosis of a parallel build issue.

Usually I'd run an emake with a presumed parallel build issue about 20 times in a loop to reproduce it, but given the time building libc requires, I don't think I can do this here. 5 times will have to suffice. Will report back whether I could reproduce the issue.

Comment 1 Martin von Gagern 2009-05-18 16:31:53 UTC

Reproduced in run 4 of my loop, so it definitely is a parallel build issue.
This time I had it with a different file, but again it's the rpcgen invocation
intended to generate a header file. I'm running my loop again, trying to get a
core dump this time around.

Comment 2 Martin von Gagern 2009-05-19 11:58:12 UTC

OK, I got a core dump, and a backtrace. It seems to be the sandbox!

Core was generated by `/bin/sh -c CPP='i686-pc-linux-gnu-gcc -E -x c-header'  /var/tmp/portage/sys-lib'.

(gdb) bt
#0  0x4005d17c in munmap@plt () from /usr/lib/libsandbox.so
#1  0x40062aef in sb_check_exec (
    filename=0x8869468 "/var/tmp/portage/sys-libs/glibc-2.10.1/work/build-default-i686-pc-linux-gnu-nptl/elf/ld-linux.so.2", 
    argv=0x8862008) at ../../sandbox-1.9/libsandbox/wrapper-funcs/__wrapper_exec.c:73
#2  0x40062df7 in execve_DEFAULT (
    path=0x8869468 "/var/tmp/portage/sys-libs/glibc-2.10.1/work/build-default-i686-pc-linux-gnu-nptl/elf/ld-linux.so.2", argv=0x8862008, 
    envp=0x8865280) at ../../sandbox-1.9/libsandbox/wrapper-funcs/__wrapper_exec.c:214
#3  0x0807106d in shell_execve (
    command=0x8869468 "/var/tmp/portage/sys-libs/glibc-2.10.1/work/build-default-i686-pc-linux-gnu-nptl/elf/ld-linux.so.2", 
    args=0x8862008, env=0x8865280) at execute_cmd.c:4676
#4  0x080761e1 in execute_simple_command (simple_command=0x8867ac0, pipe_in=<value optimized out>, pipe_out=<value optimized out>, 
    async=0, fds_to_close=0x8867aa8) at execute_cmd.c:4477
#5  0x08072e18 in execute_command_internal (command=0x88623a0, asynchronous=0, pipe_in=-1, pipe_out=-1, fds_to_close=0x8867aa8)
    at execute_cmd.c:724
#6  0x080ac4d5 in parse_and_execute (
    string=0x8862400 "CPP='i686-pc-linux-gnu-gcc -E -x c-header'  /var/tmp/portage/sys-libs/glibc-2.10.1/work/build-default-i686-pc-linux-gnu-nptl/elf/ld-linux.so.2 --library-path /var/tmp/portage/sys-libs/glibc-2.10.1/wor"..., from_file=0x80df3ab "-c", 
    flags=<value optimized out>) at evalstring.c:315
#7  0x0806171f in run_one_command (
    command=0xbffdbdc9 "CPP='i686-pc-linux-gnu-gcc -E -x c-header'  /var/tmp/portage/sys-libs/glibc-2.10.1/work/build-default-i686-pc-linux-gnu-nptl/elf/ld-linux.so.2 --library-path /var/tmp/portage/sys-libs/glibc-2.10.1/wor"...) at shell.c:1302
#8  0x080624cb in main (argc=3, argv=0xbffda154, env=0xbffda164) at shell.c:680
(gdb) frame 1
#1  0x40062aef in sb_check_exec (
    filename=0x8869468 "/var/tmp/portage/sys-libs/glibc-2.10.1/work/build-default-i686-pc-linux-gnu-nptl/elf/ld-linux.so.2", 
    argv=0x8862008) at ../../sandbox-1.9/libsandbox/wrapper-funcs/__wrapper_exec.c:73
73		munmap(elf, st.st_size);
(gdb) p elf
$1 = (unsigned char *) 0x405b3000 "\177ELF\001\001\001"

I can see no obvious problem in that frame 1. I still don't understand what frame 0 is. PLT probably stands for Procedure Linkage Table, which I had never heard of before.

(gdb) disassemble 0x4005d17c
Dump of assembler code for function munmap@plt:
0x4005d17c <munmap@plt+0>:	jmp    *0x100(%ebx)
0x4005d182 <munmap@plt+6>:	push   $0x1e8
0x4005d187 <munmap@plt+11>:	jmp    0x4005cd9c <_init+24>
End of assembler dump.
(gdb) info registers
eax            0x405b3000	1079717888
ecx            0x9562f	611887
edx            0x9562f	611887
ebx            0x7b	123
esp            0xbffd9d80	0xbffd9d80
ebp            0xbffd9e38	0xbffd9e38
esi            0x3	3
edi            0x8869468	143037544
eip            0x40062aef	0x40062aef <sb_check_exec+367>
eflags         0x210286	[ PF SF IF RF ID ]
cs             0x73	115
ss             0x7b	123
ds             0x7b	123
es             0x7b	123
fs             0x0	0
gs             0x33	51

So yes, this indeed doesn't look too good; 0x100(%ebx) is in fact almost certainly an illegal address. A disassembly of sb_check_exec shows that it does seem to initialize ebx to some larger value based on its own address:

0x4006298c <sb_check_exec+12>:	call   0x40062991 <sb_check_exec+17>
0x40062991 <sb_check_exec+17>:	pop    %ebx
0x40062992 <sb_check_exec+18>:	add    $0x6663,%ebx

Other than this, it doesn't modify ebx at all. As a previous call to mmap@plt similar to the munmap@plt didn't cause a SIGSEGV, I assume ebx was sane at that time. The only function to get called after that mmap is trace_main, so either it or something it called somehow garbled ebx. Seems like a PIC/PIE issue somewhere.

How all of this corresponds to parallel makes, I have not the slightest idea. But someone involved in the internals of sandbox probably should have a look, and involving upstream seems a bad idea right now.

Comment 3 SpanKY gentoo-dev

2009-05-19 23:33:58 UTC

wonder if the library path given while executing that application includes the local libdl from glibc-2.10.1

Comment 4 Martin von Gagern 2009-05-20 09:53:55 UTC

(In reply to comment #3)
> wonder if the library path given while executing that application includes the
> local libdl from glibc-2.10.1

"strings core" does list stuff like LD_PRELOAD and GRASS_LD_LIBRARY_PATH but no LD_LIBRARY_PATH. So I assume the environment is part of the core dump, but LD_LIBRARY_PATH is not set. Is this enough to answer your question with "no"?

Comment 5 SpanKY gentoo-dev

2009-05-20 10:55:23 UTC

the env settings wont be relevant.  you need to look in the build log to see how rusers.stmp was invoked.  iirc, glibc executes all of its local binaries through its freshly compiled ldso/libraries.

Comment 6 Martin von Gagern 2009-05-21 09:48:04 UTC

(In reply to comment #5)
> the env settings wont be relevant.  you need to look in the build log to see
> how rusers.stmp was invoked.  iirc, glibc executes all of its local binaries
> through its freshly compiled ldso/libraries.

I just pieced together the full argv[2] from the core dump. Abbreviating $WORKDIR/build-default-i686-pc-linux-gnu-nptl to $B for legibility, it reads thus:

CPP='i686-pc-linux-gnu-gcc -E -x c-header'  $B/elf/ld-linux.so.2 --library-path $B:$B/math:$B/elf:$B/dlfcn:$B/nss:$B/nis:$B/rt:$B/resolv:$B/crypt:$B/nptl $B/sunrpc/rpcgen -Y ../scripts -c rpcsvc/klm_prot.x -o $B/sunrpc/xklm_prot.T"

This seems to be a shell run by make in order to execute the rpcgen-cmd command from the following rule:

$(rpcsvc:%.x=$(objpfx)x%.c): $(objpfx)x%.c: $(objpfx)x%.stmp
        @:
$(objpfx)x%.stmp: rpcsvc/%.x $(objpfx)rpcgen
        -@rm -f ${@:stmp=T} $@
        $(rpcgen-cmd) -c $< -o ${@:stmp=T}
        $(move-if-change) $(@:stmp=T) $(@:stmp=c)
        touch $@

The command from the log of my first segfaulting run looks slightly different, as it was creating a header, not a C source file:

CPP='i686-pc-linux-gnu-gcc -E -x c-header'  $B/elf/ld-linux.so.2 --library-path $B:$B/math:$B/elf:$B/dlfcn:$B/nss:$B/nis:$B/rt:$B/resolv:$B/crypt:$B/nptl $B/sunrpc/rpcgen -Y ../scripts -h rpcsvc/rusers.x -o $B/sunrpc/rpcsvc/rusers

The corresponding Makefile rules:

$(rpcsvc:%.x=$(objpfx)rpcsvc/%.h): $(objpfx)rpcsvc/%.h: $(objpfx)rpcsvc/%.stmp
        @:
$(objpfx)rpcsvc/%.stmp: rpcsvc/%.x $(objpfx)rpcgen
        $(make-target-directory)
        -@rm -f ${@:stmp=T} $@
        $(rpcgen-cmd) -h $< -o ${@:stmp=T}
        $(move-if-change) $(@:stmp=T) $(@:stmp=h)
        touch $@

However, judging from the backtrace I'd say that command wasn't actually executed at all. I believe that the sandboxed shell invoked by make segfaulted before executing that command.

Running the invoked ld-linux.so.2 through a sandbox within gdb, I found that sb_check_exec in libsandbox actually decides the executable is static and has to be ptraced. gdb operating on the core file tells me trace_pid == 0, so it should be the child that crashed, not the parent.

Looking at the assembly code of trace_main, this is how it ends:
0x4005fe88 <trace_main+1560>:	call   0x4005f452 <_do_ptrace>
0x4005fe8d <trace_main+1565>:	call   0x4005ce0c <getpid@plt>
0x4005fe92 <trace_main+1570>:	movl   $0x13,0x4(%esp)
0x4005fe9a <trace_main+1578>:	mov    %eax,(%esp)
0x4005fe9d <trace_main+1581>:	call   0x4005d19c <kill@plt>
0x4005fea2 <trace_main+1586>:	add    $0x1ac,%esp
0x4005fea8 <trace_main+1592>:	pop    %ebx
0x4005fea9 <trace_main+1593>:	pop    %esi
0x4005feaa <trace_main+1594>:	pop    %edi
0x4005feab <trace_main+1595>:	pop    %ebp
0x4005feac <trace_main+1596>:	ret    

It then returns here:
0x40062acc <sb_check_exec+332>:	call   0x4005f870 <trace_main>
0x40062ad1 <sb_check_exec+337>:	mov    -0x3c(%ebp),%edx
0x40062ad4 <sb_check_exec+340>:	mov    %edx,-0x94(%ebp)
0x40062ada <sb_check_exec+346>:	mov    -0x94(%ebp),%ecx
0x40062ae0 <sb_check_exec+352>:	mov    %ecx,0x4(%esp)
0x40062ae4 <sb_check_exec+356>:	mov    -0x74(%ebp),%eax
0x40062ae7 <sb_check_exec+359>:	mov    %eax,(%esp)
0x40062aea <sb_check_exec+362>:	call   0x4005d17c <munmap@plt>

The mmap@plt call in sb_check_exec didn't segfault, so ebx was correct at that point in time. After that, I see no assembly instruction in sb_check_exec that should modify ebx, so when trace_main got invoked, ebx should still be correct. trace_main pushes ebx along with several other registers. It then sets up its own ebx, does some stuff including a fork, then returns using the code above. Assuming the return works as expected, we have to assume the stack pointer is sane, so it pops ebx from the place where it pushed it before. Almost immediately after trace_main returned, sb_ckeck_exec calls the munmap@plt which causes the segfault.

This seems to me like some form of stack corruption. I'd guess something or other overwrote the ebx on the stack of trace_main, causing an incorrect value to be popped. As an alternative, the wrong ebx value migh come from some signal handler interfering, some kernel bug concerning context switches, some hardware failure on my local system (though the number of CCs here may speak against that) or something else I haven't thought of.

Comment 7 Martin von Gagern 2009-05-21 12:16:44 UTC

(In reply to comment #6)
> The mmap@plt call in sb_check_exec didn't segfault, so ebx was correct at that
> point in time.

I guess I was a bit too quick there: mmap@plt should jump to mmap, so when it returns, ebx might be broken.

It seems that the GOT in my core dump is at 0x40062991 + 0x6663 = 0x40068ff4, so that's what the value of ebx should be if everything worked all right:
0x4006298c <sb_check_exec+12>:	call   0x40062991 <sb_check_exec+17>
0x40062991 <sb_check_exec+17>:	pop    %ebx
0x40062992 <sb_check_exec+18>:	add    $0x6663,%ebx

Looking at the GOT at offset 0x20, as indicated in mmap@plt:
0x40068ff4:	0x0000dee8	0x4001c000	0x400145d0	0x4005cdb2
0x40069004:	0x4005cdc2	0x4005cdd2	0x40086e50	0x40063930
0x40069014:	0x4013e490	0x4005ce12	0x4005ce22	0x4005ce32
                ^^^^^^^^^^
(gdb) info symbol 0x4013e490
mmap in section .text
(gdb) info sharedlibrary
From        To          Syms Read   Shared Object Library
0x4005d210  0x40065a8a  Yes         /usr/lib/libsandbox.so
0x45be82e0  0x45c0ebac  Yes         /lib/libncurses.so.5
0x4006ca30  0x4006d9ac  Yes         /lib/libdl.so.2
0x40086840  0x40181128  Yes         /lib/libc.so.6
0x400008d0  0x40017b5f  Yes         /lib/ld-linux.so.2
0x4057ce10  0x40581c88  Yes         /lib/libnss_compat.so.2
0x40588170  0x4059500c  Yes         /lib/libnsl.so.1
0x4059d980  0x405a3a00  Yes         /lib/libnss_nis.so.2
0x405a8a00  0x405afd9c  Yes         /lib/libnss_files.so.2

So the mmap version called should be the one from the standard libc, nothing fancy or specific to sandbox. I'm not sure how likely that's involved in the breakage.

Comment 8 Nico R. Wohlgemuth 2009-05-30 15:30:39 UTC

Created attachment 192984 [details]
1243696130-install-sys-libs_glibc-2.10.1:2.2::gentoo.out.bz2

Fails for me on sunrpc/xkey_prot.stmp

make[2]: *** [/var/tmp/paludis/sys-libs-glibc-2.10.1/work/build-default-i686-pc-linux-gnu-nptl/sunrpc/xkey_prot.stmp] Error 137
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [sunrpc/others] Error 2
make: *** [all] Error 2

Comment 9 Nico R. Wohlgemuth 2009-05-30 15:50:10 UTC

building with -j1 succeeds for me too.

Comment 10 Mark Loeser (RETIRED) gentoo-dev

2009-08-09 16:52:44 UTC

Are you sure this is a parallel build bug and not some hardware failure?  If it were a parallel build bug, I'm sure we would have seen a lot more of these failures by this point.  I can't reproduce this on any of my machines.

Comment 11 Martin von Gagern 2009-08-10 06:57:37 UTC

(In reply to comment #10)
> Are you sure this is a parallel build bug and not some hardware failure?

No, I'm not sure it's a parallel build issue. It might as well be some other kind of bug causing sporadic breakage, like e.g. a race condition.

It it were a hardware failure on my system, then I doubt you'd have seen even the one reproduction by nico.

I can only imagine a full disk partition or something similar as a rare bug not unique cause for such a behaviour.

> I can't reproduce this on any of my machines.

I try to reproduce this with sandbox-2.0, and if that fails, try to reproduce it yet again with sandbox-1.9. I'll have a good look at disk capacity this time around as well. I'll get back to you with the results.

If you had been able to reproduce it on your machines, what would you have done next? Maybe I can do some of that investigation on my system.

Comment 12 Martin von Gagern 2009-08-14 19:21:03 UTC

(In reply to comment #11)
> I try to reproduce this with sandbox-2.0, and if that fails, try to reproduce
> it yet again with sandbox-1.9.

First result: couldn't reproduce it with sandbox-2.0, tried about 20 times.

Comment 13 Martin von Gagern 2009-08-15 15:03:13 UTC

(In reply to comment #11)
> and if that fails, try to reproduce it yet again with sandbox-1.9.

Failed as well. I can't reproduce this thing any more. Strange!