ERROR: All 87 tests were run, 1 failed unexpectedly. ## -------------------------- ## ## testsuite.log was created. ## ## -------------------------- ## ------------------------------------------------------------------- This is an unstable amd64 chroot image at a tinderbox (==build bot) name: 17.1_desktop_plasma-test-20190802-201521 ------------------------------------------------------------------- gcc-config -l: [1] x86_64-pc-linux-gnu-9.1.0 * Available Python interpreters, in order of preference: [1] python3.6 [2] python2.7 (fallback) emerge -qpvO sys-apps/sandbox [ebuild U ] sys-apps/sandbox-2.18 [2.13] ABI_X86="(32) (64) (-x32)"
Created attachment 585582 [details] emerge-info.txt
Created attachment 585584 [details] emerge-history.txt
Created attachment 585586 [details] environment
Created attachment 585588 [details] etc.portage.tbz2
Created attachment 585590 [details] logs.tbz2
Created attachment 585592 [details] sys-apps:sandbox-2.18:20190803-042059.log
Created attachment 585594 [details] temp.tbz2
> 41: open_static/1 FAILED (open_static.at:1) Probably a variant of bug #689376
(In reply to Sergei Trofimovich from comment #8) > > 41: open_static/1 FAILED (open_static.at:1) > Probably a variant of bug #689376 Woohoo \o/ It's a flaky test (source of flakiness unknown yet). Running the following in sandbox eventually triggers the same test failure: $ while make check; do echo again; done <a few successful runs> 41: open_static/1 FAILED (open_static.at:1)
Status update on how to reproduce the failure faster: 1. Make open_static helper loop until it fails: --- a/tests/open_static-1.sh +++ b/tests/open_static-1.sh @@ -8,5 +8,7 @@ open_static-0 3 ok O_RDONLY 0666 || exit 1 mkdir deny || exit 1 adddeny $PWD/deny -open_static-0 -1,EPERM deny/not-ok "O_WRONLY|O_CREAT" 0666 || exit 1 +while :; do + open_static-0 -1,EPERM deny/not-ok "O_WRONLY|O_CREAT" 0666 || exit 1 +done test -e sandbox.log 2. Run test-suite without parallelism: sandbox/tests $ './testsuite' AUTOTEST_PATH='src:tests' --jobs=1 --verbose 2>&1 | nl ... 6067 PASS: open("deny/not-ok", 0x41, 666) = -1 (wanted -1); errno = EPERM:1 [Operation not permitted] (wanted EPERM:1 [Operation not permitted]) 6068 PASS: open("deny/not-ok", 0x41, 666) = -1 (wanted -1); errno = EPERM:1 [Operation not permitted] (wanted EPERM:1 [Operation not permitted]) 6069 PASS: open("deny/not-ok", 0x41, 666) = -1 (wanted -1); errno = EPERM:1 [Operation not permitted] (wanted EPERM:1 [Operation not permitted]) 6070 FAIL: open("deny/not-ok", 0x41, 666) = 3 (wanted -1); errno = Success:0 [Success] (wanted EPERM:1 [Operation not permitted]) 6071 ./open_static.at:1: exit code was 1, expected 0 6072 1. open_static.at:1: FAILED (open_static.at:1) It usually takes a few seconds (a few thousands iterations) to fail.
Enabling debug tracing shows it's a bug in ptrace()/child-exec() handoff. Sometimes exec() in the child happens before parent starts tracing. Debug patch: --- a/libsandbox/trace.c +++ b/libsandbox/trace.c @@ -26,7 +26,7 @@ static long _do_ptrace(sb_ptrace_req_t request, const char *srequest, void *addr #ifdef DEBUG # define SBDEBUG 1 #else -# define SBDEBUG 0 +# define SBDEBUG 1 #endif Good: ++ open_static-0 -1,EPERM deny/not-ok 'O_WRONLY|O_CREAT' 0666 * TRACE (pid=5617 ):trace_main*: parent waiting for child (pid=5 6TRACE (pid=1586) to signal 18):trace_main: child setting up ... * TRACE (pid=5617):trace_child_signal: got sig SIGCHLD(17): code:CLD_TRAPPED(4) status:SIGSTOP(19) * TRACE (pid=5617):trace_child_signal: got sig SIGCHLD(17): code:CLD_TRAPPED(4) status:SIGCONT(18) * TRACE (pid=5617):trace_loop: waiting for exec; status: 0x127f * TRACE (pid=5617):trace_child_signal: got sig SIGCHLD(17): code:CLD_TRAPPED(4) status:SIGTRAP(5) * TRACE (pid=5617):trace_loop: waiting for exec; status: 0x857f * TRACE (pid=5617):trace_child_signal: got sig SIGCHLD(17): code:CLD_TRAPPED(4) status:SIGTRAP(5) * TRACE (pid=5617):trace_loop: hit exec! * TRACE (pid=5617):trace_child_signal: got sig SIGCHLD(17): code:CLD_TRAPPED(4) status:SIGTRAP(5) = 0 * TRACE (pid=5617):trace_child_signal: got sig SIGCHLD(17): code:CLD_TRAPPED(4) status:SIGTRAP(5) ... * TRACE (pid=5619):trace_loop: IDK:231(...) * TRACE (pid=5619):trace_child_signal: got sig SIGCHLD(17): code:CLD_EXITED(1) status:SIG???(0) = 0 ++ : Bad: ++ open_static-0 -1,EPERM deny/not-ok 'O_WRONLY|O_CREAT' 0666 * TRACE (pid=*56 2TRACE (pid=35):6trace_main2: parent waiting for child (pid=45):6trace_main2: child setting up ... 4) to signal * TRACE (pid=5623):trace_child_signal: got sig SIGCHLD(17): code:CLD_TRAPPED(4) status:SIG???(0) * TRACE (pid=5623):trace_child_signal: got sig SIGCHLD(17): code:CLD_TRAPPED(4) status:SIGTRAP(5) * TRACE (pid=5623):trace_child_signal: got sig SIGCHLD(17): code:CLD_TRAPPED(4) status:SIGTRAP(5) * TRACE (pid=5623):trace_loop: waiting for exec; status: 0x857f * TRACE (pid=5623):trace_child_signal: got sig SIGCHLD(17): code:CLD_TRAPPED(4) status:SIGTRAP(5) * TRACE (pid=5623):trace_loop: waiting for exec; status: 0x857f * TRACE (pid=5623):trace_child_signal: got sig SIGCHLD(17): code:CLD_TRAPPED(4) status:SIGTRAP(5) ... * TRACE (pid=5623):trace_loop: waiting for exec; status: 0x857f * TRACE (pid=5623):trace_child_signal: got sig SIGCHLD(17): code:CLD_EXITED(1) status:SIGHUP(1) = 1 ++ exit 1. Note: 'hit exec!' was never observed by parent. I guess it already happened before parent processed first event.
At least one problem in sandbox's ptrace state machine is in handling SIGCHLD/CLD_TRAPPED: normally tracer gets this signal when child is stopped by ptrace notification event (by signal). It that case info->si_status contains real signal that stopped tracee. But in our case tracee gets info->si_status = 0: * TRACE (pid=5619):trace_loop: IDK:231(...) * TRACE (pid=5619):trace_child_signal: got sig SIGCHLD(17): code:CLD_EXITED(1) status:SIG???(0) We should not forward back sig=0 to tracee as it will just resume tracee without signal delivery. That causes ptrace state machine to get out of sync across kernel and tracer. The following workaround seems to make things strictly better: --- a/libsandbox/trace.c +++ b/libsandbox/trace.c @@ -244,7 +244,9 @@ static void trace_child_signal(int signo, siginfo_t *info, void *context) * its problem, not ours, so don't whine about it. We just * have to be sure to bubble it back up. #265072 */ - do_ptrace(PTRACE_CONT, NULL, (void *)(long)info->si_status); + if (info->si_status != 0) { + do_ptrace(PTRACE_CONT, NULL, (void *)(long)info->si_status); + } return; } I'll try to find meaning of SIGCHLD/CLD_TRAPPED/info->si_status=0 first.
i rewrote the state machine for 2.26, and leverage PTRACE_EVENT_EXEC more strongly now. hopefully it's better now, but who knows. https://gitweb.gentoo.org/proj/sandbox.git/commit/?id=e9946633344c34b295485130f8f0fceb10fedc96