Rust 1.56 change the implement of Command::spawn to clone3 syscall on Linux (https://github.com/rust-lang/rust/pull/81825)
When run inside sandbox, cargo will hang after compile few dependence
Steps to Reproduce:
1. Write ebuild for dev-lang/rust-1.56_beta or dev-lang/rust-bin-1.56_beta and install it
2. select rust 1.56 via eselect rust
3. emerge exa
Some zombie rustc don't exit and cargo hang
cargo won't hang with FEATURES=-sandbox
(Rust 1.56 is not release yet, it will release on 2021-10-21)
Upstream report: https://github.com/rust-lang/rust/issues/89522
> emerge --info
Portage 3.0.26 (python 3.9.7-final-0, default/linux/amd64/17.0/musl, gcc-13.0.0, musl-1.2.2-r3, 5.14.8-zen+ x86_64)
System uname: Linux-5.14.8-zen+-x86_64-Intel-R-_Core-TM-_i9-9900T_CPU_@_2.10GHz-with-libc
KiB Mem: 32645680 total, 11028428 free
KiB Swap: 8388604 total, 7202812 free
Timestamp of repository gentoo: Tue, 05 Oct 2021 04:36:37 +0000
Head commit of repository gentoo: ea4e312c260f3ecbf0f4557a0fc902bfbeb4de43
Timestamp of repository gentoo-zh: Mon, 04 Oct 2021 01:06:16 +0000
Head commit of repository gentoo-zh: 312107f2205ee3543dad93b5539a7d040595f92c
Timestamp of repository guru: Sun, 03 Oct 2021 07:12:26 +0000
Head commit of repository guru: e67b9cf012d1dfd950adc6c4794cc3b28849923c
Timestamp of repository mv: Sat, 02 Oct 2021 09:51:20 +0000
Head commit of repository mv: 252d7490701964c65533e50c316734ab24db4d98
Timestamp of repository pentoo: Tue, 05 Oct 2021 02:36:15 +0000
Head commit of repository pentoo: 02f0b193029e4edf4a3ec70eea71ec984a3f21cb
Timestamp of repository pg_overlay: Tue, 05 Oct 2021 01:06:22 +0000
Head commit of repository pg_overlay: f00613aa0cb8b461326b7ca68f906b29f1687287
Timestamp of repository musl: Sat, 02 Oct 2021 18:21:16 +0000
Head commit of repository musl: 63526da3e64b1b9cb234ce874a471209cd877546
sh bash 5.1_p8
ld LLD 13.0.0 (compatible with GNU linkers)
ccache version 4.4.2 [enabled]
dev-lang/python: 2.7.18_p13::pg_overlay, 3.9.7::pg_overlay
sys-devel/autoconf: 2.13-r1::gentoo, 2.71-r1::gentoo
sys-kernel/linux-headers: 5.14::gentoo (virtual/os-headers)
CFLAGS="-march=skylake -mtune=skylake -O3 -pipe -falign-functions=32 -fomit-frame-pointer -ffunction-sections -fdata-sections -flto=thin"
CONFIG_PROTECT="/etc /usr/lib/libreoffice/program/sofficerc /usr/share/config /usr/share/gnupg/qualified.txt"
CONFIG_PROTECT_MASK="/etc/ca-certificates.conf /etc/dconf /etc/env.d /etc/fonts/fonts.conf /etc/gconf /etc/gentoo-release /etc/revdep-rebuild /etc/sandbox.d /etc/terminfo /etc/texmf/language.dat.d /etc/texmf/language.def.d /etc/texmf/updmap.d /etc/texmf/web2c"
CXXFLAGS="-march=skylake -mtune=skylake -O3 -pipe -falign-functions=32 -fomit-frame-pointer -ffunction-sections -fdata-sections -flto=thin -stdlib=libc++"
ENV_UNSET="CARGO_HOME DBUS_SESSION_BUS_ADDRESS DISPLAY GOBIN GOPATH PERL5LIB PERL5OPT PERLPREFIX PERL_CORE PERL_MB_OPT PERL_MM_OPT XAUTHORITY XDG_CACHE_HOME XDG_CONFIG_HOME XDG_DATA_HOME XDG_RUNTIME_DIR"
FEATURES="assume-digests binpkg-docompress binpkg-dostrip binpkg-logs binpkg-multi-instance candy ccache config-protect-if-modified distlocks downgrade-backup ebuild-locks fixlafiles ipc-sandbox merge-sync network-sandbox news parallel-fetch pid-sandbox preserve-libs protect-owned qa-unresolved-soname-deps sandbox sfperms strict unknown-features-warn unmerge-logs unmerge-orphans userfetch userpriv usersandbox usersync xattr"
FFLAGS="-march=skylake -mtune=skylake -O3 -pipe -falign-functions=32 -fomit-frame-pointer -ffunction-sections -fdata-sections"
LDFLAGS="-fuse-ld=lld -rtlib=compiler-rt -unwindlib=libunwind -flto=thin -Wl,--lto-O3 -Wl,-O3 -Wl,--gc-sections -march=skylake -mtune=skylake"
PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --omit-dir-times --compress --force --whole-file --delete --stats --human-readable --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages --exclude=/.git"
RUSTFLAGS="-Ctarget-cpu=skylake -Clinker=clang -Clink-arg=-fuse-ld=lld -Ctarget-feature=-crt-static"
USE="X acl alsa amd64 bluetooth bzip2 clang cli compiler-rt crypt cups dbus default-compiler-rt default-libcxx dri elogind eselect-ldso fortran gif iconv ipv6 jemalloc jpeg libcxx libcxxabi libglvnd libtirpc libunwind llvm-libunwind ncurses nls nptl opengl openmp pam pcre png policykit pulseaudio readline seccomp split-usr ssl svg tcpd unicode vaapi vulkan wayland xattr zlib zsh-completion" ABI_X86="64" ADA_TARGET="gnat_2019" ALSA_CARDS="hda-intel usb-audio" APACHE2_MODULES="authn_core authz_core socache_shmcb unixd actions alias auth_basic authn_alias authn_anon authn_dbm authn_default authn_file authz_dbm authz_default authz_groupfile authz_host authz_owner authz_user autoindex cache cgi cgid dav dav_fs dav_lock deflate dir disk_cache env expires ext_filter file_cache filter headers include info log_config logio mem_cache mime mime_magic negotiation rewrite setenvif speling status unique_id userdir usertrack vhost_alias" CALLIGRA_FEATURES="karbon sheets words" COLLECTD_PLUGINS="df interface irq load memory rrdtool swap syslog" CPU_FLAGS_X86="aes avx avx2 f16c fma3 mmx mmxext pclmul popcnt sse sse2 sse3 sse4_1 sse4_2 ssse3" ELIBC="musl" GPSD_PROTOCOLS="ashtech aivdm earthmate evermore fv18 garmin garmintxt gpsclock greis isync itrax mtk3301 nmea ntrip navcom oceanserver oldstyle oncore rtcm104v2 rtcm104v3 sirf skytraq superstar2 timing tsip tripmate tnt ublox ubx" INPUT_DEVICES="evdev libinput" KERNEL="linux" L10N="zh-CN" LCD_DEVICES="bayrad cfontz cfontz633 glk hd44780 lb216 lcdm001 mtxorb ncurses text" LIBREOFFICE_EXTENSIONS="presenter-console presenter-minimizer" LLVM_TARGETS="X86 ARM AArch64 RISCV WebAssembly AMDGPU AVR BPF Hexagon Lanai MSP430 Mips NVPTX PowerPC Sparc SystemZ XCore" LUA_SINGLE_TARGET="lua5-4" LUA_TARGETS="lua5-4 luajit" OFFICE_IMPLEMENTATION="libreoffice" PHP_TARGETS="php7-3 php7-4" POSTGRES_TARGETS="postgres12 postgres13" PYTHON_SINGLE_TARGET="python3_9" PYTHON_TARGETS="python3_9" QEMU_SOFTMMU_TARGETS="arm aarch64 i386 x86_64 riscv32 riscv64" QEMU_USER_TARGETS="arm aarch64 riscv32 riscv64" RUBY_TARGETS="ruby30" USERLAND="GNU" VIDEO_CARDS="intel i965 iris" XTABLES_ADDONS="quota2 psd pknock lscan length2 ipv4options ipset ipp2p iface geoip fuzzy condition tee tarpit sysrq proto steal rawnat logmark ipmark dhcpmac delude chaos account"
Unset: CPPFLAGS, CTARGET, EMERGE_DEFAULT_OPTS, LINGUAS, PORTAGE_BUNZIP2_COMMAND, PORTAGE_COMPRESS, PORTAGE_COMPRESS_FLAGS, PORTAGE_RSYNC_EXTRA_OPTS
As far as I can tell, this happens because the sandbox tool has a global lock, and hooks fork to acquire that lock before forking and drop it afterwards (so that fork doesn't happen while another thread holds the lock), but does not have a similar hook for clone or clone3. It's possible to create another process using clone or clone3 (not just a thread), if the flags do not include CLONE_VM. I think the right fix is to hook clone and clone3, and if the flags do *not* include CLONE_VM, use the same lock/unlock logic.
(Note that since clone3 is a syscall and not a library call, library interposing wouldn't help here. It may be necessary to do external ptracing on programs that use clone3.)
(In reply to Josh Triplett from comment #1)
yes, sandbox relies on the program not making syscalls directly and forking its own processes. if it does that, all bets are off. considering how painful it has been to use the clone syscall, no one has bothered, especially when portable code already has access to everything it needs via the fork() & pthread C library APIs.
(In reply to Josh Triplett from comment #2)
we currently only ptrace static programs. as painful as interposing is, it's still less flaky than ptrace, especially on some architectures.
while detecting if a particular ELF is static is trivial, detecting if it uses a specific syscall is practically impossible.
it'd prob help to have someone dig a bit into rust/cargo source to see where the clone3 call is coming from. if it were glibc, our existing symbol interposition should have handled it.
there is a clone C library wrapper we could/should interpose and check the flags, but i suspect that isn't the source of the trouble here. the clone function still hits the clone syscall only, not clone3. the only way glibc itself hits clone3 is via __clone_internal, and that's only used with pthread & spawn APIs.
so if rust itself is hitting clone3 syscall directly via its internal code, we're a bit buggered. we could maybe add an env var hook to sandbox to force it to use its ptrace mode, and rust ebuilds/eclasses could set that.
rust 1.56 is indeed doing a direct clone3 syscall: https://github.com/rust-lang/rust/pull/81825/files#diff-7015a38ee6056bbfa832b33281ffeaad5531c4dbfaff60ddfce0934475e040f4R163
Also, while I'm not too familiar with Rust, this change appears to be in the rust standard library (or whatever they call it) - so it would affect anything built with rust, not just cargo. I've already run into a related issue trying to do profile-guided optimization in firefox (which runs firefox in a virtual X server during build, to generate profile data - it hangs forever, just like cargo).
I did manage to "fix" cargo by patching this line to default "HAS_CLONE3" to false rather than true: https://github.com/rust-lang/rust/blob/5e02151318ddd431aea6d58e23948246c1446044/library/std/src/sys/unix/process/process_unix.rs#L147
Looks like they disabled this feature (in most cases) in 1.56 final, so hopefully it won't be an issue: https://github.com/rust-lang/rust/pull/89924
There's still a lot of discussion in the original bug though ( https://github.com/rust-lang/rust/issues/89522 ), so it's not clear what's going to happen in future versions.
we'll have to rework the sandbox startup logic so we only run in ptrace mode. i'm not sure how much cutting this will take though.
first we'll have to fix a long standing bug where ptrace only works one-level deep. if the traced process forks, we don't keep track of any of those descendants. i have this somewhat working.
FWIW, Rust's "direct" clone3 syscall is actually using libc's syscall (2), so in theory you could interpose that and decode number==SYS_clone3. I don't know if it's worth the effort though, because I think Rust is going to have to stop using clone3 anyway, for all the libc arguments against it.
(In reply to Josh Stone from comment #8)
thanks ... i had been including syscall() here, but you're certainly right we could interpose that interface too. a cursory scan of my system shows that not too many programs go through that, so it shouldn't (hopefully) be too much overhead if we want to go that route.
looks like workaround made in in 1.56.0
i have seccomp+ptrace working locally, and i'll prob ship it in sandbox-2.28. waiting to hear back on the new 2.27 release which includes NNP enabled (and is a requirement for seccomp usage).
however, ptrace is still only used when running static or set*id programs, it isn't used all the time. will have to do more work to pull the ptrace logic out into the main code path as an option (vs using LD_PRELOAD at all).
we also have the problem that the ptrace support hasn't been ported to all architectures. i did it for a bunch that i had easy access to via Gentoo dev machines, but that's not all of them. so i think the LD_PRELOAD logic will be with us for quite some time.
also need to see what kind of perf hit this takes overall. or maybe it's a win.
(In reply to SpanKY from comment #11)
> i have seccomp+ptrace working locally, and i'll prob ship it in
> sandbox-2.28. waiting to hear back on the new 2.27 release which includes
> NNP enabled (and is a requirement for seccomp usage).
yeah, it's been fine so far, but giving it a bit longer is a good idea.
> we also have the problem that the ptrace support hasn't been ported to all
> architectures. i did it for a bunch that i had easy access to via Gentoo
> dev machines, but that's not all of them. so i think the LD_PRELOAD logic
> will be with us for quite some time.
which are you missing?
sandbox arch list:
Gentoo arch list:
at this point, i'm considering implementing generic layers in the kernel and sending those upstream. linux-5.3 has PTRACE_GET_SYSCALL_INFO which generalizes the reading side of the equation, but there's no generic "set" operation for forcing an error. then i wouldn't have to fight with sandbox always being behind the curve with newer Linux ports.
updated port status:
* i got an arm64 laptop got that mostly working (i think i hit a kernel bug)
* i got access to the sparc dev box and fixed the sparc32 & sparc64 ports
* i fixed the ppc64 ptrace logic
the big ones we're still missing are mips and maybe riscv. also m68k & sh, but we know those are lesser used.
talking with some folks about getting a common set ptrace API in the kernel. optimistic that it'll happen, but it'll be a while before we can rely on it.
relevant ptrace discussion in case people are interested:
but at this point, i don't think it's a blocker for progress with most arches.
sandbox-3.0 is out w/support for tracing of children of static programs. that's going to need a while to bake i think.
rust thing ported to beta branches, so it's not only in 1.56* now.