Created attachment 385246 [details] complete build.log (plus extra after ebuilds) The bug is specific to distcc builds. It is necessary to use these to build llvm on lower memory ARM systems (e.g. a raspberry pi b+, as in this case). Unfortunately the latest upgrade to 3.5.0 results in the compilation of Function.cpp now taking more than 5 minutes on my powerful 64-bit x86 system and this exceeds the default setting for DISTCC_IO_TIMEOUT of 300s. The compile files, a local compile is attempted and some time after cc1plus has grow to 250MByte of RSS (this on a machine with 512MByte of RAM) the aggressive Linux OOM killer kills it. The problem won't arise on larger systems as the local compile will succeed (hiding the timeout). It probably won't arise on Linices other than 3.12.x because the OOM killer action is totally unnecessary (there is more than enough swap space to do the compile). It's not clear how to change the distcc IO timeout - adding declare -x DISTCC_IO_TIMEOUT=600 to environment (see attachments) does not help. I'm still investigating a work-round. Note that the attached build.log has a lot of history in it - an emerge followed by several attempts to continue the compile step with "ebuild install". Marked the bug as minor because it is pretty obvious what is going on; just difficult to find a work-round.
Created attachment 385248 [details] emerge --info output
Created attachment 385250 [details] environment (including non-working DISTCC_IO_TIMEOUT work round) The only change to the environment from the original emerge generated one was to add DISTCC_IO_TIMEOUT
Created attachment 385252 [details] Command that times out on the distcc server This is the command copied from ps ww
Created attachment 385254 [details] Preprocessed file being compiled This is the file the client machine (where the emerge is running) sent to the compile server. (Compressed because it is so large)
Created attachment 385256 [details] Test program to validate how long the compile actually takes
The attachment cc1plus.sh validates that the compile does actually run to completion and verifies how long it takes - just over the 300s limit. Output is: + cp distccd_4404fc4d.ii /tmp/distccd_4404fc4d.ii + env -i /usr/libexec/gcc/armv6j-hardfloat-linux-gnueabi/4.8.3/cc1plus -fpreprocessed /tmp/distccd_4404fc4d.ii -quiet -dumpbase distccd_4404fc4d.ii -march=armv6zk -mfpu=vfp -mfloat-abi=hard -march=armv6zk -mfpu=vfp -mfloat-abi=hard -march=armv6zk -mfpu=vfp -mfloat-abi=hard -mtls-dialect=gnu -auxbase-strip /tmp/distccd_4439fc4d.o -g -g -g -Os -Os -Os -Woverloaded-virtual -Wcast-qual -Wpedantic -Wno-long-long -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wno-maybe-uninitialized -Wno-missing-field-initializers -std=c++11 -fvisibility-inlines-hidden -fno-exceptions -fPIC -ffunction-sections -fdata-sections -fstack-protector -o /tmp/ccAiCu7t.s real 5m52.762s user 5m52.694s sys 0m0.226s
Created attachment 385258 [details] emerge -pqv '=sys-devel/llvm-3.5.0::gentoo'
It looks like this is a distcc bug; although some distcc documentation describes 'DISTCC_IO_TIMEOUT' the distcc installed by gentoo (sys-devel/distcc-3.1-r9) does not seem to have support for it. This is why my declaration had no effect; I can demonstrate that it was correct by defining DISTCC_FALLBACK=0 in the same way and this does prevent the certain-to-fail local compile. I increased the severity of the bug to normal; this is preventing me building llvm 3.5.0, if DISTCC_IO_TIMEOUT worked I could make a work-round and one could be incorporated into the llvm ebuild.
Comment on attachment 385246 [details] complete build.log (plus extra after ebuilds) distcc[8293] (dcc_select_for_read) ERROR: IO timeout distcc[8293] (dcc_r_token_int) ERROR: read failed while waiting for token "DONE" distcc[8293] (dcc_r_result_header) ERROR: server provided no answer. Is the server configured to allow access from your IP address? Does the server have the compiler installed? Is the server configured to access the compiler? distcc[8293] Warning: failed to distribute /var/tmp/portage/sys-devel/llvm-3.5.0/work/llvm-3.5.0.src/lib/IR/Function.cpp to hippopop us.jbowler.com, running locally instead armv6j-hardfloat-linux-gnueabi-g++: internal compiler error: Killed (program cc1plus) Please submit a full bug report, with preprocessed source if appropriate. See <https://bugs.gentoo.org/> for instructions. distcc[8293] ERROR: compile /var/tmp/portage/sys-devel/llvm-3.5.0/work/llvm-3.5.0.src/lib/IR/Function.cpp on localhost failed with e xit code 4 /var/tmp/portage/sys-devel/llvm-3.5.0/work/llvm-3.5.0.src/Makefile.rules:1519: recipe for target '/var/tmp/portage/sys-devel/llvm-3. 5.0/work/llvm-3.5.0.src-.arm/lib/IR/Release/Function.o' failed make[1]: *** [/var/tmp/portage/sys-devel/llvm-3.5.0/work/llvm-3.5.0.src-.arm/lib/IR/Release/Function.o] Error 1 distcc is working fine. You ran out of memory.
If a remote distcc job fails, it simply abandons that job and tries again locally. It is there that it failed, using to much memory and getting killed. You have five make jobs on a system with less than half a gigabyte of memory. Each compiler job (preprocessing, compiling, assembling, linking) might easily take up half a gigabyte. The best thing to do on such a system is to compile nothing locally, and instead prepare packages on a proper workstation that cross-compiles everything.
My fix works with distcc 3.2_rc1 on the *client* side (my server was still 3.1). The issue is that DISTCC_IO_TIMEOUT isn't supported in 3.1 and 3.2 has been out-for-testing for (apparently) 3 years (might be wrong about that; it's based on the 2011 date in the portage testing mask). "distcc works fine for me" isn't really true, is it? It doesn't work - it falls back to the local compile and *that* works (and it takes 5 minutes to find out then probably a lot longer on the local system).
Incidentally; the subject line on the bug is wrong; this bug isn't about cc1plus failing because a rabid Linux OOM killer kills it (that's a separate bug), it's a bug about distcc timing out a compile that takes 5m42s. I listed llvm originally because it happens in the llvm build and, given the built in default timeout of 5m in distcc, it can be fixed in the project ebuild, but I guess you could say the work-round is client (emerge machine) specific and it can be fixed in make.conf. (So it's nothing to do with llvm, just a distcc issue.)
See: https://bugs.gentoo.org/show_bug.cgi?id=518884 A fix for this bug is blocked by that. BTW, the OOM killer issue seems to be a bug in Linux 3.12.y; the process tree above cc1plus (or the later ld which suffers from it if this bug is fixed) has plenty of swappable processes waiting on the cc/ld; Linux chooses to kill the child rather than swap (or page) the parent. Put status back to 'wontfix', it clearly *can* be fixed (whatever you think the bug is - either the too short distcc timeout or the OOM problem.)