669978 – ebuilds that should disable distcc FEATURE

Bug 669978 - ebuilds that should disable distcc FEATURE

Summary: ebuilds that should disable distcc FEATURE

Status:	CONFIRMED

Alias:	None

Product:	Portage Development
Classification:	Unclassified
Component:	Unclassified (show other bugs)
Hardware:	All Linux

Importance:	Normal normal (vote)
Assignee:	Portage team

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2018-10-30 16:37 UTC by soundbastlerlive
Modified:	2019-07-27 08:51 UTC (History)
CC List:	5 users (show)

See Also:	671950 28300 80894 636806 620738 581732 522716 497404 473856 679368
Package list:
Runtime testing required:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description soundbastlerlive 2018-10-30 16:37:22 UTC

Some packages cannot be emerged with FEATURES="distcc", e.g. bison, mariadb, cmake, re2c, tdb, talloc, tevent and ironically (because of same authors) samba.
It will produce weird errors otherwise that often do not point to distcc being the problem.

I believe/guess that this can be avoided/specified by the ebuild so no manual interaction (FEATURES="-distcc" emerge ...) is required.

At least the doc says to report packages which fail.

Thanks!

Comment 1 Matt Turner gentoo-dev

2018-11-01 18:50:33 UTC

I've got an idea for this.

Comment 2 soundbastlerlive 2018-11-10 09:49:19 UTC

sys-devel/bison also fails with weird errors

Comment 3 soundbastlerlive 2018-11-12 23:32:51 UTC

nano also fails with weird issues like bison:

[...]

checking whether <wchar.h> uses 'inline' correctly... no
configure: error: <wchar.h> cannot be used with this compiler (x86_64-pc-linux-gnu-gcc -O3 -pipe -march=broadwell -mtune=skylake --param l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=512 -mmmx -mno-3dnow -msse -msse2 -msse3 -mssse3 -mno-sse4a -mcx16 -msahf -mmovbe -maes -mno-sha -mpclmul -mpopcnt -mabm -mno-lwp -mfma -mno-fma4 -mno-xop -mbmi -mno-sgx -mbmi2 -mno-tbm -mavx -mavx2 -msse4.2 -msse4.1 -mlzcnt -mno-rtm -mno-hle -mrdrnd -mf16c -mfsgsbase -mrdseed -mprfchw -madx -mfxsr -mxsave -mxsaveopt -mno-avx512f -mno-avx512er -mno-avx512cd -mno-avx512pf -mno-prefetchwt1 -mno-clflushopt -mno-xsavec -mno-xsaves -mno-avx512dq -mno-avx512bw -mno-avx512vl -mno-avx512ifma -mno-avx512vbmi -mno-avx5124fmaps -mno-avx5124vnniw -mno-clwb -mno-mwaitx -mno-clzero -mno-pku -mno-rdpid ).
This is a known interoperability problem of glibc <= 2.5 with gcc >= 4.3 in
C99 mode. You have four options:
  - Add the flag -fgnu89-inline to CC and reconfigure, or
  - Fix your include files, using parts of
    <https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=b037a293a48718af30d706c2e18c929d0e69a621>, or
  - Use a gcc version older than 4.3, or
  - Don't use the flags -std=c99 or -std=gnu99.
Configuration aborted.

Comment 4 soundbastlerlive 2018-11-21 14:36:55 UTC

net-libs/socket_wrapper also fails

I'm not sure all can be fixed, however I thought there was some ebuild syntax to disable it. Maybe as a first step these ebuilds should set FEATURES="-distcc" or something?

Comment 5 Thomas Deutschmann (RETIRED) gentoo-dev

2018-11-25 22:19:57 UTC

From current status, this looks like a distcc bug and nothing portage should handle.

See bug 665214 for details. In other words: Every package using gnulib will sooner or later get commit 285334ca5ac8f537bc183abd121aa68984e5a515 which will break distcc.

Comment 6 soundbastlerlive 2018-11-26 11:24:16 UTC

Thanks for clarifying! I forgot a few pkgs, because I never bothered to create a bug report, but maybe they too are related to that.

I interpret your findings as "this cannot be fixed and will only get worse", however then I would hope for portage to handle this even more. Sure it's a problem with the code, but if this cannot be fixed shouldn't portage/ebuilds just specify their incompatibility with distcc?

Comment 7 Michał Górny archtester

2018-11-26 16:38:16 UTC

Most of the time, distcc bugs fall into three categories:

1. Use of distcc-pump which is simply broken upstream and (as I've told multiple times) should be removed from FEATURES to prevent the same bugs spawning over and over again.

2. Use of fancy compiler flags that indicates a serious problem with the build system, with the fix being disabling the fancy things instead of disabling distcc.

3. Parallel make problems that get triggered by high --jobs.

In any case, distcc *must never* be silently disabled by Portage or ebuilds.  Using distcc implies using high --jobs value, and relying on distcc scheduler to prevent killing the system.  By disabling distcc, you're effectively turning the ebuild into DoS attack on user's system.

Comment 8 soundbastlerlive 2018-11-29 15:58:38 UTC

Michał, thank you for your input and remarks!
However I respectfully disagree with some of your points, mostly your final remark.

1) distcc-pump works fine at least for me on hundreds of different packages on over 100 gentoo machines for years now, but that's only me, so I'm sure you have more data on that
2) obviously -march=native cannot work with distcc on different CPUs, other than that this (many flags) work fine for me as well
3) that would be a problem of the pkg Makefile and will surface and need to be by the pkg src devs fixed sooner rather than later with 64 core/128 thread (or more!) systems now becoming increasingly common. I have compiled hundreds of pkgs with -j98 without issues, however this was a non-distcc AMD Epyc system.

I *totally* disagree that portage "should never disable distcc because of high number of processes which will be spawned".

How is this "launching a DoS on the user"? It is the user itself (and a technically very proficient one if (s)he setup a distcc gentoo system) starting this job (emerge). So the user would be DoSing her-/himself in your opinion? To me that does not count as DoS. Certainly not an "attack".

Furthermore I have done exactly that for quite a few times when my distcc system/nodes were down. I was too lazy to lower "-j", so I used e.g. "-j48" on a "poor and weak" dual core system.
Sure, the load average goes way up, but the Linux scheduler handles this perfectly fine, without any lockups etc., as long as you don't run out of RAM. And if you do run out of RAM, the build process will abort anyways because of OOM. In the very worst case the OOM reaper would kill the wrong process (e.g. your browser), but this is an edge case anyway IMHO. (e.g. -j48 on a dual core machine which only has 2GB of RAM and very little swap)

-j48 is not really much different than -j4 on a dual-core/thread machine. Both spawn more processes than available HW threads. It will simply use more RAM and be less efficient overall (total time goes up slightly).

Comment 9 Thomas Deutschmann (RETIRED) gentoo-dev

2018-11-29 18:13:49 UTC

(In reply to soundbastlerlive from comment #8)
> I *totally* disagree that portage "should never disable distcc because of
> high number of processes which will be spawned".
> 
> How is this "launching a DoS on the user"? It is the user itself (and a
> technically very proficient one if (s)he setup a distcc gentoo system)
> starting this job (emerge). So the user would be DoSing her-/himself in your
> opinion? To me that does not count as DoS. Certainly not an "attack".
> 
> [...]
> 
> -j48 is not really much different than -j4 on a dual-core/thread machine.
> Both spawn more processes than available HW threads. It will simply use more
> RAM and be less efficient overall (total time goes up slightly).

No. If you *totally* disagree with the DoS claim, I *totally* disagree with the statement, that -j48 vs -j4 on a low equipped machine doesn't cause any problems because scheduler will handle that very well.

I really don't know what scheduler you are using but most users are running vanilla kernel without any special scheduler tweaking and what you described is just not the OOTB experience for most users.

So yes: If we are talking about RESTRICT=distcc we would also need something like MAKEOPTS_FALLBACK for example so that PM would use different MAKEOPTS if distcc wouldn't be available. But this is not enough: Because most packages cannot really benefit from more than 16 threads you would also increase PM's jobs to compile let's say 2 packages with 16 threads each at the same time to fully utilize your build cluster. Not easy to implement.

Patches are welcome. :)

Comment 10 Matt Turner gentoo-dev

2018-11-29 21:37:27 UTC

Sorry I haven't had a chance to work on this.

My idea is to provide a package.env file (in the same format as a package.mask) that can list packages that don't work with distcc. We can keep it in a separate git repo and users can use it if they wish.

The problem of too high a -jX value when distcc doesn't work is very real. There are packages for which distcc work, but you don't want to use it and specify a correspondingly high -jX because the package has a lot of compiles that aren't distcc compatible, like guile which compiles lots of guile code.

I'll see what I can put together.

Comment 11 soundbastlerlive 2018-11-29 23:27:16 UTC

I don't disagree at all that all these more complex solutions presented by you are (more) ideal/desirable, however IMHO emerge should first and foremost work and finish the merge no matter what and not fail/abort/deny for some packages because of some FEATURES and require manual intervention.
That would easily be possible with very little effort if specific ebuilds could be excluded from distcc or some other similar simple mechanism.
After that that feature can still be optimized and refined of course. All the things you propose are definitely nice to have, but not required to have stuff "just work".

I just tested emerging guile locally with a dual core/4 thread system and -j48. Of course there will be slightly more short delays/stuttering (for a few ms) during emerge as guile easily maxes out those 48 jobs. However I easily was able to continue working.

** But what is the issue with a slightly laggy system during full load and how is that apparently *worse* than completely failing to emerge? **

When a user starts some heavy emerging (s)he should always expect some load and slowdowns. That is not a DoS attack.

It's also not really all that different from -j[CPUs + 1], which you seem to think would be fine. The machine is fully loaded in both cases, just more processes in the ready queue. I use standard gentoo-sources (both latest LTS and stable, currently 4.14/4.19) BTW, just a fully manual config, which shouldn't really improve things in this regard.

Finally it is always advisable to use -l[some load] anyways and then you could have -j999999 and it would never spawn any more tasks above the specified load average. All your feared scenarios could then never manifest themselves no matter what (local or distributed compile).

E.g. on many 4C/8T VMs I use:
EMERGE_DEFAULT_OPTS="[...] --jobs=9 --load-average=4"
MAKEOPTS="-j48 -l7"
FEATURES="[...] distcc distcc-pump"

and it has worked for years without issues except for the few packages mentioned.

To summarize my humble opinion/findings:
*) most importantly, emerge should first and foremost work without failing requiring manual intervention
*) emerge failing with cryptic messages is not better than having a slightly lagging system during fully loaded emerge scenario, which can happen without distcc as well
*) someone who uses gentoo and sets up a working distcc cluster can be reasonably expected to simply add an appropriate "-l" MAKEOPTS option to limit the local load, avoiding any (debatable) "DoS attack" scenario
*) the simplest easiest solution as a first step is therefore to exclude specific packages from distcc
*) also simple, low hanging fruit: if falling back to local, automatically add MAKEOPTS -l[CPUs + 1] (or similar) if not already defined anyway

Everything else beyond that is a nice-to-have/bonus feature complicating things just to improve performance for something that happens very rarely anyways and potentially introducing new bugs

Comment 12 Michał Górny archtester

2018-11-30 07:38:20 UTC

You still didn't share what miraculous kernel you're using.  Because on our systems, > -j5 with heavy C++ package usually means running out of memory into heavy swapping that makes system *unusable* (to the point of mouse pointer freezing) for minutes.

Before you start claiming that it's our fault to have swap, I should tell you a similar thing happens without swap -- except the system freezes for a little shorter, after which it OOM-kills the compiler and emerge fails.

-l won't help at all if the deadly number of compiler processes is spawned *before* they actually start taking up resources.  And gcc is kinda slow-ish in taking up all the available memory.

Finally, I should point out that some of the packages you've listed work just fine for me and are probably distcc-pump issues which you discarded.  Yes, it's the cause for many cryptic build failures and *silent miscompilations*.

Comment 13 soundbastlerlive 2018-11-30 09:08:08 UTC

(In reply to Michał Górny from comment #12)
> You still didn't share what miraculous kernel you're using.  Because on our
of course I did (not that it really matters though)...but it seems you couldn't even finish reading half my comment, before having the need to get miraculously nonconstructive ;)

> systems, > -j5 with heavy C++ package usually means running out of memory
> into heavy swapping that makes system *unusable* (to the point of mouse
> pointer freezing) for minutes.
So what's your point? Any 4C/8T CPUs with corresponding e.g. -j9 will have the same problem irrespective of distcc usage and with an SSD swapping isn't that bad.
STILL better than emerge just failing. Unlike a real DoS attack it *will* finish (or worst case: abort/fail) at some point.
Who in their right mind expects their system to be unaffected while emerging something that is already too demanding at -j5 on low end hardware but is at the same time technically proficient enough to set up a working distcc cluster?
Especially with apparently completely ancient hardware which cannot even handle -j5 without "exploding", has way too little RAM to do any reasonable compiling/emerging and no SSD to gracefully handle swapping.
Huge gcc processes in my experience use 500MB to maybe at most 2GB of RAM (of course there are exceptions), of which not all is active. No problem for a modern system with 4-32GB RAM and an SSD.

What are your ("our") systems anyway and by that do you mean "THE (i.e. standard/common) gentoo systems", your company or are you speaking for everyone's systems but mine?
Are those single cores with 2GB of RAM? Raspberry PIs? Just seems extremely unlikely to me to cause any harm for those using distcc setups (better than failing).

> Before you start claiming that it's our fault to have swap, I should tell
> you a similar thing happens without swap -- except the system freezes for a
> little shorter, after which it OOM-kills the compiler and emerge fails.
Why would I claim having swap is a problem/your fault? I always set up swap space as well, although usually just a few GB for "emergencies", because sizing RAM correctly in the first place is obviously preferable.
I even wrote myself that worst case, OOM reaper will activate, but it seems you didn't read that either.
STILL better than emerge not supporting distcc blacklists.

> -l won't help at all if the deadly number of compiler processes is spawned
> *before* they actually start taking up resources.  And gcc is kinda slow-ish
> in taking up all the available memory.
You are actually correct about the load average taking some time to go up, but again: is the goal of emerge, initiated by the use to compile packages for said user or just be some magical process that does nothing and uses no resources? Doing stuff requires resources, that what they are here for anyways. Gentoo has always been the most un-green distro in that regard ;) (which is why I use shared binpkgs for specific groups, instead of compiling everything on ~100 machines for the last ~15 years I've been using (and loving!) it).
You are also losing context here: all of these unlikely "horror scenarios" would only happen on advanced users' (distcc setup) systems for very few and specific packages which would otherwise fail to emerge anyway. We would not be "endangering" the "helpless masses" of "normal" gentoo users.

> Finally, I should point out that some of the packages you've listed work
> just fine for me and are probably distcc-pump issues which you discarded. 
> Yes, it's the cause for many cryptic build failures and *silent
> miscompilations*.
I did not discard your input regarding distcc-pump (thanks again!). IMHO however the same solution would also be perfect for this: allow blacklisting of distcc-pump as well. E.g. cmake does not emerge with distcc at all (even without distcc-pump).

I really still do not understand why you are so opposed to just blacklisting distcc for some packages. The advantages IMHO far outweigh the very unlikely edge cases. There are a zillion other more significant pitfalls in using gentoo that affect *way* more users than this. This is not a criticism, but a natural consequence of being so incredibly flexible and awesome. I have always loved gentoo for that and am just trying to improve it even more!

Comment 14 Thomas Deutschmann (RETIRED) gentoo-dev

2018-11-30 17:39:15 UTC

(In reply to soundbastlerlive from comment #13)
> I really still do not understand why you are so opposed to just blacklisting
> distcc for some packages. The advantages IMHO far outweigh the very unlikely
> edge cases.

No, it is not an edge case:

Like said, when you decide to use distcc in general,

1) you add "distcc" and maybe "distcc-pump" to FEATURES in your make.conf.

2) you increase MAKEOPTS.

But like said, we *know* that there's chance to shot yourself in the foot when you just set FEATURES=-distcc but forget about adjusting MAKEOPTS.

I am not saying that we have the ultimate truth but this is open source: You will never see someone implementing something where the person who will actually do the work already knows that it won't work and moreover will cause problems for him/herself.

But again, this is open source: We are more than happy if you or anyone else decide to spend time on this and improve things like we are always looking forward to improve Gentoo. But please keep in mind that any patch which will just implement 'RESTRICT="distcc"' without dealing with the problem caused by unadjusted MAKEOPTS has no chance for getting accepted.

Comment 15 soundbastlerlive 2018-11-30 18:10:00 UTC

Why are distcc users not allowed to shoot themselves in the foot? ;)
I don't think anyone who does all that would then complain or blame gentoo if their system is slightly laggy during very few compiles and would prefer it to fail instead like it does now.

How is probably less than 0,01% of all packages and problems which may (!) only occur on very low end hardware not an edge case?
Why is it better to just fail than have increased load during emerge?

The user asks the computer to do some task (emerge) and expects it to work/finish and not fail because finishing it may or may not cause increased load for some time.

We are not talking about huge packages anyways:
*) re2c (13s)
*) nano (19s)
*) bison (20s)
*) tevent (22s)
*) cmake (80s)
*) samba (160s)

None of which could ever spawn that many gcc processes anyway (a problem which already effects modern high-core-count systems like Epyc where almost no package compile can use all cores and even if usually for only 1-5s).

So on super crappy hardware (e.g. single core, 1GB RAM, no SSD) combined with super crazy distcc setup (-j100, a choice made by user!) and then still only for packages which are actually big and independent enough to spawn that many  tasks (very rare because of inter-file dependencies!) this would make the system lag for at most a few minutes after the user himself deciding (s)he wants to compile something.
But you seem to think it's better to abort/fail to protect these very rare combinations set up by the user himself from "harming" the user, which deliberately made this choice.
This causes emerge system/world or any other longer emerge which includes one of those packages to fail, the user has to investigate, try/fool around, hope for the best, spend a lot of time etc..

I still fail to see the advantage to that.

I was also afraid you were going to say that a patch wouldn't even be accepted :(

I also already suggested that if we are so scared of causing high load, RESTRICT=distcc (or whatever) could simply replace any "-jXXX" MAKEOPTS larger than number of threads with just that number (so -j100 would be replaced with -j8 on 8T CPUs), completely alleviating this IMHO unlikely problem. AFAIK gentoo already does this simple text replacement for some CFLAGS etc..

Wouldn't that make everyone happy?

Comment 16 Michał Górny archtester

2018-12-02 16:32:06 UTC

Here's evidence on your claim that distcc-pump isn't the issue.  Please particularly note the huge number of duplicates on the cmake issue.

Comment 17 soundbastlerlive 2018-12-02 19:42:40 UTC

What?!
I never once claimed that distcc-pump is never the issue, just that it works fine on hundreds of packages.

My last post specifically thanked you for the info (because you also falsely claimed I discarded that information) and that the same solution could be applied for it (blacklist distcc-pump for specific package). With that your "denial of service computer explosion" argument would never even be possible.

This is getting ridiculous.
You are the "gatekeepers" of gentoo anyway and get to choose what happens with it (like not accepting patches), you don't have to repeatedly lie about what people say or make up claims as well.
"We want it to fail and not work" would have been enough.

So let's just have a lot of stuff fail cryptically to prevent someone from "DoSing" their own system for 10 seconds by emerging stuff like nano...yeah, right...thanks for "protecting" us poor and stupid distcc users with single core 1GB RAM systems with IDE HDDs and no swap!
We do not want emerge world to finish, we want to spend hours to manually look at errors and emerge packages individually, because otherwise emerge world (which apparently should not take any resources and cause no load at all!) may slow down our systems for a few seconds or - god forbid - even minutes!

Comment 18 Larry the Git Cow gentoo-dev

2019-07-22 18:36:05 UTC

The bug has been referenced in the following commit(s):

https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=bbf87b340dc007fba1b4229d5569e8b4ae399436

commit bbf87b340dc007fba1b4229d5569e8b4ae399436
Author:     Matt Turner <mattst88@gentoo.org>
AuthorDate: 2019-07-22 18:22:13 +0000
Commit:     Matt Turner <mattst88@gentoo.org>
CommitDate: 2019-07-22 18:35:46 +0000

    app-portage/no-distcc-env: New package
    
    Bug: https://bugs.gentoo.org/669978
    Signed-off-by: Matt Turner <mattst88@gentoo.org>

 app-portage/no-distcc-env/metadata.xml             |  8 +++++
 .../no-distcc-env/no-distcc-env-9999.ebuild        | 36 ++++++++++++++++++++++
 2 files changed, 44 insertions(+)

Comment 19 Matt Turner gentoo-dev

2019-07-22 18:38:34 UTC

Please give app-portage/no-distcc-env a try. It ships package.env files to disable FEATURES=distcc/distcc-pump per-package.

Hopefully we can crowd-source this and make it possible to enable FEATURES=distcc in make.conf.

Comment 20 Zac Medico gentoo-dev

2019-07-27 08:48:44 UTC

(In reply to soundbastlerlive from comment #15)
> I also already suggested that if we are so scared of causing high load,
> RESTRICT=distcc (or whatever) could simply replace any "-jXXX" MAKEOPTS
> larger than number of threads with just that number (so -j100 would be
> replaced with -j8 on 8T CPUs), completely alleviating this IMHO unlikely
> problem. AFAIK gentoo already does this simple text replacement for some
> CFLAGS etc..
> 
> Wouldn't that make everyone happy?

We should allow the user to specify fallback MAKEOPTS, as suggested in comment #9.

Comment 21 Zac Medico gentoo-dev

2019-07-27 08:51:44 UTC

Maybe add a MAKEOPTS_DISTCC variable for the case where distcc is enabled, and otherwise use the normal MAKEOPTS variable.