in CrOS, we install a lot of binpkgs. we'll often install hundreds of them into an empty root when creating a new disk image for a device. on systems with a lot of cores (e.g. 32+), we find that portage has very low parallelism. historically it hasn't been great, but it's gotten much worse in recent versions. our baseline is (largely) v2.3.75. the full branch you can find here: https://chromium.googlesource.com/chromiumos/third_party/portage_tool/+/refs/heads/chromeos-2.3.75 we rebased our fork onto 3.0.21 and that's when we detected a huge drop in performance. you can find that branch here: https://chromium.googlesource.com/chromiumos/third_party/portage_tool/+/refs/heads/chromeos-3.0.21 it's been hard bisecting things down due to the async migrations breaking things, but we've identified at least one place where there was a large regression. reverting that in 3.0.21 seems to bring back a lot, but not all, of the performance. https://gitweb.gentoo.org/proj/portage.git/commit/?id=d66e9ec0b10522528d62e18b83e012c1ec121787 here's one way of reproducing with vanilla Gentoo. i use -j32 here because that's what our builders tend to have, and my workstation has 36 cores (72 if you count SMT). but this should be reproducible with even just -j4 or more. * start with new Gentoo install https://bouncer.gentoo.org/fetch/root/all/releases/amd64/autobuilds/20220821T170533Z/stage3-amd64-openrc-20220821T170533Z.tar.xz emerge --version Portage 3.0.30 (python 3.10.5-final-0, default/linux/amd64/17.1, gcc-11.3.0, glibc-2.35-r8, 5.17.11-1rodete2-amd64 x86_64) * install a few extra packages emerge -q --sync emerge -j32 -1 dev-libs/nss dev-vcs/git * clone a small CrOS repo for accounts cd /var/db/repos/ git clone --depth=1 https://chromium.googlesource.com/chromiumos/overlays/eclass-overlay * workaround a bug in CrOS packages mkdir -p /usr/share/fonts/foo * setup a sysroot for the board SYSROOT=/build/amd64-generic mkdir -p $SYSROOT/etc/portage/profile cd $SYSROOT/etc/portage cat <<EOF >make.conf SYSROOT=$SYSROOT CHOST=x86_64-cros-linux-gnu PKGDIR=$SYSROOT/packages ACCEPT_LICENSE="*" MAKEOPTS=-j32 PORTDIR_OVERLAY=/var/db/repos/eclass-overlay PORTAGE_BINHOST=https://storage.googleapis.com/chromeos-prebuilt/test-vapier/amd64-generic/2022.08.22/packages FEATURES="parallel-fetch parallel-install" EOF ln -sfT /var/db/repos/gentoo/profiles/default/linux/amd64/17.1 make.profile wget https://storage.googleapis.com/chromeos-prebuilt/test-vapier/amd64-generic/2022.08.22/package.provided -P profile/ * try to emerge things and watch load average emerge --with-bdeps=n -Gv --binpkg-respect-use=n virtual/target-os -j32 --root $SYSROOT --config-root $SYSROOT |& tee /log even though there are many many places that say "32 running", the load average barely gets above 2. that means portage isn't actually installing anything in parallel. NB: the overall install will probably fail after installing ~ packages, but that's to be expected due to CrOS specific things not working in vanilla Gentoo. the problem should manifest itself well before that point though so i didn't bother fixing it.
ccing zmedico in case he has any ideas
Sam asked me to update this bug with what I know. Firstly, the document go/cros-portage-bisection-findings from within Google has a bit more analysis. I'm not sure why all of it didn't find its way here. I also emailed Zac a few months ago. (In reply to SpanKY from comment #0) > it's been hard bisecting things down due to the async migrations breaking > things, but we've identified at least one place where there was a large > regression. reverting that in 3.0.21 seems to bring back a lot, but not > all, of the performance. > https://gitweb.gentoo.org/proj/portage.git/commit/ > ?id=d66e9ec0b10522528d62e18b83e012c1ec121787 This commit is in portage-2.3.90. In a private email with Zac, he notes that this commit was reverted in commit https://gitweb.gentoo.org/proj/portage.git/commit/?id=71ae5a58fe72bc32dce030210a73ea5c9eeb4a1c which is in portage-2.3.97. From looking at the Google doc, I suspect that vapier meant a different commit (this one isn't mentioned in the doc, and it was already reverted before 3.0.21) The two commits mentioned in the Google doc are 1) https://gitweb.gentoo.org/proj/portage.git/commit/?id=9b755b46f9e88f25fecada0a32095ea614a73b57 (in portage-2.3.99), reported to cause a large increase in time taken to calculate dependencies (+800%). This may have been the issue mitigated by https://gitweb.gentoo.org/proj/portage.git/commit/?id=839ab46be1777e5886da28b98b53a462b992c5bf 2. https://gitweb.gentoo.org/proj/portage.git/commit/?id=50da2e16599202b9ecb3d4494f214a0d30b073d (in portage-2.3.93), reported to keep various processes alive for longer than before, which leads to less parallelization. I believe this is the commit vapier meant to cite in the original report.
Zac also noted that the rdep_cache in https://chromium-review.googlesource.com/c/chromiumos/third_party/portage_tool/+/3780786 may help here.
Created attachment 864225 [details] Portage Versions Bisection Findings.pdf Here's the document.