Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 912072 - dev-lang/go-1.21.0: sandbox causes intermittent I/O hang when GOMAXPROCS>1, affecting go-module.eclass
Summary: dev-lang/go-1.21.0: sandbox causes intermittent I/O hang when GOMAXPROCS>1, a...
Status: RESOLVED NEEDINFO
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: Current packages (show other bugs)
Hardware: All Linux
: Normal normal (vote)
Assignee: Sandbox Maintainers
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: go-1.21-bugs
  Show dependency tree
 
Reported: 2023-08-11 09:38 UTC by Holger Hoffstätte
Modified: 2024-05-14 12:57 UTC (History)
4 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments
strace from bash (strace.txt,19.17 KB, text/plain)
2023-12-12 17:15 UTC, Holger Hoffstätte
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Holger Hoffstätte 2023-08-11 09:38:34 UTC
After updating to go-1.21.0 I noticed that go ebuilds with dependency tarballs take a looong time to emerge. The symptom is that during "go mod verify" the go process takes ~120% CPU, while in parallel a bash process takes ~30%. Both noodle around for several minutes. They do eventually finish, but this should not take more than half a second?


Reproducible: Always

Steps to Reproduce:
1. emerge go-1.21.0
2. emerge -v1 --quiet-build=n app-containers/docker
3. ..zzz...
Actual Results:  
Very slow go mod verify phase


I tried to peek inside the bash and go processes and both seem very busy signaling ecah other without much progress. The go process seems to take inordinate amounts of time doing sha26 block checksumming, but takes forever doing so.
Comment 1 Holger Hoffstätte 2023-08-11 09:44:29 UTC
Additional info:

- trying to build node_exporter is a better example
- for some reason it only seems to happen on my AMD Zen2 laptop,
not my SandyBridge Intel boxes where "go mod verify" takes <3 seconds.
Comment 2 Holger Hoffstätte 2023-08-11 15:29:54 UTC
Downgrading to 1.20 immediately fixes the problem, to the point where "go" does not even show up in top; the "go mod verify" call is basically immediate.

On this machine the TSC is broken and so it uses the HPET. I somehow suspect that a scheduler change leads to this stall/livelock, because as soon as I start sysdig to trace the go/bash processes, they start to make progress.
Comment 3 Holger Hoffstätte 2023-08-11 16:01:20 UTC
As suspected setting GOMAXPROCS=1 fixes the problem; all values >1 lead to go/bash stepping on each other's toes, probably with missed signal handling.

diff --git a/eclass/go-module.eclass b/eclass/go-module.eclass
index 6c58d7f26..a006da290 100644
--- a/eclass/go-module.eclass
+++ b/eclass/go-module.eclass
@@ -383,7 +383,7 @@ go-module_src_unpack() {
            cd "${S}"
            local nf
            [[ -n ${NONFATAL_VERIFY} ]] && nf=nonfatal
-           ${nf} ego mod verify
+           GOMAXPROCS=1 ${nf} ego mod verify
        fi
    fi
 }
Comment 4 Holger Hoffstätte 2023-08-11 18:34:01 UTC
For anybody also affected by this: instead of editing the eclass, passing GOMAXPROCS=1 to emerge has the same helpful effect. Compile times do not seem to suffer too much due to multiple parallel compiler jobs; IMHO an acceptable workaround for now.
Comment 5 Sam James archtester Gentoo Infrastructure gentoo-dev Security 2023-08-14 05:12:36 UTC
Holger, could you report this one upstream and link it here? I don't think it's worth one of us relaying it given we can't reproduce it yet and they're likely to want more info. Thanks!
Comment 6 Holger Hoffstätte 2023-08-14 10:01:57 UTC
(In reply to Sam James from comment #5)
> Holger, could you report this one upstream and link it here? I don't think
> it's worth one of us relaying it given we can't reproduce it yet and they're
> likely to want more info. Thanks!

Fair enough, but before I do so I'll have to run a few more experiments to see where this gremlin really hides.
Comment 7 Holger Hoffstätte 2023-08-14 14:04:51 UTC
After trying basically everything under the sun I found that FEATURES='-sandbox -usersandbox' seems to fix the problem right away, reliably and independent of GOMAXPROCS or the number of online cores.

This would explain why I cannot reproduce the problem oustide of ebuild/emerge, i.e. manually in a go package tree, or why everything else works just fine at runtime.

Setting SANDBOX_DEBUG=yes and enabled sandbox I see no violations (everything green) but it hangs on random reads of files in the GOMODCACHE - which exist (in the $WORKDIR/go-mod/cache directories) and are available, i.e. there is no need to download anything. It seems I/O just stops.
Comment 8 Holger Hoffstätte 2023-08-14 20:53:23 UTC
This behaviour also extends to building go itself (e.g. a rebuild of 1.21.0 with itself), where various go-based tasks ("dist", "compile") would show the same symptom, so it is not limited to "mod verify". I guess all I/O is affected.
Comment 9 Mike Gilbert gentoo-dev 2023-08-14 20:54:32 UTC
This sounds like some issue with sandbox's ptrace code, or a related kernel bug.

Given the highly specific reproduction steps that seem to require hardware I do not own, I really have no way to reproduce and debug this.
Comment 10 Holger Hoffstätte 2023-08-14 21:57:15 UTC
I guess I'll have to resign myself to setting GOMAXPROCS=1 in make.conf and move on then. Other than for building a few packages I don't use go and don't want to waste any more time on this either.
Thanks anyway.
Comment 11 Holger Hoffstätte 2023-12-12 17:15:42 UTC
Created attachment 878944 [details]
strace from bash

This is a size-limited strace snapshot of a non-progressing bash, the parent of the started go command with GOMAXPROCS > 1.
Comment 12 Holger Hoffstätte 2023-12-12 18:06:32 UTC
Managed to reproduce this outside of the shell/python/emerge lasagne! \o/
Here in a prepared node_exporter source tree with GOMAXPROCS unset:

holger>while (true) ; do time LD_PRELOAD=/usr/lib64/libsandbox.so sh -c "go mod verify > /dev/null" && sleep 1 ; done
LD_PRELOAD=/usr/lib64/libsandbox.so sh -c "go mod verify > /dev/null"  0.01s user 0.02s system 167% cpu 0.016 total
LD_PRELOAD=/usr/lib64/libsandbox.so sh -c "go mod verify > /dev/null"  0.01s user 0.02s system 152% cpu 0.016 total
LD_PRELOAD=/usr/lib64/libsandbox.so sh -c "go mod verify > /dev/null"  0.71s user 8.38s system 150% cpu 6.043 total
LD_PRELOAD=/usr/lib64/libsandbox.so sh -c "go mod verify > /dev/null"  0.01s user 0.07s system 156% cpu 0.051 total
LD_PRELOAD=/usr/lib64/libsandbox.so sh -c "go mod verify > /dev/null"  0.01s user 0.01s system 145% cpu 0.014 total
LD_PRELOAD=/usr/lib64/libsandbox.so sh -c "go mod verify > /dev/null"  0.02s user 0.00s system 153% cpu 0.016 total
LD_PRELOAD=/usr/lib64/libsandbox.so sh -c "go mod verify > /dev/null"  6.10s user 74.70s system 150% cpu 53.695 total
LD_PRELOAD=/usr/lib64/libsandbox.so sh -c "go mod verify > /dev/null"  0.01s user 0.01s system 140% cpu 0.014 total
LD_PRELOAD=/usr/lib64/libsandbox.so sh -c "go mod verify > /dev/null"  0.01s user 0.01s system 165% cpu 0.016 total
LD_PRELOAD=/usr/lib64/libsandbox.so sh -c "go mod verify > /dev/null"  0.02s user 0.00s system 173% cpu 0.014 total
LD_PRELOAD=/usr/lib64/libsandbox.so sh -c "go mod verify > /dev/null"  5.61s user 62.72s system 150% cpu 45.318 total
LD_PRELOAD=/usr/lib64/libsandbox.so sh -c "go mod verify > /dev/null"  0.01s user 0.01s system 153% cpu 0.015 total
LD_PRELOAD=/usr/lib64/libsandbox.so sh -c "go mod verify > /dev/null"  0.01s user 0.01s system 122% cpu 0.016 total

The outliers (8.38, 74.7, 62.72 s) are the cases where things are stuck for no reason.
Comment 13 Paolo Pedroni 2024-05-14 12:57:24 UTC
It may be fixed in recent versions of dev-lang/go. I tried today on a zen4 machine and it worked fine with sandbox and usersandbox.