645658 – dev-lang/rust-1.23 compile phase rustc goes into infinite loop

Bug 645658 - dev-lang/rust-1.23 compile phase rustc goes into infinite loop

Summary: dev-lang/rust-1.23 compile phase rustc goes into infinite loop

Status:	RESOLVED WORKSFORME

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	Current packages (show other bugs)
Hardware:	All Linux

Importance:	Normal normal
Assignee:	Gentoo Rust Project

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2018-01-25 07:55 UTC by Duncan
Modified:	2018-11-15 10:41 UTC (History)
CC List:	3 users (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
dev-lang:rust-1.23.0:20180125-061236.log.xz (dev-lang:rust-1.23.0:20180125-061236.log.xz,61.44 KB, application/x-xz) 2018-01-25 08:00 UTC, Duncan	Details
emerge --info rust (emerge.rust.info,7.74 KB, text/plain) 2018-01-25 08:03 UTC, Duncan	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Duncan 2018-01-25 07:55:40 UTC

rust-1.23 compile phase sends rustc into an apparent infinite loop, sits for over an hour of used CPU time @ 100% cpu on a single core (so over an hour wall time as well), with an attempt to strace that thread (from htop) saying it's x32(!!), despite me being on ~amd64/no-multilib.  Other threads of that invocation strace to a futex, apparently waiting on the spinning thread, but how can they be traced from amd64/no-multilib, while the one thread triggering the problem says it's x32 and won't trace?

After over an hour of wall and CPU time, I killed it, which of course killed the merge.  Several tries, with and without USE=jemalloc, same infinite loop on all of them.

(Additionally, setting MAKEOPTS=-j1 to try to debug possible parallel-build issues seemed to have 0 effect, but that's a different bug already filed and duped by others.)

FWIW, rust-1.19 is installed, but IIRC I had a similar issue (infinite looping, that I let sit overnite that time, then a few days later 1.19 merged fine) with an earlier rust, when I first tried to install it as a then new firefox dep, so it would seem the problem has remained, but 1.19 bypassed it somehow.

USE="-debug -doc -jemalloc (-clang) (-libcxx)"

But enabling jemalloc doesn't help.

clang and llvm both 5.0.0 if it matters.  (5.0.1 won't allow X to start, even after rebuilding xorg-server/mesa/xf86-video-amdgpu, so it's pkg-masked.)

Here's the log.

Comment 1 Duncan 2018-01-25 08:00:22 UTC

Created attachment 516516 [details]
dev-lang:rust-1.23.0:20180125-061236.log.xz

Comment 2 Duncan 2018-01-25 08:03:14 UTC

Created attachment 516518 [details]
emerge --info rust

Comment 3 Dirkjan Ochtman (RETIRED) gentoo-dev

2018-01-26 12:40:52 UTC

I have no clue how I would debug this. Maybe open an issue on the rust-lang/rust bug tracker to ask if they know what could be causing something like this?

Comment 4 tt_1 2018-01-27 17:06:15 UTC

A successfull emerge of dev-lang/rust (1.21.0 was the last one I tried) took also a very long time with only one job, I think it was the compile of rustc itself. However, I haven't tried it again since that. 

Which is the bug you opened to tackle MAKEOPTS not being respected by the ebuild? 
And also may I ask, if you have opened a bug about the breakage of radeon drivers with llvm-5.0.1? It just got stable and it would be a pitty if it was broken anyway.

Comment 5 Duncan 2018-01-28 05:58:16 UTC

(In reply to tt_1 from comment #4)
> A successfull emerge of dev-lang/rust (1.21.0 was the last one I tried) took
> also a very long time with only one job, I think it was the compile of rustc
> itself. However, I haven't tried it again since that. 

It takes quite awhile with multiple jobs; it would indeed take a /very/ long time with a single job.  However, I do run ccache, and additional builds do take less time with it.  Having built it I think most of the way with multiple jobs, rebuilding with a single job could use ccache, which should strink the time since it wouldn't actually be /building/ much/most of it due to ccache.

Of course when it got to the buggy bit that didn't complete, that wouldn't be in ccache so it'd go from there at single thread, but that /should/ be just the buggy bit, unless serializing it actually solves the problem, which I suspect it might, in which case it would complete the build single-threaded.  But if I'm most of the way thru as I suspect, even that shouldn't be /too/ bad.

> Which is the bug you opened to tackle MAKEOPTS not being respected by the
> ebuild?

Wasn't me, but there's at least 3 bugs (1 open and two dups), from earlier versions but they still apply.  Bugs #613794, #626080 and #635696.  There's apparently a build-config file already setup by upstream that can be tweaked to set the number of jobs, etc, so it's definitely a gentoo bug that it's not looking at makeopts and setting the variable in the build file appropriately.

That's why I parentheticallized that bit -- it's already a known gentoo/ebuild-specific bug with a known fix, the ebuilds simply haven't been adjusted to incorporate the fix, yet.

> And also may I ask, if you have opened a bug about the breakage of radeon
> drivers with llvm-5.0.1? It just got stable and it would be a pitty if it
> was broken anyway.

I haven't had time to research that properly to file a bug on it.  But there's apparently and unregistered dep, that revdep-rebuild doesn't detect either, so updating llvm doesn't trigger a rebuild of whatever package like it should.  It actually happened to me twice recently.

One of them, before I figured out it was llvm, I was without X for several days (nearly a week), until whatever package it is was apparently updated and thus rebuilt, and I could finally get into X again, the day I finally had some time to investigate it, thus short-circuiting my investigation.

The second time happened to coincide with one of the big gcc update and security option change triggered rebuilds, recently, so I rebuilt /everything/ that time, only to find I couldn't get into X!  Again!  But fortunately I had a system partition/filesystem backup that was a couple months old, and I run FEATURES=buildpkg (which has saved me a *lot* of trouble over the years!), so I was able to restore from the backup, and then update --package-only a few packages at a time, rebooting and starting X each time to ensure I still could, until I tracked down the problem to llvm -- even after I rebuilt the direct deps (amdgpu driver and mesa, plus xorg-server itself and the input drivers, just in case).

But now knowing that the problem was definitely triggered by the llvm update, I could simply mask it and fall back to the old version that let me get into X, that as I said above, had "magically" started working when whatever mystery package got rebuilt on its own.

So now I'm at a bit of a loss, not knowing what the mystery package is that apparently has an "automagic dep" on llvm and needs rebuilt after llvm updates, but that doesn't have the appropriate slot-dep OR normal dep on llvm, and that revdep-rebuild doesn't detect either, that keeps X from starting.  As I said it's not the amdgpu driver or mesa, the direct deps (other than rust, which is only used by firefox here, so it shouldn't be the problem).

And I got that bad flu earlier this winter, and am now working overtime, so haven't really had the time to look into it further, particularly at the risk of killing X and having to restore from known working backup again.  But having narrowed it down to something to do with llvm updates, I can at least the troublesome update for now, while continuing to update everything else, and that's what I've done.


Meanwhile, back on topic, I see bug #645672, rust-1.23 failing very early in the build on a *32-bit* *amd* machine.  He traced it down to either rust or libc (I'm not clear which, or whether it's glibc, or the llvm libc, based on a quick read of the bug) being build with --march=native (and/or --mfpmath=sse).  The problem seemed to go away with a more generic --march=686.

Significantly, I'm *amd* also, on a bulldozer-1 (fx6100), with --march=native.  But I'm on 64-bit amd64 not 32-bit x86, and the fail behavior is different as well.  In his case it failed early in the build with an error.  In my case it fails much later, with an apparent infinite loop that doesn't actually produce an error until I kill the looping thread.

But they're both amd with --march=native, and for me, trying to strace the thread that goes into the infinite loop gives me an error that it's x32, which would seem to indicate 32-bit code, despite that making no sense at all because other threads of the same executable are apparently 64-bit and strace (to a futex as they're apparently waiting on the spinning one) just fine.

But that bug could indeed be related.  I work Sunday/tommorrow but have Monday off and hopefully can take some time to try rereading that bug and then possibly playing with more generic --march= on rust and/or llvm and/or glibc, and see if it changes that looping behavior at all.  <shrug>  Maybe if I'm lucky, fixing this will change the broken X behavior as well, and I'll be able to upgrade llvm without breaking X again.  It's a compiler.  Maybe /it's/ breaking on --march=native, triggering all these other seemingly independent bugs that have llvm, on an amd because that's what I'm on, in common.

Comment 6 Martin Väth 2018-01-30 21:08:41 UTC

Remove -fmerge-all-constants from your CFLAGS

Comment 7 Leonardo Ferraguzzi 2018-02-02 09:49:12 UTC

(In reply to Martin Väth from comment #6)
> Remove -fmerge-all-constants from your CFLAGS

I confirm this works.

Comment 8 tt_1 2018-02-02 15:40:36 UTC

(In reply to Leonardo Ferraguzzi from comment #7)
> (In reply to Martin Väth from comment #6)
> > Remove -fmerge-all-constants from your CFLAGS
> 
> I confirm this works.

Is it possible to blacklist that cflag, hence to strip it if present for instance during src_compile?

Comment 9 Duncan 2018-02-03 04:46:01 UTC

(In reply to tt_1 from comment #8)
> (In reply to Leonardo Ferraguzzi from comment #7)
> > (In reply to Martin Väth from comment #6)
> > > Remove -fmerge-all-constants from your CFLAGS
> > 
> > I confirm this works.

Confirmed here too (actually by appending -fno-merge-all-constants to the existing CFLAGS using package.env).  Thanks for the suggestion, Martin! =:^)

So we know the problem now.

> Is it possible to blacklist that cflag, hence to strip it if present for
> instance during src_compile?

Yes.  In fact it's a simple eclass inherit and function call.  See flag-o-matic.eclass .

After inheriting the flag-o-matic eclass either filter-flags -fmerge-all-constants or append-flags -fno-merge-all-constants (note the negation) should work.

Comment 10 Martin Väth 2018-02-04 04:10:44 UTC

(In reply to Duncan from comment #9)
> appending -fno-merge-all-constants to the existing CFLAGS using package.env

Unrelated to this bug, you might want to check out portage-bashrc-mv (from the mv overlay) which can remove flags on a per-package basis.
There is also the public configuration https://github.com/vaeth/portage-env-mv/
which already contained +fmerge-all-constants for rust.

Comment 11 Duncan 2018-03-19 06:08:25 UTC

(In reply to tt_1 from comment #4)
> And also may I ask, if you have opened a bug about the breakage of radeon
> drivers with llvm-5.0.1? It just got stable and it would be a pitty if it
> was broken anyway.

(In reply to Martin Väth from comment #6)
> Remove -fmerge-all-constants from your CFLAGS

Turned out it was -fmerge-all-constants for that one too, and it's llvm itself that's sensitive to it.  Adding -fno-merge-all-constants to llvm's C(XX)FLAGS via package.env worked there too.

Thanks again MV! =:^)  It had been so long since something had blown up here due to cflags (because of my long-standing package.env settings for already affected packages) that it would have taken me quite some time to remember to check that on my own, /too/ long, I'm a bit ashamed to say, so you saved me quite a bit of work!

Duncan

Comment 12 Dirkjan Ochtman (RETIRED) gentoo-dev

2018-11-15 10:41:45 UTC

Since this appears to have been due to the specific CFLAG, I'm going to close this.