Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 249346 - dev-lang/ruby-1.8.6_p287-r3 parallel make fails
Summary: dev-lang/ruby-1.8.6_p287-r3 parallel make fails
Status: RESOLVED NEEDINFO
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: Current packages (show other bugs)
Hardware: All All
: High normal
Assignee: Gentoo Linux bug wranglers
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-11-30 08:06 UTC by Duncan
Modified: 2008-12-04 05:04 UTC (History)
2 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments
Full build log (dev-lang:ruby-1.8.6_p287-r3:20081130-064105.log,36.47 KB, text/plain)
2008-11-30 08:11 UTC, Duncan
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Duncan 2008-11-30 08:06:40 UTC
dev-lang/ruby-1.8.6_p287-r3 gave me an emerge error that looked like a parallel make  issue to me.  Sure enough, with MAKEOPTS=j1 set in /etc/portage/env/dev-lang/=ruby-1.8.6_p287-r3 , it merged fine.  I normally run MAKEOPTS="-j -l21" .

Note that -r2 compiled just fine for me a couple days ago, and no previous version has had the problem either, so chances are the change from -r2 to -r3 ("Fix our expansion"...) is the culprit.  I'm keywording REGRESSION based on this as well.

Here's the last bit of the compile log, showing the problem -- it's trying to create a dir that already exists.  As I said, obvious parallel make error!

make[1]: Entering directory `/tmp/portage/dev-lang/ruby-1.8.6_p287-r3/work/ruby-1.8.6-p287/ext/digest/sha2'
x86_64-pc-linux-gnu-gcc -I. -I../../.. -I../../../. -I../../.././ext/digest/sha2 -I../../.././ext/digest/sha2/.. -DHAVE_CONFIG_H -DHAVE_SYS_CDEFS_H -DHAVE_INTTYPES_H -DHAVE_UNISTD_H -DHAVE_TYPE_UINT64_T   -fPIC -march=opteron-sse3 -pipe -O2 -frename-registers -fweb -fmerge-all-constants -fgcse-sm -fgcse-las -fgcse-after-reload -ftree-vectorize -fdirectives-only -freorder-blocks-and-partition -combine -fno-strict-aliasing  -fPIC  -c sha2.c
x86_64-pc-linux-gnu-gcc -I. -I../../.. -I../../../. -I../../.././ext/digest/sha2 -I../../.././ext/digest/sha2/.. -DHAVE_CONFIG_H -DHAVE_SYS_CDEFS_H -DHAVE_INTTYPES_H -DHAVE_UNISTD_H -DHAVE_TYPE_UINT64_T   -fPIC -march=opteron-sse3 -pipe -O2 -frename-registers -fweb -fmerge-all-constants -fgcse-sm -fgcse-las -fgcse-after-reload -ftree-vectorize -fdirectives-only -freorder-blocks-and-partition -combine -fno-strict-aliasing  -fPIC  -c sha2init.c
mkdir -p ../../../.ext/common/digest
cp ../../.././ext/digest/sha2/lib/sha2.rb ../../../.ext/common/digest
mkdir: cannot create directory `../../../.ext/common/digest': File exists
make[1]: *** [../../../.ext/common/digest] Error 1
make[1]: *** Waiting for unfinished jobs....
make[1]: Leaving directory `/tmp/portage/dev-lang/ruby-1.8.6_p287-r3/work/ruby-1.8.6-p287/ext/digest/sha2'
make: *** [all] Error 1
 *
 * ERROR: dev-lang/ruby-1.8.6_p287-r3 failed.

I don't believe full emerge --info is necessary here but if so ask and I'll attach, but I'm running amd64.   As above, with MAKEOPTS="-j -l21" it fails, with MAKEOPTS=-j1 it works.  Here's the emerge -av line with USE flag status, etc:

[ebuild     U ] dev-lang/ruby-1.8.6_p287-r3 [1.8.6_p287-r2] USE="gdbm ssl threads -berkdb -debug -doc -emacs -examples -ipv6 -rubytests -socks5 -tk -xemacs"

I'll attach the full build log, but I expect the fact that the changes for -r3 apparently triggered it will be the most helpful bit, along with the excerpt above.
Comment 1 Duncan 2008-11-30 08:11:52 UTC
Created attachment 173849 [details]
Full build log
Comment 2 Hans de Graaff gentoo-dev Security 2008-11-30 08:16:50 UTC
I don't think this is a regression. -r3 is exactly the same as -r2, except for a one line edit in a patch that was already being applied and that should not influence the build process in any way.

I guess this might be a race condition that was always there and only now gets triggered for you for some reason. Maybe your load was lower now and this build got paralleled more aggresively?
Comment 3 Duncan 2008-11-30 16:04:25 UTC
(In reply to comment #2)
> I don't think this is a regression. -r3 is exactly the same as -r2, except
> for a one line edit in a patch that was already being applied and that
> should not influence the build process in any way.
> 
> I guess this might be a race condition that was always there and only now
> gets triggered for you for some reason. Maybe your load was lower now and
> this build got paralleled more aggresively?

If it's a race condition, it's not a difference that's quite so easily triggered by load.  I just took ccache out of features to ensure that wasn't changing the results, and commented the line in that env file I mentioned earlier (making the entire file a noop), and remerged -r2, no problem, and again -r3, which again failed with the same error: 

mkdir:cannot create directory `../../../.ext/common/io`: File exists

So it seems repeatable.

Now what /might/ be happening is that the patch triggers some other problem that was there all the time, but was previously masked, for whatever reason.  

FWIW, in my pre-filing bug-search, ALL dev-lang/ruby, I found no previous parallel make error bugs at all, tho there was one bug that had a comment indicating it was suspected at one point, but it apparently turned out to be something else.

Hmm...  With MAKEOPTS="-j" (no job limit, no load limit), it compiles and merges fine (as it does with -j1, one job only).  This despite the fact that one-minute-load-average never goes above 3.5 or so, so I'd /think/ the -l wouldn't trigger, at least when I'm single merging and 3.5 is it.

(Is there even a way to get load averages under a minute?  After all, I believe a modern machine does more in a second than one back when that was standardized did in a minute.  Does the -l option use one-minute-load-average too?)

Trying some different -lX numbers now, and -jX where X is other than 1, without -lX.

I've seen some packages that honored -j (with or without a number) but not -lX where X is a number.  Maybe it was working before but is now breaking on -lX?  We'll see as soon as I get a few more test results.
Comment 4 Duncan 2008-11-30 17:43:47 UTC
(In reply to comment #3)
> Trying some different -lX numbers now, and -jX where X is other than 1,
> without -lX.
> 
> I've seen some packages that honored -j (with or without a number) but
> not -lX [.]  Maybe it was working before but is now breaking on -lX? 

(I compile to tmpfs and thankfully this is a pretty small package, so it goes fast...)

Testing MAKEOPTS="-j -lX", where X equals:
1    FAIL
100  FAIL
1000 SUCCESS

So it's NOT as simple as a broken -lX.  Again, the one-minute-load-average never gets above 4 (3.5-ish), so either -l is using a more precise measure, or there's something else funny going on.  Why would -l100 fail but -l1000 work, if the load average it's measuring never gets above 4?

Testing simple MAKEOPTS="-jX", no -l at all, where X equals:
1     SUCCESS (which we knew)
21    SUCCESS (this was to try to match the -l21 I normally use)
10    SUCCESS
3     SUCCESS (trying a multi-job less than the number of cores (4))
1000  SUCCESS (trying something ridiculously big)
2     SUCCESS

I can't seem to make a simple -j fail.  It thus seems to be -l related, but as I said above, not as simple as a not taking -l at all.

Testing the long options:

Testing MAKEOPTS="--jobs=X", no --load-average, where X equals:

5  FAIL
3  

Testing MAKEOPTS="--jobs --load-average=X" where X equals:
1     FAIL
21    FAIL
100   SUCCESS This one's different than the above short-option result
1000  SUCCESS

Well, long vs short options isn't quite consistent, at the 100+ load average point.  On the two that succeeded, I also noted a somewhat higher one minute load average max of ~4.5 instead of the 3.5 I had been seeing.  Could long options vs. short options actually be affecting the result?

Well, let's see if the "-j -l100" fails consistently, and if the load-average
tips up above the previously noted 3.5-ish with short options, at all, if any succeed.

5 tries, F/S noted on each, with load average (tho on success the high appears later, I think):
F  2.8
S  4.2
F  4.0
S  3.5
S  4.3

So "-j -l100" seems to be pretty close to the tipping point.

Hmm...  let's try this one:

Testing MAKEOPTS="-j -l" (no numbers attached at all, but with -l, would turn a previous load-average off, but if there's problems with -l parsing...).  This SHOULD be the same as simply -j.

Two more SUCCESS.  It /does/ seem the same as -j.

So...  It appears -l has /something/ to do with it as I've not see a -j without -l fail.  With -lX, it fails with low X, succeeds with high X, with a near 50/50 success rate at X=100.

It's definitely quite consistently repeatable here, with higher allowed load-averages increasing the success rate.  The problem does appear to be a ordering race related, but with a set number of jobs it seems to always work, and with a load-limited number of jobs, the closer to unrestricted the limit is, the better the chance of success, with the 50/50 point being about a 100 load-limit, this despite the one-minute-load-average anyway, never going over ~4.5 (now upped from the 3.5 I had observed earlier).

So what /is/ make measuring for load average?  Surely it's not instantaneously runnable load, as the manpage /clearly/ says /average/?  The make info page on parallel execution (5.4) specifies "current load average", but doesn't detail what that actually /means/.  (FWIW, I'm reading one-minute-load-average off of ksysguard's graphing of same, as I have it configured.  I'd love to know of a more "current" one, say 2-10 seconds, but AFAIK the kernel doesn't supply that info in an easily accessible form.)  But /something/ related to that switch is clearly triggering well before it hits the target on the minute-load-average anyway.

Meanwhile, looks like a user workaround (if anyone else is running into this bug) would be simply setting only -j (with or without a number, but no -l number) in MAKEOPTS, in the package's env file if necessary.  It doesn't seem to use too many make jobs /or/ too much memory, in any case.
Comment 5 Jeroen Roovers (RETIRED) gentoo-dev 2008-12-01 21:02:40 UTC
(In reply to comment #0)
> dev-lang/ruby-1.8.6_p287-r3 gave me an emerge error that looked like a parallel
> make  issue to me.  Sure enough, with MAKEOPTS=j1 set in
> /etc/portage/env/dev-lang/=ruby-1.8.6_p287-r3 , it merged fine.

Um, your reasoning is invalid. You did two runs, one failed and the other didn't.

Do two runs of 20 iterations of the emerge and you could start to suggest that there's a parallel make problem.
Comment 6 Jeroen Roovers (RETIRED) gentoo-dev 2008-12-04 05:04:12 UTC
Either just use -jN or live with it, please. Or explain how make -lN should be fixed upstream - it doesn't appear to be a new problem. :)