The cputimes/cpuload reported from top/ps etc. report extreme loads on kernel processes migration/x. The problem isn't permanent as it takes a while to set off, but once it does it never stops. I am unable to locate what's triggering it, but I did a bisect and found the commit at fault: commit d670ec13178d0fd8680e6742a2bc6e04f28f87d8 - posix-cpu-timers: Cure SMP wobbles. The load reported doesn't seem real, as there's no sign that the system is busy or unresponsive, but it seems like cpu frequency-scaling (ondemand) is kicking in and never scales the frequency down again. Reproducible: Always Steps to Reproduce: 1. Boot normally with kernel >= 3.0.7 2. Wait ~1 hour (unable to figure out what sets it off. Idle vs full load doesn't matter) Actual Results: Cpuload/cputime on some migration/x kernel processes show a constant load between 20-100%. Cpu frequency scaling (ondemand) seems to be affected as it never scales down the cores/threads affected. Expected Results: Report normal load on migration processes. Allow ondemand to scale down.
Created attachment 295605 [details] git bisect log
Created attachment 295607 [details] Kernel config used
Created attachment 295609 [details] emerge --info
I see a discussion regarding this is referenced in the bug now. https://lkml.org/lkml/2011/10/12/382 Is the proposed patch accepted? I tested 3.2_rc5, and the problem persists but I'm unable to determine if the patch is included or not. I find it strange that this problem isn't more widespread. I'm experiencing this on two different machines (AMD64), but not on the third which is X86. Tried different configs including default ones with bare minimum of drivers, but still no change.
I see some changes around this area. Can you test with gentoo-sources-3.2.11 ?
Unfortunately I get the same results using 3.6.11.
(In reply to comment #6) > Unfortunately I get the same results using 3.6.11. As of today, latest from kernel.org is 3.3.6; how did you get 3.6.?? I'm running from 3.2-rcX to now 3.4-rc7; the [migration/0] is using 75% pcpu: [root@archtop ~]# ps fuww -e |awk 'NR <=2 || /migration/' USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 2 0.0 0.0 0 0 ? S May10 0:00 [kthreadd] root 6 76.1 0.0 0 0 ? S May10 4649:06 \_ [migration/0] root 982 0.0 0.0 0 0 ? S May14 0:00 \_ [migration/1] root 985 0.3 0.0 0 0 ? S May14 0:39 \_ [migration/2] root 988 0.0 0.0 0 0 ? S May14 0:00 \_ [migration/3]
And: # ps ax -o pid,comm,pcpu | grep "migration" 6 migration/0 76.1 982 migration/1 0.0 985 migration/2 0.3 988 migration/3 0.0
(In reply to comment #7) > (In reply to comment #6) > > Unfortunately I get the same results using 3.6.11. > > As of today, latest from kernel.org is 3.3.6; how did you get 3.6.?? > I was running 3.2.11 at the time. My bad. Recent versions show no improvement on the issue, but I've turned off tickless system (NO_HZ = n), and that "fixed" it. Still hoping for some real fix eventually though...
I tried the NO_NZ=n route and still see unreal time's on the migration kernel threads.. # uname -r -v -m 3.2.12-gentoo #1 SMP Tue Jul 17 09:41:18 EDT 2012 x86_64 # uptime 10:53:28 up 4 days, 23:04, 1 user, load average: 0.14, 0.23, 0.24 # ps uax|grep -e US[E]R -e migra[t]ion USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 6 0.0 0.0 0 0 ? S Jul18 0:00 [migration/0] root 7 44.2 0.0 0 0 ? S Jul18 3160:23 [migration/1] root 11 0.0 0.0 0 0 ? S Jul18 0:00 [migration/2] root 14 67.0 0.0 0 0 ? S Jul18 4789:14 [migration/3] root 17 0.0 0.0 0 0 ? S Jul18 0:00 [migration/4] root 20 0.0 0.0 0 0 ? S Jul18 0:00 [migration/5] root 23 0.0 0.0 0 0 ? S Jul18 0:00 [migration/6] root 26 0.0 0.0 0 0 ? S Jul18 0:00 [migration/7] # zcat /proc/config.gz |grep NO_HZ # CONFIG_NO_HZ is not set
Seems the issue is present in 3.4.4 also: # uname -r -v -m ; uptime ; ps uax|grep -e US[E]R -e migra[t]ion ; zcat /proc/config.gz |grep NO_HZ 3.4.4-gentoo #7 SMP Sat Jul 14 15:15:33 EDT 2012 x86_64 13:02:23 up 16 days, 22:22, 9 users, load average: 3.60, 3.95, 4.47 USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 6 64.1 0.0 0 0 ? S Jul25 15636:28 [migration/0] root 7 89.8 0.0 0 0 ? S Jul25 21911:32 [migration/1] root 11 95.5 0.0 0 0 ? S Jul25 23304:44 [migration/2] root 14 90.3 0.0 0 0 ? S Jul25 22021:08 [migration/3] root 17 81.6 0.0 0 0 ? S Jul25 19900:55 [migration/4] root 20 95.9 0.0 0 0 ? S Jul25 23387:25 [migration/5] root 23 99.6 0.0 0 0 ? S Jul25 24305:11 [migration/6] root 26 82.2 0.0 0 0 ? S Jul25 20052:41 [migration/7] root 29 51.9 0.0 0 0 ? S Jul25 12666:22 [migration/8] root 32 91.8 0.0 0 0 ? S Jul25 22387:41 [migration/9] root 35 99.3 0.0 0 0 ? S Jul25 24229:26 [migration/10] root 38 94.8 0.0 0 0 ? S Jul25 23136:40 [migration/11] # CONFIG_RCU_FAST_NO_HZ is not set CONFIG_NO_HZ=y Perhaps I should disable tickless to see if it helps ...
While I was staring at 'top -d5' (and sorted by total time T) the migration/15 process on the very top showed exactly 3200% CPU usage for at least one cycle and then it was back to zero. This was suspicious since I have never before witnessed any CPU usage by migration processes during 5-second intervals, but more and more migration processes keep collecting at the top of the listing based on total time. Also, I have 32 processors and a load average above 20, but the load is completely explainable from normal activities (not the migration processes). I have several processes running at 100%, but they are no longer at the top of the list. Hope this helps. ross@wilson:~$ uname -a Linux wilson 3.2.0-3-amd64 #1 SMP Thu Jun 28 09:07:26 UTC 2012 x86_64 GNU/Linux Also, I'm running Debian Wheezy, but didn't see this bug in their tracking, so I attached myself here.
(In reply to comment #12) top doesn't show CPU usage, but ps does: root@wilson:~# uname -r -v -m ; uptime ; ps uax|grep -e US[E]R -e migra[t]ion 3.2.0-3-amd64 #1 SMP Thu Jun 28 09:07:26 UTC 2012 x86_64 13:27:23 up 2 days, 5:53, 33 users, load average: 22.09, 21.61, 21.74 USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 6 89.7 0.0 0 0 ? S Aug10 2903:49 [migration/0] root 8 79.5 0.0 0 0 ? S Aug10 2573:33 [migration/1] root 13 0.0 0.0 0 0 ? S Aug10 0:00 [migration/2] root 17 0.0 0.0 0 0 ? S Aug10 0:00 [migration/3] root 21 0.0 0.0 0 0 ? S Aug10 0:00 [migration/4] root 25 0.0 0.0 0 0 ? S Aug10 0:00 [migration/5] root 29 0.0 0.0 0 0 ? S Aug10 0:00 [migration/6] root 33 0.0 0.0 0 0 ? S Aug10 0:00 [migration/7] root 37 0.0 0.0 0 0 ? S Aug10 0:00 [migration/8] root 41 79.0 0.0 0 0 ? S Aug10 2554:48 [migration/9] root 45 0.0 0.0 0 0 ? S Aug10 0:00 [migration/10] root 49 55.4 0.0 0 0 ? S Aug10 1792:50 [migration/11] root 53 0.0 0.0 0 0 ? S Aug10 0:00 [migration/12] root 57 66.5 0.0 0 0 ? S Aug10 2152:52 [migration/13] root 61 74.3 0.0 0 0 ? S Aug10 2405:08 [migration/14] root 65 98.6 0.0 0 0 ? S Aug10 3189:35 [migration/15] root 69 89.7 0.0 0 0 ? S Aug10 2903:31 [migration/16] root 73 0.0 0.0 0 0 ? S Aug10 0:00 [migration/17] root 78 0.0 0.0 0 0 ? S Aug10 0:00 [migration/18] root 89 97.5 0.0 0 0 ? S Aug10 3153:28 [migration/19] root 93 84.3 0.0 0 0 ? S Aug10 2727:58 [migration/20] root 97 0.0 0.0 0 0 ? S Aug10 0:00 [migration/21] root 101 0.0 0.0 0 0 ? S Aug10 0:00 [migration/22] root 105 0.0 0.0 0 0 ? S Aug10 0:00 [migration/23] root 109 94.3 0.0 0 0 ? S Aug10 3050:03 [migration/24] root 113 0.0 0.0 0 0 ? S Aug10 0:00 [migration/25] root 117 78.7 0.0 0 0 ? S Aug10 2545:33 [migration/26] root 121 76.2 0.0 0 0 ? S Aug10 2465:08 [migration/27] root 129 0.0 0.0 0 0 ? S Aug10 0:00 [migration/28] root 139 0.0 0.0 0 0 ? S Aug10 0:00 [migration/29] root 143 0.0 0.0 0 0 ? S Aug10 0:00 [migration/30] root 147 0.0 0.0 0 0 ? S Aug10 0:00 [migration/31]
Disabling tickless seems to have done the trick, which is nice :-) # uname -r -v -m ; uptime ; ps uax|grep -e US[E]R -e migra[t]ion ; zcat /proc/config.gz |grep NO_HZ 3.4.4-gentoo #8 SMP Sat Aug 11 14:19:27 EDT 2012 x86_64 17:49:42 up 3 days, 3:13, 9 users, load average: 4.69, 4.80, 4.97 USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 6 0.0 0.0 0 0 ? S Aug11 0:00 [migration/0] root 7 0.0 0.0 0 0 ? S Aug11 0:00 [migration/1] root 11 0.0 0.0 0 0 ? S Aug11 0:00 [migration/2] root 14 0.0 0.0 0 0 ? S Aug11 0:00 [migration/3] root 17 0.0 0.0 0 0 ? S Aug11 0:00 [migration/4] root 20 0.0 0.0 0 0 ? S Aug11 0:00 [migration/5] root 23 0.0 0.0 0 0 ? S Aug11 0:00 [migration/6] root 26 0.0 0.0 0 0 ? S Aug11 0:00 [migration/7] root 29 0.0 0.0 0 0 ? S Aug11 0:00 [migration/8] root 32 0.0 0.0 0 0 ? S Aug11 0:00 [migration/9] root 35 0.0 0.0 0 0 ? S Aug11 0:00 [migration/10] root 38 0.0 0.0 0 0 ? S Aug11 0:00 [migration/11] # CONFIG_NO_HZ is not set
Blast, they're back with 3.4.9: # uname -r -v -m ; uptime ; ps uax|grep -e US[E]R -e migra[t]ion ; zcat /proc/config.gz |grep NO_HZ 3.4.9-gentoo #1 SMP Fri Aug 24 08:38:41 EDT 2012 x86_64 11:40:06 up 2 days, 2:37, 9 users, load average: 3.71, 3.87, 4.12 USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 6 0.0 0.0 0 0 ? S Aug24 0:00 [migration/0] root 7 50.5 0.0 0 0 ? S Aug24 1534:49 [migration/1] root 11 0.0 0.0 0 0 ? S Aug24 0:00 [migration/2] root 14 0.0 0.0 0 0 ? S Aug24 0:00 [migration/3] root 17 0.0 0.0 0 0 ? S Aug24 0:00 [migration/4] root 20 0.0 0.0 0 0 ? S Aug24 0:00 [migration/5] root 23 0.0 0.0 0 0 ? S Aug24 0:00 [migration/6] root 26 0.0 0.0 0 0 ? S Aug24 0:00 [migration/7] root 29 65.4 0.0 0 0 ? S Aug24 1986:42 [migration/8] root 32 77.2 0.0 0 0 ? S Aug24 2347:25 [migration/9] root 35 0.0 0.0 0 0 ? S Aug24 0:00 [migration/10] root 38 87.0 0.0 0 0 ? S Aug24 2643:40 [migration/11] # CONFIG_NO_HZ is not set
Can you please bring this issue upstream at http://bugzilla.kernel.org and post the url to the back back here.
I just reported it here: https://bugzilla.kernel.org/show_bug.cgi?id=47341
ok, thanks. Watching the upstream bug, and we'll attempt to backport any fix identified.
Nothing much seems to have happened in the upstream, though 3.6.11 appears to be cured: # uname -r -v -m ; uptime ; ps uax|grep -e US[E]R -e migra[t]ion ; zcat /proc/config.gz |grep NO_HZ 3.6.11-gentoo #1 SMP Mon Dec 24 08:29:21 EST 2012 x86_64 10:33:54 up 13 days, 38 min, 9 users, load average: 5.26, 4.81, 4.26 USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 8 0.0 0.0 0 0 ? S 2012 3:03 [migration/0] root 9 0.0 0.0 0 0 ? S 2012 1:32 [migration/1] root 13 0.0 0.0 0 0 ? S 2012 1:30 [migration/2] root 17 0.0 0.0 0 0 ? S 2012 2:26 [migration/3] root 21 0.0 0.0 0 0 ? S 2012 2:08 [migration/4] root 25 0.0 0.0 0 0 ? S 2012 1:44 [migration/5] root 29 0.0 0.0 0 0 ? S 2012 1:09 [migration/6] root 33 0.0 0.0 0 0 ? S 2012 1:03 [migration/7] root 37 0.0 0.0 0 0 ? S 2012 1:09 [migration/8] root 41 0.0 0.0 0 0 ? S 2012 1:06 [migration/9] root 45 0.0 0.0 0 0 ? S 2012 0:54 [migration/10] root 49 0.0 0.0 0 0 ? S 2012 1:08 [migration/11] # CONFIG_NO_HZ is not set I may try re-enabling tickless ...
You're right. >= 3.6.11 has fixed the issue, and I've tried both with and without CONFIG_NO_HZ.