We have 3 identical servers. Trying to implement 3 node web-cluster with shared FC SAN storage. For cluster filesystem I've tried ocfs2/gfs2 and gfs. After connection 2nd node to cluster I'see following messages in kernel log (doesn't matter using ocfs2/gfs2/gfs). With gfs: Dec 7 18:11:56 falcon-cl3 INFO: task apache2:3133 blocked for more than 120 seconds. Dec 7 18:11:56 falcon-cl3 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Dec 7 18:11:56 falcon-cl3 apache2 D ffffc20000065000 0 3133 2102 Dec 7 18:11:56 falcon-cl3 ffff88018c9e1ca8 0000000000000086 0000000000000000 ffff88011edde9c0 Dec 7 18:11:56 falcon-cl3 0000000000000292 ffff8801ef9da090 ffff8801e58a3950 ffff8801ef9da2d8 Dec 7 18:11:56 falcon-cl3 ffffffff807bc000 ffffffffa019b2ff ffff8801edd18180 ffffffffa01d2ad0 Dec 7 18:11:56 falcon-cl3 Call Trace: Dec 7 18:11:56 falcon-cl3 [<ffffffffa019b2ff>] dlm_lock+0x9f/0x1c0 [dlm] Dec 7 18:11:56 falcon-cl3 [<ffffffffa01d2ad0>] gdlm_hold_lvb+0x170/0x230 [gfs] Dec 7 18:11:56 falcon-cl3 [<ffffffff805a5095>] schedule_timeout+0x95/0xd0 Dec 7 18:11:56 falcon-cl3 [<ffffffff805a4628>] wait_for_common+0xb8/0x170 Dec 7 18:11:56 falcon-cl3 [<ffffffff8022e850>] default_wake_function+0x0/0x10 Dec 7 18:11:56 falcon-cl3 [<ffffffffa01c8374>] gfs_glock_xmote_th+0xb4/0x270 [gfs] Dec 7 18:11:56 falcon-cl3 [<ffffffffa01c6775>] gfs_reclaim_glock+0x5b5/0x8a0 [gfs] Dec 7 18:11:56 falcon-cl3 [<ffffffffa01c6bf0>] gfs_glock_nq+0x190/0x460 [gfs] Dec 7 18:11:56 falcon-cl3 [<ffffffffa01c6ede>] gfs_glock_nq_init+0x1e/0x40 [gfs] Dec 7 18:11:56 falcon-cl3 [<ffffffffa01e0ae5>] gfs_removexattr+0x10b5/0x10f0 [gfs] Dec 7 18:11:56 falcon-cl3 [<ffffffff802994de>] vfs_getattr+0x2e/0xa0 Dec 7 18:11:56 falcon-cl3 [<ffffffff8029979a>] vfs_stat_fd+0x3a/0x60 Dec 7 18:11:56 falcon-cl3 [<ffffffff80299857>] sys_newstat+0x27/0x50 Dec 7 18:11:56 falcon-cl3 [<ffffffff802963da>] vfs_read+0x12a/0x160 Dec 7 18:11:56 falcon-cl3 [<ffffffff80296753>] sys_read+0x53/0x90 Dec 7 18:11:56 falcon-cl3 [<ffffffff8020b71b>] system_call_fastpath+0x16/0x1b Dec 7 18:11:56 falcon-cl3 With ocfs2: Dec 1 18:32:48 falcon-cl3 INFO: task apache2:20388 blocked for more than 120 seconds. Dec 1 18:32:48 falcon-cl3 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Dec 1 18:32:48 falcon-cl3 apache2 D ffff81000100d780 0 20388 5039 Dec 1 18:32:48 falcon-cl3 ffff8100774fbd48 0000000000000086 ffff8101ed9e6d80 ffff8101ee9cc180 Dec 1 18:32:48 falcon-cl3 ffff8100774fbe48 ffff810125161810 ffff81008bcff810 ffff810125161a50 Dec 1 18:32:48 falcon-cl3 ffff8100774fbd48 ffffffff8028f273 000000000000c344 0000000000033cbc Dec 1 18:32:48 falcon-cl3 Call Trace: Dec 1 18:32:48 falcon-cl3 [<ffffffff8028f273>] Dec 1 18:32:48 falcon-cl3 [<ffffffff880aa20a>] Dec 1 18:32:48 falcon-cl3 [<ffffffff8023cde0>] Dec 1 18:32:48 falcon-cl3 [<ffffffff880baceb>] Dec 1 18:32:48 falcon-cl3 [<ffffffff880b6036>] Dec 1 18:32:48 falcon-cl3 [<ffffffff802881a3>] Dec 1 18:32:48 falcon-cl3 [<ffffffff802881f7>] Dec 1 18:32:48 falcon-cl3 [<ffffffff8020247b>] Dec 1 18:32:48 falcon-cl3 With gfs2: Dec 4 08:58:03 falcon-cl3 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Dec 4 08:58:03 falcon-cl3 apache2 D ffff8801e92182e0 0 16367 5233 Dec 4 08:58:03 falcon-cl3 ffff880052cb3ae8 0000000000000086 ffffe2000477e9b8 0000000000000000 Dec 4 08:58:03 falcon-cl3 ffff8801e85b2760 ffff8801efa04210 ffff8801e85b29b0 ffff880051dd5050 Dec 4 08:58:03 falcon-cl3 ffff880000000000 0000000000000000 ffffffff809323a0 ffff8801ec56fac0 Dec 4 08:58:03 falcon-cl3 Call Trace: Dec 4 08:58:03 falcon-cl3 [<ffffffff805ac4c5>] 0xffffffff805ac4c5 Dec 4 08:58:03 falcon-cl3 [<ffffffffa00842d4>] 0xffffffffa00842d4 Dec 4 08:58:03 falcon-cl3 [<ffffffff80219899>] 0xffffffff80219899 Dec 4 08:58:03 falcon-cl3 [<ffffffff805aba38>] 0xffffffff805aba38 Dec 4 08:58:03 falcon-cl3 [<ffffffff80225df0>] 0xffffffff80225df0 Dec 4 08:58:03 falcon-cl3 [<ffffffffa00842fd>] 0xffffffffa00842fd Dec 4 08:58:03 falcon-cl3 [<ffffffffa00854b6>] 0xffffffffa00854b6 Dec 4 08:58:03 falcon-cl3 [<ffffffffa0086437>] 0xffffffffa0086437 Dec 4 08:58:03 falcon-cl3 [<ffffffffa00831b2>] 0xffffffffa00831b2 Dec 4 08:58:03 falcon-cl3 [<ffffffffa0088537>] 0xffffffffa0088537 Dec 4 08:58:03 falcon-cl3 [<ffffffffa0092c46>] 0xffffffffa0092c46 Dec 4 08:58:03 falcon-cl3 [<ffffffff8028c799>] 0xffffffff8028c799 Dec 4 08:58:03 falcon-cl3 [<ffffffff80241d50>] 0xffffffff80241d50 Dec 4 08:58:03 falcon-cl3 [<ffffffff8028d0f5>] 0xffffffff8028d0f5 Dec 4 08:58:03 falcon-cl3 [<ffffffff8028d4d3>] 0xffffffff8028d4d3 Dec 4 08:58:03 falcon-cl3 [<ffffffff8020288b>] 0xffffffff8020288b Dec 4 08:58:03 falcon-cl3 And only full node restart helps, till next 30 minutes. falcon-cl3 ~ # uname -a Linux falcon-cl3 2.6.27-gentoo-r4 #1 SMP Sat Dec 6 18:28:37 EET 2008 x86_64 Intel(R) Xeon(R) CPU 5110 @ 1.60GHz GenuineIntel GNU/Linux Using profile /usr/portage/profiles/default/linux/amd64/2008.0/no-multilib www-servers/apache 2.2.9-r1 On the same 3 servers with OpenSUSE 11 and OCFS2 all works smoothly. Reproducible: Always
Created attachment 174576 [details] kernel config /proc/config.gz
Generally speaking it looks like http://bugzilla.kernel.org/show_bug.cgi?id=10582 but concerns ocfs2/gfs/gfs2
The same thing with vanilla-sources. Linux falcon-cl3 2.6.27.8 #1 SMP Sun Dec 7 19:11:27 EET 2008 x86_64 Intel(R) Xeon(R) CPU 5110 @ 1.60GHz GenuineIntel GNU/Linux
We've had the same issue at work with heavily loaded machines. It happens when pdflush takes more than 120 seconds to push dirty pages from RAM to disk and usually means that either your hardware is misconfigured or your system load is much too high.
I don't agree. Because I reverted those servers to openSUSE 11 with generic kernel Linux falcon-cl3 2.6.25.18-0.2-default #1 SMP 2008-10-21 16:30:26 +0200 x86_64 x86_64 x86_64 GNU/Linux and the same set software (nginx+apache/mod_php+ocfs2, all package based) and with the same load all works as expected (i.e. even when LA>50 no hangups).