Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!

Bug 250190

Summary: apache block when using any cluster file system
Product: Gentoo Linux Reporter: Oleg Gawriloff <barzog>
Component: [OLD] Core systemAssignee: Gentoo Linux bug wranglers <bug-wranglers>
Status: RESOLVED INVALID    
Severity: critical CC: patrick
Priority: High    
Version: unspecified   
Hardware: All   
OS: Linux   
Whiteboard:
Package list:
Runtime testing required: ---
Attachments: kernel config /proc/config.gz

Description Oleg Gawriloff 2008-12-07 16:21:38 UTC
We have 3 identical servers. Trying to implement 3 node web-cluster with shared FC SAN storage. For cluster filesystem I've tried ocfs2/gfs2 and gfs. After connection 2nd node to cluster I'see following messages in kernel log (doesn't matter using ocfs2/gfs2/gfs). 

With gfs:
Dec  7 18:11:56 falcon-cl3 INFO: task apache2:3133 blocked for more than 120 seconds.
Dec  7 18:11:56 falcon-cl3 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec  7 18:11:56 falcon-cl3 apache2       D ffffc20000065000     0  3133   2102
Dec  7 18:11:56 falcon-cl3 ffff88018c9e1ca8 0000000000000086 0000000000000000 ffff88011edde9c0
Dec  7 18:11:56 falcon-cl3 0000000000000292 ffff8801ef9da090 ffff8801e58a3950 ffff8801ef9da2d8
Dec  7 18:11:56 falcon-cl3 ffffffff807bc000 ffffffffa019b2ff ffff8801edd18180 ffffffffa01d2ad0
Dec  7 18:11:56 falcon-cl3 Call Trace:
Dec  7 18:11:56 falcon-cl3 [<ffffffffa019b2ff>] dlm_lock+0x9f/0x1c0 [dlm]
Dec  7 18:11:56 falcon-cl3 [<ffffffffa01d2ad0>] gdlm_hold_lvb+0x170/0x230 [gfs]
Dec  7 18:11:56 falcon-cl3 [<ffffffff805a5095>] schedule_timeout+0x95/0xd0
Dec  7 18:11:56 falcon-cl3 [<ffffffff805a4628>] wait_for_common+0xb8/0x170
Dec  7 18:11:56 falcon-cl3 [<ffffffff8022e850>] default_wake_function+0x0/0x10
Dec  7 18:11:56 falcon-cl3 [<ffffffffa01c8374>] gfs_glock_xmote_th+0xb4/0x270 [gfs]
Dec  7 18:11:56 falcon-cl3 [<ffffffffa01c6775>] gfs_reclaim_glock+0x5b5/0x8a0 [gfs]
Dec  7 18:11:56 falcon-cl3 [<ffffffffa01c6bf0>] gfs_glock_nq+0x190/0x460 [gfs]
Dec  7 18:11:56 falcon-cl3 [<ffffffffa01c6ede>] gfs_glock_nq_init+0x1e/0x40 [gfs]
Dec  7 18:11:56 falcon-cl3 [<ffffffffa01e0ae5>] gfs_removexattr+0x10b5/0x10f0 [gfs]
Dec  7 18:11:56 falcon-cl3 [<ffffffff802994de>] vfs_getattr+0x2e/0xa0
Dec  7 18:11:56 falcon-cl3 [<ffffffff8029979a>] vfs_stat_fd+0x3a/0x60
Dec  7 18:11:56 falcon-cl3 [<ffffffff80299857>] sys_newstat+0x27/0x50
Dec  7 18:11:56 falcon-cl3 [<ffffffff802963da>] vfs_read+0x12a/0x160
Dec  7 18:11:56 falcon-cl3 [<ffffffff80296753>] sys_read+0x53/0x90
Dec  7 18:11:56 falcon-cl3 [<ffffffff8020b71b>] system_call_fastpath+0x16/0x1b
Dec  7 18:11:56 falcon-cl3

With ocfs2:
Dec  1 18:32:48 falcon-cl3 INFO: task apache2:20388 blocked for more than 120 seconds.
Dec  1 18:32:48 falcon-cl3 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec  1 18:32:48 falcon-cl3 apache2       D ffff81000100d780     0 20388   5039
Dec  1 18:32:48 falcon-cl3 ffff8100774fbd48 0000000000000086 ffff8101ed9e6d80 ffff8101ee9cc180
Dec  1 18:32:48 falcon-cl3 ffff8100774fbe48 ffff810125161810 ffff81008bcff810 ffff810125161a50
Dec  1 18:32:48 falcon-cl3 ffff8100774fbd48 ffffffff8028f273 000000000000c344 0000000000033cbc
Dec  1 18:32:48 falcon-cl3 Call Trace:
Dec  1 18:32:48 falcon-cl3 [<ffffffff8028f273>]
Dec  1 18:32:48 falcon-cl3 [<ffffffff880aa20a>]
Dec  1 18:32:48 falcon-cl3 [<ffffffff8023cde0>]
Dec  1 18:32:48 falcon-cl3 [<ffffffff880baceb>]
Dec  1 18:32:48 falcon-cl3 [<ffffffff880b6036>]
Dec  1 18:32:48 falcon-cl3 [<ffffffff802881a3>]
Dec  1 18:32:48 falcon-cl3 [<ffffffff802881f7>]
Dec  1 18:32:48 falcon-cl3 [<ffffffff8020247b>]
Dec  1 18:32:48 falcon-cl3

With gfs2:
Dec  4 08:58:03 falcon-cl3 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec  4 08:58:03 falcon-cl3 apache2       D ffff8801e92182e0     0 16367   5233
Dec  4 08:58:03 falcon-cl3 ffff880052cb3ae8 0000000000000086 ffffe2000477e9b8 0000000000000000
Dec  4 08:58:03 falcon-cl3 ffff8801e85b2760 ffff8801efa04210 ffff8801e85b29b0 ffff880051dd5050
Dec  4 08:58:03 falcon-cl3 ffff880000000000 0000000000000000 ffffffff809323a0 ffff8801ec56fac0
Dec  4 08:58:03 falcon-cl3 Call Trace:
Dec  4 08:58:03 falcon-cl3 [<ffffffff805ac4c5>] 0xffffffff805ac4c5
Dec  4 08:58:03 falcon-cl3 [<ffffffffa00842d4>] 0xffffffffa00842d4
Dec  4 08:58:03 falcon-cl3 [<ffffffff80219899>] 0xffffffff80219899
Dec  4 08:58:03 falcon-cl3 [<ffffffff805aba38>] 0xffffffff805aba38
Dec  4 08:58:03 falcon-cl3 [<ffffffff80225df0>] 0xffffffff80225df0
Dec  4 08:58:03 falcon-cl3 [<ffffffffa00842fd>] 0xffffffffa00842fd
Dec  4 08:58:03 falcon-cl3 [<ffffffffa00854b6>] 0xffffffffa00854b6
Dec  4 08:58:03 falcon-cl3 [<ffffffffa0086437>] 0xffffffffa0086437
Dec  4 08:58:03 falcon-cl3 [<ffffffffa00831b2>] 0xffffffffa00831b2
Dec  4 08:58:03 falcon-cl3 [<ffffffffa0088537>] 0xffffffffa0088537
Dec  4 08:58:03 falcon-cl3 [<ffffffffa0092c46>] 0xffffffffa0092c46
Dec  4 08:58:03 falcon-cl3 [<ffffffff8028c799>] 0xffffffff8028c799
Dec  4 08:58:03 falcon-cl3 [<ffffffff80241d50>] 0xffffffff80241d50
Dec  4 08:58:03 falcon-cl3 [<ffffffff8028d0f5>] 0xffffffff8028d0f5
Dec  4 08:58:03 falcon-cl3 [<ffffffff8028d4d3>] 0xffffffff8028d4d3
Dec  4 08:58:03 falcon-cl3 [<ffffffff8020288b>] 0xffffffff8020288b
Dec  4 08:58:03 falcon-cl3

And only full node restart helps, till next 30 minutes.

falcon-cl3 ~ # uname -a
Linux falcon-cl3 2.6.27-gentoo-r4 #1 SMP Sat Dec 6 18:28:37 EET 2008 x86_64 Intel(R) Xeon(R) CPU 5110 @ 1.60GHz GenuineIntel GNU/Linux

Using profile /usr/portage/profiles/default/linux/amd64/2008.0/no-multilib

www-servers/apache 2.2.9-r1

On the same 3 servers with OpenSUSE 11 and OCFS2 all works smoothly.


Reproducible: Always
Comment 1 Oleg Gawriloff 2008-12-07 16:22:46 UTC
Created attachment 174576 [details]
kernel config /proc/config.gz
Comment 2 Oleg Gawriloff 2008-12-07 17:18:28 UTC
Generally speaking it looks like http://bugzilla.kernel.org/show_bug.cgi?id=10582 but concerns ocfs2/gfs/gfs2
Comment 3 Oleg Gawriloff 2008-12-07 19:18:39 UTC
The same thing with vanilla-sources.
Linux falcon-cl3 2.6.27.8 #1 SMP Sun Dec 7 19:11:27 EET 2008 x86_64 Intel(R) Xeon(R) CPU 5110 @ 1.60GHz GenuineIntel GNU/Linux
Comment 4 Patrick Lauer gentoo-dev 2008-12-07 20:56:56 UTC
We've had the same issue at work with heavily loaded machines.
It happens when pdflush takes more than 120 seconds to push dirty pages from RAM to disk and usually means that either your hardware is misconfigured or your system load is much too high.
Comment 5 Oleg Gawriloff 2008-12-09 05:48:04 UTC
I don't agree. Because I reverted those servers to openSUSE 11 with generic kernel Linux falcon-cl3 2.6.25.18-0.2-default #1 SMP 2008-10-21 16:30:26 +0200 x86_64 x86_64 x86_64 GNU/Linux
and the same set software (nginx+apache/mod_php+ocfs2, all package based) and with the same load all works as expected (i.e. even when LA>50 no hangups).