Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 198215 - gentoo-sources-2.6.22-r9 md hangs
Summary: gentoo-sources-2.6.22-r9 md hangs
Status: RESOLVED NEEDINFO
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: [OLD] Core system (show other bugs)
Hardware: All Linux
: High critical (vote)
Assignee: Gentoo Kernel Bug Wranglers and Kernel Maintainers
URL:
Whiteboard: linux-2.6.25
Keywords:
Depends on:
Blocks:
 
Reported: 2007-11-05 22:07 UTC by Andrej Filipcic
Modified: 2008-05-12 15:07 UTC (History)
2 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments
sysreq-T on raid5 hang (kern.log,98.63 KB, text/plain)
2007-11-06 00:05 UTC, Andrej Filipcic
Details
fix-misapplied-biofill-op.patch (fix-misapplied-biofill-op.patch,621 bytes, patch)
2007-11-06 01:05 UTC, Mike Pagano
Details | Diff
emerge --info (emerge.info,2.83 KB, text/plain)
2007-11-06 21:28 UTC, Andrej Filipcic
Details
.config (config-2.6.24-rc1-git15,68.69 KB, text/plain)
2007-11-06 21:29 UTC, Andrej Filipcic
Details
clearing of biofill operations patch (fix-clearing-of-biofill-operations.patch,5.64 KB, patch)
2007-11-09 00:11 UTC, Mike Pagano
Details | Diff
dmesg with sysreq-t (dmesg.log,96.44 KB, text/plain)
2007-12-03 17:38 UTC, Andrej Filipcic
Details
dmesg with sysreq-t (dmesg.2.6.24-rc8,117.08 KB, text/plain)
2008-01-24 23:07 UTC, Andrej Filipcic
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Andrej Filipcic 2007-11-05 22:07:09 UTC
md devices (raid0 and raid5) hang quite regularly and all the processes accessing the corresponding mount points are in D state until reboot. This has happened on i386/raid0 and amd64/raid5 with different machines. On both machines, 2.6.20-r* worked without any problems. Is this known? Is there any patch for it?

There is a similar problem reported in 2.6.23.1
http://article.gmane.org/gmane.linux.raid/17131
Comment 1 Andrej Filipcic 2007-11-06 00:05:13 UTC
Created attachment 135286 [details]
sysreq-T on raid5 hang
Comment 2 Mike Pagano gentoo-dev 2007-11-06 01:05:00 UTC
The patch referred to on the mailing list was committed tonight (11/5) and as of yet is not in a git snapshot.  I fixed the patch to apply to gentoo-sources-2.6.23-r1.

Could you apply this patch, recompile and install your kernel and let me know if it fixes your issue. 

apply this patch by:
1. go to /usr/src/linux or whereever your linux sources reside
2. type: patch -p1 < fix-misapplied-biofill-op.patch
3. rebuild and install your kernel as normal
Comment 3 Mike Pagano gentoo-dev 2007-11-06 01:05:42 UTC
Created attachment 135287 [details, diff]
fix-misapplied-biofill-op.patch
Comment 4 Andrej Filipcic 2007-11-06 08:56:57 UTC
This patch is for 2.6.23.x. I would rather solve the issue with 2.6.22.x and I do not see how this patch could be easily backported. I have also mentioned raid0, but this is a mistake. There is no problem with raid0. Both machines use raid5.

I will test with 2.6.23.x and report.
Comment 5 Andrej Filipcic 2007-11-06 12:22:42 UTC
I have tested 2.6.23-gentoo-r1 and biofill patch. It does not solve the problem.
Comment 6 Mike Pagano gentoo-dev 2007-11-06 12:51:03 UTC
Ok, let's start by determining if the latest patching has addressed the problem.

I have just committed git-sources-2.6.24_rc1-r15 to the tree. Once it hits the mirrors, can you please test with that kernel?

This snapshot has the latter patches for raid5.

Comment 7 Andrej Filipcic 2007-11-06 16:37:37 UTC
I have tested git-sources-2.6.24_rc1-r15. This one seems to work, at least after writing 200k files, while 22 or 23 stopped at few k files. md device is also being reconstructed to put a bit more stress on. I will continue to run the checks to be sure... 
Comment 8 Andrej Filipcic 2007-11-06 17:45:06 UTC
It does not work so well. Now the behavior is different. After a couple of hours, md0_raid5 thread is at 100% cpu with plenty of messages (traces are different):

BUG: soft lockup - CPU#0 stuck for 11s! [md0_raid5:4270]

Pid: 4270, comm: md0_raid5 Not tainted (2.6.24-rc1-git15 #1)
EIP: 0060:[<f88b212e>] EFLAGS: 00000202 CPU: 0
EIP is at xor_sse_5+0x12e/0x3a8 [xor]
EAX: 0000000e EBX: c4f05200 ECX: c4f02200 EDX: c4f07200
ESI: c4f04200 EDI: c4f03200 EBP: c498fcb0 ESP: c498fcac
 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
CR0: 80050033 CR2: b7ef7000 CR3: 0056c000 CR4: 000006d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
 [<f88b2a59>] xor_blocks+0x7d/0x85 [xor]
 [<f897e125>] async_xor+0x125/0x1a2 [async_xor]
 [<f897e1f4>] async_xor_zero_sum+0x52/0xba [async_xor]
 [<f9983d75>] ops_run_check+0x92/0xc7 [raid456]
 [<c013dc02>] lock_release_holdtime+0x25/0x43
 [<f9984c29>] handle_stripe5+0xe7f/0x10df [raid456]
 [<f9983e2c>] handle_stripe5+0x82/0x10df [raid456]
 [<c013dc02>] lock_release_holdtime+0x25/0x43
 [<c013e58a>] __lock_acquire+0x3a6/0x609
 [<f998648d>] handle_stripe+0xc08/0xc36 [raid456]
 [<c013e58a>] __lock_acquire+0x3a6/0x609
 [<c013dc02>] lock_release_holdtime+0x25/0x43
 [<c013dc02>] lock_release_holdtime+0x25/0x43
 [<f99867df>] raid5d+0x324/0x346 [raid456]
 [<f99867ec>] raid5d+0x331/0x346 [raid456]
 [<c031aa53>] md_thread+0xb4/0xd6
 [<c03a0c4b>] _spin_lock_irqsave+0x54/0x5d
 [<c031aa5e>] md_thread+0xbf/0xd6
 [<c0136aba>] autoremove_wake_function+0x0/0x33
 [<c031a99f>] md_thread+0x0/0xd6
 [<c0136a05>] kthread+0x38/0x5f
 [<c01369cd>] kthread+0x0/0x5f
 [<c0104ab3>] kernel_thread_helper+0x7/0x10
 =======================
Comment 9 Mike Pagano gentoo-dev 2007-11-06 20:25:14 UTC
please attach your .config and the output of emerge --info
Comment 10 Andrej Filipcic 2007-11-06 21:28:45 UTC
Created attachment 135373 [details]
emerge --info
Comment 11 Andrej Filipcic 2007-11-06 21:29:26 UTC
Created attachment 135375 [details]
.config
Comment 12 Mike Pagano gentoo-dev 2007-11-06 22:19:19 UTC
Can you perform the following from your kernel source directory and post the results here:

Using the same gcc version and kernel from the trace.

make CONFIG_DEBUG_INFO=y crypto/xor.o
gdb crypto/xor.o
list *xor_sse_5+0x12e

Comment 13 Andrej Filipcic 2007-11-06 22:37:26 UTC
(gdb) list *xor_sse_5+0x12e
0x12e is in xor_sse_5 (include/asm/xor_32.h:783).
778                because we modify p4 and p5 there, but we can't mark them
779                as read/write, otherwise we'd overflow the 10-asm-operands
780                limit of GCC < 3.1.  */
781             __asm__ ("" : "+r" (p4), "+r" (p5));
782
783             __asm__ __volatile__ (
784     #undef BLOCK
785     #define BLOCK(i) \
786                     PF1(i)                                  \
787                                     PF1(i+2)                \
Comment 14 Andrej Filipcic 2007-11-06 23:14:44 UTC
I have tried once more with git kernel to write. This time the processes hung like with 2.6.22, for example:

Nov  7 00:00:46 f9pc18 pdflush       D c2c22a98     0   183      2
Nov  7 00:00:46 f9pc18 00155589 00000086 00000145 c2c22a98 00000002 c3fdac78 c0520ed8 c056bb00
Nov  7 00:00:46 f9pc18 c056bb00 c30d0f40 c30d1084 c2c6db00 00000000 f89612bd 00000246 f89612b3
Nov  7 00:00:46 f9pc18 000000ff 00000000 00000000 00000145 c3b9050c c3b90400 c3b904ac c3b904b4
Nov  7 00:00:46 f9pc18 Call Trace:
Nov  7 00:00:46 f9pc18 [<f89612bd>] unplug_slaves+0xe0/0xfb [raid456]
Nov  7 00:00:46 f9pc18 [<f89612b3>] unplug_slaves+0xd6/0xfb [raid456]
Nov  7 00:00:46 f9pc18 [<f896205a>] get_active_stripe+0x1e6/0x432 [raid456]
Nov  7 00:00:46 f9pc18 [<c013dc02>] lock_release_holdtime+0x25/0x43
Nov  7 00:00:46 f9pc18 [<c03a0c4b>] _spin_lock_irqsave+0x54/0x5d

All the other D processes hang in the same place (unplug_slaves)

It it helps:

(gdb) list *unplug_slaves+0xe0
0x2bd is in unplug_slaves (drivers/md/raid5.c:3197).
3192                            rdev_dec_pending(rdev, mddev);
3193                            rcu_read_lock();
3194                    }
3195            }
3196            rcu_read_unlock();
3197    }
3198
3199    static void raid5_unplug_device(struct request_queue *q)
3200    {
3201            mddev_t *mddev = q->queuedata;


Comment 15 Mike Pagano gentoo-dev 2007-11-09 00:11:59 UTC
Created attachment 135535 [details, diff]
clearing of biofill operations patch

Some on the mailing list reported this as fixing the issue. Can you  apply to a clean gentoo-sources-2.6.23-gentoo-r1 and post the results.
Comment 16 Andrej Filipcic 2007-11-09 11:26:31 UTC
So far, so good. Within 5h of writing there are no problems. I will fill ~600GB (1M files) in a day or so and report if something goes wrong.

If this patch is OK, is there a possibility to backport it to 2.6.22? This kernel is still widely used.
Comment 17 Andrej Filipcic 2007-11-09 13:09:31 UTC
Well, I am always too quick. After 6h, 120k files, 170GB written,
md0_raid5 is at 100% cpu and all the other md-accessing processes are in the D state



Nov  9 14:00:30 f9pc18 =======================
Nov  9 14:00:30 f9pc18 md0_raid5     R running      0  4126      2
Nov  9 14:00:30 f9pc18 xfsbufd       S c2e48270     0  4760      2
Nov  9 14:00:30 f9pc18 e333bf8c 00000086 00000046 c2e48270 00000001 c059cd10 c050eddc c0559e80
Nov  9 14:00:30 f9pc18 c0559e80 c2e48270 c2e483b0 c2c6de80 00000000 00000046 c059cd00 00000296
Nov  9 14:00:30 f9pc18 c059cd00 f740b360 00000296 e333bf9c c059cd00 c012c76e 00000000 00000296
Nov  9 14:00:30 f9pc18 Call Trace:
Nov  9 14:00:30 f9pc18 [<c012c76e>] __mod_timer+0x92/0x9c
Nov  9 14:00:30 f9pc18 [<c039734f>] schedule_timeout+0x70/0x8d
Nov  9 14:00:30 f9pc18 [<c0211e04>] xfs_buf_delwri_split+0xc5/0xcf
Nov  9 14:00:30 f9pc18 [<c012c587>] process_timeout+0x0/0x5
Nov  9 14:00:30 f9pc18 [<c0211fac>] xfsbufd+0x58/0xec
Nov  9 14:00:30 f9pc18 [<c0211f54>] xfsbufd+0x0/0xec
Nov  9 14:00:30 f9pc18 [<c0134f31>] kthread+0x38/0x5f
Nov  9 14:00:30 f9pc18 [<c0134ef9>] kthread+0x0/0x5f
Nov  9 14:00:30 f9pc18 [<c0104a5f>] kernel_thread_helper+0x7/0x10
Nov  9 14:00:30 f9pc18 =======================
Comment 18 Andrej Filipcic 2007-11-10 09:16:13 UTC
One more annoying thing. After the reset, xfs_repair on md device oopsed right at the beginning in something like get_next_stripe. (I did not catch the log). md can still be mounted and used, but I would say it is not really safe. There might be some corruption bug in raid5 or xfs code...
Comment 19 Mike Pagano gentoo-dev 2007-11-23 18:18:58 UTC
The thread at http://marc.info/?l=linux-raid&m=119502458615538&w=2 indicates two upcoming patches to fix a problem which appears to be similar to yours.

They indicate that the problem does not occur in 2.6.22. Not sure if you tested that kernel.

Comment 20 Mike Pagano gentoo-dev 2007-11-23 18:20:12 UTC
maybe related:
http://bugzilla.kernel.org/show_bug.cgi?id=9419
Comment 21 Andrej Filipcic 2007-12-03 17:36:35 UTC
I have found today some time to test gentoo-sources-2.6.23-r3. raid5 still hangs... 
Comment 22 Andrej Filipcic 2007-12-03 17:38:18 UTC
Created attachment 137635 [details]
dmesg with sysreq-t
Comment 23 Mike Pagano gentoo-dev 2007-12-19 19:55:54 UTC
The latest vanilla kernel rc release contains two patches which might be related to your problem.  One is a biofill patch and the other is to fix an unending write sequence.

Could you please test with vanilla-sources-2.6.24_rc5 and post the result.
Comment 24 Mike Pagano gentoo-dev 2008-01-08 15:52:47 UTC
Please reopen when you've had a chance to test with the latest development kernel as requested in comment #23.
Comment 25 Andrej Filipcic 2008-01-24 23:05:39 UTC
I did have a chance today to check vanilla-sources-2.6.24_rc8. The problem is still there, but it occurs much latter (after 350k files instead of 100k).
Comment 26 Andrej Filipcic 2008-01-24 23:07:07 UTC
Created attachment 141735 [details]
dmesg with sysreq-t
Comment 27 Andrej Filipcic 2008-01-24 23:45:23 UTC
I have found the fix from Neil Brown on
http://thread.gmane.org/gmane.linux.raid/17738
so I will try with that and report the results.
Comment 28 Andrej Filipcic 2008-01-25 18:55:42 UTC
Neil's patches work. No troubles for 1TB, 1M files. I guess we have to wait for 2.6.24.1.
Comment 29 Mike Pagano gentoo-dev 2008-02-07 02:08:00 UTC
It looks like these patches have made the mainline tree.  They should be in git-sources-2.6.24-r16 which does not exist yet. But as soon I as see git snapshots, I commit the ebuild.

So when you have a moment can you test git-sources-2.6.24-r16 when its available and post the results.

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=1ec4a9398dc05061b6258061676fede733458893
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=c5d79adba7ced41d7ac097c2ab74759d10522dd5
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=29ac4aa3fc68a86279aca50f20df4d614de2e204
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=6ed3003c19a96fe18edf8179c4be6fe14abbebbc
Comment 30 Andrej Filipcic 2008-02-07 16:13:31 UTC
I tried (remotely) to boot git-sources-2.6.24-r16 but it panicked. I will not be able to do the tests for 2 weeks due to absence.
Comment 31 Andrej Filipcic 2008-03-05 07:22:42 UTC
I did some tests with various kernels. gentoo-sources-2.6.24-r3 still does not work properly. git-sources-2.6.25_rc3-r4 works OK, and as I have seen, the md code is the same as in vanilla-sources-2.6.25_rc3. So it seems that 2.6.25 will be OK, although it would be nice if md patches could be backported to 2.6.24 or maybe even 2.6.23.
Comment 32 Mike Pagano gentoo-dev 2008-04-25 00:55:58 UTC
Can you test with the latest gentoo-sources, which is 2.6.25-r1 as of this writing.
Comment 33 Mike Pagano gentoo-dev 2008-05-12 15:03:17 UTC
Please reopen if there is still a problem with the latest 2.6.25 kernel.
Comment 34 Andrej Filipcic 2008-05-12 15:07:11 UTC
Sorry, I forgot to report. 2.6.25 works fine. The heavily loaded server with 2.6.25-gentoo-r1 is up 9 days without a single problem.