198215 – gentoo-sources-2.6.22-r9 md hangs

Bug 198215 - gentoo-sources-2.6.22-r9 md hangs

Summary: gentoo-sources-2.6.22-r9 md hangs

Status:	RESOLVED NEEDINFO

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	[OLD] Core system (show other bugs)
Hardware:	All Linux

Importance:	High critical
Assignee:	Gentoo Kernel Bug Wranglers and Kernel Maintainers

URL:
Whiteboard:	linux-2.6.25
Keywords:

Depends on:
Blocks:

Reported:	2007-11-05 22:07 UTC by Andrej Filipcic
Modified:	2008-05-12 15:07 UTC (History)
CC List:	2 users (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
sysreq-T on raid5 hang (kern.log,98.63 KB, text/plain) 2007-11-06 00:05 UTC, Andrej Filipcic	Details
fix-misapplied-biofill-op.patch (fix-misapplied-biofill-op.patch,621 bytes, patch) 2007-11-06 01:05 UTC, Mike Pagano	Details \| Diff
emerge --info (emerge.info,2.83 KB, text/plain) 2007-11-06 21:28 UTC, Andrej Filipcic	Details
.config (config-2.6.24-rc1-git15,68.69 KB, text/plain) 2007-11-06 21:29 UTC, Andrej Filipcic	Details
clearing of biofill operations patch (fix-clearing-of-biofill-operations.patch,5.64 KB, patch) 2007-11-09 00:11 UTC, Mike Pagano	Details \| Diff
dmesg with sysreq-t (dmesg.log,96.44 KB, text/plain) 2007-12-03 17:38 UTC, Andrej Filipcic	Details
dmesg with sysreq-t (dmesg.2.6.24-rc8,117.08 KB, text/plain) 2008-01-24 23:07 UTC, Andrej Filipcic	Details
Show Obsolete (1) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Andrej Filipcic 2007-11-05 22:07:09 UTC

md devices (raid0 and raid5) hang quite regularly and all the processes accessing the corresponding mount points are in D state until reboot. This has happened on i386/raid0 and amd64/raid5 with different machines. On both machines, 2.6.20-r* worked without any problems. Is this known? Is there any patch for it?

There is a similar problem reported in 2.6.23.1
http://article.gmane.org/gmane.linux.raid/17131

Comment 1 Andrej Filipcic 2007-11-06 00:05:13 UTC

Created attachment 135286 [details]
sysreq-T on raid5 hang

Comment 2 Mike Pagano gentoo-dev

2007-11-06 01:05:00 UTC

The patch referred to on the mailing list was committed tonight (11/5) and as of yet is not in a git snapshot.  I fixed the patch to apply to gentoo-sources-2.6.23-r1.

Could you apply this patch, recompile and install your kernel and let me know if it fixes your issue. 

apply this patch by:
1. go to /usr/src/linux or whereever your linux sources reside
2. type: patch -p1 < fix-misapplied-biofill-op.patch
3. rebuild and install your kernel as normal

Comment 3 Mike Pagano gentoo-dev

2007-11-06 01:05:42 UTC

Created attachment 135287 [details, diff]
fix-misapplied-biofill-op.patch

Comment 4 Andrej Filipcic 2007-11-06 08:56:57 UTC

This patch is for 2.6.23.x. I would rather solve the issue with 2.6.22.x and I do not see how this patch could be easily backported. I have also mentioned raid0, but this is a mistake. There is no problem with raid0. Both machines use raid5.

I will test with 2.6.23.x and report.

Comment 5 Andrej Filipcic 2007-11-06 12:22:42 UTC

I have tested 2.6.23-gentoo-r1 and biofill patch. It does not solve the problem.

Comment 6 Mike Pagano gentoo-dev

2007-11-06 12:51:03 UTC

Ok, let's start by determining if the latest patching has addressed the problem.

I have just committed git-sources-2.6.24_rc1-r15 to the tree. Once it hits the mirrors, can you please test with that kernel?

This snapshot has the latter patches for raid5.

Comment 7 Andrej Filipcic 2007-11-06 16:37:37 UTC

I have tested git-sources-2.6.24_rc1-r15. This one seems to work, at least after writing 200k files, while 22 or 23 stopped at few k files. md device is also being reconstructed to put a bit more stress on. I will continue to run the checks to be sure...

Comment 8 Andrej Filipcic 2007-11-06 17:45:06 UTC

It does not work so well. Now the behavior is different. After a couple of hours, md0_raid5 thread is at 100% cpu with plenty of messages (traces are different):

BUG: soft lockup - CPU#0 stuck for 11s! [md0_raid5:4270]

Pid: 4270, comm: md0_raid5 Not tainted (2.6.24-rc1-git15 #1)
EIP: 0060:[<f88b212e>] EFLAGS: 00000202 CPU: 0
EIP is at xor_sse_5+0x12e/0x3a8 [xor]
EAX: 0000000e EBX: c4f05200 ECX: c4f02200 EDX: c4f07200
ESI: c4f04200 EDI: c4f03200 EBP: c498fcb0 ESP: c498fcac
 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
CR0: 80050033 CR2: b7ef7000 CR3: 0056c000 CR4: 000006d0
DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
DR6: ffff0ff0 DR7: 00000400
 [<f88b2a59>] xor_blocks+0x7d/0x85 [xor]
 [<f897e125>] async_xor+0x125/0x1a2 [async_xor]
 [<f897e1f4>] async_xor_zero_sum+0x52/0xba [async_xor]
 [<f9983d75>] ops_run_check+0x92/0xc7 [raid456]
 [<c013dc02>] lock_release_holdtime+0x25/0x43
 [<f9984c29>] handle_stripe5+0xe7f/0x10df [raid456]
 [<f9983e2c>] handle_stripe5+0x82/0x10df [raid456]
 [<c013dc02>] lock_release_holdtime+0x25/0x43
 [<c013e58a>] __lock_acquire+0x3a6/0x609
 [<f998648d>] handle_stripe+0xc08/0xc36 [raid456]
 [<c013e58a>] __lock_acquire+0x3a6/0x609
 [<c013dc02>] lock_release_holdtime+0x25/0x43
 [<c013dc02>] lock_release_holdtime+0x25/0x43
 [<f99867df>] raid5d+0x324/0x346 [raid456]
 [<f99867ec>] raid5d+0x331/0x346 [raid456]
 [<c031aa53>] md_thread+0xb4/0xd6
 [<c03a0c4b>] _spin_lock_irqsave+0x54/0x5d
 [<c031aa5e>] md_thread+0xbf/0xd6
 [<c0136aba>] autoremove_wake_function+0x0/0x33
 [<c031a99f>] md_thread+0x0/0xd6
 [<c0136a05>] kthread+0x38/0x5f
 [<c01369cd>] kthread+0x0/0x5f
 [<c0104ab3>] kernel_thread_helper+0x7/0x10
 =======================

Comment 9 Mike Pagano gentoo-dev

2007-11-06 20:25:14 UTC

please attach your .config and the output of emerge --info

Comment 10 Andrej Filipcic 2007-11-06 21:28:45 UTC

Created attachment 135373 [details]
emerge --info

Comment 11 Andrej Filipcic 2007-11-06 21:29:26 UTC

Created attachment 135375 [details]
.config

Comment 12 Mike Pagano gentoo-dev

2007-11-06 22:19:19 UTC

Can you perform the following from your kernel source directory and post the results here:

Using the same gcc version and kernel from the trace.

make CONFIG_DEBUG_INFO=y crypto/xor.o
gdb crypto/xor.o
list *xor_sse_5+0x12e

Comment 13 Andrej Filipcic 2007-11-06 22:37:26 UTC

(gdb) list *xor_sse_5+0x12e
0x12e is in xor_sse_5 (include/asm/xor_32.h:783).
778                because we modify p4 and p5 there, but we can't mark them
779                as read/write, otherwise we'd overflow the 10-asm-operands
780                limit of GCC < 3.1.  */
781             __asm__ ("" : "+r" (p4), "+r" (p5));
782
783             __asm__ __volatile__ (
784     #undef BLOCK
785     #define BLOCK(i) \
786                     PF1(i)                                  \
787                                     PF1(i+2)                \

Comment 14 Andrej Filipcic 2007-11-06 23:14:44 UTC

I have tried once more with git kernel to write. This time the processes hung like with 2.6.22, for example:

Nov  7 00:00:46 f9pc18 pdflush       D c2c22a98     0   183      2
Nov  7 00:00:46 f9pc18 00155589 00000086 00000145 c2c22a98 00000002 c3fdac78 c0520ed8 c056bb00
Nov  7 00:00:46 f9pc18 c056bb00 c30d0f40 c30d1084 c2c6db00 00000000 f89612bd 00000246 f89612b3
Nov  7 00:00:46 f9pc18 000000ff 00000000 00000000 00000145 c3b9050c c3b90400 c3b904ac c3b904b4
Nov  7 00:00:46 f9pc18 Call Trace:
Nov  7 00:00:46 f9pc18 [<f89612bd>] unplug_slaves+0xe0/0xfb [raid456]
Nov  7 00:00:46 f9pc18 [<f89612b3>] unplug_slaves+0xd6/0xfb [raid456]
Nov  7 00:00:46 f9pc18 [<f896205a>] get_active_stripe+0x1e6/0x432 [raid456]
Nov  7 00:00:46 f9pc18 [<c013dc02>] lock_release_holdtime+0x25/0x43
Nov  7 00:00:46 f9pc18 [<c03a0c4b>] _spin_lock_irqsave+0x54/0x5d

All the other D processes hang in the same place (unplug_slaves)

It it helps:

(gdb) list *unplug_slaves+0xe0
0x2bd is in unplug_slaves (drivers/md/raid5.c:3197).
3192                            rdev_dec_pending(rdev, mddev);
3193                            rcu_read_lock();
3194                    }
3195            }
3196            rcu_read_unlock();
3197    }
3198
3199    static void raid5_unplug_device(struct request_queue *q)
3200    {
3201            mddev_t *mddev = q->queuedata;

Comment 15 Mike Pagano gentoo-dev

2007-11-09 00:11:59 UTC

Created attachment 135535 [details, diff]
clearing of biofill operations patch

Some on the mailing list reported this as fixing the issue. Can you  apply to a clean gentoo-sources-2.6.23-gentoo-r1 and post the results.

Comment 16 Andrej Filipcic 2007-11-09 11:26:31 UTC

So far, so good. Within 5h of writing there are no problems. I will fill ~600GB (1M files) in a day or so and report if something goes wrong.

If this patch is OK, is there a possibility to backport it to 2.6.22? This kernel is still widely used.

Comment 17 Andrej Filipcic 2007-11-09 13:09:31 UTC

Well, I am always too quick. After 6h, 120k files, 170GB written,
md0_raid5 is at 100% cpu and all the other md-accessing processes are in the D state



Nov  9 14:00:30 f9pc18 =======================
Nov  9 14:00:30 f9pc18 md0_raid5     R running      0  4126      2
Nov  9 14:00:30 f9pc18 xfsbufd       S c2e48270     0  4760      2
Nov  9 14:00:30 f9pc18 e333bf8c 00000086 00000046 c2e48270 00000001 c059cd10 c050eddc c0559e80
Nov  9 14:00:30 f9pc18 c0559e80 c2e48270 c2e483b0 c2c6de80 00000000 00000046 c059cd00 00000296
Nov  9 14:00:30 f9pc18 c059cd00 f740b360 00000296 e333bf9c c059cd00 c012c76e 00000000 00000296
Nov  9 14:00:30 f9pc18 Call Trace:
Nov  9 14:00:30 f9pc18 [<c012c76e>] __mod_timer+0x92/0x9c
Nov  9 14:00:30 f9pc18 [<c039734f>] schedule_timeout+0x70/0x8d
Nov  9 14:00:30 f9pc18 [<c0211e04>] xfs_buf_delwri_split+0xc5/0xcf
Nov  9 14:00:30 f9pc18 [<c012c587>] process_timeout+0x0/0x5
Nov  9 14:00:30 f9pc18 [<c0211fac>] xfsbufd+0x58/0xec
Nov  9 14:00:30 f9pc18 [<c0211f54>] xfsbufd+0x0/0xec
Nov  9 14:00:30 f9pc18 [<c0134f31>] kthread+0x38/0x5f
Nov  9 14:00:30 f9pc18 [<c0134ef9>] kthread+0x0/0x5f
Nov  9 14:00:30 f9pc18 [<c0104a5f>] kernel_thread_helper+0x7/0x10
Nov  9 14:00:30 f9pc18 =======================

Comment 18 Andrej Filipcic 2007-11-10 09:16:13 UTC

One more annoying thing. After the reset, xfs_repair on md device oopsed right at the beginning in something like get_next_stripe. (I did not catch the log). md can still be mounted and used, but I would say it is not really safe. There might be some corruption bug in raid5 or xfs code...

Comment 19 Mike Pagano gentoo-dev

2007-11-23 18:18:58 UTC

The thread at http://marc.info/?l=linux-raid&m=119502458615538&w=2 indicates two upcoming patches to fix a problem which appears to be similar to yours.

They indicate that the problem does not occur in 2.6.22. Not sure if you tested that kernel.

Comment 20 Mike Pagano gentoo-dev

2007-11-23 18:20:12 UTC

maybe related:
http://bugzilla.kernel.org/show_bug.cgi?id=9419

Comment 21 Andrej Filipcic 2007-12-03 17:36:35 UTC

I have found today some time to test gentoo-sources-2.6.23-r3. raid5 still hangs...

Comment 22 Andrej Filipcic 2007-12-03 17:38:18 UTC

Created attachment 137635 [details]
dmesg with sysreq-t

Comment 23 Mike Pagano gentoo-dev

2007-12-19 19:55:54 UTC

The latest vanilla kernel rc release contains two patches which might be related to your problem.  One is a biofill patch and the other is to fix an unending write sequence.

Could you please test with vanilla-sources-2.6.24_rc5 and post the result.

Comment 24 Mike Pagano gentoo-dev

2008-01-08 15:52:47 UTC

Please reopen when you've had a chance to test with the latest development kernel as requested in comment #23.

Comment 25 Andrej Filipcic 2008-01-24 23:05:39 UTC

I did have a chance today to check vanilla-sources-2.6.24_rc8. The problem is still there, but it occurs much latter (after 350k files instead of 100k).

Comment 26 Andrej Filipcic 2008-01-24 23:07:07 UTC

Created attachment 141735 [details]
dmesg with sysreq-t

Comment 27 Andrej Filipcic 2008-01-24 23:45:23 UTC

I have found the fix from Neil Brown on
http://thread.gmane.org/gmane.linux.raid/17738
so I will try with that and report the results.

Comment 28 Andrej Filipcic 2008-01-25 18:55:42 UTC

Neil's patches work. No troubles for 1TB, 1M files. I guess we have to wait for 2.6.24.1.

Comment 29 Mike Pagano gentoo-dev

2008-02-07 02:08:00 UTC

It looks like these patches have made the mainline tree.  They should be in git-sources-2.6.24-r16 which does not exist yet. But as soon I as see git snapshots, I commit the ebuild.

So when you have a moment can you test git-sources-2.6.24-r16 when its available and post the results.

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=1ec4a9398dc05061b6258061676fede733458893
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=c5d79adba7ced41d7ac097c2ab74759d10522dd5
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=29ac4aa3fc68a86279aca50f20df4d614de2e204
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=6ed3003c19a96fe18edf8179c4be6fe14abbebbc

Comment 30 Andrej Filipcic 2008-02-07 16:13:31 UTC

I tried (remotely) to boot git-sources-2.6.24-r16 but it panicked. I will not be able to do the tests for 2 weeks due to absence.

Comment 31 Andrej Filipcic 2008-03-05 07:22:42 UTC

I did some tests with various kernels. gentoo-sources-2.6.24-r3 still does not work properly. git-sources-2.6.25_rc3-r4 works OK, and as I have seen, the md code is the same as in vanilla-sources-2.6.25_rc3. So it seems that 2.6.25 will be OK, although it would be nice if md patches could be backported to 2.6.24 or maybe even 2.6.23.

Comment 32 Mike Pagano gentoo-dev

2008-04-25 00:55:58 UTC

Can you test with the latest gentoo-sources, which is 2.6.25-r1 as of this writing.

Comment 33 Mike Pagano gentoo-dev

2008-05-12 15:03:17 UTC

Please reopen if there is still a problem with the latest 2.6.25 kernel.

Comment 34 Andrej Filipcic 2008-05-12 15:07:11 UTC

Sorry, I forgot to report. 2.6.25 works fine. The heavily loaded server with 2.6.25-gentoo-r1 is up 9 days without a single problem.