75965 – Software raid volume gets corrupt while booting gentoo-dev-sources-2.6.10-r1

Bug 75965 - Software raid volume gets corrupt while booting gentoo-dev-sources-2.6.10-r1

Summary: Software raid volume gets corrupt while booting gentoo-dev-sources-2.6.10-r1

Status:	RESOLVED UPSTREAM

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	[OLD] Core system (show other bugs)
Hardware:	x86 Linux

Importance:	High critical
Assignee:	Gentoo Kernel Bug Wranglers and Kernel Maintainers

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2004-12-28 16:50 UTC by Miroslaw Poray-Zbrozek
Modified:	2005-01-03 00:54 UTC (History)
CC List:	2 users (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Miroslaw Poray-Zbrozek 2004-12-28 16:50:14 UTC

I am using a software raid volume for the root file system on some of my servers. Earlier today I tried to upgrade the kernel to 2.6.10-gentoo-r1 on one of them. At the same time I tried to switch to udev (I know that it is unreasonable to do such two things at the same time but so far gentoo sources worked so well for me that I did not expect any trouble). After rebooting I got a kernel panic message with an explanation that the root volume /dev/md2 was corrupt.

Reproducible: Always
Steps to Reproduce:
1. emerge gentoo-dev-sources
2. cp /boot/config-2.6.9-gentoo-rx /usr/src/linux-2.6.10-gentoo-r1
3. cd /usr/src/linux-2.6.10-gentoo-r1
4. make oldconfig and then menuconfig to get rid of devfs
5. make
6. copy System.map, .config, bzImage to boot (I skip modules as I use a monolithic kernel on that box)
7. edit grub.conf accordingly
8. reboot
Actual Results:
The system failed to boot with a message that the root volume was corrupt. After
that I booted from a live cd and after starting raid manually I did
# reiserfsck --rebuild-tree /dev/md2
I have never tried that before and I must say that I am impressed with reiserfs
and reiserfsprogs. I was able to recover all the files without any data loss
even though the picture looked scary at first. The whole file system ended up in
a lost+found directory and the subdirectories had names like 2_16779. However it
was easy to move them back into the right places.

At this point I was not sure what was wrong - udev or the kernel so I unmerged
udev and rebooted the system with the old 2.6.9-gentoo-r6 kernel without any
trouble. Then I recompiled the 2.6.10 kernel back with the devs support an tried
to boot the system with it. Again the boot attempt ended with a kernel panic
message due to the root filesystem corruption so most likely udev is innocent here.

I had to repeat the system recovery (this time it turned out to be easier as
reiserfsck was able to put the recovered directories back into the right places
without my help). Finally I emerged the 2.6.9-gentoo-r13 kernel and left the
system running with that.

I guess that the problem is with the kernel however I am not 100% sure - it
could be also my mistake of some kind that I am not aware of - please let me
know if you see any possibilities. Anyway I did not try to reemerge the udev. I
left the 2.6.9-gentoo-r13 kernel with devfs support as I did not feel like
recovering the system for the third time (and I might not be as lucky with the
recovery as for the first two times).

Expected Results:
The system should have booted up as always.

Comment 1 Stian Skjelstad 2004-12-29 04:34:14 UTC

Sounds very like a kernel bug, since raid-volumes are assembled in kernel space without the need of userspace-tools (as long as you use md superblocks or specify the device layouts as kernel parameters)

Comment 2 Daniel Drake (RETIRED) gentoo-dev

2004-12-29 07:08:21 UTC

Evil...we had a similar-ish report about reiserfs corruption in 2.6.9
/me writes to namesys

Comment 3 Daniel Drake (RETIRED) gentoo-dev

2004-12-29 07:10:40 UTC

Similar to bug #73145

Comment 4 Miroslaw Poray-Zbrozek 2004-12-29 07:20:37 UTC

I have been trying to investigate this problem for a while today. I am using vmware to simulate a similar configuration that is with 2 virtual drives on 2 IDE channels and a raid 1 volume built on the top of that setup. It seems strange but the 2.6.10-gentoo-r1 kernel works with that setup perfectly well no matter what I try to do with the config - both with udev and devfs.

It seems that the problem I encounterd yesterday must have something to do with the low level drivers not the soft raid itself. On that system I have a Promise PDC20276 controller but I do not use the native RAID. I disabled that in the BIOS and I am using that controler like a regular IDE device. Possibly the problem is here?

Comment 5 Daniel Drake (RETIRED) gentoo-dev

2004-12-29 07:24:23 UTC

You are definately using reiserfs within vmware?

Also, we really need to confirm that no gentoo patches are causing this, after all we do currently add multipath support...
We really need to get a clean 2.6.10 tested, but I understand the risks of data loss mean you might not want to try this..

Comment 6 Miroslaw Poray-Zbrozek 2004-12-29 08:00:16 UTC

Now I suspect that the reason may be the plug and play support added recently to the kernel. I turned it on by mistake while doing the oldconfig. Possibly this feature succedes turning the Promise raid on and then the Promise BIOS tries to rebuild the array simultaneously with the md driver. I will try to check that later but I have to take that box of the network first and backup everything in order to be able to restore it fast in case I am wrong. I will let you know.

Comment 7 Daniel Drake (RETIRED) gentoo-dev

2004-12-29 08:38:55 UTC

Thanks, your efforts are much appreciated

Comment 8 Miroslaw Poray-Zbrozek 2004-12-29 08:58:23 UTC

It seems that my previous hypothesis was right. I rebuild the 2.6.10-gentoo-r1 kernel but this time I took my old config verbatim and then double checked that the plug & play support was disabled. Now everythig works OK (including udev that I reinstalled earlier).

Two raid controlers working on the same volume explain the fs corruption. I wish I was able to record the kernel messages - that should confirm an attempt to enable the Promise BIOS - but unfortunately I do not see how - there is no way to save anything under those circumstances. Possibly taking a picture with a camera would do the job.

So it seems that what I thought was a bug actually is a feature however a bit dangerous:-)

Comment 9 Stian Skjelstad 2004-12-29 09:02:10 UTC

dmesg and /var/log/messages for instance are your friends

Forward this to the linux kernel mailinglist?

Comment 10 Miroslaw Poray-Zbrozek 2004-12-29 09:17:46 UTC

Sure but only if the system has a chance to save them...

Comment 11 Stian Skjelstad 2004-12-29 09:24:44 UTC

if you can create a bootCD using the bogus kernel, you can bring up a live filesystem aside the one on the raid-array that gets trashed.

Or if you have a null-link cabel you can redirect console-output to another pc (serial port)

Comment 12 Daniel Drake (RETIRED) gentoo-dev

2004-12-29 09:25:51 UTC

dmesg can't be run on a kernel panic, /var/log/messages won't be written to when the kernel fails to mount the root device.

Taking a picture with a camera will be fine, if that is an option. Alternatively you could set up a serial console where the kernel messages are printed out over a serial cable connected to another computer which logs the messages. Or you could just transcribe the bits you find to be relevant.

Which exact kernel option did you deduce was the cause of this?

Comment 13 Miroslaw Poray-Zbrozek 2004-12-29 09:42:29 UTC

Thank you for your suggestions how to record the boot messages. I will try to get a serial cable. Speaking about the kernel features that in my opinion caused the fs corruption that was:
Device Drivers --> Plug and Play Support
and 
Device Drivers --> Plug and Play Support --> ACPI Plug and Play support

Comment 14 Daniel Drake (RETIRED) gentoo-dev

2005-01-01 11:18:33 UTC

I don't know how interesting/vital those messages will be (only you knows what you are seeing..) but if its too much trouble then I wouldn't worry too much right now..
Could you please file a bug upstream about this (http://bugzilla.kernel.org). Although in your situation, enabling PnP (and hence enabling hw raid) would not have produced the desired results, I think the fact that it caused proper filesystem corruption means this can be classed as a bug.
After opening the bug, please update the URL field here with the URL of your bug report. Thanks.

Comment 15 Miroslaw Poray-Zbrozek 2005-01-03 00:54:48 UTC

In my opinion there is no point in filing this as a bug to kernel.org or anywhere else. Plug and play feature did exactly what it was expected to that is configured and turned on the hardware raid driver. I admit that was my fault - I should have expected that and checked my config more carefully. I do not see what could be done about that unless we want to make Windows out of Linux - ask about 10 times if the user really wants to do what he/she is trying to do and then do something else anyway :-)