I am using a software raid volume for the root file system on some of my servers. Earlier today I tried to upgrade the kernel to 2.6.10-gentoo-r1 on one of them. At the same time I tried to switch to udev (I know that it is unreasonable to do such two things at the same time but so far gentoo sources worked so well for me that I did not expect any trouble). After rebooting I got a kernel panic message with an explanation that the root volume /dev/md2 was corrupt. Reproducible: Always Steps to Reproduce: 1. emerge gentoo-dev-sources 2. cp /boot/config-2.6.9-gentoo-rx /usr/src/linux-2.6.10-gentoo-r1 3. cd /usr/src/linux-2.6.10-gentoo-r1 4. make oldconfig and then menuconfig to get rid of devfs 5. make 6. copy System.map, .config, bzImage to boot (I skip modules as I use a monolithic kernel on that box) 7. edit grub.conf accordingly 8. reboot Actual Results: The system failed to boot with a message that the root volume was corrupt. After that I booted from a live cd and after starting raid manually I did # reiserfsck --rebuild-tree /dev/md2 I have never tried that before and I must say that I am impressed with reiserfs and reiserfsprogs. I was able to recover all the files without any data loss even though the picture looked scary at first. The whole file system ended up in a lost+found directory and the subdirectories had names like 2_16779. However it was easy to move them back into the right places. At this point I was not sure what was wrong - udev or the kernel so I unmerged udev and rebooted the system with the old 2.6.9-gentoo-r6 kernel without any trouble. Then I recompiled the 2.6.10 kernel back with the devs support an tried to boot the system with it. Again the boot attempt ended with a kernel panic message due to the root filesystem corruption so most likely udev is innocent here. I had to repeat the system recovery (this time it turned out to be easier as reiserfsck was able to put the recovered directories back into the right places without my help). Finally I emerged the 2.6.9-gentoo-r13 kernel and left the system running with that. I guess that the problem is with the kernel however I am not 100% sure - it could be also my mistake of some kind that I am not aware of - please let me know if you see any possibilities. Anyway I did not try to reemerge the udev. I left the 2.6.9-gentoo-r13 kernel with devfs support as I did not feel like recovering the system for the third time (and I might not be as lucky with the recovery as for the first two times). Expected Results: The system should have booted up as always.
Sounds very like a kernel bug, since raid-volumes are assembled in kernel space without the need of userspace-tools (as long as you use md superblocks or specify the device layouts as kernel parameters)
Evil...we had a similar-ish report about reiserfs corruption in 2.6.9 /me writes to namesys
Similar to bug #73145
I have been trying to investigate this problem for a while today. I am using vmware to simulate a similar configuration that is with 2 virtual drives on 2 IDE channels and a raid 1 volume built on the top of that setup. It seems strange but the 2.6.10-gentoo-r1 kernel works with that setup perfectly well no matter what I try to do with the config - both with udev and devfs. It seems that the problem I encounterd yesterday must have something to do with the low level drivers not the soft raid itself. On that system I have a Promise PDC20276 controller but I do not use the native RAID. I disabled that in the BIOS and I am using that controler like a regular IDE device. Possibly the problem is here?
You are definately using reiserfs within vmware? Also, we really need to confirm that no gentoo patches are causing this, after all we do currently add multipath support... We really need to get a clean 2.6.10 tested, but I understand the risks of data loss mean you might not want to try this..
Now I suspect that the reason may be the plug and play support added recently to the kernel. I turned it on by mistake while doing the oldconfig. Possibly this feature succedes turning the Promise raid on and then the Promise BIOS tries to rebuild the array simultaneously with the md driver. I will try to check that later but I have to take that box of the network first and backup everything in order to be able to restore it fast in case I am wrong. I will let you know.
Thanks, your efforts are much appreciated
It seems that my previous hypothesis was right. I rebuild the 2.6.10-gentoo-r1 kernel but this time I took my old config verbatim and then double checked that the plug & play support was disabled. Now everythig works OK (including udev that I reinstalled earlier). Two raid controlers working on the same volume explain the fs corruption. I wish I was able to record the kernel messages - that should confirm an attempt to enable the Promise BIOS - but unfortunately I do not see how - there is no way to save anything under those circumstances. Possibly taking a picture with a camera would do the job. So it seems that what I thought was a bug actually is a feature however a bit dangerous:-)
dmesg and /var/log/messages for instance are your friends Forward this to the linux kernel mailinglist?
Sure but only if the system has a chance to save them...
if you can create a bootCD using the bogus kernel, you can bring up a live filesystem aside the one on the raid-array that gets trashed. Or if you have a null-link cabel you can redirect console-output to another pc (serial port)
dmesg can't be run on a kernel panic, /var/log/messages won't be written to when the kernel fails to mount the root device. Taking a picture with a camera will be fine, if that is an option. Alternatively you could set up a serial console where the kernel messages are printed out over a serial cable connected to another computer which logs the messages. Or you could just transcribe the bits you find to be relevant. Which exact kernel option did you deduce was the cause of this?
Thank you for your suggestions how to record the boot messages. I will try to get a serial cable. Speaking about the kernel features that in my opinion caused the fs corruption that was: Device Drivers --> Plug and Play Support and Device Drivers --> Plug and Play Support --> ACPI Plug and Play support
I don't know how interesting/vital those messages will be (only you knows what you are seeing..) but if its too much trouble then I wouldn't worry too much right now.. Could you please file a bug upstream about this (http://bugzilla.kernel.org). Although in your situation, enabling PnP (and hence enabling hw raid) would not have produced the desired results, I think the fact that it caused proper filesystem corruption means this can be classed as a bug. After opening the bug, please update the URL field here with the URL of your bug report. Thanks.
In my opinion there is no point in filing this as a bug to kernel.org or anywhere else. Plug and play feature did exactly what it was expected to that is configured and turned on the hardware raid driver. I admit that was my fault - I should have expected that and checked my config more carefully. I do not see what could be done about that unless we want to make Windows out of Linux - ask about 10 times if the user really wants to do what he/she is trying to do and then do something else anyway :-)