Summary: | checkfs hangs on boot | ||
---|---|---|---|
Product: | Gentoo Linux | Reporter: | Roger <rogerx.oss> |
Component: | [OLD] baselayout | Assignee: | Gentoo's Team for Core System packages <base-system> |
Status: | VERIFIED NEEDINFO | ||
Severity: | critical | ||
Priority: | High | ||
Version: | 2006.0 | ||
Hardware: | x86 | ||
OS: | Linux | ||
Whiteboard: | |||
Package list: | Runtime testing required: | --- |
Description
Roger
2006-03-10 20:08:06 UTC
how big are the partitions ? reiserfsck doesnt output a status bar so it may *look* like it's hung up, but it isnt really
> Since I'm using reiserfs, I also found it redundant of checkfs to be checking a
> journaled filesystem so I've commented out the lines within /etc/init.d/checkfs
> invoking fsck & my problems have completely subsided.
just because they're journaled doesnt make them perfect
26G, 112G, 112G respectively. I routinely used to have to do a fsck manually after booting with a Gentoo install cdrom. The fsck usually takes no more then 5 minutes. The hangs I'm encountering are endless. One thought that crossed my mind, is it possible more then one fsck process is running at the same time? I've already ruled out smp, preempt along with other possibilities. I've never had this issue using reiserfsck manually. Just with checkfs on boot (ie. fsck -C -T -R -A -a). I like to be safe too. Who knows, maybe more of the WD drives are dropping. > The fsck usually takes no more then 5 minutes. sure, if reiserfsck detects no problems ... if there are problems or it thinks it should do more than a journal check, it'll take a hell of a lot longer ... i know because ive got a 200g reiserfs partition that most of the time pauses for a short time at boot to fsck but when i had a lockup, it did a full fsck and it took quite a while to finish with no visible clues that it was actually doing anything ... afaik, reiserfsck offers no sort of progress bar like e2fsck > The hangs I'm encountering are endless. it's all in your mind :P try hitting CTRL+C at boot and it should ask you to enter your root pw ... if you run the fsck command in the background: fsck -C -T -R -A -a & you should then be able to use like fuser or lsof or strace and see what each process is doing ... > One thought that crossed my mind, is it possible more then one fsck > process is running at the same time? yes and no ... if you read the fsck/fstab manpages, you'll see that fsck runs parallel on different disks, but not partitions ... so in your case, you would get three fscks running at once since hde/hdf/hdh are diff physical drives > I've never had this issue using reiserfsck manually. Just with checkfs on > boot (ie. fsck -C -T -R -A -a). fsck simply executes reiserfsck > I like to be safe too. Who knows, maybe more of the WD drives are dropping. you could always check your SMART status ... i used that to detect prefailures on one of my drives (managed to get all my data off just before it finally bit the big one) ... for example, `emerge ide-smart && ide-smart /dev/hde` >> One thought that crossed my mind, is it possible more then one fsck >> process is running at the same time? > yes and no ... if you read the fsck/fstab manpages, you'll see that fsck runs > parallel on different disks, but not partitions ... so in your case, you would > get three fscks running at once since hde/hdf/hdh are diff physical drives This is where I think I'm having the problem. For some reason, there's a temp file written someplace that was never deleted telling checkfs to run fsck on two partitions (omitting the 3rd partition). The main issue seems to be lying at the possibility of the kernel freaking-out here when trying to run fsck on more then two partitions at the same time. > This is where I think I'm having the problem. For some reason, there's a temp > file written someplace that was never deleted telling checkfs to run fsck on > two partitions (omitting the 3rd partition). must be from fsck/reiserfsck itself as i dont believe the init.d script does that any sort of thing ... > The main issue seems to be lying at the possibility of the kernel freaking-out > here when trying to run fsck on more then two partitions at the same time. ive never had probs with running multiple ext3 in parallel, but i only have one reiserfs so i cant vouch for that ... but it shouldnt be an issue ... >> The main issue seems to be lying at the possibility of the kernel freaking-out >> here when trying to run fsck on more then two partitions at the same time. >ive never had probs with running multiple ext3 in parallel, but i only have one >reiserfs so i cant vouch for that ... but it shouldnt be an issue ... I also get a display during checkfs showing both drives supposedly being checked. I' guessing it's pretty safe to pressume fsck running in parellel is a very good place to start with this. Unless anybody else can think of anything. When I get a chance, I'll pull one drive to prevent checkfs running more then one instance. (Previously, one or both of these drives were ext3 until just recently.) > I also get a display during checkfs showing both drives supposedly being
> checked. I' guessing it's pretty safe to pressume fsck running in parellel is
> a very good place to start with this.
and then what ? does your computer lock up ? or do you just think the fsck processes hang ? i said that running a real fsck on a reiserfs partition (especially ones as big as yours) will take a while, regardless of whether they're running in parallel
Appears to lock-up. At times, a kernel oops. No matter though because I'll go off and take lunch and still find the system 15-30 minutes later in the same hung state. So what I first started doing is booting with the Gentoo install cd and manually running fsck on the reiserfs partitions. Then cleanly umount the partitions before trying to boot-up the system again. On boot of the system, checkfs would still want to fsck those same cleanly umounted partitions (which leads me to guess the gentoo baselayout has a stale file around marking the partitions as uncleanly umounted -- if I knew what was causing this, I could further troubleshoot). The fsck for reiserfs will print to stdout status info, even with checkfs. I know because it sometimes does make it this far. I'm just reporting this issue after looking over checkfs script, google and the bugs here. Maybe somebody else will stumble upon this. I have yet to do one thing that might resolve all this, is to give my computer a nice big hug. > Appears to lock-up. yes, but can you CTRL+C it ? > At times, a kernel oops. kernel bug then and i can wash my hands of this :) > checkfs would still want to fsck those same cleanly umounted partitions (which > leads me to guess the gentoo baselayout has a stale file around marking the > partitions as uncleanly umounted -- if I knew what was causing this, I could > further troubleshoot). Gentoo has no such file ... we simply run `fsck` try setting the 6th field to 0 ... this will tell fsck to skip checking the partitions ... then once you login, see what happens if you run reiserfsck on the partitions yourself ... watch `dmesg` as well, reiserfs is verbose there Got it & will do. Have a couple of other ideas to perform here as well. From what I'm seeing now, it's either a bug with kernel and/or reiserfs-3. Doubtful these hard drives are failing because manually running fsck on these reiserfs partitions as individual processes results in no problems. Also to note, the drives are attached to a Promise ATA/100 pci card, so there's an additional kernel level driver to look at. Additionally, thank you for your time. :-) We're going to need more info here. Resolved by yanking six of my Western Digital hard drives and returning them to WD. (Granite, refused to honor replacing and given a crappy discount on a new drive for exchange.) Got a Seagate SATA2 300GB and have never seen this bug again. From what I'm seeing, one drive was fried. Another seems to have bad blocks. Another four are questionable or the problem was a known bug with the Promise ATA100 PCI controller card. Using SMART is buggy with Promise cards and results in drive crashes. This bug, may have been an additional issue. I'm using these WD drives in an old box to comprise a Mythtv unit. I've yet to see this bug, however, the drives may not be connected in the same sequential order. Probably OK to close this bug. Especially since I don't see any others with this problem posting on this bug here. |