Bug 125787

Summary:	checkfs hangs on boot
Product:	Gentoo Linux	Reporter:	Roger <rogerx.oss>
Component:	[OLD] baselayout	Assignee:	Gentoo's Team for Core System packages <base-system>
Status:	VERIFIED NEEDINFO
Severity:	critical
Priority:	High
Version:	2006.0
Hardware:	x86
OS:	Linux
Whiteboard:
Package list:		Runtime testing required:	---

Description Roger 2006-03-10 20:08:06 UTC

Checkfs hangs on boot.
=sys-apps/baselayout-1.11.14-r6

/etc/fstab
/dev/hde2               /               reiserfs        noatime         0 0
/dev/hdf1       /mnt/extra1             reiserfs        noatime         1 2
/dev/hdh1       /mnt/extra2             reiserfs        noatime         1 2

Two drives are always being checked with fsck on boot via the checkfs init.d script.

After I compared block free/sizes on what checkfs was providing for info during fsck on boot, I've figured checkfs is always checking /dev/hde2 & /dev/hdf1 on boot (seems to always ignore /dev/hdh1).

Since I'm using reiserfs, I also found it redundant of checkfs to be checking a journaled filesystem so I've commented out the lines within /etc/init.d/checkfs invoking fsck & my problems have completely subsided.

I don't seem to have this problem on my laptop.  I'm stumped.


Portage 2.0.54 (default-linux/x86/2006.0, gcc-3.4.5, glibc-2.3.5-r2, 2.6.15.5Y i686)
=================================================================
System uname: 2.6.15.5Y i686 Pentium III (Coppermine)
Gentoo Base System version 1.6.14
dev-lang/python:     2.4.2
sys-apps/sandbox:    1.2.12
sys-devel/autoconf:  2.13, 2.59-r6
sys-devel/automake:  1.4_p6, 1.5, 1.6.3, 1.7.9-r1, 1.8.5-r3, 1.9.6-r1
sys-devel/binutils:  2.16.1
sys-devel/libtool:   1.4.3-r1, 1.5.22
virtual/os-headers:  2.6.11-r2
ACCEPT_KEYWORDS="x86"
AUTOCLEAN="yes"
CBUILD="i686-pc-linux-gnu"
CFLAGS="-march=pentium3 -mtune=pentium3 -O3 -pipe -fomit-frame-pointer "
CHOST="i686-pc-linux-gnu"
CONFIG_PROTECT="/etc /usr/kde/2/share/config /usr/kde/3.4/env /usr/kde/3.4/share/config /usr/kde/3.4/shutdown /usr/kde/3/share/config /usr/lib/X11/xkb /usr/lib/mozilla/defaults/pref /usr/share/config /usr/share/texmf/dvipdfm/config/ /usr/share/texmf/dvips/config/ /usr/share/texmf/tex/generic/config/ /usr/share/texmf/tex/platex/config/ /usr/share/texmf/xdvi/ /var/bind /var/qmail/control"
CONFIG_PROTECT_MASK="/etc/gconf /etc/terminfo /etc/env.d"
CXXFLAGS="-march=pentium3 -mtune=pentium3 -O3 -pipe -fomit-frame-pointer "
DISTDIR="/usr/portage/distfiles"
FEATURES="autoconfig ccache distlocks fixpackages loadpolicy sandbox sfperms strict userpriv usersandbox"
GENTOO_MIRRORS="/mnt/disk1 ftp://ftp.ucsb.edu/pub/mirrors/linux/gentoo/ http://gentoo.oregonstate.edu/ ftp://ftp.ussg.iu.edu/pub/linux/gentoo http://csociety-ftp.ecn.purdue.edu/pub/gentoo/ ftp://ftp.gtlib.cc.gatech.edu/pub/gentoo http://www.ibiblio.org/pub/Linux/distributions/gentoo"
MAKEOPTS="-j1"
PKGDIR="/usr/portage/packages"
PORTAGE_TMPDIR="/var/tmp"
PORTDIR="/usr/portage"
PORTDIR_OVERLAY="/usr/local/portage"
SYNC="rsync://rsync.namerica.gentoo.org/gentoo-portage"
USE="x86 16bit S3TC X a52 aac aim alsa amd apache2 arts artworkextra audiofile avi bash-completion berkdb bidi bitmap-fonts bonobo bzip2 bzlib cairo cddb cdinstall cdparanoia cdr crypt cscope css cups curl curlwrappers devmap dga dhcp dio directfb divx4linux dri dts dv dvd dvdr dvdread eds emboss encode escreen esd ethereal etwin evo exif expat fam fame fbcon festival ffmpeg flac font-server foomaticdb fortran fpx freetype gd gdbm gif gimp gimpprint gkrellm glut gmp gnokii gnome gphoto2 gpm gps gs gstreamer gtk gtk2 gtkhtml guile hbci icq idn ieee1394 ignore-glep31 imagemagick imlib ipv6 irda irmc jabber jack java jbig jce jikes joystick jp2 jpeg jpeg2k justify kde kerberos lcd lcms libcaca libclamav libg++ libvisual libwww lirc live lzo lzw lzw-tiff mad mbox memlimit mhash mikmod mime mjpeg mmx mng motif mozilla moznoirc moznomail mozp3p mozsvg mp3 mpeg msql mysql mysqli ncurses network nls no-old-linux nptl nsplugin nuvexport nvidia offensive ogg oggvorbis openal opengl oscar oss pam pcre pdflib perl php pic pie png pnp posix ppds private python qt quicktime readline real recode rtc samba sasl scanner sdk sdl silc slang smartcard smime sms spell sqlite sse ssl svga sysvipc tcltk tcpd tetex theora threads tidy tiff transcode truetype truetype-fonts trusted type1-fonts udev usb v4l2 vim-with-x visualization vorbis xanim xine xml xml2 xmms xv xvid yahoo yv12 zlib userland_GNU kernel_linux elibc_glibc"
Unset:  ASFLAGS, CTARGET, LANG, LC_ALL, LDFLAGS, LINGUAS

Comment 1 SpanKY gentoo-dev

2006-03-10 20:18:15 UTC

how big are the partitions ?  reiserfsck doesnt output a status bar so it may *look* like it's hung up, but it isnt really

> Since I'm using reiserfs, I also found it redundant of checkfs to be checking a
> journaled filesystem so I've commented out the lines within /etc/init.d/checkfs
> invoking fsck & my problems have completely subsided.

just because they're journaled doesnt make them perfect

Comment 2 Roger 2006-03-10 22:47:02 UTC

26G, 112G, 112G respectively.

I routinely used to have to do a fsck manually after booting with a Gentoo install cdrom.

The fsck usually takes no more then 5 minutes.  The hangs I'm encountering are endless.  One thought that crossed my mind, is it possible more then one fsck process is running at the same time?

I've already ruled out smp, preempt along with other possibilities.

I've never had this issue using reiserfsck manually.  Just with checkfs on boot (ie. fsck -C -T -R -A -a).

I like to be safe too.  Who knows, maybe more of the WD drives are dropping.

Comment 3 SpanKY gentoo-dev

2006-03-10 23:01:07 UTC

> The fsck usually takes no more then 5 minutes.

sure, if reiserfsck detects no problems ... if there are problems or it thinks it should do more than a journal check, it'll take a hell of a lot longer ...

i know because ive got a 200g reiserfs partition that most of the time pauses for a short time at boot to fsck but when i had a lockup, it did a full fsck and it took quite a while to finish with no visible clues that it was actually doing anything ... afaik, reiserfsck offers no sort of progress bar like e2fsck

> The hangs I'm encountering are endless.

it's all in your mind :P

try hitting CTRL+C at boot and it should ask you to enter your root pw ... if you run the fsck command in the background:
fsck -C -T -R -A -a &
you should then be able to use like fuser or lsof or strace and see what each process is doing ...

> One thought that crossed my mind, is it possible more then one fsck
> process is running at the same time?

yes and no ... if you read the fsck/fstab manpages, you'll see that fsck runs parallel on different disks, but not partitions ... so in your case, you would get three fscks running at once since hde/hdf/hdh are diff physical drives

> I've never had this issue using reiserfsck manually.  Just with checkfs on
> boot (ie. fsck -C -T -R -A -a).

fsck simply executes reiserfsck

> I like to be safe too.  Who knows, maybe more of the WD drives are dropping.

you could always check your SMART status ... i used that to detect prefailures on one of my drives (managed to get all my data off just before it finally bit the big one) ... for example, `emerge ide-smart && ide-smart /dev/hde`

Comment 4 Roger 2006-03-10 23:13:11 UTC

>> One thought that crossed my mind, is it possible more then one fsck
>> process is running at the same time?

> yes and no ... if you read the fsck/fstab manpages, you'll see that fsck runs
> parallel on different disks, but not partitions ... so in your case, you would
> get three fscks running at once since hde/hdf/hdh are diff physical drives

This is where I think I'm having the problem.  For some reason, there's a temp file written someplace that was never deleted telling checkfs to run fsck on two partitions (omitting the 3rd partition).  

The main issue seems to be lying at the possibility of the kernel freaking-out here when trying to run fsck on more then two partitions at the same time.

Comment 5 SpanKY gentoo-dev

2006-03-10 23:36:43 UTC

> This is where I think I'm having the problem.  For some reason, there's a temp
> file written someplace that was never deleted telling checkfs to run fsck on
> two partitions (omitting the 3rd partition).  

must be from fsck/reiserfsck itself as i dont believe the init.d script does that any sort of thing ...

> The main issue seems to be lying at the possibility of the kernel freaking-out
> here when trying to run fsck on more then two partitions at the same time.

ive never had probs with running multiple ext3 in parallel, but i only have one reiserfs so i cant vouch for that ... but it shouldnt be an issue ...

Comment 6 Roger 2006-03-10 23:56:21 UTC

>> The main issue seems to be lying at the possibility of the kernel freaking-out
>> here when trying to run fsck on more then two partitions at the same time.

>ive never had probs with running multiple ext3 in parallel, but i only have one
>reiserfs so i cant vouch for that ... but it shouldnt be an issue ...

I also get a display during checkfs showing both drives supposedly being checked.   I' guessing it's pretty safe to pressume fsck running in parellel is a very good place to start with this.  Unless anybody else can think of anything.  When I get a chance, I'll pull one drive to prevent checkfs running more then one instance.

(Previously, one or both of these drives were ext3 until just recently.)

Comment 7 SpanKY gentoo-dev

2006-03-11 00:55:22 UTC

> I also get a display during checkfs showing both drives supposedly being
> checked.   I' guessing it's pretty safe to pressume fsck running in parellel is
> a very good place to start with this.

and then what ?  does your computer lock up ?  or do you just think the fsck processes hang ?  i said that running a real fsck on a reiserfs partition (especially ones as big as yours) will take a while, regardless of whether they're running in parallel

Comment 8 Roger 2006-03-11 02:02:50 UTC

Appears to lock-up.  At times, a kernel oops.  No matter though because I'll go off and take lunch and still find the system 15-30 minutes later in the same hung state.

So what I first started doing is booting with the Gentoo install cd and manually running fsck on the reiserfs partitions.  Then cleanly umount the partitions before trying to boot-up the system again.  On boot of the system, checkfs would still want to fsck those same cleanly umounted partitions (which leads me to guess the gentoo baselayout has a stale file around marking the partitions as uncleanly umounted -- if I knew what was causing this, I could further troubleshoot).

The fsck for reiserfs will print to stdout status info, even with checkfs.  I know because it sometimes does make it this far.

I'm just reporting this issue after looking over checkfs script, google and the bugs here.  Maybe somebody else will stumble upon this.

I have yet to do one thing that might resolve all this, is to give my computer a nice big hug.

Comment 9 SpanKY gentoo-dev

2006-03-11 02:21:02 UTC

> Appears to lock-up.

yes, but can you CTRL+C it ?

> At times, a kernel oops.

kernel bug then and i can wash my hands of this :)

> checkfs would still want to fsck those same cleanly umounted partitions (which
> leads me to guess the gentoo baselayout has a stale file around marking the
> partitions as uncleanly umounted -- if I knew what was causing this, I could
> further troubleshoot).

Gentoo has no such file ... we simply run `fsck`

try setting the 6th field to 0 ... this will tell fsck to skip checking the partitions ... then once you login, see what happens if you run reiserfsck on the partitions yourself ... watch `dmesg` as well, reiserfs is verbose there

Comment 10 Roger 2006-03-11 14:21:49 UTC

Got it & will do.  Have a couple of other ideas to perform here as well. 

From what I'm seeing now, it's either a bug with kernel and/or reiserfs-3.  

Doubtful these hard drives are failing because manually running fsck on these reiserfs partitions as individual processes results in no problems.

Also to note, the drives are attached to a Promise ATA/100 pci card, so there's an additional kernel level driver to look at.

Additionally, thank you for your time. :-)

Comment 11 Roy Marples (RETIRED) gentoo-dev

2007-01-10 12:44:59 UTC

We're going to need more info here.

Comment 12 Roger 2007-01-14 08:31:02 UTC

Resolved by yanking six of my Western Digital hard drives and returning them to WD.  (Granite, refused to honor replacing and given a crappy discount on a new drive for exchange.)

Got a Seagate SATA2 300GB and have never seen this bug again.

From what I'm seeing, one drive was fried.  Another seems to have bad blocks.  Another four are questionable or the problem was a known bug with the Promise ATA100 PCI controller card.  Using SMART is buggy with Promise cards and results in drive crashes.  This bug, may have been an additional issue. 

I'm using these WD drives in an old box to comprise a Mythtv unit.  I've yet to see this bug, however, the drives may not be connected in the same sequential order.

Probably OK to close this bug.  Especially since I don't see any others with this problem posting on this bug here.