17974 – reiserfs - boot fails with incorrect report of corrupt partition

Bug 17974 - reiserfs - boot fails with incorrect report of corrupt partition

Summary: reiserfs - boot fails with incorrect report of corrupt partition

Status:	RESOLVED FIXED

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	[OLD] Core system (show other bugs)
Hardware:	All Linux

Importance:	High critical
Assignee:	Brandon Low (RETIRED)

URL:
Whiteboard:
Keywords:

Duplicates (1):	18057 (view as bug list)
Depends on:
Blocks:

Reported:	2003-03-21 23:32 UTC by Phil Almquist
Modified:	2003-04-17 13:32 UTC (History)
CC List:	7 users (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Phil Almquist 2003-03-21 23:32:52 UTC

The default action of the fsck command used in checkfs with new reiserfs tools
it not to automatically check the drive but ask the user if they are sure they
want to fsck.  The command used in checkfs, fsck -C -R -A -a, prints a warning
and asks the user if they want to proceed and then before waiting for an answer
immediately exits.  It is easy to notice that this command isnt working from a
shell but the init scripts hide the error and only report that checkfs failed
and the reiserfs partition has errors that need to be fixed (errors that in fact
do not exist but are caused by a faulty script)

Reproducible: Always
Steps to Reproduce:
1. emerge the latest baselayout and reiserfstools
2. boot the computer or run 'fsck -C -R -A'

Actual Results:  
boot failure and message telling me the partition is corrupt

Expected Results:  
booted successfully because there is no real corruption

Comment 1 Phil Almquist 2003-03-21 23:49:52 UTC

Correction: 'fsck -C -R -A' asks for confirmation and quits (a seperate bug) but 'fsck -C -R -A -a', the command in checkfs, produces a different output which is still wrong and produces errors.



root@wlan0 k0te # fsck -C -R -A -a
fsck 1.32 (09-Nov-2002)
Reiserfs super block in block 16 on 0x302 of format 3.6 with standard journal
Blocks (total/free): 18900/10147 by 4096 bytes
Filesystem is cleanly umounted
 
flush_buffers: device is not specifed
Warning... fsck.reiserfs for device /dev/hda2 exited with signal 6.
root@wlan0 k0te #




root@wlan0 k0te # cat /etc/fstab
/dev/hda1               /mnt/win        ntfs            noauto,ro               0 0
/dev/hda2               /boot           reiserfs        noauto,noatime,notail   1 1
/dev/hda3               /               reiserfs        noatime                 0 0
/dev/cdroms/cdrom0      /mnt/cdrom      iso9660         noauto,ro               0 0
/dev/loop0              /mnt/crypt      reiserfs        noauto,noatime,loop     0 0
/dev/loop2              none            swap            noauto,sw,loop          0 0
proc                    /proc           proc            defaults                0 0
tmpfs                   /dev/shm        tmpfs           noauto                  0 0
root@wlan0 k0te #

Comment 2 Wojciech Milkowski 2003-03-22 05:51:02 UTC

It happend for me, after I emerge reiserfsprogs-3.6.5. The older version 3.6.4(-r1) works fine.

Comment 3 Seth Chandler 2003-03-23 03:44:23 UTC

i can confirm this, i don't believe its a kernel problem however...

i'm going to cc x86-kernel on it, and assign it to Azarah, as i think it has something to do with our init scripts...

it claims to be unmounted cleanly, but errors out anyway...i think its becuase the fsck returns with a positive number..(6)


Az, let me know what you think, i havn't had time to look into it more deeply

Comment 4 Martin Schlemmer (RETIRED) gentoo-dev

2003-03-23 04:08:15 UTC

Right, but it looks like a reiserfs tools bug, not a baselayout one.  There
are standard error values that all fsck compatible tools should follow.  Thus
if only the newer version behaves like this, it is a bug .. no ?

Passing the bucket as I do not have reiserfs partitions to test.

Comment 5 Seth Chandler 2003-03-23 04:22:23 UTC

should we mask this in the mean time?  -x86 or something?

Comment 6 Martin Schlemmer (RETIRED) gentoo-dev

2003-03-23 05:33:23 UTC

If the 'problems' that the new version fix against the old is less dramatic than
this issue, then mask it in package.mask.

Comment 7 Seth Chandler 2003-03-23 06:03:58 UTC

the change log entry is minor, a couple of fringe fixes...i'd rather not have all our reiser 
users be prompted to hit control -D at startup... 
 
i've masked it in package.mask...lolo, take a look at this thing...AFAIK its an upstream 
problem... 
 
and reiser sucks about support sometimes...they ain't got a bugzilla....:/ 
 
seth

Comment 8 Seth Chandler 2003-03-23 16:33:10 UTC

*** Bug 18057 has been marked as a duplicate of this bug. ***

Comment 9 Edwin Cremer 2003-03-24 07:08:46 UTC

IMHO there ist a Problem that "reiserfs" need a stdin device at running
because
"flush_buffers: device is not specifed"

I delete the link to /sbin/fsck.reiserfs and create a script which start reiserfsck whith a stdin handle 
---- schnipp -------
/sbin/reiserfsck -q $* </dev/null
---- schanpp -------
this works for me very well and it checks the partitions too.

remember,
this is my first bug report and my english is very bad but I hope I can help.

GenToo rulz !!

Comment 10 Brandon Low (RETIRED) gentoo-dev

2003-03-24 22:02:16 UTC

Azarah: what do you think about this:  There is another related issue which I've just been eaten by wherein devfsd doesn't get started because rm -f /dev/.devfsd fails in the init scripts.  This causes the same exit 6 from reiserfsprogs...  I'm trying to figure out why rm -f /dev/.devfsd doesn't work, wondering if it relates to having kernel automount it...

Comment 11 Brandon Low (RETIRED) gentoo-dev

2003-03-24 22:27:42 UTC

Hmm... on reiserfsprogs-3.6.5 I am definitely able to boot as long as everything else (awk, devfsd, etc.) are working correctly... ponderous.

Comment 12 Max Kalika (RETIRED) gentoo-dev

2003-03-25 12:33:53 UTC

This seems to be an awk problem.  The offending line in /sbin/rc is:

awk '($3 == "devfs") { print "yes"; exit 0 }' /proc/mounts

This doesn't seem to work with gawk-3.1.2(-r1)? installed.  3.1.1-r1 works fine.


BTW: azarah: I noticed on line 113 of /sbin/rc you use the "clear" command which won't be found if /usr is a separate filesystem (clear is /usr/bin/clear) because filesystems other than / haven't been mounted yet.

Comment 13 Max Kalika (RETIRED) gentoo-dev

2003-03-25 14:33:43 UTC

allow me to clarify the above awk statement.  here's what I did:

valkyrie init.d # emerge -s gawk
Searching...
[ Results for search key : gawk ]
[ Applications found : 1 ]

*  sys-apps/gawk
      Latest version available: 3.1.2-r2
      Latest version installed: 3.1.2-r2
      Size of downloaded files: 1,956 kB
      Homepage:    http://www.gnu.org/software/gawk/gawk.html
      Description: GNU awk pattern-matching language


valkyrie init.d # cat /proc/mounts
rootfs / rootfs rw 0 0
/dev/root / reiserfs rw,noatime 0 0
none /dev devfs rw 0 0
proc /proc proc rw 0 0
tmpfs /mnt/init tmpfs rw 0 0
/dev/hda5 /usr reiserfs rw 0 0
/dev/hda6 /var reiserfs rw,noatime 0 0
/dev/hda7 /tmp reiserfs rw,nosuid 0 0
/dev/hda8 /home reiserfs rw,noatime 0 0
/dev/hdb1 /home/data/media reiserfs rw,noatime 0 0
tmpfs /dev/shm tmpfs rw 0 0
none /proc/sys/fs/binfmt_misc binfmt_misc rw 0 0

valkyrie init.d # awk '($3 == "devfs") { print "yes"; exit 0 }' /proc/mounts
valkyrie init.d #


now if I emerge gawk-3.1.1-r1 ... (emerge --oneshot =gawk-3.1.1-r1)


valkyrie init.d # awk '($3 == "devfs") { print "yes"; exit 0 }' /proc/mounts
yes
valkyrie init.d #


Something seems funky here. :-)

Comment 14 herring 2003-03-25 18:48:45 UTC

>This seems to be an awk problem.

Well, when I first encountered the reiserfs "boot-failure" I was using
sys-apps/gawk-3.1.1-r1 (I'm still using it, never been higher/downgraded)

At least the fsck error went away by editing /etc/checkfs adding a -s

fsck -C -R -A -a -s

Comment 15 Martin Schlemmer (RETIRED) gentoo-dev

2003-03-30 15:11:55 UTC

The gawk problem is a totally different problem (fixed in gawk-3.1.2-r3).

Comment 16 Martin Schlemmer (RETIRED) gentoo-dev

2003-03-30 15:15:34 UTC

Phil, using serial console or something ?

Comment 17 Phil Almquist 2003-03-31 11:14:21 UTC

no serial console or anything id consider abnormal, ill post the kernel config later when i can get to it.  whatever it is that's causing this seems to be fixed by going back a version in the reiserfsprogs, only the masked version is giving any problems.

Comment 18 Robin Johnson archtester

2003-04-01 12:35:55 UTC

here is a fix for the "flush_buffers: device is not specifed" bug from the reiserfs mailing list.

It is also integrated into reiserfsprogs 3.6.6-pre.
ftp://ftp.namesys.com/pub/reiserfsprogs/reiserfsprogs-3.6.5-flush_buffers-bug.patch

Comment 19 Vitaly Fertman 2003-04-01 12:57:16 UTC

It seems that I have fixed the probelem in reiserfsprogs. It turned out that  
when reiserfsck is launched from "fsck -A -a" stdin was not openned and the  
first open call returned file descriptor 0. And there was a check that file  
descriptor cannot be 0. Could you try the patch:  
ftp.namesys.com/pub/reiserfsprogs/reiserfsprogs-3.6.5-flush_buffers-bug.patch  
  
this thread actually helped me to localise the bug, but it was pointed to me  
only yesterday. Could you cc to reiserfs-list@namesys.com next time when a  
problem conserns reiserfs/reiserfsprogs.  
  
thank you.

Comment 20 Brandon Low (RETIRED) gentoo-dev

2003-04-01 13:17:14 UTC

Sorry to not have you on the bug in the first place, I was actually waiting till I could personally take some time and reproduce this on my machine to send you folks at namesys an e-mail about this.

I have added a new ebuild for reiserfsprogs to gentoo's unstable profile with that patch applied, anyone on this bug who has had problems please test that and let us know how it goes.

Comment 21 Sean Groarke 2003-04-04 04:29:34 UTC

This morning I emerged reiserfsprogs-3.6.5-r1.ebuild and the problem is very much still there. Frightened the life out of me...! Fell back to 3.6.4-r1 and all OK again. I think 3.6.5 should be fully masked out until we have a proven fox for this.

Comment 22 Sean Groarke 2003-04-04 04:31:26 UTC

"proven fox". Err, I mean "fix"...!

Comment 23 Robin Johnson archtester

2003-04-04 05:51:34 UTC

Sean: 
I have reiserfsprogs-3.6.5-r1.ebuild emerged and I can't reproduce it now on my work testing box. 

I have even just turned the power off to see that it came up right, and it did so without any problems or requiring any other input. It worked fine for me. My gawk here is 3.1.1-r2, so may be the reason.

Comment 24 Phil Almquist 2003-04-04 08:30:33 UTC

After leaving my laptop at school for the night I have it again and everything seems to be working perfectly with the new patched reiserfsprogs (3.6.5-r1) although I did update some other things including gawk before the reboot as well.  Fsck appears to be running like it should and I'm not recieving any more errors, looks resolved from my end.

Comment 25 Sean Groarke 2003-04-04 15:47:35 UTC

Hmmmm. OK, let's re-try it:

- machine boots OK with 3.6.4-r1.
- emerge (again!) 3.6.5-r1
- machine fails to boot - Hey, let's just make sure we're talking about the same fault here! What I see on the console (typed in, not screen captured) is:
.
.
.
* Checking root filesystem...
fsck 1.32 (09-Nov-2002)
reiserfs_open_journal: journal parameters from super block does not match to journal parameters from journal.

Either make journal partition available or use --no-journal-available
If you have the standard journal or if your partition is available
and you specified it correctly, you must run rebuild-sb.
* Filesystem couldn't be fixed :(

Give root passwd for maintenance:
.
.
.

- re-emerge 3.6.4-r1
- boots normally (no errors at all)

What else can I tell you about the machine that exhibits the fault?

- gawk = 3.1.1-r2
- baselayout = 1.8.6.4-r1
- devfsd = 1.3.25-r3

Note that the machine itself is not full 1.4_rcX: the make.profile is default-1.0-gcc3.

The relevant part of /etc/fstab is:
	/dev/hda1	/	reiserfs	noatime	0 0


So the fault is there and fully reproducible, but I'm not clear what other info is pertinent... Let me know if I can supply any more info.

Comment 26 Brandon Low (RETIRED) gentoo-dev

2003-04-05 16:00:55 UTC

Please check that partition using both reiserfsprogs-3.6.4 and 3.6.5 from a maintenance console (that is either a liveCD or init 1).  Post the complete output of BOTH checks.  I believe that what you are seeing here is simply pre-existing corruption of your superblock that was not noticed before, because versions prior to 3.6.5 did NO CHECKING AT ALL at boot time.

Comment 27 Sean Groarke 2003-04-06 10:56:06 UTC

Brandon: I just did a check from the liveCD I had handy (which is actually reiserfs 3.6.3, so not 3.6.4 or .5) and the same partition is as clean as a whistle - nothing at all.

I'm going to do a little more digging on the machine in question, as there is clearly something amiss in the 3.6.5 ebuild as it results on at least that machine...

Comment 28 David Arias 2003-04-13 20:49:16 UTC

I am having the same problem.

ReiserFS Progs ver. 3.6.5-r1

I just built a new Gentoo system, created all the fs' correctly, /etc/fstab's correct...

On the first boot AFTER installing from one of the latest experimental 1.4_rc4 live CDs [http://cvs.gentoo.org/~livewire/livecd-experimental-rc4-04-10-03.iso], the system bootup fails after fsck.reiserfs exits with "signal 6" on "/dev/sdb3" which is my root file system... Since this is a new install of previously funcitonal drives, I don't think I am having persistent superblock corruption :) 

Just finished emerging 3.6.4-r1, rebooted with it, and had the same problem.

Any ideas, or if you need information from me, please let me know. I'm eager to be up and running. :)

Comment 29 Brandon Low (RETIRED) gentoo-dev

2003-04-17 13:32:20 UTC

Sean: Vitaly from reiserfs had this to say:

There is a new check for journal parameters were added which helps to
recover fs with relocated journal and in a few other cases. These parameters
were not stored correctly in old kernels before relocated journal appeared.
This is nothing to worry about, and you can fix with fsck --rebuild-sb help
- say there to fix parameters in the jouranl by ones from the super block.

I'm closing this bug on that note, if there are still problems, please open a new bug with a new summary and all that jazz, cuz this one is getting hard to follow.