Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 219179 - baselayout2/openrc boot failure after applying git patch for bug #218063
Summary: baselayout2/openrc boot failure after applying git patch for bug #218063
Status: RESOLVED FIXED
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: [OLD] baselayout (show other bugs)
Hardware: AMD64 Linux
: High critical (vote)
Assignee: Gentoo's Team for Core System packages
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2008-04-24 18:08 UTC by Steve Arnold
Modified: 2008-10-07 14:22 UTC (History)
1 user (show)

See Also:
Package list:
Runtime testing required: ---


Attachments
migrated rc.conf (rc.conf,4.55 KB, text/plain)
2008-04-27 18:41 UTC, Steve Arnold
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Steve Arnold archtester gentoo-dev 2008-04-24 18:08:24 UTC
I was still getting the same errors as bug #218067 and bug #218063 after upgrading to baselayout2/openrc; I applied Roy's patch he mentions in comment #13 of the latter bug, since the shutdown issue results in file system corruption (and I hate corruption, especially file systems)...

After applying the patch and rebuilding openrc, the services affected start and stop fine, ie, everything seemed normal, but after rebooting the fun begins...

It won't boot at all.  It tries to create all kinds of directories while root is mounted read-only and fails miserably.  Now it only boots up to a root prompt without /usr, /var, or other important pieces (/ is a raid0 device, while everything else is lvm over raid).  The trouble starts before the boot log kicks in, and prior to entering runlevel 3.  Here's what I see on the console before it boots to (none) login:

Everything looks normal up through "processing uevents..." then the display resets and the next thing I see is:

(red)* waitpid: Interrupted system call

mkdir '/lib64/rc/init.d/starting': Read-only file system
mkdir '/lib64/rc/init.d/started': Read-only file system
mkdir '/lib64/rc/init.d/stopping': Read-only file system

...

mkdir '/lib64/rc/init.d/scheduled': Read-only file system

(green)* Device initiated services

mkdir '/lib64/rc/init.d/starting': Read-only file system
mkdir '/lib64/rc/init.d/started': Read-only file system
mkdir '/lib64/rc/init.d/stopping': Read-only file system

...

mkdir '/lib64/rc/init.d/scheduled': Read-only file system

(red)* rc: failed to create stopping dir '...rc.stopping': Read-only file system

Init Entering runlevel 3

mkdir '/lib64/rc/init.d/starting': Read-only file system
mkdir '/lib64/rc/init.d/started': Read-only file system
mkdir '/lib64/rc/init.d/stopping': Read-only file system

...

mkdir '/lib64/rc/init.d/scheduled': Read-only file system

(red)* rc: failed to create stopping dir '...rc.stopping': Read-only file system


then the (none) login: prompt shows up, but immediately there's more spew from the file UberLord had me modify (sorry, can't remember which one) with a "set -x, set+x" around the stuff from bug #218067 (not considered relevant here).
Comment 1 Roy Marples 2008-04-24 21:54:30 UTC
waitpid: Interrupted system call

The only thing I can think of is that udev is causing this as the special sysinit runlevel is not mounting /lib/rc/init.d as a memory disk. Did you try running a static dev as I asked on IRC?
Comment 2 Steve Arnold archtester gentoo-dev 2008-04-25 00:00:09 UTC
Not yet, just finished grading exams and now getting for class tonight...  I need to get to it soonest, but I have tons of other crap to do, so I'll probably only get to it in small increments before the weekend.

I also have several custom udev rules and hal .fdi files, so I still need to get back to udev working correctly.  Do you have any idea which devices I need to have static?  Just my boot block devices and console?  What am I looking for once I get it to boot to a usable state?

Thanks...
Comment 3 Steve Arnold archtester gentoo-dev 2008-04-25 00:23:37 UTC
Okay, once I added some raid devices, it looks like I have a full complement of static devices.  I'll try setting rc_devices="static" and see how far it gets...
Comment 4 Steve Arnold archtester gentoo-dev 2008-04-26 04:37:23 UTC
Nope, setting it to use static devices doesn't work either; it made it up through module loading, but then barfed a bunch more directory errors, but this time continued all the way through runlevel 3, failing each service along the way.  By the time it got to the prompt, all the relevant errors were gone and my usb keyboard was non-functional.  Since neither the boot log nor messages had any new entries, I have nothing useful to post...
Comment 5 Roy Marples 2008-04-26 15:04:57 UTC
Could you try the openrc-9999 git ebuild please?
Also, remove /lib/rc/plugins/splash.so as that's what's clearing the screen even if splash itself is disabled on the kernel command line.
Comment 6 Steve Arnold archtester gentoo-dev 2008-04-26 16:12:13 UTC
This is getting weird...  I was in a chroot last night editing package.mask to go back to the current 0.2.2 ebuild from my patched -r1 but I didn't merge it yet, as I wanted to try once booting with rc_devices="" just to see what happens.  And it booted!  But only once, and then it went back to the previous behavior...

The one time it booted, I saw the "unlink" (cosmetic) errors, but it booted up fine.  When I looked at the console, I saw a bash error about a missing "]" on line 476 of /lib64/rc/sh/runscript.sh (I think that's the file) which only has about 200 lines.  I found a "]" in net.sh without a space in front of it on line 476 so I added one, rebooted and it never booted again.  :(

I just updated to the git ebuild, so I should be back shortly.
Comment 7 Steve Arnold archtester gentoo-dev 2008-04-26 16:57:38 UTC
And it does exactly the same thing as my previous post: I updated to the git ebuild, and it booted up fine, after one set of "mkdir" errors just like my first post:

mkdir '/lib64/rc/init.d/starting': Read-only file system

but without any of the others.  In this case, it continues through the boot process just fine.  Then I reboot and it goes right back to the original error, ie, it only makes it to the "(none) login:" prompt as described above.

At this point, I'm at a loss as to what the cause of the problem is; about the only thing I can think of is that maybe something is happening too fast in the mount phase?

I'll try rebuilding with "debug" enabled and see what happens...
Comment 8 Steve Arnold archtester gentoo-dev 2008-04-26 17:11:23 UTC
Oh, and I don't have a plugins dir; I don't use any fancy splash stuff, but I do use the radeon framebuffer driver.  I had tried removing the video stuff from the grub config the other day, but it still resets the console when it loads the default font so there must be something else besides the splash plugin.
Comment 9 Roy Marples 2008-04-26 19:35:51 UTC
(In reply to comment #6)
> The one time it booted, I saw the "unlink" (cosmetic) errors, but it booted up
> fine.  When I looked at the console, I saw a bash error about a missing "]" on
> line 476 of /lib64/rc/sh/runscript.sh (I think that's the file) which only has
> about 200 lines.  I found a "]" in net.sh without a space in front of it on
> line 476 so I added one, rebooted and it never booted again.  :(

That sounds like a network script as it's so big. Anyway, I've pushed a fix so that the real script name is reported instead of runscript.sh, which will help you fix that error. Now that I think about it, I think a bad commit caused that which has been fixed now.

Still the root cause is something in init.sh bailing early and we need to stop the screen from clearing to find the error.
Comment 10 Steve Arnold archtester gentoo-dev 2008-04-27 00:30:16 UTC
Okay, it seems that if I rebuild the openrc package, then it will boot up correctly one time, and then fall back to the original error.  When it boots that first time, I get one set of these for each directory under init.d:

mkdir '/lib64/rc/init.d/starting': Read-only file system

then the mounting of /dev/pts and /dev/shm, then it continues as normal through runlevel 3.  This time it stopped to do a partition check on a large partition, and I saw this right before the above mkdir errors:

rc: failed to exec '/lib64/rc/sh/init.sh'

Not sure yet why it's failing, but I'm about to poke around some more...
Comment 11 Roy Marples 2008-04-27 07:12:15 UTC
If it's trying to mount pts or shm *after* you see those messages, then either waitpid isn't working OR something else is starting rc or a service. With udev this shouldn't happen as the udev scripts create /dev/.rcsysinit which causes rc and services to block until this file is removed.

At this point I cannot help you anymore unless you find the cause or get a static dev working - rc_devices="static".
Comment 12 Steve Arnold archtester gentoo-dev 2008-04-27 18:21:26 UTC
Nope, static devices won't cut it on this machine (since I already tried it and it doesn't even come close to working correctly).  I also have a ton of work backing up and I need this machine running, so I'll have to go with the workaround I just figured out (see below).

If I fuss with it, such as re-merging the package, I can sometimes get it to boot once; when it boots, /dev/pts and /dev/shm are mounted soon enough for runlevel3 to continue.

Most of the time, however, it fails to boot properly as shown above; in this case it tries to go into runlevel3 too soon, and falls to the (none) login: prompt on a read-only root filesystem *before* the above are mounted.  This happens repeatedly (as shown above) where it tries to mount /dev/pts and /dev/shm too late and the boot fails.

Here's the workaround part; if I set rc_coldplug="NO" it will boot all the way up, however, with my normal kernel/module/udev config and a USB keyboard and mouse, this results in a full boot with no working keyboard or mouse.

I can get the above to boot every time, with different kernels, and when I tried the genkernel initrd setup I built yesterday, I can actually use it, since the initrd stuff seems to enable the USB hardware even with rc_coldplug="NO".

As I said, I consider this a workaround, since it should work correctly using the original config and kernel.  Something is making those mounts happen too late (or the other stuff is too early) and that should be fixed.  I can probably test a few things along the way, but it's not my code and I really don't have the time right now to do anything more besides catch up on my own work (this machine has been mostly non-functional for a week simply because I decided to upgrade to baselayout2, which from my perspective seems like a premature unmasking).

Let me know if you have any changes you'd like me to try; for now I'm running the 0.2.2 ebuild with the boinc start/stop commands renamed so it will shutdown correctly.  Using the gentoo-sources kernel with an initrd, along with the rc_coldplug thing, seems to at least yield a working system (although I can't say everything is working fine, at least the basics are working).

The main boot failure described in this bug happens with all openrc versions I've tried so far; the current ~version, my own 0.2.2-r1 (with the backported patch), and a couple of git versions, since the latter appears to update the sources with each rebuild of the package.
Comment 13 Steve Arnold archtester gentoo-dev 2008-04-27 18:41:36 UTC
Created attachment 151164 [details]
migrated rc.conf

I followed the little migration doc, but it didn't really say a lot about which of the old settings from conf.d/rc are valid in the new rc.conf so maybe you can see something obvious in there I'm not seeing...
Comment 14 Roy Marples 2008-04-28 14:52:57 UTC
If prior versions worked, can you identify with certainty the git commit that first breaks things like this?
Comment 15 Roy Marples 2008-04-28 17:39:30 UTC
I've had a report this has been fixed in git version 857b2c9f
Comment 16 SpanKY gentoo-dev 2008-05-05 05:03:45 UTC
openrc-0.2.3 now in the tree too ... 
Comment 17 Steve Arnold archtester gentoo-dev 2008-05-06 01:53:32 UTC
Yeah, I tried both 0.2.3 and the latest git version (as of Sunday afternoon) but neither of them was any different than previous versions, at least with respect to this problem.

I also tried both my regular kernel/udev setup and the gentoo-sources/initrd setup, but both still have the same boot error, ie, the tmpfs mounting comes too late.  The latest 2.6.25-gentoo-r1 has some problems with my hardware I guess, at least the NIC driver doesn't seem to load correctly so I have no network with that kernel (whereas vanilla 2.6.25 worked fine with the old baselayout), but I did get the radeonfb driver to at least load and set the right console resolution (1440x900) with 2.6.24-gentoo-r4 (and most everything else seems to work so far).

So I'm still currently stuck with the workaround of setting rc_coldplug="NO" and using the initrd instead of my normal kernel/udev setup (but at least I can get some work done).
Comment 18 Roy Marples 2008-05-06 11:00:53 UTC
This is starting to sound like bug #219929.
If you build the ATI fb driver into the kernel instead of being a module does it fix it?
Comment 19 Steve Arnold archtester gentoo-dev 2008-05-10 04:22:25 UTC
The radeonfb module loads automatically with my normal kernel/udev setup (the one that doesn't work with openrc) but not with the current setup; I don't usually use genkernel or an initrd image, so I'm not sure why it isn't loaded, but if I add it to the modules file it loads and works as expected.  That doesn't seem to have anything to do with the openrc init failure (it just changes the console res).

It didn't jump out at me from looking at the init scripts, so I'm not sure exactly what needs tweaking, but somehow the tmpfs mounts need to happen sooner when rc_coldplug is set to "YES".
Comment 20 Roy Marples 2008-05-10 10:13:48 UTC
(In reply to comment #19)
> That
> doesn't seem to have anything to do with the openrc init failure (it just
> changes the console res).

Which actually was the error as it delivered a SIGWINCH from init.sh to rc which we were not handling correctly. This should now be fixed in git.
Comment 21 Roy Marples 2008-05-12 09:06:42 UTC
This should be fixed now in openrc-0.2.4
Comment 22 Steve Arnold archtester gentoo-dev 2008-06-24 07:53:37 UTC
yeah, sys-apps/openrc-0.2.5 seems to be working, more or less, at least it boots and shuts down.  It's successfully hiding the shutdown messages so far, except some new gdm errors, so I haven't seen anything interesting there.  I'd recommend closing...
Comment 23 Doug Goldstein (RETIRED) gentoo-dev 2008-10-07 14:22:42 UTC
Fixed in 0.3.0.