|Summary:||LVM not starting properly with baselayout-1.11.8|
|Product:||Gentoo Linux||Reporter:||Harris Landgarten <harrisl>|
|Component:||[OLD] baselayout||Assignee:||Greg Kroah-Hartman (RETIRED) <gregkh>|
|Package list:||Runtime testing required:||---|
hotplug environment logger
Description Harris Landgarten 2004-12-22 19:58:43 UTC
I just installed a stage1 on a new ~x86 system. On boot LVM2 volume failed when checked in checkfs and reports cannot open /dev/mapper/control. At the maintenance stop I can run vgchange -ay and then exit to continue booting properly. I veried that /dev/mapper/control was there. Downgrading to baselayout-1.11.7-r2 solved the problem. I believe that checkfs is not waiting long enough after the modprobing of dm-mod for /dev/mapper/control to appear causing a race condition. Reproducible: Always Steps to Reproduce: 1.see above 2. 3. Portage 2.0.51-r8 (default-linux/x86/2004.3, gcc-3.4.3, glibc-220.127.116.1141102-r0, 2.6.9-gentoo-r10 i686) ================================================================= System uname: 2.6.9-gentoo-r10 i686 Intel(R) Pentium(R) 4 CPU 2.40GHz Gentoo Base System version 1.6.7 Python: dev-lang/python-2.3.4 [2.3.4 (#1, Dec 21 2004, 00:54:04)] dev-lang/python: 2.3.4 sys-devel/autoconf: 2.59-r6, 2.13 sys-devel/automake: 1.8.5-r2, 1.5, 1.4_p6, 1.6.3, 1.7.9, 1.9.3 sys-devel/binutils: 18.104.22.168.2-r2 sys-devel/libtool: 1.5.10-r2 virtual/os-headers: 22.214.171.124-r1 ACCEPT_KEYWORDS="x86 ~x86" AUTOCLEAN="yes" CFLAGS="-O2 -march=pentium4 -fomit-frame-pointer -pipe" CHOST="i686-pc-linux-gnu" CONFIG_PROTECT="/etc /usr/X11R6/lib/X11/xkb /usr/kde/2/share/config /usr/kde/3.3/env /usr/kde/3.3/share/config /usr/kde/3.3/shutdown /usr/kde/3/share/config /usr/lib/mozilla/defaults/pref /usr/share/config /var/qmail/control" CONFIG_PROTECT_MASK="/etc/gconf /etc/terminfo /etc/env.d" CXXFLAGS="-O2 -march=pentium4 -fomit-frame-pointer -pipe" DISTDIR="/usr/portage/distfiles" FEATURES="autoaddcvs autoconfig ccache distlocks sandbox sfperms" GENTOO_MIRRORS="http://distfiles.gentoo.org http://distro.ibiblio.org/pub/Linux/distributions/gentoo" MAKEOPTS="-j8" PKGDIR="/usr/portage/packages" PORTAGE_TMPDIR="/var/tmp" PORTDIR="/usr/portage" PORTDIR_OVERLAY="" SYNC="rsync://rsync.gentoo.org/gentoo-portage" USE="x86 X acl alsa apache2 apm arts avi berkdb bitmap-fonts cdr crypt cups dvd encode esd fam foomaticdb fortran gd gdbm gif gnome gpm gstreamer gtk gtk2 guile imlib ipv6 java jpeg kde libwww mad mikmod mmx mmx2 motif mozilla mpeg mysql ncurses nls nptl oggvorbis opengl oss pam pdflib perl png python qt quicktime readline samba sdl spell sse sse2 ssl svga tcpd tiff truetype xml xml2 xmms xv zlib video_cards_radeon"
Comment 1 SpanKY 2004-12-22 20:02:27 UTC
adding calls to 'sleep' is not a valid fix do you use udev ? if so, what version ?
Comment 2 Harris Landgarten 2004-12-22 20:52:55 UTC
Yes I am using udev-050
Comment 3 SpanKY 2004-12-22 21:03:51 UTC
if you edit /sbin/rc and change this if statement: if [ "$(udev_version)" -ge "48" ] ; then read like this: if [ "$(udev_version)" -ge "500" ] ; then and then reboot, does everything work out ?
Comment 4 Harris Landgarten 2004-12-23 05:07:50 UTC
The change to /sbin/rc does not fix the problem.
Comment 5 Harris Landgarten 2004-12-27 08:25:15 UTC
On a hunch, I downgraded to UDEV-045 and the problem stopped. I now believe this problem is related to bug 74786. In addition to fixing the LVM problem it also solved a problem where net.eth0 was showing started before it received an IP address from dhcpcd causing ntp-client to fail to start. I appears that udevsend is broken is more ways than one and is the cause of various race conditions on boot.
Comment 6 Kay Sievers 2004-12-27 08:48:35 UTC
Any chance to try a 2.6.10 kernel? We've had a bug in 2.6.9 which causes holes hotplug sequence numbers: http://linus.bkbits.net:8080/linux-2.5/cset@41798d05sEo-J9zrpeXvVYgctOLw1w
Comment 7 Harris Landgarten 2004-12-27 08:54:37 UTC
I was already running 2.6.10-gentoo
Comment 8 Kay Sievers 2004-12-27 09:08:32 UTC
Created attachment 46975 [details] hotplug environment logger Ah, ok. Any chance to reproduce this while recording the events with the simple attached script?
Comment 9 Harris Landgarten 2004-12-27 09:23:28 UTC
Where should I put the script?
Comment 10 Kay Sievers 2004-12-27 09:33:47 UTC
Sorry, forgot that: /etc/hotplug.d/default/00-log.hotplug
Comment 12 Harris Landgarten 2004-12-27 09:58:09 UTC
The first time I rebooted with the log script LVM started properly but ntp-client did not. I rebooted and LVM hung at the maintenance point. After running vgchange -ay and exiting booting completed and this time ntp-client started. hotplug.log is attached.
Comment 13 Kay Sievers 2004-12-27 10:15:20 UTC
Seems you didn't use /sbin/udevsend as the hotplug helper as there is no MANAGED_EVENT=1 key in your log.
Comment 14 Harris Landgarten 2004-12-27 10:29:55 UTC
I rebooted again and it hung again. I checked /dev/.hotplug.log from the maintenance point and it showed --------924-------- 1104169448 add /module/dm_mod SUBSYSTEM=module DEVPATH=/module/dm_mod PATH=/sbin:/bin:/usr/sbin:/usr/bin ACTION=add _=/bin/env PWD=/ HOME=/ SHLVL=2 SEQNUM=924 --------924-------- --------925-------- 1104169448 add /module/dm_mod/sections SUBSYSTEM=module DEVPATH=/module/dm_mod/sections PATH=/sbin:/bin:/usr/sbin:/usr/bin ACTION=add _=/bin/env PWD=/ HOME=/ SHLVL=2 SEQNUM=925 --------925-------- --------926-------- 1104169448 add /class/misc/device-mapper SUBSYSTEM=misc DEVPATH=/class/misc/device-mapper PATH=/sbin:/bin:/usr/sbin:/usr/bin ACTION=add _=/bin/env PWD=/ HOME=/ SHLVL=2 SEQNUM=926 --------926-------- Downgrading to udev-045 definitely solved the problem so what else could it be?
Comment 15 Kay Sievers 2004-12-27 10:53:32 UTC
Don't know what is causing the problem here. It's definitely not the udevsend problem. We need to narrow the possibilities. Do the rules change when you up/downgrade udev? You may try the 045 udev with the 050 rules, or the other way around.
Comment 16 Harris Landgarten 2004-12-27 16:40:19 UTC
To begin with, I wanted to verify my race condition prop. I added a sleep 5 to checkfs before the lvm section. Rebooting failed. I changed the sleep 5 to a sleep 10 and rebooting suceeded many times. This proves that checkfs is running before /dev/mapper/control can be established. This only started happening with baselayout-1.11.8 and udev > 045. I doubt it is rule based because udev-050 works fine with baselayout-1.11.7-r2. Something changed with these updates that slowed down the post modprobe processing of dm_mod using udev. Whatever is causing this may be the cause of other problems. Aside from that, checkfs needs a more robust method than testing for the existence of /dev/mapper/control to handle when dm_mod is in process but not finished. BTW, should I post the net.eth0 problems as a separate bug?
Comment 17 Kay Sievers 2004-12-27 17:52:44 UTC
> I changed the sleep 5 to a sleep 10 and rebooting suceeded many times. You can see in your logs that the kernel emits the events with 20 seconds time difference: --------926-------- 1104169448 add /class/misc/device-mapper --------927-------- 1104169468 add /block/dm-0 Any idea why this happens? What is triggering the creation of the dm-0 device at bootup?
Comment 18 Harris Landgarten 2004-12-27 20:18:59 UTC
I captured the hotplug_d.log and ls -l /dev/mapper before the LVM stuff in checkfs which is right after dm_mod is modprobed hotplug_d.log contains: --------924-------- 1104206320 add /module/dm_mod SUBSYSTEM=module DEVPATH=/module/dm_mod PATH=/sbin:/bin:/usr/sbin:/usr/bin ACTION=add _=/bin/env PWD=/ HOME=/ SHLVL=2 SEQNUM=924 --------924-------- --------925-------- 1104206321 add /module/dm_mod/sections SUBSYSTEM=module DEVPATH=/module/dm_mod/sections PATH=/sbin:/bin:/usr/sbin:/usr/bin ACTION=add _=/bin/env PWD=/ HOME=/ SHLVL=2 SEQNUM=925 --------925-------- --------926-------- 1104206321 add /class/misc/device-mapper SUBSYSTEM=misc DEVPATH=/class/misc/device-mapper PATH=/sbin:/bin:/usr/sbin:/usr/bin ACTION=add _=/bin/env PWD=/ HOME=/ SHLVL=2 SEQNUM=926 --------926-------- ls -l /dev/mapper ls: /dev/mapper: No such file or directory At the maintenance point /dev/.hotplug_d.log has not changed but ls -l /dev/mapper gives: crw-rw---- 1 root root 10, 61 Dec 27 22:58 control dm-0 etc are produced after vgchange -ay runs which explains the delay since it is being input manually. It seems that it is simple case of /dev/mapper/control not appearing in time for checkfs and I assume this is the responibility of udev in response to dm_mod loading. It looks like event 926 triggers this but why is it taking over 5s? Could it have anything to do with the fact that it is the last event? Is udev waiting for something else?
Comment 19 Harris Landgarten 2004-12-27 21:02:41 UTC
/etc/hotplug.d/default/10-udev.hotplug is symlinked to /sbin/udevsend. Doesn't this mean that udevsend is being used as a helper agent for all hotplug events? If not udevsend -> udevd, what is processing kernel event 926 and creating /dev/mapper/control?
Comment 20 Kay Sievers 2004-12-28 07:44:25 UTC
> It seems that it is simple case of /dev/mapper/control not appearing > in time for checkfs and I assume this is the responibility of udev > in response to dm_mod loading. Yes, udev creates the node with the event. > /etc/hotplug.d/default/10-udev.hotplug is symlinked to /sbin/udevsend. > Doesn't this mean that udevsend is being used as a helper agent for all > hotplug events? If not udevsend -> udevd, what is processing kernel > event 926 and creating /dev/mapper/control? Yes, the unpredictability of events by forked helper preocesses makes it necessary for udev to reorder the events. It's that way from the early days of udev on. In the future, we want to take over the _whole_ hotplug event with udevsend by letting the kernel fork /sbin/udevsend instead of /sbin/hotplug, but that is disabled on your box, according to your logs. > Could it have anything to do with the fact that it is the last event? > Is udev waiting for something else? Not that I can think of anything. > It looks like event 926 triggers this but why is it taking over 5s? Good question. Please put the same script from /hotplug.d/ in: /etc/dev.d/default/00-log.dev and let it write to another file. This way we get the time udev has created the node.
Comment 21 Harris Landgarten 2004-12-31 12:46:11 UTC
Sorry for the delay. My experimental machine was busy recovering a failed stripe 0 for a customer. The following was recorded at the maintenance point. From hotplug_d.log --------926-------- 1104519006 add /class/misc/device-mapper SUBSYSTEM=misc DEVPATH=/class/misc/device-mapper PATH=/sbin:/bin:/usr/sbin:/usr/bin ACTION=add _=/bin/env PWD=/ HOME=/ SHLVL=2 SEQNUM=926 --------926-------- From dev_d.log --------926-------- 1104519016 add /class/misc/device-mapper SUBSYSTEM=misc DEVPATH=/class/misc/device-mapper PATH=/sbin:/bin:/usr/sbin:/usr/bin ACTION=add _=/bin/env PWD=/ MANAGED_EVENT=1 DEVNAME=/dev/mapper/control HOME=/ SHLVL=2 SEQNUM=926 --------926-------- As you can see it is taking 10s for udev to respond to the 926 request. BTW, how do I configure the system so that udevsend takes over all hotplug activity?
Comment 22 Kay Sievers 2005-01-01 11:40:04 UTC
Happy New Year! > As you can see it is taking 10s for udev to respond to the 926 request. That's bad. How does the complete sequence look like? Are more udev events delayed like this one? Or is it only this single one. I can't reproduce this on any of my boxes, both events happen in the same second after loading the module. > how do I configure the system so that udevsend takes over all hotplug activity? By echo "/sbin/udevsend > /proc/sys/kernel/hotplug". This way the kernel notifies udevd and the event is handled by the forked udev instance after the node creation and waiting for sysfs to be ready. But we should find the error first in your current setup.
Comment 23 Harris Landgarten 2005-01-01 12:34:07 UTC
I looked at the .dev_d.log file at the maintance point. At this point hotplug_d.log only has 3 entries as show above. dev_d.log was a 256K file but the 926 entry was the only one with a sequence number and was the last entry in the file. The only other issue I have is that net.eth0 is shown as started before it receives an ip from dhcpcd on boot. An ifconfig immediately after root signon sometimes shows eth0 without an IP address with subsequent tries showing the IP address. I don't know if this is somehow related but it is baselayout-1.11.8 issue I think we should put some debugging statements into udevsend so that we can see what is causing the delay. udevsend is probably timing out of the code that starts on line 179. It is also possible that udev is slow in starting. Would 926 be the sequency that intially causes udevsend to load udevd? Harris
Comment 24 Kay Sievers 2005-01-01 13:14:56 UTC
Just an idea: You may try to remove the udevd kill from /sbin/rc:91: killall -q udevd This is bad. It will delay all further events for 10 seconds on a current udev. I've made a patch to prevent the initial delay, so killing it should work sometimes in the future, but doing it today isn't a good idea: http://bugs.gentoo.org/attachment.cgi?id=46753&action=view Maybe your network problems are caused by the changes according to: http://bugs.gentoo.org/show_bug.cgi?id=73327
Comment 25 Harris Landgarten 2005-01-01 14:06:21 UTC
I tried inserting a usb stick to see if there were any delays in udev creating a dev for it. Both .hotplug_d.log and .dev_d.log show the same time stamp in processing this event so it seems like all is working as it should once things get started and settle down. Since I have a totally reproducable boot related bug on a new 2004.3 installation I would like to systematically track it down. We know that there is a 10s delay in creating /dev/mapper/control. Assuming this delay is happening in the loop that starts at line 179 of udevsend.c (my first guess), could you create a version of that file with some debugging output that will show the status of that loop and whether it is waiting on udevd or something else is preventing its return.
Comment 26 Kay Sievers 2005-01-01 16:02:31 UTC
> seems like all is working as it should once things get started and settle down. Yes, seems so. Loading all basic modules _before_ the run of udevstart should solve all these kinds of problems too. That's the way other distributions are doing this and we don't get reports from them. > Since I have a totally reproducable boot related bug on a new 2004.3 installation > I would like to systematically track it down. We know that there is a 10s delay in > creating /dev/mapper/control. Did you try the patch, that reduces the startup delay? http://bugs.gentoo.org/attachment.cgi?id=46753&action=view > Assuming this delay is happening in the loop that starts at line 179 of > udevsend.c (my first guess), That's very unlikely. I don't expect any problem here. > could you create a version of that file with some > debugging output that will show the status of that loop and whether it is waiting > on udevd or something else is preventing its return. Just pass a "DEBUG=true" to the make command and udev will print that to syslog.
Comment 27 Harris Landgarten 2005-01-04 04:32:46 UTC
I applied the udevd-initial-timout-fix-01.patch to udev-050 but it did not make any difference. I am still getting a 10s delay.
Comment 28 Kay Sievers 2005-01-04 06:43:23 UTC
Hmm, no idea. Any chance to pass DEBUG=true to the make command. It will write a lot of debug to syslog.
Comment 29 Harris Landgarten 2005-01-04 11:45:20 UTC
Will DEBUG=true emerge udev do what you want. Also remember /var is not mounted when the error occurs.
Comment 30 Kay Sievers 2005-01-04 12:08:32 UTC
> Will DEBUG=true emerge udev do what you want. Sorry, I don't know anything about gentoo. > Also remember /var is not mounted when the error occurs. Ah, I see. It should print on the console then, but it may be too much log to follow it, don't know... Apart from the possible udev bug, that we looking for and it would be nice to find, gentoo should consider loading the dm modules before it runs udevstart.
Comment 31 Harris Landgarten 2005-01-29 04:52:30 UTC
I just added a lvm home to my notebook and now it has the same problem.
Comment 32 Matthew Harris 2005-05-04 02:05:54 UTC
Another data point: I'm also seeing the problem after upgrading to baselayout-1.11.11-r3, but this isn't a ~x86 system: udev-045 lvm2-2.00.33-r1 device-mapper-1.00.19-r2 gentoo-sources-2.6.11-r6 Possibly just coincidental: when I boot, I also get /dev/usb created as a character device rather than a directory, so apcupsd fails until I recreate it as a directory and reattach my UPS. (Just like <https://www.redhat.com/archives/fedora-list/2005-January/msg07909.html>.)
Comment 33 Eric Edgar (RETIRED) 2006-04-21 13:00:25 UTC
reopen this bug if this is still an issue. This should be fixed with newer baselayouts and newer lvm2 packages