Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 154670 - sys-apps/baselayout starting ntp-client before network is ready
Summary: sys-apps/baselayout starting ntp-client before network is ready
Status: RESOLVED FIXED
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: [OLD] baselayout (show other bugs)
Hardware: All Linux
: High normal (vote)
Assignee: Gentoo's Team for Core System packages
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-11-10 06:13 UTC by Daniel Drake (RETIRED)
Modified: 2007-01-17 17:02 UTC (History)
2 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments
Test on exit -1 instead of service_inactive (x,2.50 KB, patch)
2006-11-13 16:40 UTC, Roy Marples (RETIRED)
Details | Diff
race illustration (test.html,16.22 KB, text/html)
2006-11-15 11:34 UTC, Daniel Drake (RETIRED)
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Daniel Drake (RETIRED) gentoo-dev 2006-11-10 06:13:23 UTC
Occasionally baselayout is starting ntp-client before net.ath0 has obtained an IP using dhcpcd. This is a madwifi interface using WEP over wpa_supplicant, and using parallel startup.

By inserting debug iwconfig/ifconfig calls into the ntp-client init script, I can see that when this problem does happen, wpa_supplicant has associated but dhcpcd has not yet obtained an IP.

Reproduced on 1.12.5 and 1.13.0_alpha6

I plan to look into fixing this over the next few days.
Comment 1 Daniel Drake (RETIRED) gentoo-dev 2006-11-10 06:14:07 UTC
This is what is happening:

INIT entering runlevel 3
Service binary-modules starting
Service net.ath0 starting
Service netmount starting
Service ntp-client starting
Service sshd starting
Service uploader starting
Service vixie-cron starting
Service local starting
Service uploader started
Service binary-modules started
Service net.ath0 started
Service netmount started
Error : Temporary failure in name resolution
9 Nov 13:10:07 ntpdate[6607]: can't find host pool.ntp.org
 
9 Nov 13:10:07 ntpdate[6607]: no servers can be used, exiting
Falied to set clock [!!]
Service sshd started
Service vixie-cron started
Service local started
Comment 2 Daniel Drake (RETIRED) gentoo-dev 2006-11-10 06:20:19 UTC
Also added some "ps ax | grep dhcpcd" debug output into the ntp-client init script.

Judging by the fact that the dhcpcd PID listed there is different to the one listed when I login after boot, it looks like dhcpcd hasn't even forked off at the point of failure.

Also, the process is marked in state D (uninterruptible sleep) at the failure point, not sure if this is of relevance.
Comment 3 Daniel Drake (RETIRED) gentoo-dev 2006-11-10 06:32:52 UTC
The plot thickens, just saw this happen with 1.13:

Error : Temporary failure in name resolution
9 Nov 13:10:07 ntpdate[6607]: can't find host pool.ntp.org
9 Nov 13:10:07 ntpdate[6607]: no servers can be used, exiting
Setting clock via the NTP client 'ntpdate'
Falied to set clock [!!]
ERROR: ntp-client failed to start
Starting sshd
WARNING: ntp-client is schedule to start when net.ath0 has started
Comment 4 Roy Marples (RETIRED) gentoo-dev 2006-11-10 09:24:55 UTC
put this in /etc/conf.d/net

postup() {
   if [[ ${IFACE} == "ath0" ]] ; then
      echo "Pinging pool.ntp.org"
      print pool.ntp.org
   fi
}

Does that work?
Comment 5 Roy Marples (RETIRED) gentoo-dev 2006-11-10 09:29:01 UTC
(In reply to comment #4)
>       print pool.ntp.org

Obviously that should be ping, not print ;)
Comment 6 Daniel Drake (RETIRED) gentoo-dev 2006-11-13 06:55:09 UTC
That output didn't seem to hit the console so I used tee to save it to a file. It looks like postup was executed at the right time, *after* ntp-client was started: the ping succeeded.
Comment 7 Daniel Drake (RETIRED) gentoo-dev 2006-11-13 07:04:03 UTC
Confirmed that by adding "date +%s.%N" calls to /etc/init.d/ntp-client and postup
Comment 8 Daniel Drake (RETIRED) gentoo-dev 2006-11-13 13:09:55 UTC
Also happens on non-parallel startup.

Unsurprisingly this seems to be a race condition. This is the sequence of events:

> net.ath0 starts
> net.ath0 starts wpa_supplicant and wpa_cli
> wpa_supplicant starts association, drops into background, and exits
[some time passes]
> wpa_supplicant associates
> wpa_cli calls /etc/init.d/net.ath0 start
> net.ath0 detects wpa supplicant is already started, so invokes dhcpcd
[dhcpcd starts, the boot continues on another processor]
< ntp-client starts, realises it has a dependency on net.ath0
< runscript.sh calls service_started on net.ath0, decides that net.ath0 has started, so it is safe to continue
< ntp-client init script launches ntpdate
< ntpdate bitches about not having network access
> dhcpcd gets an IP and forks into the background
system continues to boot

In short the problem is that "service_started net.ath0" succeeds before dhcpcd has forked.
Comment 9 Daniel Drake (RETIRED) gentoo-dev 2006-11-13 13:29:19 UTC
net.ath0 starts (for the first time, as opposed to wpa_cli starting it for the 2nd time), this happens:

 * Starting ath0
 *  Starting wpa_supplicant on ath0
 *  Starting wpa_cli on ath0
 *    Backgrounding

Now, line 519 in runscript.sh is reached. Incase this is different in my copy, it's in the svc_start() function and is in this context:

	else
		svcstarted=0
		mark_service_started "${SVCNAME}" ## <-------- LINE 519
		veinfo $"Service" "${SVCNAME}" $"started"
	fi

	# Flush the ebuffer
	if [[ ${RC_PARALLEL_STARTUP} == "yes" && ${RC_QUIET} != "yes" ]] ; then
		eflush
		ebuffer ""
	fi

The service is marked started despite no indication that wpa_cli has done its work and started net.ath0
Comment 10 Roy Marples (RETIRED) gentoo-dev 2006-11-13 16:39:22 UTC
(In reply to comment #9)
> Now, line 519 in runscript.sh is reached.

Line 519 is only reached on my boxes after an IP address has been successfully configured, which is correct.

If you look at the lines above, you'll see that we bail out immediately after based on service status. So I suppose there is a tiny window to race through. By default we always exit 0 when backgrounding in the net scripts. 1 is always our error code, so maybe we should return -1 and test for that instead of using service status. Of course the downside of that is we have to guarantee that other init scripts don't return -1 - heh.
Comment 11 Roy Marples (RETIRED) gentoo-dev 2006-11-13 16:40:29 UTC
Created attachment 101886 [details, diff]
Test on exit -1 instead of service_inactive

Totally untested as I'm going to bed :P
Comment 12 Daniel Drake (RETIRED) gentoo-dev 2006-11-14 07:28:57 UTC
After this patch, it appears that dhcpcd is never invoked after wpa_supplicant brings the interface up.

I added some more debug output to get exact sequences. With the patch:

STARTING
INACTIVE
STOPPED
[ntp-client starts]
[wpa_cli invokes init script]

Without the patch, here is the exact sequence of a buggy boot:

STARTING
INACTIVE
[wpa_cli invokes init script]
STARTING
STARTED <-- line 519 in runscript.sh did this
[ntp-client starts]
[dhcpcd starts]
[dhcpcd receives IP and forks]
STARTED

Another buggy boot:

STARTING
INACTIVE
[wpa_cli invokes init script]
STARTING
STARTED <-- line 519 in runscript.sh did this
[dhcpcd starts]
[ntp-client starts]
[dhcpcd receives IP and forks]
STARTED
STARTED <-- line 519 in runscript.sh did this

Successful boot:

STARTING
INACTIVE
[wpa_cli invokes init script]
STARTING
[dhcpcd starts]
[dhcpcd receives IP and forks]
STARTED
STARTED
(this one is missing the position of the ntp-client start, i'm having real trouble getting another successful boot so I can add this info...)

Another successful boot:

STARTING
INACTIVE
[wpa_cli invokes init script]
STARTING
[dhcpcd starts]
[dhcpcd obtains IP and forks]
STARTED
[ntp-client starts]
STARTED

The on-screen output was interesting in the above case, as ntp-client was started in the foreground despite net.ath0 being backgrounded/inactive:

* Starting ath0
 * Backgrounding
* WARNING: net.ath0 has started but is inactive
* Mounting network filesystems
* Setting clock via the NTP client ntpdate

(normally it moans about ntp-client being scheduled to start when net.ath0 has started)
Comment 13 Daniel Drake (RETIRED) gentoo-dev 2006-11-15 07:57:11 UTC
The patch doesn't work because exit codes are unsigned. I changed -1 to 255 and it is closer but not quite there:

Wed Nov 15 10:46:21 EST 2006 STARTING
Wed Nov 15 10:46:22 EST 2006 INACTIVE
Wed Nov 15 10:46:22 EST 2006 wpa_cli running init script
Wed Nov 15 10:46:22 EST 2006 STARTING
Wed Nov 15 10:46:22 EST 2006 running dhcpcd
Wed Nov 15 10:46:22 EST 2006 ntp-client start
Wed Nov 15 10:46:23 EST 2006 dhcpcd complete
Wed Nov 15 10:46:23 EST 2006 STARTED
Wed Nov 15 10:46:23 EST 2006 STARTED

Rebooted again and got different results:

Wed Nov 15 10:49:47 EST 2006 STARTING
Wed Nov 15 10:49:48 EST 2006 INACTIVE
Wed Nov 15 10:49:50 EST 2006 wpa_cli running init script
Wed Nov 15 10:49:50 EST 2006 STARTING
Wed Nov 15 10:49:50 EST 2006 running dhcpcd
Wed Nov 15 10:49:51 EST 2006 dhcpcd complete
Wed Nov 15 10:49:51 EST 2006 STARTED
Wed Nov 15 10:49:53 EST 2006 STARTED

ntp-client (and other net-dependent services) were not started at all
Comment 14 Daniel Drake (RETIRED) gentoo-dev 2006-11-15 11:34:40 UTC
Created attachment 101997 [details]
race illustration

here's a pretty illustration of the sequence of events and the race window.
the race window does look small, but the race is happening most of the time, probably due to the fact that the scheduler is likely change process upon subshell process termination.
Comment 15 Daniel Drake (RETIRED) gentoo-dev 2007-01-08 18:32:28 UTC
Roy has checked in a workaround for this, and it seems to work here (survived 20 reboots with no races, previously it was happening almost every time).
Comment 16 Roy Marples (RETIRED) gentoo-dev 2007-01-10 12:19:43 UTC
This should be fixed in baselayout-1.13.0_alpha11. Re-open if you disagree.
Comment 17 Petteri Räty (RETIRED) gentoo-dev 2007-01-10 14:38:33 UTC
(In reply to comment #16)
> This should be fixed in baselayout-1.13.0_alpha11. Re-open if you disagree.
> 

I disagree :)
Comment 18 Daniel Drake (RETIRED) gentoo-dev 2007-01-10 15:35:32 UTC
Why do you disagree? Do you see the same bug? Is it unfixed for you?
Comment 19 Roy Marples (RETIRED) gentoo-dev 2007-01-14 18:56:56 UTC
OK, I think I have this sussed now. svc_in_control is a good function that works fine and does block the link, but we need to change from linking to the actual script to just touching a file by the same name in $svcdir.

This is because -nt and test de-reference the link which makes it pointless. Should be fixed in alpha12 when that comes out.
Comment 20 Daniel Drake (RETIRED) gentoo-dev 2007-01-17 17:02:46 UTC
seems to be fixed in alpha12, thanks!