122795 – ppp0 without an IP address after loss of connection in persistent mode

Bug 122795 - ppp0 without an IP address after loss of connection in persistent mode

Summary: ppp0 without an IP address after loss of connection in persistent mode

Status:	RESOLVED FIXED

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	[OLD] baselayout (show other bugs)
Hardware:	All Linux

Importance:	High normal (vote)
Assignee:	Gentoo Dialup Developers

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2006-02-14 07:03 UTC by Martin von Gagern
Modified:	2006-05-09 05:25 UTC (History)
CC List:	1 user (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
syslog (ppp_noip.log,25.14 KB, text/plain) 2006-02-14 07:22 UTC, Martin von Gagern	Details
Proposed ip-up script (ip-up,1.69 KB, text/plain) 2006-02-17 09:51 UTC, Alin Năstac (RETIRED)	Details
Patch that forces pppd to wait for its childs before trying to reconnect (ppp-2.4.3-child-wait.patch,1.11 KB, patch) 2006-02-17 12:53 UTC, Alin Năstac (RETIRED)	Details \| Diff
syslog with ppp-2.4.3-r14 (bug122795-syslog2.txt,5.39 KB, text/plain) 2006-05-06 04:01 UTC, Martin von Gagern	Details
Show Obsolete (2) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Martin von Gagern 2006-02-14 07:03:39 UTC

I have a box running gentoo linux, baselayout-1.12.0_pre15-r1 ppp-2.4.3-r10 and connected to the web using pppoe over an adsl link.
This link is supposed to be always up, but apparently there was some hassle with it the other day, and my connection got reset several times because lcp echos were not answered. Annoying, but no big deal so far, and probably the isps fault.

Anyway, after the fifth such reconnect in a short period of time, I could no longer reach this box from outside. I had to drive here just to see what happened to the link. Apparently the ppp0 device was up but without an ip address. Neither was there a valid default route.

Closer examination of the syslog told me that all network dependent services where stopped when the link was lost and restarted afterwards. This seems a bit of an overkill, as I'd expect the link to be reestablished any moment. OK, I have RC_NET_STRICT_CHECKING="yes" because otherwise I'd not get things booted or shut down in the correct order, at least that's what happened in the past.
One problem of restarting all services is that e.g. ntp would apparently lose some calibration information, according to http://ntp.isc.org/bugs/51#c23

Apparently pppd was busy reestablishing the connection while ip-down was still running. In the very last such cycle, the termination of ip-down and the start of ip-up are logged in the very same second. This makes me think of some kind of race condition. Could it be possible that pppd was already reestablishing the device while ip-down and the net.ppp0 init script called from it were still busy bringing the interface down? Could the result of such a situation be a device that is only partially configured?

I don't know how this should be fixed. One option would be to patch pppd to wait for ip-down to finish before setting up a new connection. Another option would be to not call net.ppp0 stop at all, but rather nothing at all. A third alternative would be to somehow shut down all services depending on net.ppp0, but leave the interface itself alone, as a terminating pppd would take care of that.

Here some commands I executed while the link was still unavailable:

# ifconfig ppp0
ppp0 Link encap:Point-to-Point Protocol
UP POINTOPOINT RUNNING NOARP MULTICAST MTU:1492 Metric:1
RX packets:1478 errors:0 dropped:0 overruns:0 frame:0
TX packets:3 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:3
RX bytes:94135 (91.9 Kb) TX bytes:54 (54.0 b)

# route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
192.168.70.2 0.0.0.0 255.255.255.255 UH 0 0 0 tun0
192.168.71.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0
192.168.70.0 192.168.70.2 255.255.255.0 UG 0 0 0 tun0
127.0.0.0 0.0.0.0 255.0.0.0 U 0 0 0 lo

# ps -C pppd -o cmd | cat
CMD
/usr/sbin/pppd unit 0 persist maxfail 0 remotename ppp0 user X0056843@mdsl.mnet-online.de linkname ppp0 updetach debug noauth defaultroute usepeerdns ipcp-accept-remote ipcp-accept-local noipdefault holdoff 3 lcp-echo-interval 15 lcp-echo-failure 3 connect true plugin rp-pppoe.so eth1

# cat /etc/resolv.conf
# Generated by pppd for interface up
nameserver 212.18.3.5
nameserver 212.18.0.5

# cat /etc/ppp/resolv.conf
nameserver 212.18.3.5
nameserver 212.18.0.5

Comment 1 Martin von Gagern 2006-02-14 07:22:39 UTC

Created attachment 79759 [details]
syslog

This is a stripped down version of my syslog. It should contain all the messages from pppd as well as all the services started or stopped in the process. I believe this might be useful, but it's still 393 lines...

Comment 2 Roy Marples (RETIRED) gentoo-dev

2006-02-14 08:25:10 UTC

RC_NET_STRICT_CHECKING="yes" is doing it's job taking down the services. This will not change.

If pppd is doing this
link down -> /etc/init.d/net.ppp0 stop wihout checking if net.ppp0 is starting
link up -> /etc/init.d/net.ppp0 start without checking if net.ppp0 is stopping

then they won't work as we don't allow stopping when starting and vice versa.
As you summarised it's racing and pppd or its ip-{up,down} scripts will need to be patched to stop this.

As such, the bug goes to net-dialup.

Comment 3 Alin Năstac (RETIRED) gentoo-dev

2006-02-16 04:08:10 UTC

my net connection at home is down. maybe today will be restored, but I cannot promise that.

Comment 4 Alin Năstac (RETIRED) gentoo-dev

2006-02-17 09:51:59 UTC

Created attachment 80019 [details]
Proposed ip-up script

IMO the gentoo version of ip-up and ip-down scripts should block eachother by using a some kind of "semaphore".

For instance, this file could be installed as ip-up script. Roy, what do you think? Do you see a more elegant way of doing it?

Comment 5 Martin von Gagern 2006-02-17 09:58:27 UTC

(In reply to comment #4)
> IMO the gentoo version of ip-up and ip-down scripts should block eachother by
> using a some kind of "semaphore".

I don't think this solves the issue. As far as I can tell, pppd does take care not to run the scripts concurrently. At least the log looks that way. But the problem is that the device is not set up by ip-up, but rather by pppd itself. So it is rather a race of ip-down against the internal workings of pppd.

Comment 6 Alin Năstac (RETIRED) gentoo-dev

2006-02-17 12:53:29 UTC

Created attachment 80039 [details, diff]
Patch that forces pppd to wait for its childs before trying to reconnect

You're right and I'm wrong. 
Please try the attached patch and see if it solve the problem.

Comment 7 Alin Năstac (RETIRED) gentoo-dev

2006-02-21 14:49:43 UTC

fixed in 2.4.3-r11.

now pppd waits for its children when the PPP session is closed. this fix has an effect only when pppd is running in persist mode (in normal mode pppd was already set to wait before ending pppd).

Comment 8 Martin von Gagern 2006-05-06 03:58:32 UTC

Yesterday I experienced this bug again, with net-dialup/ppp-2.4.3-r14.
Reopening.

Comment 9 Martin von Gagern 2006-05-06 04:01:27 UTC

Created attachment 86250 [details]
syslog with ppp-2.4.3-r14

Comment 10 Alin Năstac (RETIRED) gentoo-dev

2006-05-06 04:17:17 UTC

This time it isn't pppd's fault.
As you can see, ip-up is launched after ip-down finishes doing its job.

Comment 11 Martin von Gagern 2006-05-06 06:31:25 UTC

(In reply to comment #10)

You are right. But it is still the scenario described in the summary:
I had the ppp0 interface up but without an IP address assigned.

Thw following two lines are in my opinion caused by the shutdown, although ip-down is already terminated:
> rc-scripts: Failed to set noip addresses to 0.0.0.0, error 255
> squid[7372]: Squid Parent: child process 7374 exited with status 0

And the following line indicates that some kind of concurrence probably still lies at the heart of the problem:
> rc-scripts: ERROR:  net.ppp0 is already stopping.

Looking at /etc/ppp/ip-down, I see the setting
> export IN_BACKGROUND="true"
before stopping the device. This suggests that the net script backgrounds some of its actions, which is probably a bad idea in any case (as we run ip-down.local concurrently), and most likely the cause of this instance of this bug.

Is tis file under some kind of version control, to find out who put this line there and what the idea behind it was in the first place?

Comment 12 Alin Năstac (RETIRED) gentoo-dev

2006-05-06 09:56:13 UTC

I think IN_BACKGROUND name mislead you, but Roy (aka uberlord) is the maintainer of the baselayout, so I'll leave the rebuttal to him.

The cvs web interface is available here: http://www.gentoo.org/cgi-bin/viewcvs.cgi/net-dialup/ppp/ .

Roy, can you understand what happened?

Comment 13 Roy Marples (RETIRED) gentoo-dev

2006-05-07 09:47:20 UTC

IN_BACKGROUND is set true when a background process calls us - it does not fork anything in the background that it would not otherwise do.

I don't see a baselayout error here.

Comment 14 Alin Năstac (RETIRED) gentoo-dev

2006-05-07 09:52:56 UTC

The baselayout errors are:
5 17:15:32 server rc-scripts: Failed to set noip addresses to 0.0.0.0, error 255
...
May  5 17:15:35 server rc-scripts: ERROR:  net.ppp0 is already stopping.

The first one isn't even resulted from ip-up/ip-down executions!?!

Comment 15 Roy Marples (RETIRED) gentoo-dev

2006-05-07 10:24:21 UTC

(In reply to comment #14)
> The baselayout errors are:
> 5 17:15:32 server rc-scripts: Failed to set noip addresses to 0.0.0.0, error
> 255
> ...
> May  5 17:15:35 server rc-scripts: ERROR:  net.ppp0 is already stopping.

baselayout does not produce the first error message. Even though it's from "rc-scripts" any script that sources /sbin/functions.sh will who rc-scripts when they use einfo/ewarn/eerror functions. It could be produced when pppd calls net.ppp0 stop and then removes the ppp0 interface during this. We do check if the interface actually exists before setting its address to 0.0.0.0 so I'm 99% sure it's not baselayout in error here.

The 2nd error message is showing that something else - maybe pppd - called net.ppp0 stop in error. Only one thing can stop it at once, so this is not a baselayout error.

Comment 16 Martin von Gagern 2006-05-07 10:48:10 UTC

(In reply to comment #15)
> baselayout does not produce the first error message.

That one is created by the noip daemon from net-dns/noip-updater which manages my dynamic dns record. As noip depends on net and RC_NET_STRICT_CHECKING="yes" is set, stopping net.ppp0 causes noip to be stopped. Upon stopping that service, however, it tries to reset the registered IP address, which fails because the interface is already disconnected.

So it is OK that this is executed (at least if accoriding to comment #2 it is intentional to bring al services down even for a temporary connection loss), and even the error message is to be expected. But this script should be terminated by the time ip-down returns, which seems to be not the case.

> The 2nd error message is showing that something else - maybe pppd - called
> net.ppp0 stop in error. Only one thing can stop it at once, so this is not a
> baselayout error.

Could it be that pppd called "net.ppp0 start" while some part of "net.ppp0 stop" was still running? Or would that cause another error message?

Comment 17 Alin Năstac (RETIRED) gentoo-dev

2006-05-07 11:44:29 UTC

what /etc/ppp/ip-down script do you have?

Comment 18 Martin von Gagern 2006-05-07 13:50:56 UTC

(In reply to comment #17)
> what /etc/ppp/ip-down script do you have?

It is the ip-down script from net-dialup/ppp-2.4.3-r14, with MD5 sum d224887368bd3b15d96d5dc976bf58bf, which matches that from net-dialup/ppp/files/ip-down.baselayout in the current revision, 1.3.

Comment 19 Alin Năstac (RETIRED) gentoo-dev

2006-05-07 15:58:40 UTC

ah, I see ip-down script has been killed! do you use "child-timeout 15" by any chance?
ip-down should have been completed /etc/init.d/net.ppp0 --quiet stop by then, but who knows...:-\

Comment 20 Martin von Gagern 2006-05-08 02:23:50 UTC

(In reply to comment #19)
> ah, I see ip-down script has been killed! do you use "child-timeout 15" by any
> chance?

I see ip-down running for 8 seconds, from 17:15:23 to 17:15:31. Where did you get the idea of timeout 15?

I got the command line from the pppd that was running after the bug occurred:

/usr/sbin/pppd unit 0 remotename ppp0 user <MyUserName> linkname ppp0 updetach debug noauth defaultroute usepeerdns ipcp-accept-remote ipcp-accept-local noipdefault holdoff 3 lcp-echo-interval 15 lcp-echo-failure 3 maxfail 0 persist connect true plugin rp-pppoe.so eth1

I see no child-timeout there. grep -r timeout /etc/ppp did not return any results, so I guess there is no such setting in any of my config files either.
Unless the setting is compiled in as default or inherited from lcp-echo-interval, I can see no way how this could be set.
I can see no information about the default value of child-timeout.

> ip-down should have been completed /etc/init.d/net.ppp0 --quiet stop by then,
> but who knows...:-\

I often experience long waits when shutting down squid. So far I could not reproduce it good enough to file a bug report about it. But if this happened here, it might well have caused a longer execution time. As you can see, the squid termination message is the last message related to this bug in my syslog, 13 seconds after the initial call to ip-down.

Comment 21 Alin Năstac (RETIRED) gentoo-dev

2006-05-09 05:01:07 UTC

My mistake about that 15 seconds.

I've look in sorces of ppp. The default child-timeout is 5 seconds, so you should either set it to a big enough value or set it to 0 (no timeout).

Squid stop operation is known to be a quite lengthy operation. Nothing we can do about it.

Why don't you set RC_NET_STRICT_CHECKING=lo in /etc/conf.d/rc ? Teoretically there may be services that needs restarting just because a new network interface  popped up, but I don't know any such service. Surely squid isn't one of them (I'm the maintainer of squid).

Conclusion: baselayout failed to set net.ppp0 status to inactive because pppd has killed the ip-down script.
Solution: either increase child-timeout or reduce the list of things to do when the link goes up/down.

bug closed as FIXED mostly because the bug was FIXED before it was REOPENED. 
the part between REOPENED and this comment should be considered as INVALID.

Comment 22 Martin von Gagern 2006-05-09 05:25:27 UTC

(In reply to comment #21)
> Why don't you set RC_NET_STRICT_CHECKING=lo in /etc/conf.d/rc?

I once had RC_NET_STRICT_CHECKING=no, and as a result sevices were started in a strange order. I had ntp-date started before my internet connection was up, just because eth0 was up. I don't know if current behaviour is still like that, if so I'll probably file another bug about that. Anyway, that was the original reason why I changed to RC_NET_STRICT_CHECKING=yes.

> bug closed as FIXED mostly because the bug was FIXED before it was REOPENED. 
> the part between REOPENED and this comment should be considered as INVALID.

OK, I agree. Thanks anyway.