481970 – net-misc/netifrc - IPv6 binding will fail, because IPv6 is not yet ready

Bug 481970 - net-misc/netifrc - IPv6 binding will fail, because IPv6 is not yet ready

Summary: net-misc/netifrc - IPv6 binding will fail, because IPv6 is not yet ready

Status:	RESOLVED FIXED

Alias:	None

Product:	Gentoo Hosted Projects
Classification:	Unclassified
Component:	netifrc (show other bugs)
Hardware:	All Linux

Importance:	Normal normal (vote)
Assignee:	netifrc Team

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2013-08-21 14:10 UTC by Thomas Deutschmann (RETIRED)
Modified:	2013-08-24 01:22 UTC (History)
CC List:	0 users

See Also:
Package list:
Runtime testing required:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Thomas Deutschmann (RETIRED) gentoo-dev

2013-08-21 14:10:22 UTC

Hi,

I think this is OpenRC/netifrc related and not a specific nginx issues. nginx is just a service sample.

In my nginx configuration I have a server block which will bind to an IPv6:

server {
	listen          192.168.0.22:80;
	listen		[2001:db8:0:0:8d3::22]:80;
	
	// ...
}

(Note: These are dummy addresses from Wikipedia)

Because I am binding nginx to a specific address, I set "rc_need="!net net.eth0" in "/etc/conf.d/nginx" to make sure the these addresses are available when nginx starts.

But this doesn't work. On boot, from rc.log:

rc default logging started at Wed Aug 21 15:35:33 2013

 * Bringing up interface eth0
 *   192.168.0.22/24 ...
 [ ok ]
 *   2001:db8:0:0:8d3::22/80 ...
 [ ok ]
 *   Adding routes
 *     default via 192.168.0.1 ...
 [ ok ]
 *     default via 2001:db8:0:0:8d3::1 ...
 [ ok ]
 * Starting rsyslogd ...
 [ ok ]
  * Starting fcron ...
 [ ok ]
 * Mounting network filesystems ...
 [ ok ]
 * Checking nginx' configuration ...
nginx: [emerg] bind() to [2001:db8:0:0:8d3::22]:80 failed (99: Cannot assign requested address)
nginx: configuration file /etc/nginx/nginx.conf test failed
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: [emerg] bind() to [2001:db8:0:0:8d3::22]:80 failed (99: Cannot assign requested address)
nginx: configuration file /etc/nginx/nginx.conf test failed
 * failed, please correct errors above
 [ !! ]
 * ERROR: nginx failed to start
 
 ...

If I immediately run "/etc/init.d/nginx start" when the sytem is up, nginx is able to start. So there is nothing wrong with my configuration is guess.

The same happen, when I restart net.eth0:

OpenRC will stop nginx and other services depending on net.eth0, restart net.eth0 and tries to start the previously stopped depending services again (also nginx). But nginx will fail again. If I wait some seconds, I can start nginx manually without a problem. 


So the problem is, that the IPv6 address isn't yet ready when OpenRC thinks net.eth0 has finished and should be up...

Software involved:
- sys-apps/openrc-0.-12
- net-misc/netifrc-0.1
- www-servers/nginx-1.4.1-r5



# emerge --info
Portage 2.2.0 (default/linux/amd64/13.0, gcc-4.7.3, glibc-2.17, 3.10.9 x86_64)
=================================================================
System uname: Linux-3.10.9-x86_64-Intel-R-_Xeon-R-_CPU_E5405_@_2.00GHz-with-gentoo-2.2
KiB Mem:    16435688 total,  16002372 free
KiB Swap:    4194300 total,   4194300 free
Timestamp of tree: Wed, 21 Aug 2013 11:45:01 +0000
ld GNU ld (GNU Binutils) 2.23.2
distcc 3.1 x86_64-pc-linux-gnu [disabled]
app-shells/bash:          4.2_p45
dev-lang/python:          2.7.5-r2, 3.2.5-r2, 3.3.2-r2
dev-util/pkgconfig:       0.28
sys-apps/baselayout:      2.2
sys-apps/openrc:          0.12
sys-apps/sandbox:         2.6-r1
sys-devel/autoconf:       2.69
sys-devel/automake:       1.10.3, 1.13.4, 1.14
sys-devel/binutils:       2.23.2
sys-devel/gcc:            4.7.3
sys-devel/gcc-config:     1.8
sys-devel/libtool:        2.4.2
sys-devel/make:           3.82-r4
sys-kernel/linux-headers: 3.10 (virtual/os-headers)
sys-libs/glibc:           2.17

ABI="amd64"
ABI_X86="64"
ACCEPT_KEYWORDS="amd64 ~amd64"

Reproducible: Always

Comment 1 Thomas Deutschmann (RETIRED) gentoo-dev

2013-08-21 14:35:29 UTC

OK, it has something to do with the tentative state:

I added some debugging code to _iproute2_ipv6_tentative() so that I see that this function will be called. It got called once but won't do the tentative check, because the _has_carrier will prevent it.

If I remove the _has_carrier check, I see 6 _iproute2_ipv6_tentative calls and after that, everything is working for me.

So now I have to check why _has_carrier will prevent this check on my system.

Comment 2 Andreas Steinmetz 2013-08-21 15:17:48 UTC

The problem is /lib64/netifrc/net/iproute2.sh:


_iproute2_ipv6_tentative()
{
        # Only check tentative when we have a carrier.
        _has_carrier || return 1
                        ^^^^^^^^ <= BUG!    
        LC_ALL=C ip addr show dev "${IFACE}" | \
                grep -q "^[[:space:]]*inet6 .* tentative"
}

It must return 0 instead of 1 if "no carrier", otherwise it will never wait for IPv6 addresses to become ready on all modern systems. There is some time between IF up and link activation and thus this script happily assumes that there is no link and carries on.

There is more problematic code in this script. _add_route() doesn't wait for IPv6 addresses, too, so adding any route like "fc00::/7 via <address> metric <value> dev <device> src <interface-address>" will simply fail, even if the first bug is fixed.

Actually I do wonder if this script was ever tested...

Comment 3 Ian Stakenvicius (RETIRED) gentoo-dev

2013-08-21 17:59:14 UTC

This seems like it might relate....

http://forums.gentoo.org/viewtopic-t-891602-start-0.html


..adding a second part to this issue, in that the 'global scope' might matter for whether or not we need to wait for tentative status to clear?

Comment 4 Thomas Deutschmann (RETIRED) gentoo-dev

2013-08-21 18:06:26 UTC

Full configuration and logs as requested in #gentoo:

# cat /etc/conf.d/net
dns_servers="
	127.0.0.1
	::1
"

modules_eth0="iproute2"
config_eth0="
        192.168.0.22/24
"

modules_eth1="iproute2"
config_eth1="
	89.xxx.xxx.100/32
	2a00:xxxx:xxxx:xxxx::100/80
"
routes_eth1="
	89.xxx.xxx.96/27 dev eth1 src 89.xxx.xxx.100
	default via 89.xxx.xxx.97
	2a00:xxxx:xxxx:xxxx::/56
	default via 2a00:xxxx:xxxx:xxxx::1
"


# uptime
 19:29:19 up  3:54,  4 users,  load average: 0,02, 0,04, 0,05
# /etc/init.d/nginx status
 * status: started
# /etc/init.d/net.eth1 restart
 * Unmounting network filesystems ...                                     [ ok ]
 * Stopping nginx ...                                                     [ ok ]
 * Bringing down interface eth1
 * Bringing up interface eth1
 *   89.xxx.xxx.100/32 ...                                                [ ok ]
 *   2a00:xxxx:xxxx:xxxx::100/80 ...                                      [ ok ]
 *   Adding routes
 *     89.xxx.xxx.96/27 dev eth1 src 89.xxx.xxx.100 ...                   [ ok ]
 *     default via 89.xxx.xxx.97 ...                                      [ ok ]
 *     2a00:xxxx:xxxx:xxxx::/56 ...                                       [ ok ]
 *     default via 2a00:xxxx:xxxx:xxxx::1 ...                             [ ok ]
#  * Mounting network filesystems ...                                     [ ok ]
 * Checking nginx' configuration ...
nginx: [emerg] bind() to [2a00:xxxx:xxxx:xxxx::100]:80 failed (99: Cannot assign requested address)
nginx: configuration file /etc/nginx/nginx.conf test failed
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: [emerg] bind() to [2a00:xxxx:xxxx:xxxx::100]:80 failed (99: Cannot assign requested address)
nginx: configuration file /etc/nginx/nginx.conf test failed
 * failed, please correct errors above                                    [ !! ]
 * ERROR: nginx failed to start
# /etc/init.d/nginx start
 * Checking nginx' configuration ...                                      [ ok ]
 * Starting nginx ...                                                     [ ok ]


# dmesg

[...]

[14067.740063] bnx2 0000:1a:00.0: irq 68 for MSI/MSI-X
[14067.860021] bnx2 0000:1a:00.0 eth1: using MSI
[14067.860061] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
[14070.904924] bnx2 0000:1a:00.0 eth1: NIC Copper Link is Up, 1000 Mbps full duplex

[14070.905012] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready


You really have to use a physical system to reproduce this problem, because a VM seems to be to fast.

I hacked a little bit around (just for testing) and changed _iproute2_ipv6_tentative() to (basically I copied the loop from :

_iproute2_ipv6_tentative()
{
        local n=5
        
        einfo "Running tentative test..."
        # Only check tentative when we have a carrier.
        ebegin "Waiting for carrier"
        while [ $n -ge 0 ]; do
                _has_carrier || break
                einfo "No carrier yet..."
                sleep 1
                : $(( n -= 1 ))
        done
        [ $n -ge 0 ]
        eend $?
        LC_ALL=C ip addr show dev "${IFACE}" | \
                grep -q "^[[:space:]]*inet6 .* tentative"
}

This will result in the following output:

# /etc/init.d/net.eth1 restart
 * Unmounting network filesystems ...                                     [ ok ]
 * Stopping nginx ...                                                     [ ok ]
 * Bringing down interface eth1
 *   Caching network module dependencies
need firewalld
 * Bringing up interface eth1
 *   89.xxx.xxx.100/32 ...                                                [ ok ]
 *   2a00:xxxx:xxxx:xxxx::100/80 ...                                      [ ok ]
 *   Adding routes
 *     89.xxx.xxx.96/27 dev eth1 src 89.xxx.xxx.100 ...                   [ ok ]
 *     default via 89.xxx.xxx.97 ...                                      [ ok ]
 *     2a00:xxxx:xxxx:xxxx::/56 ...                                       [ ok ]
 *     default via 2a00:xxxx:xxxx:xxxx::1 ...                             [ ok ]
 *   Running tentative test...
 *   Waiting for carrier ...                                              [ ok ]
 *   Waiting for IPv6 addresses ...
 *   Running tentative test...
 *   Waiting for carrier ...                                              [ ok ]
 *   Running tentative test...
 *   Waiting for carrier ...                                              [ ok ]
 *   Running tentative test...
 *   Waiting for carrier ...                                              [ ok ]
 *   Running tentative test...
 *   Waiting for carrier ...
 *   No carrier yet...
 *   No carrier yet...
 *   No carrier yet...
 *   No carrier yet...
 *   No carrier yet...
 *   No carrier yet...                                                    [ ok ]
 * Mounting network filesystems ...                                       [ ok ]
 * Checking nginx' configuration ...                                      [ ok ]
 * Starting nginx ...                                                     [ ok ]

Ian in #gentoo came up with

diff --git a/net/iproute2.sh b/net/iproute2.sh
index 3bab7b7..ac30bd6 100644
--- a/net/iproute2.sh
+++ b/net/iproute2.sh
@@ -326,7 +326,7 @@ iproute2_post_start()
 		ip -6 route flush table cache dev "${IFACE}"
 	fi
 
-	if _iproute2_ipv6_tentative; then
+	if _wait_for_carrier; then
 		ebegin "Waiting for IPv6 addresses"
 		while [ $n -ge 0 ]; do
 			_iproute2_ipv6_tentative || break

This seems to be a better hack, but finally we agreed that using the 'inactive' state would be a better solution.

In summary:
_iproute2_ipv6_tentative() doesn't run due to _has_carrier, which we only check once but because it will took some time, should be checked more often (in a loop like we do with the tentative state; don't forget a timeout or we will get a regression like http://dev.gentoo.org/~vapier/openrc/projects/openrc/ticket/195.html).

Comment 5 Ian Stakenvicius (RETIRED) gentoo-dev

2013-08-22 15:45:28 UTC

Upon a fair bit of discussion with, and reviewing research (and there was a lot more than is listed above) dug up by Thomas D, I think the best route forward here is to drop the _has_carrier check from _iproute2_ipv6_tentative() (since it doesn't relate to what I think we want it to be, which is a check for whether a live cable is attached), and institute instead a dad_timeout option that can be configured per-interface (and defaults to 5s).

We could use carrier_timeout for this instead of a specific dad_timeout, but I think carrier_timeout has other uses as well (for instance, it would impact more than just ipv6 tentative if a user overrode it).  

Also, implementing the inactive state for this is probably overkill -- those that want the faster startups but keep dad_timeout will rc_parallel, and those that don't want to rc_parallel will probably set 'nodad' in the ipv6 portion of their config.

Comment 6 Robin Johnson archtester

2013-08-24 01:22:06 UTC

InGit.