489398 – net-misc/netifrc - Unable to (re)start interface when failed

Bug 489398 - net-misc/netifrc - Unable to (re)start interface when failed

Summary: net-misc/netifrc - Unable to (re)start interface when failed

Status:	RESOLVED FIXED

Alias:	None

Product:	Gentoo Hosted Projects
Classification:	Unclassified
Component:	netifrc (show other bugs)
Hardware:	All Linux

Importance:	Normal normal
Assignee:	netifrc Team

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2013-10-25 19:47 UTC by Thomas Deutschmann (RETIRED)
Modified:	2016-10-24 19:47 UTC (History)
CC List:	1 user (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Thomas Deutschmann (RETIRED) gentoo-dev

2013-10-25 19:47:52 UTC

Hi,

on boot while net.eth0 was starting I run into a dad_timeout. As result, OpenRC marked the service as failed (ERROR: net.eth0 failed to start) and blocked depending services from start ("ERROR: Cannot start sshd as net.eth0 would not start"; I think that's OK).

When I now try to (re)start the service again, it will fail with

  RTNETLINK answers: File exists

because net.eth0 tries to set routes/gateways, which are already set by the first (failed) attempt, which run into the dad_timeout.

I cannot stop the service to clear configuration, because it is marked as FAILED (=not started) and not started services cannot be stopped:

  * WARNING: net.eth0 is already stopped

=> I end in a situation where I have to manually reset the interface (or have to reboot) if I want to use eth0.


Two ideas to fix that:
1) The network script should always reset the NIC it is trying to start on start.

2) Add an extra command "reset" to the initscript which will reset the NIC so that I can try to start the service again.

Comment 1 Ian Stakenvicius (RETIRED) gentoo-dev

2013-10-28 12:57:20 UTC

What do you propose the solution should be?  I see two possibilities off the top of my head:

#1 - if there's a failure, somehow detect this and roll back all changes.  

#2 - on startup, detect if something has already been applied (ie routes, etc) and do not attempt to re-apply.

I think #2 would be easier, but #1 would seem to be more 'proper' to me.

(We could also filter out certain errors and make them non-fatal, so the interface would still start up; not sure if that would be an answer in this case or not?)

Comment 2 Thomas Deutschmann (RETIRED) gentoo-dev

2013-10-28 14:26:15 UTC

Hi,

(In reply to Ian Stakenvicius from comment #1)
> #1 - if there's a failure, somehow detect this and roll back all changes.

I agree with you, that this looks like the proper solution, but I am in doubt if this would be possible at all:

1) I am not sure if it is a feature that netifrc doesn't reset a network interface when starting to manage it. E.g. normally I would expect when you say "This interface should be controlled by XYZ" that it is OK for you, that XYZ would bring the interface into a well-known start state so that any configuration will apply (IMHO NetworkManger does something like that). But if you want to support and or keep any previous state when telling netifrc to take control, well.. sometimes you may run into problems, because the already set configuration isn't compatible.

2) Rolling back what we have set sounds nice, but image a situation where you change something manually on a already started interface without using netifrc. If you now restart the interface, this may fail. But if netifrc will only rollback netifrc changes, the manually set changes, which are incompatible, are still there and will block netifrc. Not sure how that will help, so I still see a need for a "reset" command.


> #2 - on startup, detect if something has already been applied (ie routes,
> etc) and do not attempt to re-apply.

Mh... are we really able to do that without missing something? For example when I start my interface, I expect that if the start was successful, that the interface will be in a specific state. If it would be possible that we won't setup something because we think this is already configured but isn't/doesn't work... 

 
> (We could also filter out certain errors and make them non-fatal, so the
> interface would still start up; not sure if that would be an answer in this
> case or not?)

No, as said before I am expecting a specific state when I start my interface and it reports back "Yup, I am now started like you said". For example, if a route is missing because RTNETLINK doesn't answer, this could be a problem. How would you solve that? a) Restarting the interface  b) Applying manual changes (but because our stop script will only undo what start has set... we would end with an 'orphan' route and BTW, that's how our stop script currently works).


If we want to keep existing configuration when netifrc takes control, I think there is no way to improve the situation. Then this is something the user has to fix by hand and all we can do is providing a "reset" command which will help him.

If we don't care about the current state, we could bringt the nic into a known start state (e.g. deleting everything which may be problem). No need for complex roll back mechanism or something else.

But even when we don't care, I still see the need for a "reset" command, which will bring back the entire network stack back into a know state, including policy rules for example. Currently, if you end with a broken network the only way to reset everything is to restart (e.g. you cannot flush policy rules, because this will also flush default rules. Restoring default rules can be a pain if you don't remember; I can delete link local addresses, change scopes, MTU.. do I remember everything?) :)

Just an excerpt of what has to be done:

	ip link set dev eth0 down
	ip link set dev eth0 up
	ip addr flush dev eth0 scope global
	ip addr flush dev eth0 scope site
	ip addr flush dev eth0 scope host
	ip rule flush
	ip -4 rule add lookup main priority 32766
	ip -4 rule add lookup default priority 32767
	ip -6 rule add lookup main priority 32766
	ip -6 rule add lookup default priority 32767
	ip -4 -s -s route flush table main
	ip -6 -s -s route flush table main
	ip -4 route flush table cache dev eth0
	ip -6 route flush table cache dev eth0

But you could have multiple tables, so we would have to loop through the tables... as said, just an example.

Comment 3 Robin Johnson archtester

2016-10-24 19:47:44 UTC

I implemented most of #2, by checking if an address/route already exists, and simply ignoring the error when they are applied.

It's in place as of commit bf3ee524b605e6a78f5abeb0e6577ae8a9b16f0c, 2016/10/23, which is unreleased so far.

Thanks to jmbsvicetto for reminding me on this, because he hit it on his server and I came back to this bug later.