Hi, on boot while net.eth0 was starting I run into a dad_timeout. As result, OpenRC marked the service as failed (ERROR: net.eth0 failed to start) and blocked depending services from start ("ERROR: Cannot start sshd as net.eth0 would not start"; I think that's OK). When I now try to (re)start the service again, it will fail with RTNETLINK answers: File exists because net.eth0 tries to set routes/gateways, which are already set by the first (failed) attempt, which run into the dad_timeout. I cannot stop the service to clear configuration, because it is marked as FAILED (=not started) and not started services cannot be stopped: * WARNING: net.eth0 is already stopped => I end in a situation where I have to manually reset the interface (or have to reboot) if I want to use eth0. Two ideas to fix that: 1) The network script should always reset the NIC it is trying to start on start. 2) Add an extra command "reset" to the initscript which will reset the NIC so that I can try to start the service again.
What do you propose the solution should be? I see two possibilities off the top of my head: #1 - if there's a failure, somehow detect this and roll back all changes. #2 - on startup, detect if something has already been applied (ie routes, etc) and do not attempt to re-apply. I think #2 would be easier, but #1 would seem to be more 'proper' to me. (We could also filter out certain errors and make them non-fatal, so the interface would still start up; not sure if that would be an answer in this case or not?)
Hi, (In reply to Ian Stakenvicius from comment #1) > #1 - if there's a failure, somehow detect this and roll back all changes. I agree with you, that this looks like the proper solution, but I am in doubt if this would be possible at all: 1) I am not sure if it is a feature that netifrc doesn't reset a network interface when starting to manage it. E.g. normally I would expect when you say "This interface should be controlled by XYZ" that it is OK for you, that XYZ would bring the interface into a well-known start state so that any configuration will apply (IMHO NetworkManger does something like that). But if you want to support and or keep any previous state when telling netifrc to take control, well.. sometimes you may run into problems, because the already set configuration isn't compatible. 2) Rolling back what we have set sounds nice, but image a situation where you change something manually on a already started interface without using netifrc. If you now restart the interface, this may fail. But if netifrc will only rollback netifrc changes, the manually set changes, which are incompatible, are still there and will block netifrc. Not sure how that will help, so I still see a need for a "reset" command. > #2 - on startup, detect if something has already been applied (ie routes, > etc) and do not attempt to re-apply. Mh... are we really able to do that without missing something? For example when I start my interface, I expect that if the start was successful, that the interface will be in a specific state. If it would be possible that we won't setup something because we think this is already configured but isn't/doesn't work... > (We could also filter out certain errors and make them non-fatal, so the > interface would still start up; not sure if that would be an answer in this > case or not?) No, as said before I am expecting a specific state when I start my interface and it reports back "Yup, I am now started like you said". For example, if a route is missing because RTNETLINK doesn't answer, this could be a problem. How would you solve that? a) Restarting the interface b) Applying manual changes (but because our stop script will only undo what start has set... we would end with an 'orphan' route and BTW, that's how our stop script currently works). If we want to keep existing configuration when netifrc takes control, I think there is no way to improve the situation. Then this is something the user has to fix by hand and all we can do is providing a "reset" command which will help him. If we don't care about the current state, we could bringt the nic into a known start state (e.g. deleting everything which may be problem). No need for complex roll back mechanism or something else. But even when we don't care, I still see the need for a "reset" command, which will bring back the entire network stack back into a know state, including policy rules for example. Currently, if you end with a broken network the only way to reset everything is to restart (e.g. you cannot flush policy rules, because this will also flush default rules. Restoring default rules can be a pain if you don't remember; I can delete link local addresses, change scopes, MTU.. do I remember everything?) :) Just an excerpt of what has to be done: ip link set dev eth0 down ip link set dev eth0 up ip addr flush dev eth0 scope global ip addr flush dev eth0 scope site ip addr flush dev eth0 scope host ip rule flush ip -4 rule add lookup main priority 32766 ip -4 rule add lookup default priority 32767 ip -6 rule add lookup main priority 32766 ip -6 rule add lookup default priority 32767 ip -4 -s -s route flush table main ip -6 -s -s route flush table main ip -4 route flush table cache dev eth0 ip -6 route flush table cache dev eth0 But you could have multiple tables, so we would have to loop through the tables... as said, just an example.
I implemented most of #2, by checking if an address/route already exists, and simply ignoring the error when they are applied. It's in place as of commit bf3ee524b605e6a78f5abeb0e6577ae8a9b16f0c, 2016/10/23, which is unreleased so far. Thanks to jmbsvicetto for reminding me on this, because he hit it on his server and I came back to this bug later.