After large amounts (approx. 2GB) of traffic from one vpn leaf to another the openvpn hub server stops responding on the network but remains running otherwise. No apparent indication in the logs of what the cause was. Reproducible: Always Steps to Reproduce: I have tried it 2 times so far. I've been copying multiple gigabytes of data via sshfs-fuse (mounted on one leaf from another so not involving the hub on other than the udp/vpn level) at a ~3Mbps rate from one vpn leaf through the hub to another vpn leaf. Actual Results: Both times after approx. 2GB transferred the openvpn hub server simply stopped responding on the network and remained this way, unreachable via ssh or anything else. It was otherwise running well and I found no explanation in the logs. There was the following in the OpenVPN hub log (VPN traffic is on a tap device and a >1024 UDP port) referring to it not able to reach the other VPN peers and various similar errors in the syslog resulting from no reachable nameservers, etc.: Tue May 20 04:51:59 2008 read UDPv4 [EHOSTUNREACH|EHOSTUNREACH]: No route to host (code=113) Tue May 20 04:52:09 2008 read UDPv4 [EHOSTUNREACH|EHOSTUNREACH]: No route to host (code=113) Tue May 20 04:52:19 2008 read UDPv4 [EHOSTUNREACH|EHOSTUNREACH]: No route to host (code=113) Tue May 20 04:52:29 2008 read UDPv4 [EHOSTUNREACH|EHOSTUNREACH]: No route to host (code=113) Tue May 20 04:52:39 2008 read UDPv4 [EHOSTUNREACH|EHOSTUNREACH]: No route to host (code=113) Tue May 20 04:52:49 2008 read UDPv4 [EHOSTUNREACH|EHOSTUNREACH]: No route to host (code=113) Tue May 20 04:52:59 2008 read UDPv4 [EHOSTUNREACH|EHOSTUNREACH]: No route to host (code=113) Tue May 20 04:53:09 2008 read UDPv4 [EHOSTUNREACH|EHOSTUNREACH]: No route to host (code=113) Second time: Wed May 21 03:00:24 2008 read UDPv4 [EHOSTUNREACH|EHOSTUNREACH]: No route to host (code=113) Wed May 21 03:00:34 2008 read UDPv4 [EHOSTUNREACH|EHOSTUNREACH]: No route to host (code=113) Wed May 21 03:00:44 2008 read UDPv4 [EHOSTUNREACH|EHOSTUNREACH]: No route to host (code=113) Wed May 21 03:00:54 2008 read UDPv4 [EHOSTUNREACH|EHOSTUNREACH]: No route to host (code=113) Wed May 21 03:01:04 2008 read UDPv4 [EHOSTUNREACH|EHOSTUNREACH]: No route to host (code=113) Wed May 21 03:01:12 2008 read UDPv4 [EHOSTUNREACH]: No route to host (code=113) Wed May 21 03:01:15 2008 read UDPv4 [EHOSTUNREACH]: No route to host (code=113) Both times, after a graceful reboot everything worked well again. The hub server is remote (approx. 250km) so I could not log in to the console to check the interfaces, routing, etc. I will try to set up some facilities to gather some more useful data on this as it will probably happen again if I trigger it. I've been asking the ISP personnel and they claimed that there was no event on the switch port the hub server is on and they are not doing firewalling or anything like that. The openvpn port (and many other services) is restricted to a few IP ranges with iptables. From the symptoms I don't think it has an external cause like DoS or something. hardware: HP DL360-G4 (Tigon3 GbE) kernel versions tested: 2.6.22-gentoo-r9, 2.6.23-gentoo-r9 openvpn versions tested: 2.0.6-r2, 2.0.7 iptables version: 1.3.8-r3
Did you report it to upstream?
> Did you report it to upstream? No, not yet... I'm trying to reproduce it again and gather some more useful data about what exactly is causing this. I don't know whether this is a kernel (networking, or driver) bug, an openvpn bug, or what... In fact a few hours ago my VMWare Remote Console traffic has triggered the bug again after some hours of remote desktop work but I couldn't prepare for it to catch diagnostic information yet because I'm very overloaded with my job today... Guess how happy I was when it happened again in the middle of a rush... I hope later in the evening (I'm GMT+2) I can make these preparations and catch data at the next "crash" or whatever it is.
The last night I've written a script that logs various system parameters to a text file every 10 minutes and then artificially triggered the bug by transferring lots of data through the VPN. Again, after around 2GB transferred, the bug has been triggered. Here is what the script does: - pings the first hop gateway both normally and by "ping -r" - does "ifconfig -a" - does "netstat -rn" - does "netstat -lpn" - does "iptables -t filter -L -n -v -x" and the same for "-t nat" - does "ps -eFlL" - does "df -k" I did manage to catch the parameters above at the time of the "crash": - no ping response from any host, not even with "ping -r" - ifconfig says interface is up, error counters at 0, packet counters normal - routing table is unaltered, all necessary routes are in place - no change in listening sockets - iptables rules are intact, packet counters increased normally - processes seem normal - mounts in place, no major change in free space, no filled up fs So basically I haven't learned too much... What else should I check for? Please note that this is a production server (dns, web, mail, vpn) of our company so this is why I haven't disclosed the above logs as-is and this also means that I don't want to trigger this bug too many times in the future so it would be essential to contrive some method to gather useful diagnostic information the next time...
Is it working again when you restart the openvpn daemon ?
My bet is no... but couple of days ago I've modified the diagnostic script to restart openvpn when the first hop gateway becomes unreachable. I haven't artificially triggered the bug because we had to do meaningful work on the server but when it happens probably we are going to learn about this. If anybody has some more ideas what to check for, please let me know ASAP! (/proc or /sys dumps or whatever that is available from a crontab-ran shell script)
The bug triggered again... and NO, restarting OpenVPN does not make any difference, the server remains unreachable. To me, this suggests that this may rather be a kernel bug than an OpenVPN bug...
I have same problem. We are working behind NAT. If I and my colleague try connecting to openvpn server (by UDP) simmulationaly we catch an error: Fri Feb 13 17:39:45 2009 LZO compression initialized Fri Feb 13 17:39:45 2009 UDPv4 link local (bound): [undef]:1194 Fri Feb 13 17:39:45 2009 UDPv4 link remote: 80.90.119.6:1194 Fri Feb 13 17:39:45 2009 read UDPv4 [EHOSTUNREACH]: No route to host (code=113) Fri Feb 13 17:39:48 2009 read UDPv4 [EHOSTUNREACH]: No route to host (code=113) ^CFri Feb 13 17:39:50 2009 event_wait : Interrupted system call (code=4) Fri Feb 13 17:39:50 2009 SIGINT[hard,] received, process exiting
The same probleme here: Affected host is a client of a OpenVPN network. All other hosts seem to work fine - at least I didn't notice any problems so far. Some more information about the affected host: sys-kernel/gentoo-sources-2.6.28-r2 net-misc/openvpn-2.0.7-r2 The host is running as a KVM VM using the virtio network drivers. The log messages from openvpn.log: Sun Feb 22 16:11:05 2009 read UDPv4 [EHOSTUNREACH]: No route to host (code=113) Sun Feb 22 16:11:15 2009 read UDPv4 [EHOSTUNREACH]: No route to host (code=113) Sun Feb 22 16:11:25 2009 read UDPv4 [EHOSTUNREACH]: No route to host (code=113) Sun Feb 22 16:11:36 2009 read UDPv4 [EHOSTUNREACH]: No route to host (code=113) Sun Feb 22 16:12:57 2009 read UDPv4 [EHOSTUNREACH]: No route to host (code=113) Sun Feb 22 16:13:05 2009 read UDPv4 [EHOSTUNREACH|EHOSTUNREACH|EHOSTUNREACH]: No route to host (code=113) Sun Feb 22 16:13:08 2009 read UDPv4 [EHOSTUNREACH|EHOSTUNREACH|EHOSTUNREACH]: No route to host (code=113) Sun Feb 22 16:13:11 2009 read UDPv4 [EHOSTUNREACH|EHOSTUNREACH|EHOSTUNREACH]: No route to host (code=113) The OpenVPN config file of the affected host: remote base001.acme.loc remote base002.acme.loc client tls-client port 1194 proto udp dev tapmanagement ca management/keys/ca.crt cert management/keys/evsrv002-20090220@1450.crt key management/keys/evsrv002-20090220@1450.key dh management/keys/dh1024.pem comp-lzo user nobody group nobody persist-key persist-tun status /var/log/openvpn/management/status.log log /var/log/openvpn/management/openvpn.log log-append /var/log/openvpn/management/openvpn.log verb 3 ns-cert-type server Couldn't find out yet what could cause this as everything else works fine when accessing the host directly without VPN.
I've just upgraded to the latest unstable (~amd64) release of net-misc/openvpn-2.1_rc15 on the affected client and the problem didn't occur anymore so far. Could you check whether this works for you too? So we could request 2.1_rc15 going stable.
Unluckily the problem occured with 2.1_rc15 after some minutes again. So upgrading to 2.1_rc15 isn't an option to resolve this issue ;-(
Did anyone of you solve this problem yet? Any news here? Still fighting with it on 2 VMs.
I recommend you try to figure this out with upstream. If it turns out to be a distribution issue, we can fix that later. Closing for now.