223033 – net-misc/openvpn - VPN traffic disrupts networking in a strange way

Bug 223033 - net-misc/openvpn - VPN traffic disrupts networking in a strange way

Summary: net-misc/openvpn - VPN traffic disrupts networking in a strange way

Status:	RESOLVED UPSTREAM

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	[OLD] Server (show other bugs)
Hardware:	x86 Linux

Importance:	High normal
Assignee:	Cédric Krier

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2008-05-21 08:04 UTC by Rumi Szabolcs
Modified:	2010-08-23 13:20 UTC (History)
CC List:	2 users (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Rumi Szabolcs 2008-05-21 08:04:27 UTC

After large amounts (approx. 2GB) of traffic from one vpn leaf to another
the openvpn hub server stops responding on the network but remains running
otherwise. No apparent indication in the logs of what the cause was.

Reproducible: Always

Steps to Reproduce:
I have tried it 2 times so far. I've been copying multiple gigabytes of data via
sshfs-fuse (mounted on one leaf from another so not involving the hub on other
than the udp/vpn level) at a ~3Mbps rate from one vpn leaf through the hub to
another vpn leaf.
Actual Results:  
Both times after approx. 2GB transferred the openvpn hub server simply
stopped responding on the network and remained this way, unreachable
via ssh or anything else. It was otherwise running well and I found no
explanation in the logs. There was the following in the OpenVPN hub
log (VPN traffic is on a tap device and a >1024 UDP port) referring
to it not able to reach the other VPN peers and various similar errors
in the syslog resulting from no reachable nameservers, etc.:

Tue May 20 04:51:59 2008 read UDPv4 [EHOSTUNREACH|EHOSTUNREACH]: No route to host (code=113)                                                                                          
Tue May 20 04:52:09 2008 read UDPv4 [EHOSTUNREACH|EHOSTUNREACH]: No route to host (code=113)                                                                                          
Tue May 20 04:52:19 2008 read UDPv4 [EHOSTUNREACH|EHOSTUNREACH]: No route to host (code=113)                                                                                          
Tue May 20 04:52:29 2008 read UDPv4 [EHOSTUNREACH|EHOSTUNREACH]: No route to host (code=113)                                                                                          
Tue May 20 04:52:39 2008 read UDPv4 [EHOSTUNREACH|EHOSTUNREACH]: No route to host (code=113)                                                                                          
Tue May 20 04:52:49 2008 read UDPv4 [EHOSTUNREACH|EHOSTUNREACH]: No route to host (code=113)                                                                                          
Tue May 20 04:52:59 2008 read UDPv4 [EHOSTUNREACH|EHOSTUNREACH]: No route to host (code=113)                                                                                          
Tue May 20 04:53:09 2008 read UDPv4 [EHOSTUNREACH|EHOSTUNREACH]: No route to host (code=113)

Second time:

Wed May 21 03:00:24 2008 read UDPv4 [EHOSTUNREACH|EHOSTUNREACH]: No route to host (code=113)                                                                                          
Wed May 21 03:00:34 2008 read UDPv4 [EHOSTUNREACH|EHOSTUNREACH]: No route to host (code=113)                                                                                          
Wed May 21 03:00:44 2008 read UDPv4 [EHOSTUNREACH|EHOSTUNREACH]: No route to host (code=113)                                                                                          
Wed May 21 03:00:54 2008 read UDPv4 [EHOSTUNREACH|EHOSTUNREACH]: No route to host (code=113)                                                                                          
Wed May 21 03:01:04 2008 read UDPv4 [EHOSTUNREACH|EHOSTUNREACH]: No route to host (code=113)                                                                                          
Wed May 21 03:01:12 2008 read UDPv4 [EHOSTUNREACH]: No route to host (code=113)                                                                                                       
Wed May 21 03:01:15 2008 read UDPv4 [EHOSTUNREACH]: No route to host (code=113)

Both times, after a graceful reboot everything worked well again.


The hub server is remote (approx. 250km) so I could not log in to the console
to check the interfaces, routing, etc. I will try to set up some facilities
to gather some more useful data on this as it will probably happen again
if I trigger it.

I've been asking the ISP personnel and they claimed that there was no
event on the switch port the hub server is on and they are not doing
firewalling or anything like that. The openvpn port (and many other
services) is restricted to a few IP ranges with iptables. From the
symptoms I don't think it has an external cause like DoS or something.

hardware: HP DL360-G4 (Tigon3 GbE)
kernel versions tested: 2.6.22-gentoo-r9, 2.6.23-gentoo-r9
openvpn versions tested: 2.0.6-r2, 2.0.7
iptables version: 1.3.8-r3

Comment 1 Cédric Krier gentoo-dev

2008-05-21 15:02:25 UTC

Did you report it to upstream?

Comment 2 Rumi Szabolcs 2008-05-21 16:04:12 UTC

> Did you report it to upstream?

No, not yet... I'm trying to reproduce it again and gather some
more useful data about what exactly is causing this. I don't
know whether this is a kernel (networking, or driver) bug,
an openvpn bug, or what...

In fact a few hours ago my VMWare Remote Console traffic has
triggered the bug again after some hours of remote desktop work
but I couldn't prepare for it to catch diagnostic information
yet because I'm very overloaded with my job today... Guess how
happy I was when it happened again in the middle of a rush...

I hope later in the evening (I'm GMT+2) I can make these
preparations and catch data at the next "crash" or whatever it is.

Comment 3 Rumi Szabolcs 2008-05-22 09:24:55 UTC

The last night I've written a script that logs various system
parameters to a text file every 10 minutes and then artificially
triggered the bug by transferring lots of data through the VPN.
Again, after around 2GB transferred, the bug has been triggered.

Here is what the script does:

- pings the first hop gateway both normally and by "ping -r"
- does "ifconfig -a"
- does "netstat -rn"
- does "netstat -lpn"
- does "iptables -t filter -L -n -v -x" and the same for "-t nat"
- does "ps -eFlL"
- does "df -k"

I did manage to catch the parameters above at the time of the "crash":

- no ping response from any host, not even with "ping -r"
- ifconfig says interface is up, error counters at 0, packet counters normal
- routing table is unaltered, all necessary routes are in place
- no change in listening sockets
- iptables rules are intact, packet counters increased normally
- processes seem normal
- mounts in place, no major change in free space, no filled up fs

So basically I haven't learned too much... What else should I check for?

Please note that this is a production server (dns, web, mail, vpn) of our
company so this is why I haven't disclosed the above logs as-is and this
also means that I don't want to trigger this bug too many times in the
future so it would be essential to contrive some method to gather useful
diagnostic information the next time...

Comment 4 Cédric Krier gentoo-dev

2008-05-24 13:27:13 UTC

Is it working again when you restart the openvpn daemon ?

Comment 5 Rumi Szabolcs 2008-05-30 19:46:26 UTC

My bet is no... but couple of days ago I've modified the diagnostic
script to restart openvpn when the first hop gateway becomes unreachable.
I haven't artificially triggered the bug because we had to do meaningful
work on the server but when it happens probably we are going to learn
about this.

If anybody has some more ideas what to check for, please let me know ASAP!
(/proc or /sys dumps or whatever that is available from a crontab-ran
shell script)

Comment 6 Rumi Szabolcs 2008-06-13 07:18:25 UTC

The bug triggered again... and NO, restarting OpenVPN does not make any
difference, the server remains unreachable.

To me, this suggests that this may rather be a kernel bug than an OpenVPN bug...

Comment 7 Stanislav 2009-02-13 14:53:10 UTC

I have same problem.
We are working behind NAT. If I and my colleague try connecting to openvpn server (by UDP) simmulationaly we catch an error:
Fri Feb 13 17:39:45 2009 LZO compression initialized
Fri Feb 13 17:39:45 2009 UDPv4 link local (bound): [undef]:1194
Fri Feb 13 17:39:45 2009 UDPv4 link remote: 80.90.119.6:1194
Fri Feb 13 17:39:45 2009 read UDPv4 [EHOSTUNREACH]: No route to host (code=113)
Fri Feb 13 17:39:48 2009 read UDPv4 [EHOSTUNREACH]: No route to host (code=113)
^CFri Feb 13 17:39:50 2009 event_wait : Interrupted system call (code=4)
Fri Feb 13 17:39:50 2009 SIGINT[hard,] received, process exiting

Comment 8 Elias Probst 2009-02-22 16:18:18 UTC

The same probleme here:

Affected host is a client of a OpenVPN network. All other hosts seem to work fine - at least I didn't notice any problems so far.

Some more information about the affected host:
sys-kernel/gentoo-sources-2.6.28-r2
net-misc/openvpn-2.0.7-r2
The host is running as a KVM VM using the virtio network drivers.

The log messages from openvpn.log:
Sun Feb 22 16:11:05 2009 read UDPv4 [EHOSTUNREACH]: No route to host (code=113)
Sun Feb 22 16:11:15 2009 read UDPv4 [EHOSTUNREACH]: No route to host (code=113)
Sun Feb 22 16:11:25 2009 read UDPv4 [EHOSTUNREACH]: No route to host (code=113)
Sun Feb 22 16:11:36 2009 read UDPv4 [EHOSTUNREACH]: No route to host (code=113)
Sun Feb 22 16:12:57 2009 read UDPv4 [EHOSTUNREACH]: No route to host (code=113)
Sun Feb 22 16:13:05 2009 read UDPv4 [EHOSTUNREACH|EHOSTUNREACH|EHOSTUNREACH]: No route to host (code=113)
Sun Feb 22 16:13:08 2009 read UDPv4 [EHOSTUNREACH|EHOSTUNREACH|EHOSTUNREACH]: No route to host (code=113)
Sun Feb 22 16:13:11 2009 read UDPv4 [EHOSTUNREACH|EHOSTUNREACH|EHOSTUNREACH]: No route to host (code=113)

The OpenVPN config file of the affected host:
remote base001.acme.loc
remote base002.acme.loc
client
tls-client
port 1194
proto udp
dev tapmanagement
ca management/keys/ca.crt
cert management/keys/evsrv002-20090220@1450.crt
key management/keys/evsrv002-20090220@1450.key
dh management/keys/dh1024.pem
comp-lzo
user nobody
group nobody
persist-key
persist-tun
status /var/log/openvpn/management/status.log
log /var/log/openvpn/management/openvpn.log
log-append /var/log/openvpn/management/openvpn.log
verb 3
ns-cert-type server

Couldn't find out yet what could cause this as everything else works fine when accessing the host directly without VPN.

Comment 9 Elias Probst 2009-02-24 14:04:34 UTC

I've just upgraded to the latest unstable (~amd64) release of net-misc/openvpn-2.1_rc15 on the affected client and the problem didn't occur anymore so far.

Could you check whether this works for you too? So we could request 2.1_rc15 going stable.

Comment 10 Elias Probst 2009-02-24 17:42:49 UTC

Unluckily the problem occured with 2.1_rc15 after some minutes again.
So upgrading to 2.1_rc15 isn't an option to resolve this issue ;-(

Comment 11 Elias Probst 2009-08-25 15:25:36 UTC

Did anyone of you solve this problem yet? Any news here? Still fighting with it on 2 VMs.

Comment 12 Dirkjan Ochtman (RETIRED) gentoo-dev

2010-08-23 13:20:29 UTC

I recommend you try to figure this out with upstream. If it turns out to be a distribution issue, we can fix that later. Closing for now.