|Summary:||Odd behavior from ICMP apps(ping/tracepath) when MTU is 9000|
|Product:||Gentoo Linux||Reporter:||John Stewart <js>|
|Component:||[OLD] Core system||Assignee:||x86-kernel (DEPRECATED) <x86-kernel>|
|Package list:||Runtime testing required:||---|
Description John Stewart 2004-03-29 11:06:08 UTC
Summary: I've recently switched my network to gigabit. All the hardware is 9k jumbo frame clean. I can switch all the machines to a 9000 MTU and everything works great. But, if I run tracepath on my linux machine it decides the path MTU should be 1492. However if I set the MTU to 8996 or any MTU where size= 8x+4 then tracepath works fine. I can trigger similar behavior using ping with the don't fragment flag. Details: MTU is set to 9000 via: ifconfig eth0 mtu 9000 I run ttcp to send a few 32768 byte buffers and check the size of the on wire packets on the receiving end with ethereal. They're 9014 bytes just like they're supposed to be. now I run tracepath and get the following output: tracepath 192.168.1.100 1: 192.168.1.103 (192.168.1.103) 0.099ms pmtu 9000 1: 192.168.1.103 (192.168.1.103) 0.018ms pmtu 8166 1: 192.168.1.103 (192.168.1.103) 0.015ms pmtu 4352 1: 192.168.1.103 (192.168.1.103) 0.016ms pmtu 2002 1: 192.168.1.103 (192.168.1.103) 0.013ms pmtu 1492 1: 192.168.1.100 (192.168.1.100) 0.347ms reached Resume: pmtu 1492 hops 1 back 1 For some reason the sending machine seems to have decided it has to fragment the packets. And from the output it appears that it is doing it before it reaches the wire. From this point on the ttcp+ethereal test shows the on wire packet size as 1506. This behavior persists until I change the MTU to some other number an then back to 9000. I.e. I think that it is remembering the "discovered" MTU for that route and using it for all subsequent connections. But if I set the MTU to 8996. Or for that matter to any size between say 2k and 9k that is of the form size=8*x+4 (8996=8*1124+4) then tracepath works fine. example: ifconfig eth0 MTU 8996 tracepath 192.168.1.100 1: 192.168.1.103 (192.168.1.103) 0.114ms pmtu 8996 1: 192.168.1.100 (192.168.1.100) 0.778ms reached Resume: pmtu 8996 hops 1 back 1 Both before and after running tracepath the ttcp+ethereal test shows the on wire received packet size to be: 9010 Some info on the machines in question: sending machine: motherboard: Tyan Tiger motherboard processors: dual Opteron 240 NIC: Onboard intel gigabit Running: Gentoo Linux with kernel 2.6.4-r1 for AMD64 Using the e1000 driver that came with the kernel with default settings. receiving machine: motherboard: MSI Pro266-TD Master-LR processors: Dual Intel tualatin PIIIs at 1.26 gHz NIC: Intel pro/1000 MT desktop adapter Running: Windows 2000 Using the 6.01.03 drivers from Intel. switch: SMC 8508T And here's the results of additional testing using ping with the don't fragment flag set: immediately after setting an MTU of 9000 I do the following: ping -c 4 -M do -s 8968 192.168.1.100 which succeeds: 8976 bytes from 192.168.1.100: icmp_seq=1 ttl=128 time=1.09 ms and then this: ping -c 4 -M do -s 8970 192.168.1.100 which fails: From 192.168.1.103 icmp_seq=1 Frag needed and DF set (mtu = 8166) At this point the ttcp+ethereal test shows an on wire packet size of 8180 bytes. try again with the MTU at 8996: ping -c 4 -M do -s 8968 192.168.1.100 success: 8976 bytes from 192.168.1.100: icmp_seq=1 ttl=128 time=1.12 ms ping -c 4 -M do -s 8970 192.168.1.100 fails: From 192.168.1.103 icmp_seq=1 Frag needed and DF set (mtu = 8996) But notice it hasn't changed the MTU. Since the problem occurs with both ping and tracepath I'd guess it's not happening at the application level which leaves the network stack and the NIC driver as possible culprits. Reproducible: Always Steps to Reproduce: See Details. Actual Results: See Details. Expected Results: See Details. Portage 2.0.50-r1 (default-amd64-2004.0, gcc-3.3.2, glibc-2.3.2-r9, 2.6.4-gentoo-r1) ================================================================= System uname: 2.6.4-gentoo-r1 x86_64 5 Gentoo Base System version 126.96.36.199 Autoconf: sys-devel/autoconf-2.58 Automake: sys-devel/automake-1.8.2 ACCEPT_KEYWORDS="amd64" AUTOCLEAN="yes" CFLAGS="-O2 -pipe" CHOST="x86_64-pc-linux-gnu" COMPILER="gcc3" CONFIG_PROTECT="/etc /usr/X11R6/lib/X11/xkb /usr/kde/2/share/config /usr/kde/3.1/share/config /usr/kde/3.2/share/config /usr/kde/3/share/config /usr/share/config /usr/share/texmf/dvipdfm/config/ /usr/share/texmf/dvips/config/ /usr/share/texmf/tex/generic/config/ /usr/share/texmf/tex/platex/config/ /usr/share/texmf/xdvi/ /var/qmail/control" CONFIG_PROTECT_MASK="/etc/gconf /etc/env.d" CXXFLAGS="-O2 -pipe" DISTDIR="/usr/portage/distfiles" FEATURES="autoaddcvs ccache sandbox" GENTOO_MIRRORS="http://gentoo.oregonstate.edu http://distro.ibiblio.org/pub/Linux/distributions/gentoo" MAKEOPTS="-j3" PKGDIR="/usr/portage/packages" PORTAGE_TMPDIR="/var/tmp" PORTDIR="/usr/portage" PORTDIR_OVERLAY="" SYNC="rsync://rsync.gentoo.org/gentoo-portage" USE="X aalib alsa amd64 apm arts avi berkdb cdr crypt cups doc dvd encode esd fbcon foomaticdb gdbm gif gnome gpm gtk gtk2 imlib jpeg kde libg++ libwww mad mikmod motif mozilla mpeg ncurses nls oggvorbis opengl oss pam pdflib perl png python qt quicktime readline samba sdl slang spell ssl tcltk tcpd tetex truetype usb videos wxwindows xinerama xml xml2 xmms xv zlib"
Comment 1 John Stewart 2004-03-30 16:02:43 UTC
More data on this problem. I can also trigger similar behavior with MTUs below 1500. mtu 1500: 1: 192.168.1.103 (192.168.1.103) 0.191ms pmtu 1500 1: 192.168.1.100 (192.168.1.100) 0.434ms reached Resume: pmtu 1500 hops 1 back 1 mtu 1498, 1496 and 1494 all result something like: 1: 192.168.1.103 (192.168.1.103) 0.160ms pmtu 1494 1: 192.168.1.103 (192.168.1.103) 0.015ms pmtu 1492 1: 192.168.1.100 (192.168.1.100) 0.212ms reached Resume: pmtu 1492 hops 1 back 1 1492 succeeds 1490, 1488, 1486 all go into an infinite loop: 1: 192.168.1.103 (192.168.1.103) 0.155ms pmtu 1490 1: 192.168.1.103 (192.168.1.103) 0.014ms pmtu 576 1: 192.168.1.103 (192.168.1.103) 0.013ms pmtu 552 1: 192.168.1.103 (192.168.1.103) 0.011ms pmtu 552 This last line repeats over and over again. 1484 succeeds 1484, 1492, and 1500 all fit the previously discovered 8x+4 pattern. Being that they are equal to 185, 186, and 187 * 8 + 4.
Comment 2 John Stewart 2004-04-10 15:42:02 UTC
Additional testing shows that it is likely the weird pmtu behavior did not exist with the 2.4 kernel. I just did some testing with Knoppix using a 2.4.22 or somesuch similar kernel. Results: root@ttyp0[home2]# ./ping -c 2 -M do -s 8974 192.168.1.110 PING 192.168.1.110 (192.168.1.110) 8974(9002) bytes of data. ping: local error: Message too long, mtu=9000 ping: local error: Message too long, mtu=9000 Couldn't get through but it didn't change the mtu in the process. root@ttyp0[home2]# ./tracepath 192.168.1.110 1?: [LOCALHOST] pmtu 9000 1: 192.168.1.110 (192.168.1.110) 0.652ms reached Resume: pmtu 9000 hops 1 back 1 9000 working just fine with tracepath rather than becoming 1492. The above binaries were from a gentoo filesystem as knoppix's ping doesn't have the don't fragment switches, and it doesn't appear to include tracepath at all. NIC was another e1000 driver was whatever came with Knoppix.
Comment 3 Jason Cox (RETIRED) 2004-04-16 18:09:16 UTC
I don't know what what you're asking with this bug. Is thsi still a bug even?
Comment 4 John Stewart 2004-04-16 20:08:40 UTC
Yes I'd characterize it as a bug. It has the potential to play havoc with the reliability/throughput of gigabit+ networks, especially since an MTU of 9000 seems to be evolving into a defacto standard for such networks for the time being. And 9000 in particular triggers this behavior. I'm not sure how serious it is though as so far the only way I've found to trigger it is to manually run programs that send ICMP packets(tracepath/ping). So far I'd say it's not a huge problem on a single user machine, but on a multi user machine any single non privelaged user can degrade network performance to any specific host simply by running tracepath. I've played around with the kernel code a bit. The PMTU adjustment(and the source of the 8166, 4352, etc.). Occurs in the functions guess_mtu() which is called by ip_rt_frag_needed() in net/ipv4/route.c. This in turn appears to be being called by icmp_unreach() in net/ipv4/icmp.c. But it appears that these functions are just helpers for whatever function is making the actual decision to reduce the path MTU. And so far I haven't been able to figure out where exactly where icmp_unreach() is being called from. Here's how I made the above observations: I added a printk to the icmp_unreach() function immediately following the call to ip_rt_frag_needed() that prints the value passed to and received from ip_rt_frag_needed() If you ifconfig a MTU of the form 8x+4, for instance 8996, and then run tracepath then you see the following: Apr 16 22:08:40 independent icmp_unreach: info=8996, mtu=8996 I.e. it passes in 8996, receives back 8996 and is happy with it and never calls again. However if you ifconfig a MTU that is not of the form 8x+4, for example 9000, and then run tracepath my printk shows: Apr 11 06:56:34 independent icmp_unreach: info=9000, mtu=9000 Apr 11 06:56:34 independent icmp_unreach: info=8166, mtu=9000 Apr 11 06:56:34 independent icmp_unreach: info=4352, mtu=8166 Apr 11 06:56:34 independent icmp_unreach: info=2002, mtu=4352 Apr 11 06:56:34 independent icmp_unreach: info=1492, mtu=2002 I.e. something is calling icmp_unreach repeatedly until it finally is happy when a value of 1492 is returned. 1492 is the first number in the table used by guess_mtu that fits the form 8x+4. If I add 8996 or 1500 to guess_mtu()'s table then it stops when it reaches either 8996 or 1500. And to repeat an earlier observation I've run the same tracepath binary on a 2.6.x kernel and a 2.4.x kernel and this behavior only occurs on the 2.6.x kernel so I don't think it is tracepath that is making this decision. From what I've learned so far I would suggest: 1. Add 9000 and 1500 to the table used in guess_mtu(). This is not to fix this problem in particular, but it should make PMTU discovery better in general. 2. Find and fix the code that is is only accepting MTUs of the form 8x+4 back from whatever mechanism is causing icmp_unreach()/ip_rt_frag_needed() to be called. I'm still working on it. I've ordered two SMC gigabit cards each of which has a different network chipset(one Marvell and the other Broadcom I think). This should allow me to determine whether this behavior is specific to the e1000 driver. I'll probably go dig around in the kernel some more. Does anyone have a hint or two about what I'd need to do to trace my way from running tracepath into whatever kernel functions it causes to be called.
Comment 5 Jason Cox (RETIRED) 2004-04-28 13:44:52 UTC
This is something that needs to head upstream. The behavior is really odd. Could you head to http://bugme.osdl.org and start a thread there. You've done lots of foot work already, be sure to link back to this thread.