Under heavy networking load the kernel panics. In a server/client configuration a test program that tries to push lots of data across many connections will caues a kernel panic when more threads than CPU's almost every run. Reproducible: Always Steps to Reproduce: (sever and client code added in attachement) 1.Start server on one machine 2.run client on the other 3.monitor console on both, one or the other will panic reliably. Actual Results: - On server machine (as root) ran: cl34 crashTest # time ./crashSvr [never came back as the machine crashed after the next step...] - On client machine (as root) ran: cl33 crashTest # time ./crashClnt cl34 3311 -t 16 -n 1000 -a 10000 Executing thread 0Executing thread 1 Executing thread 2 Executing thread 3 Executing thread 4 Executing thread 5 Executing thread 6 Executing thread 7 Executing thread 8 Executing thread 9 Executing thread 10 Executing thread 11 Executing thread 12 Executing thread 13 Executing thread 14 Executing thread 15 [seemed to be hung, so...] ^C real 0m54.099s user 0m0.136s sys 0m0.792s cl33 crashTest # Got this on cl34's (server) console: [...] Unable to handle kernel NULL pointer dereference at 0000000000000084 RIP: [<ffffffff804f8892>] nv_rx_process_optimized+0xd2/0x3c0 PGD 4058a7067 PUD 3f1833067 PMD 0 Oops: 0000 [1] PREEMPT SMP CPU 3 Modules linked in: Pid: 0, comm: swapper Not tainted 2.6.22-gentoo-r1 #1 RIP: 0010:[<ffffffff804f8892>] [<ffffffff804f8892>] nv_rx_process_optimized+0xd2/0x3c0 RSP: 0018:ffff810418573eb8 EFLAGS: 00010246 RAX: 0000000014000000 RBX: 0000000000000000 RCX: 0000000004000000 RDX: 0000000000000670 RSI: 0000000412a41810 RDI: ffff810217566070 RBP: 0000000034020042 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff810217b01740 R13: 0000000000000042 R14: ffff810217b01000 R15: 0000000000000001 FS: 000000005f83e940(0000) GS:ffff810418554140(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000084 CR3: 00000003f1ebb000 CR4: 00000000000006e0 Process swapper (pid: 0, threadinfo ffff810217804000, task ffff810418555080) Stack: 0000004017b01740 0000000000000010 ffff810217b01740 ffff810217b01000 ffffc20004d48000 0000000000000000 ffff810217b019e8 ffffffff804f8d7b 0000000000000001 ffff8102150e86c0 0000000000000000 0000000000000000 [nothing more till after a reboot] Expected Results: both client and server reporting successful runs. # emerge info *** Deprecated use of action 'info', use '--info' instead Portage 2.1.2.9 (default-linux/amd64/2007.0, gcc-4.1.2, glibc-2.5-r3, 2.6.22-gentoo-r1 x86_64) ================================================================= System uname: 2.6.22-gentoo-r1 x86_64 Dual-Core AMD Opteron(tm) Processor 2212 Gentoo Base System release 1.12.9 Timestamp of tree: Mon, 13 Aug 2007 10:30:10 +0000 ccache version 2.4 [disabled] dev-java/java-config: 1.3.7, 2.0.33-r1 dev-lang/python: 2.4.4-r4 dev-python/pycrypto: 2.0.1-r5 dev-util/ccache: 2.4-r7 sys-apps/sandbox: 1.2.17 sys-devel/autoconf: 2.13, 2.61 sys-devel/automake: 1.4_p6, 1.5, 1.6.3, 1.7.9-r1, 1.9.6-r2, 1.10 sys-devel/binutils: 2.17 sys-devel/gcc-config: 1.3.16 sys-devel/libtool: 1.5.23b virtual/os-headers: 2.6.17-r2 ACCEPT_KEYWORDS="amd64" AUTOCLEAN="yes" CBUILD="x86_64-pc-linux-gnu" CFLAGS="-O2 -pipe" CHOST="x86_64-pc-linux-gnu" CONFIG_PROTECT="/etc /usr/kde/3.5/env /usr/kde/3.5/share/config /usr/kde/3.5/shutdown /usr/share/config /var/bind" CONFIG_PROTECT_MASK="/etc/env.d /etc/env.d/java/ /etc/gconf /etc/php/apache2-php5/ext-active/ /etc/php/cgi-php5/ext-active/ /etc/php/cli-php5/ext-active/ /etc/revdep-rebuild /etc/terminfo /etc/texmf/web2c" CXXFLAGS="-O2 -pipe" DISTDIR="/usr/portage/distfiles" FEATURES="distlocks metadata-transfer sandbox sfperms strict" GENTOO_MIRRORS="http://ftp.ucsb.edu/pub/mirrors/linux/gentoo/ http://gentoo.llarian.net/ http://gentoo.arcticnetwork.ca/ ftp://ftp.ucsb.edu/pub/mirrors/linux/gentoo/ http://gentoo.mirrors.easynews.com/linux/gentoo/" MAKEOPTS="-j8" PKGDIR="/usr/portage/packages" PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --compress --force --whole-file --delete --delete-after --stats --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages --filter=H_**/files/digest-*" PORTAGE_TMPDIR="/var/tmp" PORTDIR="/usr/portage" PORTDIR_OVERLAY="/usr/local/portage" SYNC="rsync://cfproxy/gentoo-portage" USE="acl alsa amd64 apache2 berkdb bitmap-fonts cdr cjk cli cracklib crypt cups curl djbfft doc dri dvd ffmpeg font-server fortran gd gdbm gpm iconv imagemagick immqt-bc innodb ipv6 isdnlog ithreads kde logrotate math midi mmx mudflap mysql ncurses nls nptl nptlonly openmp pam pcre perl php pppd python qt readline reflection session snmp spl sse sse2 ssl tcpd truetype-fonts type1-fonts unicode vhosts xorg zlib" ALSA_CARDS="ali5451 als4000 atiixp atiixp-modem bt87x ca0106 cmipci emu10k1x ens1370 ens1371 es1938 es1968 fm801 hda-intel intel8x0 intel8x0m maestro3 trident usb-audio via82xx via82xx-modem ymfpci" ALSA_PCM_PLUGINS="adpcm alaw asym copy dmix dshare dsnoop empty extplug file hooks iec958 ioplug ladspa lfloat linear meter mulaw multi null plug rate route share shm softvol" ELIBC="glibc" INPUT_DEVICES="keyboard mouse evdev" KERNEL="linux" LCD_DEVICES="bayrad cfontz cfontz633 glk hd44780 lb216 lcdm001 mtxorb ncurses text" USERLAND="GNU" VIDEO_CARDS="apm ark chips cirrus cyrix dummy fbdev glint i128 i810 mach64 mga neomagic nv r128 radeon rendition s3 s3virge savage siliconmotion sis sisusb tdfx tga trident tseng v4l vesa vga via vmware voodoo" Unset: CTARGET, EMERGE_DEFAULT_OPTS, INSTALL_MASK, LANG, LC_ALL, LDFLAGS, LINGUAS, PORTAGE_COMPRESS, PORTAGE_COMPRESS_FLAGS, PORTAGE_RSYNC_EXTRA_OPTS
Created attachment 128002 [details] Code to reproduce the panic and system info Output of (a la kernel.org suggestion) - # cat /proc/version - # cat /proc/cpuinfo - # cat /proc/modules - # cat /proc/ioports - # cat /proc/iomem - # lspci -vvv - # cat /proc/scsi/scsi - related patch that doesn't fix it: http://bugzilla.kernel.org/show_bug.cgi?id=8058 - source code for server and client programs that demonstrate the panic.
A couple of questions: - What was the last kernel version that worked for you (ie. didn't panic)? - Can you post your kernel .config and dmesg output? - Could you test the latest kernel prepatch, ie. vanilla-sources-2.6.23_rc3? - If you haven't already, please also test with CONFIG_FORCEDETH_NAPI=y. Thanks.
Created attachment 128722 [details] dmesg output from the kernel that panics in nv_rx_optimized on SMP box
Created attachment 128723 [details] the .config from kernel that panics in nv_rx_optimized on SMP box
mbresser asks: - What was the last kernel version that worked for you (ie. didn't panic)? - Can you post your kernel .config and dmesg output? - Could you test the latest kernel prepatch, ie. vanilla-sources-2.6.23_rc3? - If you haven't already, please also test with CONFIG_FORCEDETH_NAPI=y. These are new machines for me, so have never had them "not panic". ;-) I could back out to an old kernel with some trouble... any hints which might be a good one to go back to? I have attached .config and dmesg output I'll try and test the latest kernel, post the results. Likewise with CONFIG_FORCEDETH_NAPI=y (was not set) Thanks - ;peter
Created attachment 128724 [details] text version of the .config from kernel that panics in nv_rx_optimized on SMP box Sorry, first file I attached was the un-expanded version from /proc/config.gz ;;peter
Well, setting CONFIG_FORCEDETH_NAPI=y on my otherwise problematic Linux cl34 2.6.22-gentoo-r1 #2 SMP PREEMPT kernel seems to have helped considerably. I was able to bump up the test parameters to: time ./crashClnt cl34 3311 -t 80 -n 10000 -a 10000 i.e. 80 threads sending 10K messages of 10K each. It still hung the server machine if I bumped it to: time ./crashClnt cl34 3311 -t 80 -n 10000 -a 25000 but didn't seem to crash, nor leave any trace in the logs, just dead as a doornail, unresponsive to pings etc. (Any tricks for getting more info out of this state?) Will now try to get the latest kernel prepatch, ie. vanilla-sources-2.6.23_rc3 Might be a bit, since I'm new to using "raw" kernels. ;-) ;;peter
Grabbed the vanilla kernel: http://www.kernel.org/pub/linux/kernel/v2.6/testing/linux-2.6.23-rc3.tar.bz2 This ran the 16 thread version, but seemed to hang on the 80 thread one. i.e. ran: time ./crashClnt cl34 3311 -t 16 -n 10000 -a 10000 ran: time ./crashClnt cl34 3311 -t 40 -n 10000 -a 10000 hung sometimes: time ./crashClnt cl34 3311 -t 80 -n 10000 -a 10000 Left one 80 thread version running overnight, was still hung in the morning, but after a bounce had these two entries in /var/log/messages : Aug 20 20:36:42 cl34 eth0: too many iterations (6) in nv_nic_irq. Aug 20 20:37:12 cl34 eth0: too many iterations (6) in nv_nic_irq. Subsequent tries with 80 threads seem to work today, but do get lots (like 30 every 10 minutes) of these entries in the log file durring the run: # time ./crashClnt cl34 3311 -t 80 -n 100000 -a 1000 Executing thread 0 Executing thread 1 Executing thread 2 [...] Executing thread 77 Executing thread 78 Executing thread 79 Server is: cl34:3311 Sent & received 100000 msgs of avg. size 1000 with 80 threads Grand total: 16064000000 bytes, or 128512000000 bits real 98m46.622s user 0m2.316s sys 2m35.470s It does seem very sensative to the count:siize ratio - here the same total data volume was transfered in a bit over 2 minutes as opposed to 1.6 hours: # time ./crashClnt cl34 3311 -t 80 -n 1000 -a 100000 Executing thread 0 Executing thread 1 Executing thread 2 [...] Executing thread 78 Executing thread 79 Server is: cl34:3311 Sent & received 1000 msgs of avg. size 100000 with 80 threads Grand total: 16000640000 bytes, or 128005120000 bits real 2m12.389s user 0m15.093s sys 1m39.678s So seems the vanilla kernel is best choice I have at the moment, hopefully the one hang was a fluke... Any other ideas of things to try welcomed - ;;peter
Try using SysRq-t to get a stack trace after it hangs. You can read instructions for it Documentation/sysrq.txt in your kernel directory. I'd suggest first trying a sequence like SysRq-t, SysRq-s, SysRq-u, SysRq-b to dump the trace, sync your disks, mount your filesystems read-only, then reboot. That should leave you with the stack traces in your system log after you reboot. Please attach the trace from the relevant process(es) here. If that doesn't work another option is to setup a serial console, as described in Documentation/serial-console.txt. You can also use netconsole (Documentation/networking/netconsole.txt) to capture log messages.
Thanks for the pointer to SysRq, I'll give it a try. I do already have a serial console set up, and am actually running these tests frmo those consoles. Now that the new kernel(linux-2.6.23-rc3) is not panicing, I see nothing output once things hang. Perhaps SysRq will provide some clues.
Created attachment 130282 [details] upstream forcedeth.c
Can you try the latest forcedeth that I am attaching? I believe the following change could have fixed your issue aswell: http://git.kernel.org/?p=linux/kernel/git/jgarzik/netdev-2.6.git;a=commitdiff;h=1a2b73302aacddf2543f9d7a25936e4323fa1486
Closing this bug. Please reopen when you have tested Ayaz's patch.
Was this patch applied? I seem to be having a similar problem - although I don't see any evidence of a kernel panic, my box does seem to lock up when there is high load with many connections.
It is in the latest vanilla stable release (currently 2.6.27.7). Could you please test with that and see if it fixes the problem for you?
Sure - I can check that. But is the patch in gentoo-sources-2.6.27-r4? If so then I'm already testing it....
Also, I just bought a cheap ethernet card to verify that it's not something else other than the forcedeth driver...I'll update with results.
(In reply to comment #16) > Sure - I can check that. But is the patch in gentoo-sources-2.6.27-r4? If so > then I'm already testing it.... > Seems so. gentoo-sources-2.6.27-r4 uses K_GENPATCHES_VER="6" which is based to 2.6.27.7.
I've been using the new ethernet card for a while (it's a card that uses the via-rhine module) with no problems even at high loads...