188767 – oops - forcedeth kernel panic in nv_rx_process_optimized in SMP multithreaded environment

Bug 188767 - oops - forcedeth kernel panic in nv_rx_process_optimized in SMP multithreaded environment

Summary: oops - forcedeth kernel panic in nv_rx_process_optimized in SMP multithreaded...

Status:	RESOLVED NEEDINFO

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	[OLD] Core system (show other bugs)
Hardware:	AMD64 Linux

Importance:	High critical (vote)
Assignee:	Gentoo Kernel Bug Wranglers and Kernel Maintainers

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2007-08-13 23:25 UTC by slowfood
Modified:	2009-01-11 20:48 UTC (History)
CC List:	4 users (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
Code to reproduce the panic and system info (forcedethPanic_testcode_2007-08-13.txt,51.70 KB, text/plain) 2007-08-13 23:38 UTC, slowfood	Details
dmesg output from the kernel that panics in nv_rx_optimized on SMP box (dmesg_panicKernel_2.6.22-gentoo-r1.txt,27.60 KB, text/plain) 2007-08-20 21:38 UTC, slowfood	Details
the .config from kernel that panics in nv_rx_optimized on SMP box (oops_config.gz_2.6.22-gentoo-r1.txt,9.90 KB, application/x-gzip) 2007-08-20 21:40 UTC, slowfood	Details
text version of the .config from kernel that panics in nv_rx_optimized on SMP box (oops_config.gz_2.6.22-gentoo-r1.txt,41.35 KB, text/plain) 2007-08-20 21:48 UTC, slowfood	Details
upstream forcedeth.c (forcedeth.c,173.18 KB, text/plain) 2007-09-07 17:26 UTC, Ayaz Abdulla	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description slowfood 2007-08-13 23:25:17 UTC

Under heavy networking load the kernel panics.
In a server/client configuration a test program that tries to
push lots of data across many connections will caues a kernel panic
when more threads than CPU's almost every run.

Reproducible: Always

Steps to Reproduce:
(sever and client code added in attachement)
1.Start server on one machine 
2.run client on the other
3.monitor console on both, one or the other will panic reliably.

Actual Results:  
- On server machine (as root) ran:

   cl34 crashTest # time ./crashSvr
   [never came back as the machine crashed after the next step...]

- On client machine (as root) ran:

   cl33 crashTest # time ./crashClnt cl34 3311 -t 16 -n 1000 -a 10000
   Executing thread 0Executing thread 1
   Executing thread 2
   Executing thread 3
   Executing thread 4
   Executing thread 5
   Executing thread 6
   Executing thread 7
   Executing thread 8
   Executing thread 9
   Executing thread 10
   Executing thread 11
   Executing thread 12
   Executing thread 13
   Executing thread 14
   Executing thread 15
 [seemed to be hung, so...]
   ^C
   real    0m54.099s
   user    0m0.136s
   sys     0m0.792s
   cl33 crashTest #


Got this on cl34's (server) console:

 [...]
Unable to handle kernel NULL pointer dereference at 0000000000000084 RIP:
 [<ffffffff804f8892>] nv_rx_process_optimized+0xd2/0x3c0
PGD 4058a7067 PUD 3f1833067 PMD 0
Oops: 0000 [1] PREEMPT SMP
CPU 3
Modules linked in:
Pid: 0, comm: swapper Not tainted 2.6.22-gentoo-r1 #1
RIP: 0010:[<ffffffff804f8892>]  [<ffffffff804f8892>] nv_rx_process_optimized+0xd2/0x3c0
RSP: 0018:ffff810418573eb8  EFLAGS: 00010246
RAX: 0000000014000000 RBX: 0000000000000000 RCX: 0000000004000000
RDX: 0000000000000670 RSI: 0000000412a41810 RDI: ffff810217566070
RBP: 0000000034020042 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff810217b01740
R13: 0000000000000042 R14: ffff810217b01000 R15: 0000000000000001
FS:  000000005f83e940(0000) GS:ffff810418554140(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000084 CR3: 00000003f1ebb000 CR4: 00000000000006e0
Process swapper (pid: 0, threadinfo ffff810217804000, task ffff810418555080)
Stack:  0000004017b01740 0000000000000010 ffff810217b01740 ffff810217b01000
 ffffc20004d48000 0000000000000000 ffff810217b019e8 ffffffff804f8d7b
 0000000000000001 ffff8102150e86c0 0000000000000000 0000000000000000
 [nothing more till after a reboot]


Expected Results:  
both client and server reporting successful runs.

# emerge info
*** Deprecated use of action 'info', use '--info' instead
Portage 2.1.2.9 (default-linux/amd64/2007.0, gcc-4.1.2, glibc-2.5-r3, 2.6.22-gentoo-r1 x86_64)
=================================================================
System uname: 2.6.22-gentoo-r1 x86_64 Dual-Core AMD Opteron(tm) Processor 2212
Gentoo Base System release 1.12.9
Timestamp of tree: Mon, 13 Aug 2007 10:30:10 +0000
ccache version 2.4 [disabled]
dev-java/java-config: 1.3.7, 2.0.33-r1
dev-lang/python:     2.4.4-r4
dev-python/pycrypto: 2.0.1-r5
dev-util/ccache:     2.4-r7
sys-apps/sandbox:    1.2.17
sys-devel/autoconf:  2.13, 2.61
sys-devel/automake:  1.4_p6, 1.5, 1.6.3, 1.7.9-r1, 1.9.6-r2, 1.10
sys-devel/binutils:  2.17
sys-devel/gcc-config: 1.3.16
sys-devel/libtool:   1.5.23b
virtual/os-headers:  2.6.17-r2
ACCEPT_KEYWORDS="amd64"
AUTOCLEAN="yes"
CBUILD="x86_64-pc-linux-gnu"
CFLAGS="-O2 -pipe"
CHOST="x86_64-pc-linux-gnu"
CONFIG_PROTECT="/etc /usr/kde/3.5/env /usr/kde/3.5/share/config /usr/kde/3.5/shutdown /usr/share/config /var/bind"
CONFIG_PROTECT_MASK="/etc/env.d /etc/env.d/java/ /etc/gconf /etc/php/apache2-php5/ext-active/ /etc/php/cgi-php5/ext-active/ /etc/php/cli-php5/ext-active/ /etc/revdep-rebuild /etc/terminfo /etc/texmf/web2c"
CXXFLAGS="-O2 -pipe"
DISTDIR="/usr/portage/distfiles"
FEATURES="distlocks metadata-transfer sandbox sfperms strict"
GENTOO_MIRRORS="http://ftp.ucsb.edu/pub/mirrors/linux/gentoo/ http://gentoo.llarian.net/ http://gentoo.arcticnetwork.ca/ ftp://ftp.ucsb.edu/pub/mirrors/linux/gentoo/ http://gentoo.mirrors.easynews.com/linux/gentoo/"
MAKEOPTS="-j8"
PKGDIR="/usr/portage/packages"
PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --compress --force --whole-file --delete --delete-after --stats --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages --filter=H_**/files/digest-*"
PORTAGE_TMPDIR="/var/tmp"
PORTDIR="/usr/portage"
PORTDIR_OVERLAY="/usr/local/portage"
SYNC="rsync://cfproxy/gentoo-portage"
USE="acl alsa amd64 apache2 berkdb bitmap-fonts cdr cjk cli cracklib crypt cups curl djbfft doc dri dvd ffmpeg font-server fortran gd gdbm gpm iconv imagemagick immqt-bc innodb ipv6 isdnlog ithreads kde logrotate math midi mmx mudflap mysql ncurses nls nptl nptlonly openmp pam pcre perl php pppd python qt readline reflection session snmp spl sse sse2 ssl tcpd truetype-fonts type1-fonts unicode vhosts xorg zlib" ALSA_CARDS="ali5451 als4000 atiixp atiixp-modem bt87x ca0106 cmipci emu10k1x ens1370 ens1371 es1938 es1968 fm801 hda-intel intel8x0 intel8x0m maestro3 trident usb-audio via82xx via82xx-modem ymfpci" ALSA_PCM_PLUGINS="adpcm alaw asym copy dmix dshare dsnoop empty extplug file hooks iec958 ioplug ladspa lfloat linear meter mulaw multi null plug rate route share shm softvol" ELIBC="glibc" INPUT_DEVICES="keyboard mouse evdev" KERNEL="linux" LCD_DEVICES="bayrad cfontz cfontz633 glk hd44780 lb216 lcdm001 mtxorb ncurses text" USERLAND="GNU" VIDEO_CARDS="apm ark chips cirrus cyrix dummy fbdev glint i128 i810 mach64 mga neomagic nv r128 radeon rendition s3 s3virge savage siliconmotion sis sisusb tdfx tga trident tseng v4l vesa vga via vmware voodoo"
Unset:  CTARGET, EMERGE_DEFAULT_OPTS, INSTALL_MASK, LANG, LC_ALL, LDFLAGS, LINGUAS, PORTAGE_COMPRESS, PORTAGE_COMPRESS_FLAGS, PORTAGE_RSYNC_EXTRA_OPTS

Comment 1 slowfood 2007-08-13 23:38:42 UTC

Created attachment 128002 [details]
Code to reproduce the panic and system info

Output of (a la kernel.org suggestion)
- # cat /proc/version
- # cat /proc/cpuinfo
- # cat /proc/modules
- # cat /proc/ioports
- # cat /proc/iomem
- # lspci -vvv
- # cat /proc/scsi/scsi

- related patch that doesn't fix it:
   http://bugzilla.kernel.org/show_bug.cgi?id=8058

- source code for server and client programs that demonstrate the panic.

Comment 2 Maarten Bressers (RETIRED) gentoo-dev

2007-08-18 21:48:42 UTC

A couple of questions:

- What was the last kernel version that worked for you (ie. didn't panic)?
- Can you post your kernel .config and dmesg output?
- Could you test the latest kernel prepatch, ie. vanilla-sources-2.6.23_rc3?
- If you haven't already, please also test with CONFIG_FORCEDETH_NAPI=y.

Thanks.

Comment 3 slowfood 2007-08-20 21:38:32 UTC

Created attachment 128722 [details]
dmesg output from the kernel that panics in nv_rx_optimized on SMP box

Comment 4 slowfood 2007-08-20 21:40:48 UTC

Created attachment 128723 [details]
the .config from kernel that panics in nv_rx_optimized on SMP box

Comment 5 slowfood 2007-08-20 21:42:37 UTC

mbresser asks:
 - What was the last kernel version that worked for you (ie. didn't panic)?
 - Can you post your kernel .config and dmesg output?
 - Could you test the latest kernel prepatch, ie. vanilla-sources-2.6.23_rc3?
 - If you haven't already, please also test with CONFIG_FORCEDETH_NAPI=y.

These are new machines for me, so have never had them "not panic". ;-)
I could back out to an old kernel with some trouble... any hints which might be
a good one to go back to?

I have attached .config and dmesg output

I'll try and test the latest kernel, post the results.
Likewise with CONFIG_FORCEDETH_NAPI=y (was not set)

Thanks -
;peter

Comment 6 slowfood 2007-08-20 21:48:58 UTC

Created attachment 128724 [details]
text version of the .config from kernel that panics in nv_rx_optimized on SMP box

Sorry, first file I attached was the un-expanded version from /proc/config.gz
;;peter

Comment 7 slowfood 2007-08-21 00:21:43 UTC

Well, setting CONFIG_FORCEDETH_NAPI=y on my otherwise problematic
Linux cl34 2.6.22-gentoo-r1 #2 SMP PREEMPT
kernel seems to have helped considerably.

I was able to bump up the test parameters to:
   time ./crashClnt cl34 3311 -t 80 -n 10000 -a 10000
i.e. 80 threads sending 10K messages of 10K each.

It still hung the server machine if I bumped it to:
   time ./crashClnt cl34 3311 -t 80 -n 10000 -a 25000
but didn't seem to crash, nor leave any trace in the logs,
just dead as a doornail, unresponsive to pings etc.
(Any tricks for getting more info out of this state?)

Will now try to get the latest kernel prepatch, ie. vanilla-sources-2.6.23_rc3
Might be a bit, since I'm new to using "raw" kernels. ;-)

;;peter

Comment 8 slowfood 2007-08-21 21:30:50 UTC

Grabbed the vanilla kernel:
   http://www.kernel.org/pub/linux/kernel/v2.6/testing/linux-2.6.23-rc3.tar.bz2
This ran the 16 thread version, but seemed to hang on the 80 thread one.
i.e.
 ran: time ./crashClnt cl34 3311 -t 16 -n 10000 -a 10000
 ran: time ./crashClnt cl34 3311 -t 40 -n 10000 -a 10000
hung sometimes:
      time ./crashClnt cl34 3311 -t 80 -n 10000 -a 10000

Left one 80 thread version running overnight, was still hung
in the morning, but after a bounce had these two
entries in /var/log/messages :
Aug 20 20:36:42 cl34 eth0: too many iterations (6) in nv_nic_irq.
Aug 20 20:37:12 cl34 eth0: too many iterations (6) in nv_nic_irq.

Subsequent tries with 80 threads seem to work today, but do get lots
(like 30 every 10 minutes) of these entries in the log file durring the
run:
# time ./crashClnt cl34 3311 -t 80 -n 100000 -a 1000
Executing thread 0
Executing thread 1
Executing thread 2
 [...]
Executing thread 77
Executing thread 78
Executing thread 79

Server is: cl34:3311
Sent & received 100000 msgs of avg. size 1000 with 80 threads
Grand total: 16064000000 bytes, or 128512000000 bits

real    98m46.622s
user    0m2.316s
sys     2m35.470s

It does seem very sensative to the count:siize ratio - here the same
total data volume was transfered in a bit over 2 minutes as opposed to 1.6 hours:
# time ./crashClnt cl34 3311 -t 80 -n 1000 -a 100000
Executing thread 0
Executing thread 1
Executing thread 2
  [...]
Executing thread 78
Executing thread 79
Server is: cl34:3311
Sent & received 1000 msgs of avg. size 100000 with 80 threads
Grand total: 16000640000 bytes, or 128005120000 bits

real    2m12.389s
user    0m15.093s
sys     1m39.678s

So seems the vanilla kernel is best choice I have at the moment, hopefully
the one hang was a fluke...
Any other ideas of things to try welcomed -

;;peter

Comment 9 Duane Griffin 2007-08-22 10:34:15 UTC

Try using SysRq-t to get a stack trace after it hangs. You can read instructions for it Documentation/sysrq.txt in your kernel directory.

I'd suggest first trying a sequence like SysRq-t, SysRq-s, SysRq-u, SysRq-b to dump the trace, sync your disks, mount your filesystems read-only, then reboot. That should leave you with the stack traces in your system log after you reboot. Please attach the trace from the relevant process(es) here.

If that doesn't work another option is to setup a serial console, as described in Documentation/serial-console.txt. You can also use netconsole (Documentation/networking/netconsole.txt) to capture log messages.

Comment 10 slowfood 2007-08-22 19:27:09 UTC

Thanks for the pointer to SysRq, I'll give it a try.
I do already have a serial console set up, and am actually running these tests
frmo those consoles.
Now that the new kernel(linux-2.6.23-rc3) is not panicing, I see nothing output
once things hang.
Perhaps SysRq will provide some clues.

Comment 11 Ayaz Abdulla 2007-09-07 17:26:33 UTC

Created attachment 130282 [details]
upstream forcedeth.c

Comment 12 Ayaz Abdulla 2007-09-07 17:27:11 UTC

Can you try the latest forcedeth that I am attaching? I believe the following change could have fixed your issue aswell:

http://git.kernel.org/?p=linux/kernel/git/jgarzik/netdev-2.6.git;a=commitdiff;h=1a2b73302aacddf2543f9d7a25936e4323fa1486

Comment 13 Maarten Bressers (RETIRED) gentoo-dev

2007-09-21 16:04:31 UTC

Closing this bug. Please reopen when you have tested Ayaz's patch.

Comment 14 Paul Sorensen 2008-12-01 20:55:03 UTC

Was this patch applied? I seem to be having a similar problem - although I don't see any evidence of a kernel panic, my box does seem to lock up when there is high load with many connections.

Comment 15 Duane Griffin 2008-12-04 12:51:50 UTC

It is in the latest vanilla stable release (currently 2.6.27.7). Could you please test with that and see if it fixes the problem for you?

Comment 16 Paul Sorensen 2008-12-07 18:18:44 UTC

Sure - I can check that.  But is the patch in gentoo-sources-2.6.27-r4? If so then I'm already testing it....

Comment 17 Paul Sorensen 2008-12-07 18:20:35 UTC

Also, I just bought a cheap ethernet card to verify that it's not something else other than the forcedeth driver...I'll update with results.

Comment 18 Axel Dyks 2008-12-07 18:29:00 UTC

(In reply to comment #16)
> Sure - I can check that.  But is the patch in gentoo-sources-2.6.27-r4? If so
> then I'm already testing it....
> 
Seems so. gentoo-sources-2.6.27-r4 uses K_GENPATCHES_VER="6"
which is based to 2.6.27.7.

Comment 19 Paul Sorensen 2009-01-11 20:48:10 UTC

I've been using the new ethernet card for a while (it's a card that uses the via-rhine module) with no problems even at high loads...