| Summary: | oops - forcedeth kernel panic in nv_rx_process_optimized in SMP multithreaded environment | ||
|---|---|---|---|
| Product: | Gentoo Linux | Reporter: | slowfood <peter> |
| Component: | [OLD] Core system | Assignee: | Gentoo Kernel Bug Wranglers and Kernel Maintainers <kernel> |
| Status: | RESOLVED NEEDINFO | ||
| Severity: | critical | CC: | aabdulla, duaneg, Paul.Sorensen, peter |
| Priority: | High | ||
| Version: | unspecified | ||
| Hardware: | AMD64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Package list: | Runtime testing required: | --- | |
| Attachments: |
Code to reproduce the panic and system info
dmesg output from the kernel that panics in nv_rx_optimized on SMP box the .config from kernel that panics in nv_rx_optimized on SMP box text version of the .config from kernel that panics in nv_rx_optimized on SMP box upstream forcedeth.c |
||
|
Description
slowfood
2007-08-13 23:25:17 UTC
Created attachment 128002 [details] Code to reproduce the panic and system info Output of (a la kernel.org suggestion) - # cat /proc/version - # cat /proc/cpuinfo - # cat /proc/modules - # cat /proc/ioports - # cat /proc/iomem - # lspci -vvv - # cat /proc/scsi/scsi - related patch that doesn't fix it: http://bugzilla.kernel.org/show_bug.cgi?id=8058 - source code for server and client programs that demonstrate the panic. A couple of questions: - What was the last kernel version that worked for you (ie. didn't panic)? - Can you post your kernel .config and dmesg output? - Could you test the latest kernel prepatch, ie. vanilla-sources-2.6.23_rc3? - If you haven't already, please also test with CONFIG_FORCEDETH_NAPI=y. Thanks. Created attachment 128722 [details]
dmesg output from the kernel that panics in nv_rx_optimized on SMP box
Created attachment 128723 [details]
the .config from kernel that panics in nv_rx_optimized on SMP box
mbresser asks: - What was the last kernel version that worked for you (ie. didn't panic)? - Can you post your kernel .config and dmesg output? - Could you test the latest kernel prepatch, ie. vanilla-sources-2.6.23_rc3? - If you haven't already, please also test with CONFIG_FORCEDETH_NAPI=y. These are new machines for me, so have never had them "not panic". ;-) I could back out to an old kernel with some trouble... any hints which might be a good one to go back to? I have attached .config and dmesg output I'll try and test the latest kernel, post the results. Likewise with CONFIG_FORCEDETH_NAPI=y (was not set) Thanks - ;peter Created attachment 128724 [details]
text version of the .config from kernel that panics in nv_rx_optimized on SMP box
Sorry, first file I attached was the un-expanded version from /proc/config.gz
;;peter
Well, setting CONFIG_FORCEDETH_NAPI=y on my otherwise problematic Linux cl34 2.6.22-gentoo-r1 #2 SMP PREEMPT kernel seems to have helped considerably. I was able to bump up the test parameters to: time ./crashClnt cl34 3311 -t 80 -n 10000 -a 10000 i.e. 80 threads sending 10K messages of 10K each. It still hung the server machine if I bumped it to: time ./crashClnt cl34 3311 -t 80 -n 10000 -a 25000 but didn't seem to crash, nor leave any trace in the logs, just dead as a doornail, unresponsive to pings etc. (Any tricks for getting more info out of this state?) Will now try to get the latest kernel prepatch, ie. vanilla-sources-2.6.23_rc3 Might be a bit, since I'm new to using "raw" kernels. ;-) ;;peter Grabbed the vanilla kernel: http://www.kernel.org/pub/linux/kernel/v2.6/testing/linux-2.6.23-rc3.tar.bz2 This ran the 16 thread version, but seemed to hang on the 80 thread one. i.e. ran: time ./crashClnt cl34 3311 -t 16 -n 10000 -a 10000 ran: time ./crashClnt cl34 3311 -t 40 -n 10000 -a 10000 hung sometimes: time ./crashClnt cl34 3311 -t 80 -n 10000 -a 10000 Left one 80 thread version running overnight, was still hung in the morning, but after a bounce had these two entries in /var/log/messages : Aug 20 20:36:42 cl34 eth0: too many iterations (6) in nv_nic_irq. Aug 20 20:37:12 cl34 eth0: too many iterations (6) in nv_nic_irq. Subsequent tries with 80 threads seem to work today, but do get lots (like 30 every 10 minutes) of these entries in the log file durring the run: # time ./crashClnt cl34 3311 -t 80 -n 100000 -a 1000 Executing thread 0 Executing thread 1 Executing thread 2 [...] Executing thread 77 Executing thread 78 Executing thread 79 Server is: cl34:3311 Sent & received 100000 msgs of avg. size 1000 with 80 threads Grand total: 16064000000 bytes, or 128512000000 bits real 98m46.622s user 0m2.316s sys 2m35.470s It does seem very sensative to the count:siize ratio - here the same total data volume was transfered in a bit over 2 minutes as opposed to 1.6 hours: # time ./crashClnt cl34 3311 -t 80 -n 1000 -a 100000 Executing thread 0 Executing thread 1 Executing thread 2 [...] Executing thread 78 Executing thread 79 Server is: cl34:3311 Sent & received 1000 msgs of avg. size 100000 with 80 threads Grand total: 16000640000 bytes, or 128005120000 bits real 2m12.389s user 0m15.093s sys 1m39.678s So seems the vanilla kernel is best choice I have at the moment, hopefully the one hang was a fluke... Any other ideas of things to try welcomed - ;;peter Try using SysRq-t to get a stack trace after it hangs. You can read instructions for it Documentation/sysrq.txt in your kernel directory. I'd suggest first trying a sequence like SysRq-t, SysRq-s, SysRq-u, SysRq-b to dump the trace, sync your disks, mount your filesystems read-only, then reboot. That should leave you with the stack traces in your system log after you reboot. Please attach the trace from the relevant process(es) here. If that doesn't work another option is to setup a serial console, as described in Documentation/serial-console.txt. You can also use netconsole (Documentation/networking/netconsole.txt) to capture log messages. Thanks for the pointer to SysRq, I'll give it a try. I do already have a serial console set up, and am actually running these tests frmo those consoles. Now that the new kernel(linux-2.6.23-rc3) is not panicing, I see nothing output once things hang. Perhaps SysRq will provide some clues. Created attachment 130282 [details]
upstream forcedeth.c
Can you try the latest forcedeth that I am attaching? I believe the following change could have fixed your issue aswell: http://git.kernel.org/?p=linux/kernel/git/jgarzik/netdev-2.6.git;a=commitdiff;h=1a2b73302aacddf2543f9d7a25936e4323fa1486 Closing this bug. Please reopen when you have tested Ayaz's patch. Was this patch applied? I seem to be having a similar problem - although I don't see any evidence of a kernel panic, my box does seem to lock up when there is high load with many connections. It is in the latest vanilla stable release (currently 2.6.27.7). Could you please test with that and see if it fixes the problem for you? Sure - I can check that. But is the patch in gentoo-sources-2.6.27-r4? If so then I'm already testing it.... Also, I just bought a cheap ethernet card to verify that it's not something else other than the forcedeth driver...I'll update with results. (In reply to comment #16) > Sure - I can check that. But is the patch in gentoo-sources-2.6.27-r4? If so > then I'm already testing it.... > Seems so. gentoo-sources-2.6.27-r4 uses K_GENPATCHES_VER="6" which is based to 2.6.27.7. I've been using the new ethernet card for a while (it's a card that uses the via-rhine module) with no problems even at high loads... |