I am using a Xircom CEM33 pcmcia NIC which has occasional hardware problems. If the netdev watchdog detects a transmit timeout, do_reset is called which msleeps. As dev_watchdog holds a spinlock, this msleep is fatal: NETDEV WATCHDOG: eth0: transmit timed out eth0: transmit timed out scheduling while atomic: sh/0x00000100/27135 [<c02c30e6>] schedule+0x536/0x610 [<c01314df>] __do_IRQ+0x8f/0xa0 [<c01314c4>] __do_IRQ+0x74/0xa0 [<c02c3855>] schedule_timeout+0x45/0xa0 [<c011c7f0>] process_timeout+0x0/0x10 [<c4c51bdf>] hardreset+0x5f/0x70 [xirc2ps_cs] [<c011ca66>] msleep+0x26/0x30 [<c4c51c22>] do_reset+0x32/0x3a0 [xirc2ps_cs] [<c4c5170c>] do_tx_timeout+0x2c/0x80 [xirc2ps_cs] [<c02811c0>] dev_watchdog+0x0/0x90 [<c0281245>] dev_watchdog+0x85/0x90 [<c011c554>] run_timer_softirq+0xb4/0x180 [<c0118cb2>] __do_softirq+0x42/0xa0 [<c0118d36>] do_softirq+0x26/0x30 [<c010421f>] do_IRQ+0x1f/0x30 [<c0102afa>] common_interrupt+0x1a/0x20 I looked at other NIC drivers and created a patch unsing schedule_work which solves this problem. I think this is not a problem of any gentoo patches as I found the same buggy code in vanilla kernel 2.6.15.
Created attachment 93003 [details, diff] patch against 2.6.17-gentoo-r4
Looks good, but the work structure should be initialised during xirc2ps_probe rather than when the interface is brought up. Can you correct the patch?
Also the ugly (void (*)(void *)) casts are not needed
Created attachment 94275 [details, diff] corrected patch against 2.6.17-gentoo-r4 of course we could get rid of the #ifdef HAVE_TX_TIMEOUT too
Why have you added the extra changes? do_tx_timeout is not used unless HAVE_TX_TIMEOUT is defined, so no need to provide the "alternate version". As it is static, the compiler will remove it if HAVE_TX_TIMEOUT is not defined (but it always will be). Just provide xirc2ps_tx_timeout_task in similar fashion. Also, tx_errors should be incremented immediately, not when the workqueue gets around to scheduling the task.
Created attachment 94565 [details, diff] patch against 2.6.17-gentoo-r4 You are right, the additional conditional stuff is crap. The forward declaration of xirc2ps_tx_timeout_task is necessary as INIT_WORK went down. I changed the argument of xirc2ps_tx_timeout_task to what it is supposed to be to avoid a a compile warning at INIT_WORK. As a consequence I inserted the local variable *dev in xirc2ps_tx_timeout_task to avoid two type casts. I hope it's OK now.
thanks, sent upstream
This is now merged into Linus' tree for 2.6.18-rc6 or 2.6.18 (depending which comes next!)
Fixed in gentoo-sources-2.6.17-r8 (genpatches-2.6.17-10). Thanks for the patch.