Summary: | =sys-kernel/gentoo-sources-2.6.27 - NETDEV WATCHDOG: eth0 (r8169): transmit timed out | ||
---|---|---|---|
Product: | Gentoo Linux | Reporter: | Gregor Kopka <mailfrom-bugs.gentoo.org> |
Component: | [OLD] Core system | Assignee: | Gentoo Kernel Bug Wranglers and Kernel Maintainers <kernel> |
Status: | RESOLVED NEEDINFO | ||
Severity: | normal | CC: | ankit.chaturvedi, Arsen.Shnurkov, mark |
Priority: | Normal | ||
Version: | 2008.0 | ||
Hardware: | AMD64 | ||
OS: | Linux | ||
URL: | http://bugzilla.kernel.org/show_bug.cgi?id=12411 | ||
Whiteboard: | |||
Package list: | Runtime testing required: | --- |
Description
Gregor Kopka
2009-02-26 20:49:25 UTC
Can you generate this oops with a kernel compiled with CONFIG_KALLSYMS enabled to get full backtracing support. Even while hoping that the error will not surface again (since it's a production machine, and so far it hasn't resurfaced) i'm currently compiling in that kernel flag. Will have to see when i'm able to reboot the system though, could be some days before there is a window for it. What i did notice is: |eth0 Link encap:Ethernet HWaddr | inet addr:192.168.0.2 Bcast:192.168.0.255 Mask:255.255.255.0 | UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 | RX packets:173067298 errors:0 dropped:0 overruns:0 frame:0 | TX packets:275714586 errors:0 dropped:46 overruns:0 carrier:0 | collisions:0 txqueuelen:1000 | RX bytes:23876716449 (22.2 GiB) TX bytes:310566145856 (289.2 GiB) | Interrupt:125 Base address:0x4000 ==> dropped:46 could it be the switch to which the box is connected (1GBit 8 port) ran amok and through this caused the problems? Another network hickup, this time only thing in /var/log/messages was: Mar 10 13:35:52 server r8169: eth0: link up According to ifconfig the amount of dropped packages doubled from 46 to 92. I managed to schedule a maintenance offline time for the system tomorrow to install the kernel with CONFIG_KALLSYMS. Any progress here? Yes, actually there is - sorry for the delay. This is the first (except the usual "r8169: eth0: link up" messages when the network had a hickup) hopefully helpful thing i found in the log: Mar 31 14:29:09 server ------------[ cut here ]------------ Mar 31 14:29:09 server WARNING: at net/sched/sch_generic.c:219 dev_watchdog+0x11f/0x1ac() Mar 31 14:29:09 server NETDEV WATCHDOG: eth0 (r8169): transmit timed out Mar 31 14:29:09 server Modules linked in: xt_tcpudp aes_x86_64 aes_generic lrw gf128mul xt_state ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack iptable_filter ip_tables x_tables p ppoe pppox pcspkr k8temp via_rhine i2c_piix4 i2c_core r8169 thermal tg3 e1000 sl811_hcd usbhid ohci_hcd uhci_hcd usb_storage ehci_hcd usbcore lpfc qla2xxx firmware_class megaraid_sas megaraid_mbox megaraid_mm megaraid aacraid 3w_9xxx 3w_xxxx mptsas scsi_transport_sas mptfc scsi_transport_fc scsi_tgt mptspi mptscsih mptbase atp870u dc395x qla1280 dmx3191d sym53c8xx gdth advansys initio BusLog ic arcmsr aic7xxx aic79xx scsi_transport_spi sg pdc_adma sata_inic162x sata_mv ata_piix ahci sata_qstor sata_vsc sata_uli sata_sis sata_sx4 sata_nv sata_via sata_svw sata_sil24 sata_sil sata_promis e scsi_wait_scan pata_sl82c105 pata_cs5530 pata_cs5520 pata_via pata_jmicron pata_marvell pata_sis pata_netcell pata_sc1200 pata_pdc202xx_old pata_triflex pata_atiixp pata_opti pata_amd pata_ali pa ta_it8213 pata_ns87415 pata_ns87410 pata_serverworks pata_platform pata_artop pata_it821x pata_optidma pata_hpt3x2n pata_hpt3x3 pata_hpt37x pata_hpt366 pata_cmd64x pata_efar pata_rz1000 pata_sil680 pata_radisys pata_pdc2027x pata_mpiix Mar 31 14:29:09 server Pid: 0, comm: swapper Not tainted 2.6.27-gentoo-r8 #1 Mar 31 14:29:09 server Mar 31 14:29:09 server Call Trace: Mar 31 14:29:09 server <IRQ> [<ffffffff8023160c>] warn_slowpath+0xb4/0xdd Mar 31 14:29:09 server [<ffffffff802280a3>] source_load+0x2a/0x4f Mar 31 14:29:09 server [<ffffffff80360821>] __next_cpu+0x19/0x26 Mar 31 14:29:09 server [<ffffffff80229ab5>] find_busiest_group+0x256/0x71c Mar 31 14:29:09 server [<ffffffff80229035>] enqueue_task_fair+0xd6/0xf1 Mar 31 14:29:09 server [<ffffffff80227ec0>] enqueue_task+0x13/0x1e Mar 31 14:29:09 server [<ffffffff8022b10e>] try_to_wake_up+0x167/0x179 Mar 31 14:29:09 server [<ffffffff802462a7>] getnstimeofday+0x53/0xb3 Mar 31 14:29:09 server [<ffffffff80238fd0>] lock_timer_base+0x26/0x4b Mar 31 14:29:09 server [<ffffffff80239115>] __mod_timer+0xb0/0xbf Mar 31 14:29:09 server [<ffffffff8046ad36>] dev_watchdog+0x11f/0x1ac Mar 31 14:29:09 server [<ffffffff80228c39>] __wake_up+0x38/0x4e Mar 31 14:29:09 server [<ffffffff8022edb8>] scheduler_tick+0xca/0x1af Mar 31 14:29:09 server [<ffffffff8046ac17>] dev_watchdog+0x0/0x1ac Mar 31 14:29:09 server [<ffffffff80238ae6>] run_timer_softirq+0x12c/0x193 Mar 31 14:29:09 server [<ffffffff802358ce>] __do_softirq+0x5e/0xcd Mar 31 14:29:09 server [<ffffffff8020d0cc>] call_softirq+0x1c/0x28 Mar 31 14:29:09 server [<ffffffff8020efae>] do_softirq+0x2c/0x68 Mar 31 14:29:09 server [<ffffffff8023582a>] irq_exit+0x3f/0x85 Mar 31 14:29:09 server [<ffffffff8021b7a2>] smp_apic_timer_interrupt+0x8e/0xa8 Mar 31 14:29:09 server [<ffffffff8020cb16>] apic_timer_interrupt+0x66/0x70 Mar 31 14:29:09 server <EOI> [<ffffffff8021b41a>] lapic_next_event+0x0/0xa Mar 31 14:29:09 server [<ffffffff80212c29>] default_idle+0x27/0x3b Mar 31 14:29:09 server [<ffffffff80212e23>] c1e_idle+0xe5/0xe9 Mar 31 14:29:09 server [<ffffffff8020abde>] cpu_idle+0x88/0xae Mar 31 14:29:09 server Mar 31 14:29:09 server ---[ end trace d716dd2492feb500 ]--- Mar 31 14:29:09 server r8169: eth0: link up I hope this contains the information needed to track this one down. Day after this message i plugged in an e1000 card (and ignore the onboard r8169 since then) - it didn't help regarding the random network malefunctions (which continued to happen at about the same rate of one every few days - at least according to the customer) but at least it got rid of any indication in /var/log/messages that they happen at all... Additinal info (no clue if it helps in any way): Because of the continued network hickups i today replaced the existing switches in the network with new ones. This included one 24 port HP 100MBit switch which only had low-volume seldom used windows clients on it (most traffic of them is from interaction with internet, so customer didn't bother to have them on a fast network so far). Before: /--------\ +----------+ +----------+ | Server | --- | 1 GBit | -------- | 1 GBit | ------ (Clients) \--------/ | 8 Port | -- | 16 Port | +----------+ \ +----------+ \ +----------+ -- | 100 MBit | ------ (Clients) | 24 Port | +----------+ Now: /--------\ +----------+ +----------+ | Server | --- | 1 GBit | -------- | 1 GBit | -----< (Clients) \--------/ | 24 Port | -- | 16 Port | +----------+ \ +----------+ \ ----< (Clients) The old setup ran without problems (with the old gentoo based server, had GBit interface too) for over 2 years. I'll report how it's looking now after eastern. Thanks for the support. There is a similar upstream bug at http://bugzilla.kernel.org/show_bug.cgi?id=12411 User's there report the same issues with both 2.6.27 and 2.6.28. A request to test 2.6.29 was made, but no one has responded at this time. I know yours is a production server, which makes that nearly impossible I'm sure. > A request to test 2.6.29 was made, but no one has responded at this time.
I have similar bug in 2.6.29-r1:
[69062.698027] ------------[ cut here ]------------
[69062.698031] WARNING: at net/sched/sch_generic.c:226 dev_watchdog+0xe7/0x148()
[69062.698033] Hardware name: EP35-DS3R
[69062.698035] NETDEV WATCHDOG: eth0 (r8169): transmit timed out
[69062.698037] Modules linked in: cifs r8169 8139too ipg dm_snapshot scsi_wait_s
can [last unloaded: i2c_i801]
[69062.698046] Pid: 0, comm: swapper Tainted: G W 2.6.29-gentoo-r1 #6
[69062.698048] Call Trace:
[69062.698050] <IRQ> [<ffffffff8105761b>] warn_slowpath+0xd3/0xf2
[69062.698058] [<ffffffff814f4e86>] ? br_nf_pre_routing+0x4b2/0x4d4
[69062.698062] [<ffffffff812be22b>] ? cpumask_next+0x19/0x1b
[69062.698065] [<ffffffff812be2cd>] ? cpumask_next_and+0x20/0x33
[69062.698070] [<ffffffff8104e46b>] ? find_busiest_group+0x25a/0x784
[69062.698074] [<ffffffff81071035>] ? clocksource_read+0x7/0x9
[69062.698076] [<ffffffff81071672>] ? getnstimeofday+0x5a/0xbb
[69062.698080] [<ffffffff8152ccfe>] ? _spin_lock+0x9/0xc
[69062.698083] [<ffffffff8144a245>] ? __netif_tx_lock+0x16/0x1f
[69062.698086] [<ffffffff8144a375>] ? netif_tx_lock+0x56/0x70
[69062.698089] [<ffffffff8144a3fa>] ? dev_watchdog+0x0/0x148
[69062.698091] [<ffffffff8144a4e1>] dev_watchdog+0xe7/0x148
[69062.698095] [<ffffffff8102bcd9>] ? read_tsc+0xd/0x24
[69062.698098] [<ffffffff810603ea>] ? cascade+0x4c/0x69
[69062.698101] [<ffffffff81060554>] run_timer_softirq+0x14d/0x1d0
[69062.698104] [<ffffffff8105c626>] __do_softirq+0x7e/0x12d
[69062.698108] [<ffffffff8102755c>] call_softirq+0x1c/0x28
[69062.698111] [<ffffffff810287ea>] do_softirq+0x44/0x92
[69062.698114] [<ffffffff8105c510>] irq_exit+0x3f/0x79
[69062.698118] [<ffffffff810372ee>] smp_apic_timer_interrupt+0x77/0x84
[69062.698121] [<ffffffff81026f33>] apic_timer_interrupt+0x13/0x20
[69062.698123] <EOI> [<ffffffff81097cba>] ? rcu_pending+0x5d/0x69
[69062.698129] [<ffffffff8102cbb2>] ? mwait_idle+0x60/0x6e
[69062.698131] [<ffffffff8102cba5>] ? mwait_idle+0x53/0x6e
[69062.698134] [<ffffffff81024b7b>] ? cpu_idle+0x59/0x88
[69062.698138] [<ffffffff81527624>] ? start_secondary+0x191/0x195
[69062.698141] ---[ end trace f33d537881e0b97f ]---
(In reply to comment #7) > > A request to test 2.6.29 was made, but no one has responded at this time. > > I have similar bug in 2.6.29-r1: > FYI, this bug has been around for a long time now. I've seen the same thing happening as far back as 2.6.15, on mips based platforms. I don't have the hardware now, but I remember that link autoneg failed randomly on r8169 and always on the first boot. I couldn't find any fix for it, and it was generally accepted to be a h/w bug in RTL8169. Gregor, is it possible to use ethtool on the affected system? If this is the same bug, forcing any mode with ethtool (100M Full/ 10M Half etc) should fail at first, showinf 10M Half mode, and then automagically the link comes up and there's no problem till next boot. This used to work every time on mips, for reasons unknown. Sorry, i will not mess with the live system of my customer for testing. I hope you can understand that. But in case i get another mainboard with same hardware i'll deploy the image of the server and give it a good beating in my testing environment (no clue if that helps though, because the error didn't show up when i tested the machine for the customer for some days before i shipped it). OK, thanks for reporting anyhow, sorry that there is no immediate fix available... will keep watching the upstream bugs. Reflecting upstream status, thus more information is necessary... If someone still experiences this then a git bisect needs to be done. See upstream bug for more information. http://wiki.gentoo.org/wiki/Kernel_git-bisect |