Linux version 2.6.27-gentoo-r8 (root@server) (gcc version 4.1.2 (Gentoo 4.1.2 p1.0.2)) #3 SMP Tue Feb 10 08:38:22 CET 2009 Error in syslog accompanied by intermittend network connectivity loss leading to interrupted samba connections with data loss on application level. Feb 26 18:18:03 server ------------[ cut here ]------------ Feb 26 18:18:03 server WARNING: at net/sched/sch_generic.c:219 () Feb 26 18:18:03 server NETDEV WATCHDOG: eth0 (r8169): transmit timed out Feb 26 18:18:03 server Modules linked in: xt_tcpudp xt_state ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack iptable_filter ip_tables x_tables aes_x86_64 aes_generic lrw gf128mul pppoe pppox pcspkr k8temp via_rhine i2c_piix4 i2c_core r8169 thermal tg3 e1000 sl811_hcd usbhid ohci_hcd uhci_hcd usb_storage ehci_hcd usbcore lpfc qla2xxx firmware_class megaraid_sas megaraid_mbox megaraid_mm megaraid aacraid 3w_9xxx 3w_xxxx mptsas scsi_transport_sas mptfc scsi_transport_fc scsi_tgt mptspi mptscsih mptbase atp870u dc395x qla1280 dmx3191d sym53c8xx gdth advansys initio BusLogic arcmsr aic7xxx aic79xx scsi_transport_spi sg pdc_adma sata_inic162x sata_mv ata_piix ahci sata_qstor sata_vsc sata_uli sata_sis sata_sx4 sata_nv sata_via sata_svw sata_sil24 sata_sil sata_promise scsi_wait_scan pata_sl82c105 pata_cs5530 pata_cs5520 pata_via pata_jmicron pata_marvell pata_sis pata_netcell pata_sc1200 pata_pdc202xx_old pata_triflex pata_atiixp pata_opti pata_amd pata_ali pata_it8213 pata_ns87415 pata_ns87410 pata_serverworks pata_platform pata_artop pata_it821x pata_optidma pata_hpt3x2n pata_hpt3x3 pata_hpt37x pata_hpt366 pata_cmd64x pata_efar pata_rz1000 pata_sil680 pata_radisys pata_pdc2027x pata_mpiix Feb 26 18:18:03 server Pid: 0, comm: swapper Not tainted 2.6.27-gentoo-r8 #3 Feb 26 18:18:03 server Feb 26 18:18:03 server Call Trace: Feb 26 18:18:03 server <IRQ> [<ffffffff8023158a>] 0xffffffff8023158a Feb 26 18:18:03 server [<ffffffff8022b0ad>] 0xffffffff8022b0ad Feb 26 18:18:03 server [<ffffffffa007d302>] 0xffffffffa007d302 Feb 26 18:18:03 server [<ffffffff80228033>] 0xffffffff80228033 Feb 26 18:18:03 server [<ffffffff8035f4b1>] 0xffffffff8035f4b1 Feb 26 18:18:03 server [<ffffffff80229a45>] 0xffffffff80229a45 Feb 26 18:18:03 server [<ffffffff80228fc5>] 0xffffffff80228fc5 Feb 26 18:18:03 server [<ffffffff80227e50>] 0xffffffff80227e50 Feb 26 18:18:03 server [<ffffffff8022b0ad>] 0xffffffff8022b0ad Feb 26 18:18:03 server [<ffffffff80241ce1>] 0xffffffff80241ce1 Feb 26 18:18:03 server [<ffffffff802282bb>] 0xffffffff802282bb Feb 26 18:18:03 server [<ffffffff804698a2>] 0xffffffff804698a2 Feb 26 18:18:03 server [<ffffffff80228bc9>] 0xffffffff80228bc9 Feb 26 18:18:03 server [<ffffffff8022ed57>] 0xffffffff8022ed57 Feb 26 18:18:03 server [<ffffffff80469783>] 0xffffffff80469783 Feb 26 18:18:03 server [<ffffffff80238a5e>] 0xffffffff80238a5e Feb 26 18:18:03 server [<ffffffff80235846>] 0xffffffff80235846 Feb 26 18:18:03 server [<ffffffff8020d0cc>] 0xffffffff8020d0cc Feb 26 18:18:03 server [<ffffffff8020ef9e>] 0xffffffff8020ef9e Feb 26 18:18:03 server [<ffffffff802357a2>] 0xffffffff802357a2 Feb 26 18:18:03 server [<ffffffff8021b6f2>] 0xffffffff8021b6f2 Feb 26 18:18:03 server [<ffffffff8020cb16>] 0xffffffff8020cb16 Feb 26 18:18:03 server <EOI> [<ffffffff8021b36a>] 0xffffffff8021b36a Feb 26 18:18:03 server [<ffffffff80212b8d>] 0xffffffff80212b8d Feb 26 18:18:03 server [<ffffffff80212d87>] 0xffffffff80212d87 Feb 26 18:18:03 server [<ffffffff8020abde>] 0xffffffff8020abde Feb 26 18:18:03 server Feb 26 18:18:03 server ---[ end trace 03a283dd89dbb16a ]--- Reproducible: Didn't try Steps to Reproduce: No idea how to reproduce it. System is a quite new (2 weeks) production system (encrypted samba domain server for ~15 clients) and seems to run fine, except it got a handfull of 'smbd invoked oom-killer' some days before (no swapspace on system at that time), example: Feb 19 13:23:23 server smbd invoked oom-killer: gfp_mask=0x1201d2, order=0, oomkilladj=0 Feb 19 13:23:23 server Pid: 11565, comm: smbd Not tainted 2.6.27-gentoo-r8 #3 Feb 19 13:23:23 server Feb 19 13:23:23 server Call Trace: Feb 19 13:23:23 server [<ffffffff80261c47>] 0xffffffff80261c47 Feb 19 13:23:23 server [<ffffffff8024620b>] 0xffffffff8024620b Feb 19 13:23:23 server [<ffffffff80261f54>] 0xffffffff80261f54 Feb 19 13:23:23 server [<ffffffff80262126>] 0xffffffff80262126 Feb 19 13:23:23 server [<ffffffff802641ab>] 0xffffffff802641ab Feb 19 13:23:23 server [<ffffffff80265df5>] 0xffffffff80265df5 Feb 19 13:23:23 server [<ffffffff80261458>] 0xffffffff80261458 Feb 19 13:23:23 server [<ffffffff8026b2d6>] 0xffffffff8026b2d6 Feb 19 13:23:23 server [<ffffffff8026cd52>] 0xffffffff8026cd52 Feb 19 13:23:23 server [<ffffffff802444bb>] 0xffffffff802444bb Feb 19 13:23:23 server [<ffffffff802225b2>] 0xffffffff802225b2 Feb 19 13:23:23 server [<ffffffff8024411c>] 0xffffffff8024411c Feb 19 13:23:23 server [<ffffffff804cc1f9>] 0xffffffff804cc1f9 Feb 19 13:23:23 server Feb 19 13:23:23 server Mem-Info: Feb 19 13:23:23 server DMA per-cpu: Feb 19 13:23:23 server CPU 0: hi: 0, btch: 1 usd: 0 Feb 19 13:23:23 server CPU 1: hi: 0, btch: 1 usd: 0 Feb 19 13:23:23 server DMA32 per-cpu: Feb 19 13:23:23 server CPU 0: hi: 186, btch: 31 usd: 156 Feb 19 13:23:23 server CPU 1: hi: 186, btch: 31 usd: 127 Feb 19 13:23:23 server Active:464564 inactive:543 dirty:0 writeback:0 unstable:0 Feb 19 13:23:23 server free:3263 slab:6133 mapped:38 pagetables:2280 bounce:0 Feb 19 13:23:23 server DMA free:7512kB min:28kB low:32kB high:40kB active:3732kB inactive:0kB present:10520kB pages_scanned:5566 all_unreclaimable? yes Feb 19 13:23:23 server lowmem_reserve[]: 0 1875 1875 1875 Feb 19 13:23:23 server DMA32 free:5540kB min:5524kB low:6904kB high:8284kB active:1854524kB inactive:2172kB present:1920892kB pages_scanned:194265 all_unreclaimable? no Feb 19 13:23:23 server lowmem_reserve[]: 0 0 0 0 Feb 19 13:23:23 server DMA: 140*4kB 147*8kB 105*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB = 7512kB Feb 19 13:23:23 server DMA32: 32*4kB 9*8kB 4*16kB 2*32kB 8*64kB 7*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 5576kB Feb 19 13:23:23 server 794 total pagecache pages Feb 19 13:23:23 server 0 pages in swap cache Feb 19 13:23:23 server Swap cache stats: add 0, delete 0, find 0/0 Feb 19 13:23:23 server Free swap = 0kB Feb 19 13:23:23 server Total swap = 0kB Feb 19 13:23:23 server 490976 pages RAM Feb 19 13:23:23 server 9257 pages reserved Feb 19 13:23:23 server 7739 pages shared Feb 19 13:23:23 server 475991 pages non-shared Feb 19 13:23:23 server Out of memory: kill process 10974 (smbd) score 14654 or a child Feb 19 13:23:23 server Killed process 11565 (smbd) Errors like these went away after a reboot of the system at which i added a 512MB swap partition. I have no idea if they are related, but now i guess the oom-killer didn't start because of main memory since: Feb 19 13:23:23 server free:3263 looks to me like there was free main memory available. Only strange thing i notice now is this from last reboot: Feb 19 13:29:18 server r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded Feb 19 13:29:18 server r8169 0000:02:00.0: PCI INT A -> GSI 18 (level, low) -> IRQ 18 Feb 19 13:29:18 server r8169 0000:02:00.0: setting latency timer to 64 Feb 19 13:29:18 server eth0: RTL8168c/8111c at 0xffffc20000034000, 00:1f:d0:98:f0:ea, XID 3c4000c0 IRQ 381 which seems odd since at one point it talks about IRQ 18 and the other about IRQ 381 /proc/interrupts for these: 18: 0 0 IO-APIC-fasteoi ohci_hcd:usb5, ohci_hcd:usb6, ohci_hcd:usb7 381: 56659 123875893 PCI-MSI-edge eth0 Please advise me what kind of further information you may need.
Can you generate this oops with a kernel compiled with CONFIG_KALLSYMS enabled to get full backtracing support.
Even while hoping that the error will not surface again (since it's a production machine, and so far it hasn't resurfaced) i'm currently compiling in that kernel flag. Will have to see when i'm able to reboot the system though, could be some days before there is a window for it. What i did notice is: |eth0 Link encap:Ethernet HWaddr | inet addr:192.168.0.2 Bcast:192.168.0.255 Mask:255.255.255.0 | UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 | RX packets:173067298 errors:0 dropped:0 overruns:0 frame:0 | TX packets:275714586 errors:0 dropped:46 overruns:0 carrier:0 | collisions:0 txqueuelen:1000 | RX bytes:23876716449 (22.2 GiB) TX bytes:310566145856 (289.2 GiB) | Interrupt:125 Base address:0x4000 ==> dropped:46 could it be the switch to which the box is connected (1GBit 8 port) ran amok and through this caused the problems?
Another network hickup, this time only thing in /var/log/messages was: Mar 10 13:35:52 server r8169: eth0: link up According to ifconfig the amount of dropped packages doubled from 46 to 92. I managed to schedule a maintenance offline time for the system tomorrow to install the kernel with CONFIG_KALLSYMS.
Any progress here?
Yes, actually there is - sorry for the delay. This is the first (except the usual "r8169: eth0: link up" messages when the network had a hickup) hopefully helpful thing i found in the log: Mar 31 14:29:09 server ------------[ cut here ]------------ Mar 31 14:29:09 server WARNING: at net/sched/sch_generic.c:219 dev_watchdog+0x11f/0x1ac() Mar 31 14:29:09 server NETDEV WATCHDOG: eth0 (r8169): transmit timed out Mar 31 14:29:09 server Modules linked in: xt_tcpudp aes_x86_64 aes_generic lrw gf128mul xt_state ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack iptable_filter ip_tables x_tables p ppoe pppox pcspkr k8temp via_rhine i2c_piix4 i2c_core r8169 thermal tg3 e1000 sl811_hcd usbhid ohci_hcd uhci_hcd usb_storage ehci_hcd usbcore lpfc qla2xxx firmware_class megaraid_sas megaraid_mbox megaraid_mm megaraid aacraid 3w_9xxx 3w_xxxx mptsas scsi_transport_sas mptfc scsi_transport_fc scsi_tgt mptspi mptscsih mptbase atp870u dc395x qla1280 dmx3191d sym53c8xx gdth advansys initio BusLog ic arcmsr aic7xxx aic79xx scsi_transport_spi sg pdc_adma sata_inic162x sata_mv ata_piix ahci sata_qstor sata_vsc sata_uli sata_sis sata_sx4 sata_nv sata_via sata_svw sata_sil24 sata_sil sata_promis e scsi_wait_scan pata_sl82c105 pata_cs5530 pata_cs5520 pata_via pata_jmicron pata_marvell pata_sis pata_netcell pata_sc1200 pata_pdc202xx_old pata_triflex pata_atiixp pata_opti pata_amd pata_ali pa ta_it8213 pata_ns87415 pata_ns87410 pata_serverworks pata_platform pata_artop pata_it821x pata_optidma pata_hpt3x2n pata_hpt3x3 pata_hpt37x pata_hpt366 pata_cmd64x pata_efar pata_rz1000 pata_sil680 pata_radisys pata_pdc2027x pata_mpiix Mar 31 14:29:09 server Pid: 0, comm: swapper Not tainted 2.6.27-gentoo-r8 #1 Mar 31 14:29:09 server Mar 31 14:29:09 server Call Trace: Mar 31 14:29:09 server <IRQ> [<ffffffff8023160c>] warn_slowpath+0xb4/0xdd Mar 31 14:29:09 server [<ffffffff802280a3>] source_load+0x2a/0x4f Mar 31 14:29:09 server [<ffffffff80360821>] __next_cpu+0x19/0x26 Mar 31 14:29:09 server [<ffffffff80229ab5>] find_busiest_group+0x256/0x71c Mar 31 14:29:09 server [<ffffffff80229035>] enqueue_task_fair+0xd6/0xf1 Mar 31 14:29:09 server [<ffffffff80227ec0>] enqueue_task+0x13/0x1e Mar 31 14:29:09 server [<ffffffff8022b10e>] try_to_wake_up+0x167/0x179 Mar 31 14:29:09 server [<ffffffff802462a7>] getnstimeofday+0x53/0xb3 Mar 31 14:29:09 server [<ffffffff80238fd0>] lock_timer_base+0x26/0x4b Mar 31 14:29:09 server [<ffffffff80239115>] __mod_timer+0xb0/0xbf Mar 31 14:29:09 server [<ffffffff8046ad36>] dev_watchdog+0x11f/0x1ac Mar 31 14:29:09 server [<ffffffff80228c39>] __wake_up+0x38/0x4e Mar 31 14:29:09 server [<ffffffff8022edb8>] scheduler_tick+0xca/0x1af Mar 31 14:29:09 server [<ffffffff8046ac17>] dev_watchdog+0x0/0x1ac Mar 31 14:29:09 server [<ffffffff80238ae6>] run_timer_softirq+0x12c/0x193 Mar 31 14:29:09 server [<ffffffff802358ce>] __do_softirq+0x5e/0xcd Mar 31 14:29:09 server [<ffffffff8020d0cc>] call_softirq+0x1c/0x28 Mar 31 14:29:09 server [<ffffffff8020efae>] do_softirq+0x2c/0x68 Mar 31 14:29:09 server [<ffffffff8023582a>] irq_exit+0x3f/0x85 Mar 31 14:29:09 server [<ffffffff8021b7a2>] smp_apic_timer_interrupt+0x8e/0xa8 Mar 31 14:29:09 server [<ffffffff8020cb16>] apic_timer_interrupt+0x66/0x70 Mar 31 14:29:09 server <EOI> [<ffffffff8021b41a>] lapic_next_event+0x0/0xa Mar 31 14:29:09 server [<ffffffff80212c29>] default_idle+0x27/0x3b Mar 31 14:29:09 server [<ffffffff80212e23>] c1e_idle+0xe5/0xe9 Mar 31 14:29:09 server [<ffffffff8020abde>] cpu_idle+0x88/0xae Mar 31 14:29:09 server Mar 31 14:29:09 server ---[ end trace d716dd2492feb500 ]--- Mar 31 14:29:09 server r8169: eth0: link up I hope this contains the information needed to track this one down. Day after this message i plugged in an e1000 card (and ignore the onboard r8169 since then) - it didn't help regarding the random network malefunctions (which continued to happen at about the same rate of one every few days - at least according to the customer) but at least it got rid of any indication in /var/log/messages that they happen at all... Additinal info (no clue if it helps in any way): Because of the continued network hickups i today replaced the existing switches in the network with new ones. This included one 24 port HP 100MBit switch which only had low-volume seldom used windows clients on it (most traffic of them is from interaction with internet, so customer didn't bother to have them on a fast network so far). Before: /--------\ +----------+ +----------+ | Server | --- | 1 GBit | -------- | 1 GBit | ------ (Clients) \--------/ | 8 Port | -- | 16 Port | +----------+ \ +----------+ \ +----------+ -- | 100 MBit | ------ (Clients) | 24 Port | +----------+ Now: /--------\ +----------+ +----------+ | Server | --- | 1 GBit | -------- | 1 GBit | -----< (Clients) \--------/ | 24 Port | -- | 16 Port | +----------+ \ +----------+ \ ----< (Clients) The old setup ran without problems (with the old gentoo based server, had GBit interface too) for over 2 years. I'll report how it's looking now after eastern. Thanks for the support.
There is a similar upstream bug at http://bugzilla.kernel.org/show_bug.cgi?id=12411 User's there report the same issues with both 2.6.27 and 2.6.28. A request to test 2.6.29 was made, but no one has responded at this time. I know yours is a production server, which makes that nearly impossible I'm sure.
> A request to test 2.6.29 was made, but no one has responded at this time. I have similar bug in 2.6.29-r1: [69062.698027] ------------[ cut here ]------------ [69062.698031] WARNING: at net/sched/sch_generic.c:226 dev_watchdog+0xe7/0x148() [69062.698033] Hardware name: EP35-DS3R [69062.698035] NETDEV WATCHDOG: eth0 (r8169): transmit timed out [69062.698037] Modules linked in: cifs r8169 8139too ipg dm_snapshot scsi_wait_s can [last unloaded: i2c_i801] [69062.698046] Pid: 0, comm: swapper Tainted: G W 2.6.29-gentoo-r1 #6 [69062.698048] Call Trace: [69062.698050] <IRQ> [<ffffffff8105761b>] warn_slowpath+0xd3/0xf2 [69062.698058] [<ffffffff814f4e86>] ? br_nf_pre_routing+0x4b2/0x4d4 [69062.698062] [<ffffffff812be22b>] ? cpumask_next+0x19/0x1b [69062.698065] [<ffffffff812be2cd>] ? cpumask_next_and+0x20/0x33 [69062.698070] [<ffffffff8104e46b>] ? find_busiest_group+0x25a/0x784 [69062.698074] [<ffffffff81071035>] ? clocksource_read+0x7/0x9 [69062.698076] [<ffffffff81071672>] ? getnstimeofday+0x5a/0xbb [69062.698080] [<ffffffff8152ccfe>] ? _spin_lock+0x9/0xc [69062.698083] [<ffffffff8144a245>] ? __netif_tx_lock+0x16/0x1f [69062.698086] [<ffffffff8144a375>] ? netif_tx_lock+0x56/0x70 [69062.698089] [<ffffffff8144a3fa>] ? dev_watchdog+0x0/0x148 [69062.698091] [<ffffffff8144a4e1>] dev_watchdog+0xe7/0x148 [69062.698095] [<ffffffff8102bcd9>] ? read_tsc+0xd/0x24 [69062.698098] [<ffffffff810603ea>] ? cascade+0x4c/0x69 [69062.698101] [<ffffffff81060554>] run_timer_softirq+0x14d/0x1d0 [69062.698104] [<ffffffff8105c626>] __do_softirq+0x7e/0x12d [69062.698108] [<ffffffff8102755c>] call_softirq+0x1c/0x28 [69062.698111] [<ffffffff810287ea>] do_softirq+0x44/0x92 [69062.698114] [<ffffffff8105c510>] irq_exit+0x3f/0x79 [69062.698118] [<ffffffff810372ee>] smp_apic_timer_interrupt+0x77/0x84 [69062.698121] [<ffffffff81026f33>] apic_timer_interrupt+0x13/0x20 [69062.698123] <EOI> [<ffffffff81097cba>] ? rcu_pending+0x5d/0x69 [69062.698129] [<ffffffff8102cbb2>] ? mwait_idle+0x60/0x6e [69062.698131] [<ffffffff8102cba5>] ? mwait_idle+0x53/0x6e [69062.698134] [<ffffffff81024b7b>] ? cpu_idle+0x59/0x88 [69062.698138] [<ffffffff81527624>] ? start_secondary+0x191/0x195 [69062.698141] ---[ end trace f33d537881e0b97f ]---
(In reply to comment #7) > > A request to test 2.6.29 was made, but no one has responded at this time. > > I have similar bug in 2.6.29-r1: > FYI, this bug has been around for a long time now. I've seen the same thing happening as far back as 2.6.15, on mips based platforms. I don't have the hardware now, but I remember that link autoneg failed randomly on r8169 and always on the first boot. I couldn't find any fix for it, and it was generally accepted to be a h/w bug in RTL8169. Gregor, is it possible to use ethtool on the affected system? If this is the same bug, forcing any mode with ethtool (100M Full/ 10M Half etc) should fail at first, showinf 10M Half mode, and then automagically the link comes up and there's no problem till next boot. This used to work every time on mips, for reasons unknown.
Sorry, i will not mess with the live system of my customer for testing. I hope you can understand that. But in case i get another mainboard with same hardware i'll deploy the image of the server and give it a good beating in my testing environment (no clue if that helps though, because the error didn't show up when i tested the machine for the customer for some days before i shipped it).
OK, thanks for reporting anyhow, sorry that there is no immediate fix available... will keep watching the upstream bugs.
Reflecting upstream status, thus more information is necessary... If someone still experiences this then a git bisect needs to be done. See upstream bug for more information. http://wiki.gentoo.org/wiki/Kernel_git-bisect