gentoo-sources-2.6.15-r1 machine crashed here is crashdump: [17372366.136000] Call Trace: [17372366.136000] [<f8839624>] tg3_rx+0x184/0x358 [tg3] [17372366.136000] [<f883986e>] tg3_poll+0x76/0x129 [tg3] [17372366.136000] [<c028ab1d>] net_rx_action+0x69/0xf0 [17372366.136000] [<c011a43d>] __do_softirq+0x55/0xbd [17372366.136000] [<c011a4d2>] do_softirq+0x2d/0x31 [17372366.136000] [<c0104447>] do_IRQ+0x47/0x4f [17372366.136000] [<c0102e52>] common_interrupt+0x1a/0x20 [17372366.136000] [<c010088c>] default_idle+0x0/0x55 [17372366.136000] [<c01008b8>] default_idle+0x2c/0x55 [17372366.136000] [<c010094f>] cpu_idle+0x5a/0x6f [17372366.136000] [<c03dc795>] start_kernel+0x14d/0x14f [17372366.136000] Code: 83 7c 24 20 00 74 1e 0f b6 43 6d c7 83 30 01 00 00 01 0 [17372366.136000] <0>Kernel panic - not syncing: Fatal exception in interrupt [17372366.328000]
Here is the missing part of the crash message on the console 17372366.136000] SMP [17372366.136000] Modules linked in: af_packet autofs4 parport_pc lp parport md3 [17372366.136000] CPU: 0 [17372366.136000] EIP: 0060:[<c028596b>] Not tainted VLI [17372366.136000] EFLAGS: 00010246 (2.6.15-gentoo-r1) [17372366.136000] EIP is at __alloc_skb+0xc5/0x130 [17372366.136000] eax: d98a8e80 ebx: d703d180 ecx: 00000000 edx: d98a8e00 [17372366.136000] esi: 00000080 edi: d703d200 ebp: 00000020 esp: c03dbf00 [17372366.136000] ds: 007b es: 007b ss: 0068 [17372366.136000] Process swapper (pid: 0, threadinfo=c03da000 task=c0347b20) [17372366.136000] Stack: f7b947a8 00010000 f5a19e80 00000122 00000042 f8839624 [17372366.136000] 00000000 f7f703e0 f6daa440 00000000 01230000 00000122 [17372366.136000] f7f70380 f7f70000 c03dbf74 f883986e f7f70380 00000040 [17372366.136000] Call Trace:
Please post "emerge --info" output to every bug that you file. How often does the crash occur? Is there a way to reliably reproduce it?
It did happen after probably two days of running. We fell back to 2.4.32 since this is our central fileserver. Since we don't know what happened beforehand (although I guess backup just started) we can't say if we can reproduce this. Just a bit of extra info (called from the 2.4.32 running now): ethtool -i eth0 driver: tg3 version: 3.26 firmware-version: bus-info: 01:00.0 if this is any help
I can report the oops but it probably wouldn't get much attention without more details on the reproducability. We would also need to demonstrate that the latest development kernel (currently vanilla-sources-2.6.16_rc5) is affected. Would you be able to perform further testing on that kernel?
If it wasn't our prodcution server I would be. But since this is a mission critical system I can't play with it. Unfortunately I don't have another box with the same nic in it so we can try to run that one on gentoo-2.6.15-r1 Have there been any changes in this particular driver after 2.6.15-r1?
Yes, but it's hard to say whether they would affect your problem. It is also hard to say whether this problem would reappear in days, months, or even years, even on 2.6.15. If it is *that* rare there is always a chance that 2.4.32 is also affected. I will file a report upstream, but first I need a little more information. You can safely get this info while running 2.4.32. Please follow this sequence: emerge -n --oneshot gdb cd /usr/src/linux-2.6.15-gentoo-r1 rm drivers/net/tg3.o make CONFIG_DEBUG_INFO=y drivers/net/tg3.o gdb drivers/net/tg3.o (at gdb prompt:) list *tg3_rx+0x184 Please paste the gdb output here.
Also, you are missing the very start of the oops report. Here's an example of the kind of thing you'd expect at the top: Unable to handle kernel paging request at virtual address 40000010 printing eip: c022d0b9 *pde = 00000000 Oops: 0000 [#1] PREEMPT SMP Modules linked in: etc Also, please post "emerge --info" output
first the gdb-output: (gdb) list *tg3_rx+0x184 0x3624 is in tg3_rx (skbuff.h:314). 309 extern struct sk_buff *__alloc_skb(unsigned int size, 310 gfp_t priority, int fclone); 311 static inline struct sk_buff *alloc_skb(unsigned int size, 312 gfp_t priority) 313 { 314 return __alloc_skb(size, priority, 0); 315 } 316 317 static inline struct sk_buff *alloc_skb_fclone(unsigned int size, 318 gfp_t priority) And emerge --info Portage 2.0.54 (default-linux/x86/2006.0, gcc-3.4.5, glibc-2.3.5-r2, 2.4.32 i686) ================================================================= System uname: 2.4.32 i686 Intel(R) Pentium(R) 4 CPU 2.40GHz Gentoo Base System version 1.6.14 distcc 2.18.3 i686-pc-linux-gnu (protocols 1 and 2) (default port 3632) [disabled] ccache version 2.3 [disabled] dev-lang/python: 2.3.5-r2, 2.4.2 sys-apps/sandbox: 1.2.12 sys-devel/autoconf: 2.13, 2.59-r6 sys-devel/automake: 1.4_p6, 1.5, 1.6.3, 1.7.9-r1, 1.8.5-r3, 1.9.6-r1 sys-devel/binutils: 2.16.1 sys-devel/libtool: 1.5.22 virtual/os-headers: 2.6.11-r2 ACCEPT_KEYWORDS="x86" AUTOCLEAN="yes" CBUILD="i686-pc-linux-gnu" CFLAGS="-O3 -march=pentium3 -fprefetch-loop-arrays -funroll-loops -pipe" CHOST="i686-pc-linux-gnu" CONFIG_PROTECT="/etc /opt/tomcat/conf /usr/kde/2/share/config /usr/kde/3.3/env /usr/kde/3.3/share/config /usr/kde/3.3/shutdown /usr/kde/3.4/env /usr/kde/3.4/share/config /usr/kde/3.4/shutdown /usr/kde/3/share/config /usr/lib/X11/xkb /usr/share/config /usr/share/texmf/dvipdfm/config/ /usr/share/texmf/dvips/config/ /usr/share/texmf/tex/generic/config/ /usr/share/texmf/tex/platex/config/ /usr/share/texmf/xdvi/ /var/bind /var/qmail/control" CONFIG_PROTECT_MASK="/etc/gconf /etc/terminfo /etc/env.d" CXXFLAGS="-O3 -march=pentium3 -fprefetch-loop-arrays -funroll-loops -pipe" DISTDIR="/usr/portage/distfiles" FEATURES="autoconfig distlocks sandbox sfperms strict" GENTOO_MIRRORS="http://distfiles.gentoo.org http://distro.ibiblio.org/pub/linux/distributions/gentoo" PKGDIR="/usr/portage/packages" PORTAGE_TMPDIR="/export/netshare/portagetmp" PORTDIR="/usr/portage" SYNC="rsync://rsync.gentoo.org/gentoo-portage" USE="x86 X acl apache2 apm arts audiofile avi berkdb bitmap-fonts bzip2 crypt cups curl eds emboss encode esd ethereal expat fam foomaticdb fortran gd gdbm gif glut gnome gpm gstreamer gtk2 idn imap imlib ipv6 java jpeg junit kde lcms ldap libg++ libwww mad mbox mikmod mmx mng motif mp3 mpeg ncurses nls nptl ogg opengl oss pam pcre pdflib perl png postgres python qt quicktime readline samba sdl slang snmp spell sse ssl svga tcpd tetex tiff truetype truetype-fonts type1-fonts udev usb vorbis xml xml2 xmms xv zlib userland_GNU kernel_linux elibc_glibc" Unset: ASFLAGS, CTARGET, LANG, LC_ALL, LDFLAGS, LINGUAS, MAKEOPTS, PORTDIR_OVERLAY
Is it possible to see the very start of the oops report?
this is all I found in the buffer: This is crom.netage.de (Linux i686 2.4.32) 08:12:33 crom login: Oops: 0000 CPU: 0 EIP: 0010:[<c0117466>] Not tainted EFLAGS: 00010206 eax: 00000013 ebx: 1f800000 ecx: c0324554 edx: 00003268 esi: 00000000 edi: ee5b6000 ebp: 00000011 esp: ee5b7d68 ds: 0018 es: 0018 ss: 0018 Process irc (pid: 2561, stackpage=ee5b7000) Stack: c02dd5c3 1f800163 ee5b7dac 00000001 f0fd0018 d7c20018 ffffff10 f557069c 00000010 00000286 f46c49a0 00030001 00000286 00000001 f46c499c f46c4000 f07c8811 f07c8000 c02158af f46c4000 0000001d 00020001 f07c8000 00000000 Call Trace: [<c02158af>] [<c028a673>] [<c02803e1>] [<c0116f50>] [<c01073b0>] [<c0263ee5>] [<c0263f77>] [<c0139049>] [<c02640e6>] [<c028b686>] [<c02ac2ba>] [<c02601bb>] [<c026031b>] [<c0141470>] [<c01072bf>] Code: 8b 9c ab 00 00 00 c0 c7 04 24 d9 d5 2d c0 89 5c 24 04 e8 83
That looks like a totally separate oops - one that occurred under 2.4.32. To make any sense of it you need to run it through ksymoops (you can find this in portage).
Here is the complete oops [17372366.136000] Oops: 0003 [#1] [17372366.136000] SMP [17372366.136000] Modules linked in: af_packet autofs4 parport_pc lp parport md3 [17372366.136000] CPU: 0 [17372366.136000] EIP: 0060:[<c028596b>] Not tainted VLI [17372366.136000] EFLAGS: 00010246 (2.6.15-gentoo-r1) [17372366.136000] EIP is at __alloc_skb+0xc5/0x130 [17372366.136000] eax: d98a8e80 ebx: d703d180 ecx: 00000000 edx: d98a8e00 [17372366.136000] esi: 00000080 edi: d703d200 ebp: 00000020 esp: c03dbf00 [17372366.136000] ds: 007b es: 007b ss: 0068 [17372366.136000] Process swapper (pid: 0, threadinfo=c03da000 task=c0347b20) [17372366.136000] Stack: f7b947a8 00010000 f5a19e80 00000122 00000042 f8839624 [17372366.136000] 00000000 f7f703e0 f6daa440 00000000 01230000 00000122 [17372366.136000] f7f70380 f7f70000 c03dbf74 f883986e f7f70380 00000040 [17372366.136000] Call Trace: [17372366.136000] [<f8839624>] tg3_rx+0x184/0x358 [tg3] [17372366.136000] [<f883986e>] tg3_poll+0x76/0x129 [tg3] [17372366.136000] [<c028ab1d>] net_rx_action+0x69/0xf0 [17372366.136000] [<c011a43d>] __do_softirq+0x55/0xbd [17372366.136000] [<c011a4d2>] do_softirq+0x2d/0x31 [17372366.136000] [<c0104447>] do_IRQ+0x47/0x4f [17372366.136000] [<c0102e52>] common_interrupt+0x1a/0x20 [17372366.136000] [<c010088c>] default_idle+0x0/0x55 [17372366.136000] [<c01008b8>] default_idle+0x2c/0x55 [17372366.136000] [<c010094f>] cpu_idle+0x5a/0x6f [17372366.136000] [<c03dc795>] start_kernel+0x14d/0x14f [17372366.136000] Code: 83 7c 24 20 00 74 1e 0f b6 43 6d c7 83 30 01 00 00 01 0 [17372366.136000] <0>Kernel panic - not syncing: Fatal exception in interrupt [17372366.328000]
Thanks, that is looking better. You are still missing a few lines from the very top though. I'll paste the sample again for reference: Unable to handle kernel paging request at virtual address 40000010 printing eip: c022d0b9 *pde = 00000000 Oops: 0000 [#1] <--- your log starts here
I am afraid that's all the serial console gave me Konstantin
Filed a bug upstream. I'm not sure if anything can be done without further testing on your side. Either way, thanks for the report.
Another question. The crash message shows that you have a "md3" module loaded. Where has this come from?
Upstream bug marked invalid as it is not clear where md3 comes from.