User attempted to copy a large file over NFS from one machine to the machine that is experiencing this problem. The filesystem in question is ext3, however, similar panics happen even when it's reiserfs. (We originally thought it was reiser, so we converted the fs over to ext3..) The problem still occurs quite regularly. Anytime there is a high amount of disk I/O on the system, the kernel panics and the disk pukes. A reboot is needed after that. System is a Dual Opteron (AMD64) using SMP. Kernel 2.6.12-gentoo-r9. Reproducible: Always Steps to Reproduce: 1. Do anything that does huge file writes to the filesystem 2. Watch the kernel panic. 3. Actual Results: Assertion failure in __journal_remove_journal_head() at fs/jbd/journal.c:1799: "jh->b_jcount >= 0" ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at "fs/jbd/journal.c":1799 invalid operand: 0000 [1] SMP CPU 0 Modules linked in: nfs nfsd exportfs lockd sunrpc usbcore ext3 jbd mbcache raid0 md sd_mod aic79xx Pid: 5891, comm: kjournald Not tainted 2.6.12-gentoo-r9 RIP: 0010:[<ffffffff88064c64>] <ffffffff88064c64>{:jbd:__journal_remove_journal_head+68} RSP: 0018:ffff8103fdd81d58 EFLAGS: 00010296 RAX: 0000000000000069 RBX: ffff8100d0820a80 RCX: ffffffff80358148 RDX: ffffffff80358148 RSI: 0000000000000292 RDI: ffffffff80358140 RBP: ffff81034a841e00 R08: 0000000000000000 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: ffff8103f7b855c0 R14: ffff8100fbc06a00 R15: 0000000000000000 FS: 0000000000536fa0(0000) GS:ffffffff803e2340(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00002aaaaac4cad0 CR3: 00000003f7c56000 CR4: 00000000000006a0 Process kjournald (pid: 5891, threadinfo ffff8103fdd80000, task ffff8100081da960) Stack: ffff81032a841e00 ffff8100d0820a80 ffff81032a841e00 ffffffff88064e14 ffff8100d0820a80 ffffffff8805fe6d ffff8100fbc06a24 ffff8100fbc06b5c 0000000000000000 0000000000000000 Call Trace:<ffffffff88064e14>{:jbd:journal_remove_journal_head+36} <ffffffff8805fe6d>{:jbd:journal_commit_transaction+1197} <ffffffff802f3f39>{thread_return+89} <ffffffff8012e663>{__wake_up+67} <ffffffff88062b34>{:jbd:kjournald+276} <ffffffff80148a00>{autoremove_wake_function+0} <ffffffff80148a00>{autoremove_wake_function+0} <ffffffff88062a00>{:jbd:commit_timeout+0} <ffffffff8010e61b>{child_rip+8} <ffffffff88062a20>{:jbd:kjournald+0} <ffffffff8010e613>{child_rip+0} Code: 0f 0b 54 6a 06 88 ff ff ff ff 07 07 f0 ff 43 18 44 8b 5d 08 RIP <ffffffff88064c64>{:jbd:__journal_remove_journal_head+68} RSP <ffff8103fdd81d58> Expected Results: Should not have kernel panicked! Portage 2.0.51.22-r2 (default-linux/amd64/2004.3, gcc-3.4.4, glibc-2.3.5-r1, 2.6.12-gentoo-r9 x86_64) ================================================================= System uname: 2.6.12-gentoo-r9 x86_64 AMD Opteron(tm) Processor 248 Gentoo Base System version 1.6.13 distcc 2.18.3 x86_64-pc-linux-gnu (protocols 1 and 2) (default port 3632) [disabled] dev-lang/python: 2.3.5 sys-apps/sandbox: 1.2.9 sys-devel/autoconf: 2.13, 2.59-r6 sys-devel/automake: 1.4_p6, 1.5, 1.6.3, 1.7.9-r1, 1.8.5-r3, 1.9.5 sys-devel/binutils: 2.15.92.0.2-r10 sys-devel/libtool: 1.5.18-r1 virtual/os-headers: 2.6.11-r2 ACCEPT_KEYWORDS="amd64" AUTOCLEAN="yes" CBUILD="x86_64-pc-linux-gnu" CFLAGS="-O3 -pipe -fomit-frame-pointer" CHOST="x86_64-pc-linux-gnu" CONFIG_PROTECT="/etc /usr/X11R6/lib/X11/xkb /usr/kde/2/share/config /usr/kde/3/share/config /usr/share/config /usr/share/texmf/dvipdfm/config/ /usr/share/texmf/dvips/config/ /usr/share/texmf/tex/generic/config/ /usr/share/texmf/tex/platex/config/ /usr/share/texmf/xdvi/ /var/qmail/control" CONFIG_PROTECT_MASK="/etc/gconf /etc/terminfo /etc/env.d" CXXFLAGS="-O3 -pipe -fomit-frame-pointer" DISTDIR="/usr/portage/distfiles" FEATURES="autoconfig distlocks sandbox sfperms strict" GENTOO_MIRRORS="http://mirror.datapipe.net/gentoo http://mirror.datapipe.net/gentoo http://gentoo.mirrors.pair.com/ http://mirrors.acm.cs.rpi.edu/gentoo/" MAKEOPTS="-j3" PKGDIR="/usr/portage/packages" PORTAGE_TMPDIR="/var/tmp" PORTDIR="/usr/portage" PORTDIR_OVERLAY="/usr/local/portage" SYNC="rsync://rsync.gentoo.org/gentoo-portage" USE="amd64 alsa avi berkdb bitmap-fonts cdr crypt cups eds encode esd fam foomaticdb fortran freetds gd gdbm gif gpm gstreamer imlib ipv6 java jpeg junit ldap libwww lzw lzw-tiff mp3 mpeg mysql ncurses nls opengl pam pdflib perl png postgres python quicktime readline samba sdl slang snmp spell ssl tcltk tcpd tetex tiff truetype-fonts type1-fonts usb userlocales xml2 xpm xv zlib userland_GNU kernel_linux elibc_glibc" Unset: ASFLAGS, CTARGET, LANG, LC_ALL, LDFLAGS, LINGUAS
Is this reproducible on gentoo-sources-2.6.13?
(In reply to comment #1) 2.6.13 appears to not be available even in ~amd64. Shall I try with a stock 2.6.13 kernel?
It is now ~amd64
Well, 2.6.13 seems to no longer kernel panic, but instead, now after a while I can't log in with ssh. The machine responds on the network fine, but logging in with ssh you never get to a prompt. I also noticed that the system load went up to about 6 while copying a file. Someone managed to copy a 22 gig file successfully, but this morning when I tried to log in, it was just hanging.
Odd. When you say "responds on the network", you mean that other network services are working fine, just ssh is unresponsive? Are you able to get local access to the machine? Another option is using a netconsole to hopefully capture an error when things start going wrong.
It pings, the gkrellmd service still works. Logging in makes it hang with ssh. Also, you can't log into the actual console. (It's headless, and we're using serial console to connect.) I am running a test. We backtracked to an older kernel (2.6.8) and were able to scp said 22G file, also testing cp over NFS. If those work, we'll try going back to 2.6.13 and scp'ing to see if it's NFS that somehow broke between 2.6.9-2.6.13. (I don't recall when it started having problems.. last year some time, and they got worse and worse.) Was there possibly some large amount of work done to the NFS bits of the kernel between those revisions, particular with 64-bit architectures?
Yes. Linux development is moving really fast, and broad things such as x86_64 and NFS support usually undergo fairly major changes every release.
At this point we're leaning towards it being a NFS bug. We did testing and both scp'ing and copying a file over NFS were both fine in 2.6.8. In 2.6.13, the problems occurred with NFS, but scp'ing the file in 2.6.13 was fine. I'll do some more testing to confirm this, but it's my belief that something has changed for the worse in NFS between .8 and .13. Should I try .9 and .10 as well to try to narrow the scope?
If you have time to do that sort of testing then yes, that would certainly be useful. Thanks!
I'm not so sure this is NFS-related now. We had someone run a sort job (using the standard UNIX sort application) on a huge file. The load jumped to around 11, and kswapd0's CPU usage jumped to near 100%. Then sort segfaulted. I'm not sure if it's because sort wasn't designed for that sort of thing, or maybe it blew out the memory. (Though the machine has 16 gigs of ram and something like 32 gigs of swap space). Any ideas?
May be the evil x86_64 bug floating around For some light reading, see http://bugzilla.kernel.org/show_bug.cgi?id=4851
Haha! Light reading indeed! It appears that doing an "echo 0 > /proc/sys/kernel/randomize_va_space" has solved the problem in this case. So we added: kernel.randomize_va_space=0 to /etc/sysctl.conf. We were able to both run the sort job and do a cp of a huge file over NFS without any problems. What exactly does randomize_va_space do? It seems to have royally screwed up a lot of people. :-/
Something to do with randomizing the memory space to make it much harder to read "secret" data by examining the systen memory space. The upstream bug suggests that randomize_va_space reduces the chance of the bug happening, but doesn't eliminate it. It wouldn't suprise me if you see this again. If you have time and can spare the downtime it would be helpful if you can help find out when the bug was introduced. It's suspected that 2.6.11 is fine and 2.6.12 is where it breaks, so after confirming 2.6.11 is ok you could test e.g. 2.6.12-rc1 2.6.12-rc2 2.6.12-rc3 2.6.12-rc4 2.6.12-rc5 ... until you meet the problem.
I'll do some additional testing. One recent large sort job did die with a segfault: sort[11073]: segfault at 00002aaa7b322d1f rip 00002aaaaac300d9 rsp 00007fffffffdf78 error 4 As far as kernel versions.. we had been having problems since 2.6.7 pretty much off and on. I can try 2.6.11 again (is it 2.6.11.11?). I'll have to use a stock kernel since it doesn't appear to be in gentoo-sources anymore.
One of our users gave us some new information basically stating that the segfaults started happening in june and that was the last time we upgraded coreutils (where 'sort' is, so we downgraded coreutils to 5.2.1-r2 (from 5.2.1-r6) and it seems to have stopped the segfaults. We are still running tests.
Hmm scratch that, still getting segfaults on sort.
Going to track the upstream bug report as I'm pretty sure this is the problem you have run into.
We have a fix upstream
*** Bug 106486 has been marked as a duplicate of this bug. ***
It's been stable so far with gentoo-sources-2.6.13-r2. I've re-enabled randomize_va_space and we've been copying a huge file to do a sort on to see if it segfaults or not. So far, the machine hasn't kernel panicked or segfaulted.
I did a little more testing. 2.6.13-r2 (with the patch), turned randomize_va_space ON (set to 1). Got this: kernel: CPU 1 kernel: Modules linked in: nfs nfsd exportfs lockd sunrpc usbcore ext3 jbd mbcache raid0 md_mod sd_mod aic79xx kernel: Pid: 31173, comm: pdflush Not tainted 2.6.13-gentoo-r2 kernel: RIP: 0010:[_end+130361672/2132660224] <ffffffff88075948>{:ext3:walk_page_buffers+56} kernel: RIP: 0010:[<ffffffff88075948>] <ffffffff88075948>{:ext3:walk_page_buffers+56} kernel: RSP: 0018:ffff810179037b58 EFLAGS: 00010206 kernel: RAX: 0000000000000001 RBX: 0000000000003000 RCX: 0000000000002000 kernel: RDX: 0000000000000001 RSI: 302e323131373931 RDI: 0000000000000000 kernel: RBP: 302e323131373931 R08: 0000000000000000 R09: ffffffff88075ef0 kernel: R10: ffff810004d5f8a8 R11: 0000000000000000 R12: ffff8100f0923978 kernel: R13: 0000000000001000 R14: 0000000000000000 R15: 0000000000001000 kernel: FS: 000000000061eae0(0000) GS:ffffffff803e8880(0000) knlGS:0000000000000000 kernel: CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b kernel: CR2: 00002aaaaaac6000 CR3: 00000001c8904000 CR4: 00000000000006a0 kernel: Process pdflush (pid: 31173, threadinfo ffff810179036000, task ffff8100082d80b0) kernel: Stack: ffffffff88075ef0 000000000000000a ffff81015b734958 ffff81000524f428 kernel: 000000005b734958 ffff8100f0923978 ffff81015b734958 ffff810179037e78 kernel: 0000000000000000 ffffffff88076030 kernel: Call Trace:<ffffffff88075ef0>{:ext3:bget_one+0} <ffffffff88076030>{:ext3:ext3_ordered_writepage+256} kernel: <ffffffff8019ede7>{mpage_writepages+455} <ffffffff880b19da>{:sunrpc:rpc_sleep_on+58} kernel: <ffffffff88075f30>{:ext3:ext3_ordered_writepage+0} <ffffffff8019d53b>{__writeback_single_inode+491} kernel: <ffffffff8013cd6f>{try_to_del_timer_sync+79} <ffffffff8013cd9b>{del_timer_sync+27} kernel: <ffffffff802f474c>{schedule_timeout+156} <ffffffff8013d6c0>{process_timeout+0} kernel: <ffffffff8019d9f1>{writeback_release+1} <ffffffff8019dc1c>{sync_sb_inodes+508} kernel: <ffffffff801495c0>{keventd_create_kthread+0} <ffffffff8019df50>{writeback_inodes+144} kernel: <ffffffff8015bc46>{background_writeout+118} <ffffffff8015c883>{pdflush+323} kernel: <ffffffff8015bbd0>{background_writeout+0} <ffffffff8015c740>{pdflush+0} kernel: <ffffffff80149579>{kthread+217} <ffffffff8010e816>{child_rip+8} kernel: <ffffffff801495c0>{keventd_create_kthread+0} <ffffffff801494a0>{kthread+0} kernel: <ffffffff8010e80e>{child_rip+0} kernel: Code: 48 8b 6e 08 0f 96 c2 44 39 f9 0f 93 c0 09 d0 a8 01 74 17 4d I turned randomize_va_space back off again and will continue to do more testing.
In addition to the above, I was also getting a lot of: kernel: nfs: server <some NFS server> not responding, still trying kernel: nfs: server <some NFS server> OK Over and over.. I wonder if the bandwidth was being saturated (100 megabits, full duplex). You could do an ls on a directory and it wouldn't be there, and a few seconds later it would be.
Getting some more kernel dumps: general protection fault: 0000 [1] SMP CPU 0 Modules linked in: nfs nfsd exportfs lockd sunrpc usbcore ext3 jbd mbcache raid0 md_mod sd_mod aic79xx Pid: 121, comm: kswapd0 Not tainted 2.6.13-gentoo-r2 RIP: 0010:[<ffffffff80193aa5>] <ffffffff80193aa5>{iput+53} RSP: 0018:ffff8100100c7d88 EFLAGS: 00010202 RAX: 0a676e6f6809676e RBX: ffff8100f0a21d68 RCX: 0000000000000000 RDX: ffff8100f0a21f48 RSI: 0000000000000400 RDI: ffff8100f0a21d68 RBP: ffff8103e0851c90 R08: 0000000000000003 R09: 0000000000000000 R10: ffffffff80421aa0 R11: ffffffff80194ca0 R12: ffff8103e0851c98 R13: 0000000000000034 R14: 00000000001ca8ee R15: 0000000000000000 FS: 000000000061eae0(0000) GS:ffffffff803e8800(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00007fffffffe7e6 CR3: 000000026c256000 CR4: 00000000000006a0 Process kswapd0 (pid: 121, threadinfo ffff8100100c6000, task ffff8103fffa00b0) Stack: ffff8100f0a21d68 ffffffff80191b55 ffff8100078b0600 0000000000058548 ffff8101ffff8f40 0000000000000085 00000000000000d0 ffffffff80192107 00000000001ca8ed ffffffff8016100b Call Trace:<ffffffff80191b55>{prune_dcache+405} <ffffffff80192107>{shrink_dcache_memory+23} <ffffffff8016100b>{shrink_slab+219} <ffffffff8016250b>{balance_pgdat+619} <ffffffff80162797>{kswapd+311} <ffffffff80149ba0>{autoremove_wake_function+0} <ffffffff80149ba0>{autoremove_wake_function+0} <ffffffff8012e040>{schedule_tail+64} <ffffffff8010e816>{child_rip+8} <ffffffff8011aff0>{flat_send_IPI_mask+0} <ffffffff80162660>{kswapd+0} <ffffffff8010e80e>{child_rip+0} Code: 48 8b 40 28 48 85 c0 74 05 48 89 df ff d0 48 8d 7b 48 48 c7 RIP <ffffffff80193aa5>{iput+53} RSP <ffff8100100c7d88> <0>general protection fault: 0000 [2] SMP CPU 1 Modules linked in: nfs nfsd exportfs lockd sunrpc usbcore ext3 jbd mbcache raid0 md_mod sd_mod aic79xx Pid: 120, comm: kswapd1 Not tainted 2.6.13-gentoo-r2 RIP: 0010:[<ffffffff801a11f3>] <ffffffff801a11f3>{inotify_inode_queue_event+323} RSP: 0018:ffff8103ffeb1d38 EFLAGS: 00010246 RAX: ffff810060a21c40 RBX: 61096c61696e6f5c RCX: 0000000000000000 RDX: ffff8100f0a21a50 RSI: 0000000000000400 RDI: ffff8100f0a21c50 RBP: ffff8103e0851d70 R08: 0000000000000003 R09: 0000000000000000 R10: ffffffff80421aa0 R11: 0000000000000008 R12: ffff8103e0851d78 R13: 0000000074617274 R14: ffff8100f0a21c40 R15: 61096c61696e6f5c FS: 0000000040110ae0(0000) GS:ffffffff803e8880(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 00002aaaaaac6000 CR3: 00000003a4b19000 CR4: 00000000000006a0 Process kswapd1 (pid: 120, threadinfo ffff8103ffeb0000, task ffff8103fffa0770) Stack: ffff810205539258 ffff8100f0a21c50 0000000000000000 0000040000000000 ffff8100f0a21a50 ffff8100f0a21a50 ffff8103e0851d70 ffff8103e0851d78 0000000000000080 00000000001d0527 Call Trace:<ffffffff80191b24>{prune_dcache+356} <ffffffff80192107>{shrink_dcache_memory+23} <ffffffff8016100b>{shrink_slab+219} <ffffffff8016250b>{balance_pgdat+619} <ffffffff80162797>{kswapd+311} <ffffffff80149ba0>{autoremove_wake_function+0} <ffffffff80149ba0>{autoremove_wake_function+0} <ffffffff8012e040>{schedule_tail+64} <ffffffff8010e816>{child_rip+8} <ffffffff8011aff0>{flat_send_IPI_mask+0} <ffffffff80162660>{kswapd+0} <ffffffff8010e80e>{child_rip+0} RIP <ffffffff801a11f3>{inotify_inode_queue_event+323} RSP <ffff8103ffeb1d38> <0>general protection fault: 0000 [3] SMP CPU 0 Modules linked in: nfs nfsd exportfs lockd sunrpc usbcore ext3 jbd mbcache raid0 md_mod sd_mod aic79xx Pid: 13541, comm: countCooccurs.p Not tainted 2.6.13-gentoo-r2 RIP: 0010:[<ffffffff80193aa5>] <ffffffff80193aa5>{iput+53} RSP: 0018:ffff8103e74d79f8 EFLAGS: 00010202 RAX: 0a676e6f6809676e RBX: ffff8100f0a21738 RCX: 0000000000000000 RDX: ffff8100f0a21918 RSI: 0000000000000400 RDI: ffff8100f0a21738 RBP: ffff8103e0851e50 R08: 0000000000000003 R09: 0000000000000000 R10: ffffffff80421aa0 R11: 0000000000000008 R12: ffff8103e0851e58 R13: 0000000000000080 R14: 000000000039b4d3 R15: 0000000000000000 FS: 000000000061eae0(0063) GS:ffffffff803e8800(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fffffffe7e6 CR3: 000000026c256000 CR4: 00000000000006a0 Process countCooccurs.p (pid: 13541, threadinfo ffff8103e74d6000, task ffff8101ffcb01f0) Stack: ffff8100f0a21738 ffffffff80191b55 ffff81000786a478 0000000000058548 ffff8101ffff8f40 0000000000000084 00000000000200d2 ffffffff80192107 000000000039b4d2 ffffffff8016100b Call Trace:<ffffffff80191b55>{prune_dcache+405} <ffffffff80192107>{shrink_dcache_memory+23} <ffffffff8016100b>{shrink_slab+219} <ffffffff801621b3>{try_to_free_pages+355} <ffffffff88077032>{:ext3:ext3_mark_iloc_dirty+834} <ffffffff8015a1d2>{__alloc_pages+594} <ffffffff88077173>{:ext3:ext3_mark_inode_dirty+51} <ffffffff801571f7>{generic_file_buffered_write+423} <ffffffff80138f15>{current_fs_time+85} <ffffffff8015638e>{do_generic_mapping_read+1374} <ffffffff80193c0e>{inode_update_time+62} <ffffffff80157c1a>{__generic_file_aio_write_nolock+938} <ffffffff80157cd1>{generic_file_aio_write+129} <ffffffff880741e3>{:ext3:ext3_file_write+35} <ffffffff801798c3>{do_sync_write+211} <ffffffff80183279>{cp_new_stat+233} <ffffffff80149ba0>{autoremove_wake_function+0} <ffffffff801799c8>{vfs_write+200} <ffffffff80179b63>{sys_write+83} <ffffffff8010dc66>{system_call+126} Code: 48 8b 40 28 48 85 c0 74 05 48 89 df ff d0 48 8d 7b 48 48 c7 RIP <ffffffff80193aa5>{iput+53} RSP <ffff8103e74d79f8> <0>general protection fault: 0000 [4] SMP CPU 0 Modules linked in: nfs nfsd exportfs lockd sunrpc usbcore ext3 jbd mbcache raid0 md_mod sd_mod aic79xx Pid: 13716, comm: java Not tainted 2.6.13-gentoo-r2 RIP: 0010:[<ffffffff801a11f3>] <ffffffff801a11f3>{inotify_inode_queue_event+323} RSP: 0018:ffff810366e05988 EFLAGS: 00010246 RAX: ffff810060a21610 RBX: 756f630a73656962 RCX: 0000000000000000 RDX: ffff8100f0a21420 RSI: 0000000000000400 RDI: ffff8100f0a21620 RBP: ffff8103e06b00d0 R08: 0000000000000003 R09: 0000000000000000 R10: ffffffff80421aa0 R11: 0000000000000008 R12: ffff8103e06b00d8 R13: 000000006c6c6109 R14: ffff8100f0a21610 R15: 756f630a73656962 FS: 0000000040110ae0(0063) GS:ffffffff803e8800(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00002aab4af79000 CR3: 00000003a4b19000 CR4: 00000000000006a0 Process java (pid: 13716, threadinfo ffff810366e04000, task ffff8101ffcb1630) Stack: ffff8102053b80d0 ffff8100f0a21620 0000000000000000 0000040000000000 ffff8100f0a21420 ffff8100f0a21420 ffff8103e06b00d0 ffff8103e06b00d8 0000000000000080 000000000039b4ec Call Trace:<ffffffff80191b24>{prune_dcache+356} <ffffffff80192107>{shrink_dcache_memory+23} <ffffffff8016100b>{shrink_slab+219} <ffffffff801621b3>{try_to_free_pages+355} <ffffffff881061a8>{:nfs:nfs_unlock_request+72} <ffffffff8015a1d2>{__alloc_pages+594} <ffffffff801571f7>{generic_file_buffered_write+423} <ffffffff80156e68>{remove_suid+8} <ffffffff80157c1a>{__generic_file_aio_write_nolock+938} <ffffffff80157cd1>{generic_file_aio_write+129} <ffffffff88101b7f>{:nfs:nfs_file_write+191} <ffffffff801798c3>{do_sync_write+211} <ffffffff801143b4>{save_i387+164} <ffffffff80114283>{init_fpu+83} <ffffffff801105d3>{math_state_restore+35} <ffffffff80149ba0>{autoremove_wake_function+0} <ffffffff801799c8>{vfs_write+200} <ffffffff80179b63>{sys_write+83} <ffffffff8010dc66>{system_call+126} Code: 4d 8b 7f 10 48 8d 43 10 49 83 ef 10 e9 2d ff ff ff 48 8b 7c RIP <ffffffff801a11f3>{inotify_inode_queue_event+323} RSP <ffff810366e05988> This time the machine stayed up and running. I wonder if it's a kswapd bug. Seems to have something to do with virtual memory.
I would like to add that prior to a crash of this nature, kswapd tends to peg the CPU at or near 100%.
Odd, they look fairly random. If randomize_va_space is 0, do the crashes disappaer? Could you test vanilla-sources-2.6.14_rc2? Also if you can spare the downtime it may be worthwhile testing your RAM with memtest86+
Whether randomize_va_space is 0 or 1 makes absolutely no difference. My colleague who has been working extensively with the system still believes it to be some kind of NFS problem.
Please test with gentoo-sources-2.6.14 and let us know if problems still remain.
We can no longer work on this issue as the machine is now running FreeBSD 5.4. It is my hope this issue does get resolved soon, however. Best of luck to you all.
emerge fails really often: Nov 3 23:31:03 sng emerge[12807] general protection rip:2aaaabfefbbc rsp:7ffffff572d0 error:0 Nov 4 08:58:35 sng emerge[30609]: segfault at 00002aaa0000001b rip 00002aaaabfefbbc rsp 00007ffffff0a050 error 4 Nov 4 19:40:44 sng emerge[18724] general protection rip:2aaaaac8d2c8 rsp:7fffff807b00 error:0 Kernel debug mode on but there is no backtrace. # cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor ondemand # cat /proc/sys/kernel/randomize_va_space 1 # emerge --info Portage 2.0.51.22-r3 (default-linux/amd64/2005.1, gcc-3.4.4, glibc-2.3.5-r2, 2.6.14-gentoo x86_64) ================================================================= System uname: 2.6.14-gentoo x86_64 AMD Athlon(tm) 64 Processor 3500+ Gentoo Base System version 1.6.13 sys-devel/binutils: 2.15.92.0.2-r10 sys-devel/libtool: 1.5.20 virtual/os-headers: 2.6.11-r2 ACCEPT_KEYWORDS="amd64" AUTOCLEAN="yes" CBUILD="x86_64-pc-linux-gnu" CFLAGS="-march=athlon64 -O3 -pipe -fomit-frame-pointer" CHOST="x86_64-pc-linux-gnu" CXXFLAGS="-march=athlon64 -O3 -pipe -fomit-frame-pointer" I'll try changing cpu governor and disable va randomization.