104151 – Kernel panic while copying large file.

Bug 104151 - Kernel panic while copying large file.

Summary: Kernel panic while copying large file.

Status:	RESOLVED CANTFIX

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	[OLD] Unspecified (show other bugs)
Hardware:	AMD64 Linux

Importance:	High critical
Assignee:	Gentoo Kernel Bug Wranglers and Kernel Maintainers

URL:	http://bugzilla.kernel.org/show_bug.c...
Whiteboard:
Keywords:

Duplicates (1):	106486 (view as bug list)
Depends on:
Blocks:

Reported:	2005-08-29 08:53 UTC by khb354102
Modified:	2005-11-04 09:58 UTC (History)
CC List:	4 users (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description khb354102 2005-08-29 08:53:42 UTC

User attempted to copy a large file over NFS from one machine to the machine
that is experiencing this problem. The filesystem in question is ext3, however,
similar panics happen even when it's reiserfs. (We originally thought it was
reiser, so we converted the fs over to ext3..) The problem still occurs quite
regularly. Anytime there is a high amount of disk I/O on the system, the kernel
panics and the disk pukes. A reboot is needed after that.

System is a Dual Opteron (AMD64) using SMP.
Kernel 2.6.12-gentoo-r9.


Reproducible: Always
Steps to Reproduce:
1. Do anything that does huge file writes to the filesystem
2. Watch the kernel panic.
3.

Actual Results:  
Assertion failure in __journal_remove_journal_head() at fs/jbd/journal.c:1799:
"jh->b_jcount >= 0"
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at "fs/jbd/journal.c":1799
invalid operand: 0000 [1] SMP
CPU 0
Modules linked in: nfs nfsd exportfs lockd sunrpc usbcore ext3 jbd mbcache raid0
md sd_mod aic79xx
Pid: 5891, comm: kjournald Not tainted 2.6.12-gentoo-r9
RIP: 0010:[<ffffffff88064c64>]
<ffffffff88064c64>{:jbd:__journal_remove_journal_head+68}
RSP: 0018:ffff8103fdd81d58  EFLAGS: 00010296
RAX: 0000000000000069 RBX: ffff8100d0820a80 RCX: ffffffff80358148
RDX: ffffffff80358148 RSI: 0000000000000292 RDI: ffffffff80358140
RBP: ffff81034a841e00 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: ffff8103f7b855c0 R14: ffff8100fbc06a00 R15: 0000000000000000
FS:  0000000000536fa0(0000) GS:ffffffff803e2340(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002aaaaac4cad0 CR3: 00000003f7c56000 CR4: 00000000000006a0
Process kjournald (pid: 5891, threadinfo ffff8103fdd80000, task ffff8100081da960)
Stack: ffff81032a841e00 ffff8100d0820a80 ffff81032a841e00 ffffffff88064e14
       ffff8100d0820a80 ffffffff8805fe6d ffff8100fbc06a24 ffff8100fbc06b5c
       0000000000000000 0000000000000000
Call Trace:<ffffffff88064e14>{:jbd:journal_remove_journal_head+36}
       <ffffffff8805fe6d>{:jbd:journal_commit_transaction+1197}
       <ffffffff802f3f39>{thread_return+89} <ffffffff8012e663>{__wake_up+67}
       <ffffffff88062b34>{:jbd:kjournald+276}
<ffffffff80148a00>{autoremove_wake_function+0}
       <ffffffff80148a00>{autoremove_wake_function+0}
<ffffffff88062a00>{:jbd:commit_timeout+0}
       <ffffffff8010e61b>{child_rip+8} <ffffffff88062a20>{:jbd:kjournald+0}
       <ffffffff8010e613>{child_rip+0}

Code: 0f 0b 54 6a 06 88 ff ff ff ff 07 07 f0 ff 43 18 44 8b 5d 08
RIP <ffffffff88064c64>{:jbd:__journal_remove_journal_head+68} RSP <ffff8103fdd81d58>

Expected Results:  
Should not have kernel panicked!

Portage 2.0.51.22-r2 (default-linux/amd64/2004.3, gcc-3.4.4, glibc-2.3.5-r1,
2.6.12-gentoo-r9 x86_64)
=================================================================
System uname: 2.6.12-gentoo-r9 x86_64 AMD Opteron(tm) Processor 248
Gentoo Base System version 1.6.13
distcc 2.18.3 x86_64-pc-linux-gnu (protocols 1 and 2) (default port 3632) [disabled]
dev-lang/python:     2.3.5
sys-apps/sandbox:    1.2.9
sys-devel/autoconf:  2.13, 2.59-r6
sys-devel/automake:  1.4_p6, 1.5, 1.6.3, 1.7.9-r1, 1.8.5-r3, 1.9.5
sys-devel/binutils:  2.15.92.0.2-r10
sys-devel/libtool:   1.5.18-r1
virtual/os-headers:  2.6.11-r2
ACCEPT_KEYWORDS="amd64"
AUTOCLEAN="yes"
CBUILD="x86_64-pc-linux-gnu"
CFLAGS="-O3 -pipe -fomit-frame-pointer"
CHOST="x86_64-pc-linux-gnu"
CONFIG_PROTECT="/etc /usr/X11R6/lib/X11/xkb /usr/kde/2/share/config
/usr/kde/3/share/config /usr/share/config /usr/share/texmf/dvipdfm/config/
/usr/share/texmf/dvips/config/ /usr/share/texmf/tex/generic/config/
/usr/share/texmf/tex/platex/config/ /usr/share/texmf/xdvi/ /var/qmail/control"
CONFIG_PROTECT_MASK="/etc/gconf /etc/terminfo /etc/env.d"
CXXFLAGS="-O3 -pipe -fomit-frame-pointer"
DISTDIR="/usr/portage/distfiles"
FEATURES="autoconfig distlocks sandbox sfperms strict"
GENTOO_MIRRORS="http://mirror.datapipe.net/gentoo
http://mirror.datapipe.net/gentoo http://gentoo.mirrors.pair.com/
http://mirrors.acm.cs.rpi.edu/gentoo/"
MAKEOPTS="-j3"
PKGDIR="/usr/portage/packages"
PORTAGE_TMPDIR="/var/tmp"
PORTDIR="/usr/portage"
PORTDIR_OVERLAY="/usr/local/portage"
SYNC="rsync://rsync.gentoo.org/gentoo-portage"
USE="amd64 alsa avi berkdb bitmap-fonts cdr crypt cups eds encode esd fam
foomaticdb fortran freetds gd gdbm gif gpm gstreamer imlib ipv6 java jpeg junit
ldap libwww lzw lzw-tiff mp3 mpeg mysql ncurses nls opengl pam pdflib perl png
postgres python quicktime readline samba sdl slang snmp spell ssl tcltk tcpd
tetex tiff truetype-fonts type1-fonts usb userlocales xml2 xpm xv zlib
userland_GNU kernel_linux elibc_glibc"
Unset:  ASFLAGS, CTARGET, LANG, LC_ALL, LDFLAGS, LINGUAS

Comment 1 Daniel Drake (RETIRED) gentoo-dev

2005-08-30 04:42:53 UTC

Is this reproducible on gentoo-sources-2.6.13?

Comment 2 khb354102 2005-08-30 05:40:59 UTC

(In reply to comment #1)

2.6.13 appears to not be available even in ~amd64.
Shall I try with a stock 2.6.13 kernel?

Comment 3 Daniel Drake (RETIRED) gentoo-dev

2005-08-30 16:32:56 UTC

It is now ~amd64

Comment 4 khb354102 2005-09-01 05:13:25 UTC

Well, 2.6.13 seems to no longer kernel panic, but instead, now after a while I
can't log in with ssh. The machine responds on the network fine, but logging in
with ssh you never get to a prompt.

I also noticed that the system load went up to about 6 while copying a file.
Someone managed to copy a 22 gig file successfully, but this morning when I
tried to log in, it was just hanging.

Comment 5 Daniel Drake (RETIRED) gentoo-dev

2005-09-01 12:10:48 UTC

Odd. When you say "responds on the network", you mean that other network
services are working fine, just ssh is unresponsive? Are you able to get local
access to the machine? Another option is using a netconsole to hopefully capture
an error when things start going wrong.

Comment 6 khb354102 2005-09-01 12:27:35 UTC

It pings, the gkrellmd service still works. Logging in makes it hang with ssh.
Also, you can't log into the actual console. (It's headless, and we're using
serial console to connect.)

I am running a test. We backtracked to an older kernel (2.6.8) and were able to
scp said 22G file, also testing cp over NFS. If those work, we'll try going back
to 2.6.13 and scp'ing to see if it's NFS that somehow broke between
2.6.9-2.6.13. (I don't recall when it started having problems.. last year some
time, and they got worse and worse.)

Was there possibly some large amount of work done to the NFS bits of the kernel
between those revisions, particular with 64-bit architectures?

Comment 7 Daniel Drake (RETIRED) gentoo-dev

2005-09-01 12:59:16 UTC

Yes. Linux development is moving really fast, and broad things such as x86_64
and NFS support usually undergo fairly major changes every release.

Comment 8 khb354102 2005-09-02 09:12:24 UTC

At this point we're leaning towards it being a NFS bug. We did testing and both
scp'ing and copying a file over NFS were both fine in 2.6.8. In 2.6.13, the
problems occurred with NFS, but scp'ing the file in 2.6.13 was fine.

I'll do some more testing to confirm this, but it's my belief that something has
changed for the worse in NFS between .8 and .13.

Should I try .9 and .10 as well to try to narrow the scope?

Comment 9 Daniel Drake (RETIRED) gentoo-dev

2005-09-04 08:02:20 UTC

If you have time to do that sort of testing then yes, that would certainly be
useful. Thanks!

Comment 10 khb354102 2005-09-06 09:37:43 UTC

I'm not so sure this is NFS-related now. We had someone run a sort job (using
the standard UNIX sort application) on a huge file. The load jumped to around
11, and kswapd0's CPU usage jumped to near 100%. Then sort segfaulted. I'm not
sure if it's because sort wasn't designed for that sort of thing, or maybe it
blew out the memory. (Though the machine has 16 gigs of ram and something like
32 gigs of swap space). Any ideas?

Comment 11 Daniel Drake (RETIRED) gentoo-dev

2005-09-06 10:01:53 UTC

May be the evil x86_64 bug floating around

For some light reading, see http://bugzilla.kernel.org/show_bug.cgi?id=4851

Comment 12 khb354102 2005-09-07 04:48:50 UTC

Haha! Light reading indeed!
It appears that doing an "echo 0 > /proc/sys/kernel/randomize_va_space" has
solved the problem in this case. So we added: kernel.randomize_va_space=0 to
/etc/sysctl.conf. We were able to both run the sort job and do a cp of a huge
file over NFS without any problems.

What exactly does randomize_va_space do? It seems to have royally screwed up a
lot of people. :-/

Comment 13 Daniel Drake (RETIRED) gentoo-dev

2005-09-07 10:06:27 UTC

Something to do with randomizing the memory space to make it much harder to read
"secret" data by examining the systen memory space.

The upstream bug suggests that randomize_va_space reduces the chance of the bug
happening, but doesn't eliminate it. It wouldn't suprise me if you see this again. 

If you have time and can spare the downtime it would be helpful if you can help
find out when the bug was introduced. It's suspected that 2.6.11 is fine and
2.6.12 is where it breaks, so after confirming 2.6.11 is ok you could test e.g.
2.6.12-rc1 2.6.12-rc2 2.6.12-rc3 2.6.12-rc4 2.6.12-rc5 ... until you meet the
problem.

Comment 14 khb354102 2005-09-08 05:19:09 UTC

I'll do some additional testing. One recent large sort job did die with a segfault:

sort[11073]: segfault at 00002aaa7b322d1f rip 00002aaaaac300d9 rsp
00007fffffffdf78 error 4

As far as kernel versions.. we had been having problems since 2.6.7 pretty much
off and on. I can try 2.6.11 again (is it 2.6.11.11?). I'll have to use a stock
kernel since it doesn't appear to be in gentoo-sources anymore.

Comment 15 khb354102 2005-09-09 05:49:18 UTC

One of our users gave us some new information basically stating that the
segfaults started happening in june and that was the last time we upgraded
coreutils (where 'sort' is, so we downgraded coreutils to 5.2.1-r2 (from
5.2.1-r6) and it seems to have stopped the segfaults. We are still running tests.

Comment 16 khb354102 2005-09-09 06:09:59 UTC

Hmm scratch that, still getting segfaults on sort.

Comment 17 Daniel Drake (RETIRED) gentoo-dev

2005-09-17 10:18:11 UTC

Going to track the upstream bug report as I'm pretty sure this is the problem
you have run into.

Comment 18 Daniel Drake (RETIRED) gentoo-dev

2005-09-18 04:09:47 UTC

We have a fix upstream

Comment 19 Daniel Drake (RETIRED) gentoo-dev

2005-09-19 05:44:24 UTC

*** Bug 106486 has been marked as a duplicate of this bug. ***

Comment 20 khb354102 2005-09-20 07:39:25 UTC

It's been stable so far with gentoo-sources-2.6.13-r2. I've re-enabled
randomize_va_space and we've been copying a huge file to do a sort on to see if
it segfaults or not. So far, the machine hasn't kernel panicked or segfaulted.

Comment 21 khb354102 2005-09-23 05:21:48 UTC

I did a little more testing. 2.6.13-r2 (with the patch), turned
randomize_va_space ON (set to 1). Got this:

kernel: CPU 1
kernel: Modules linked in: nfs nfsd exportfs lockd sunrpc usbcore ext3 jbd
mbcache raid0 md_mod sd_mod aic79xx
kernel: Pid: 31173, comm: pdflush Not tainted 2.6.13-gentoo-r2
kernel: RIP: 0010:[_end+130361672/2132660224]
<ffffffff88075948>{:ext3:walk_page_buffers+56}
kernel: RIP: 0010:[<ffffffff88075948>]
<ffffffff88075948>{:ext3:walk_page_buffers+56}
kernel: RSP: 0018:ffff810179037b58  EFLAGS: 00010206
kernel: RAX: 0000000000000001 RBX: 0000000000003000 RCX: 0000000000002000
kernel: RDX: 0000000000000001 RSI: 302e323131373931 RDI: 0000000000000000
kernel: RBP: 302e323131373931 R08: 0000000000000000 R09: ffffffff88075ef0
kernel: R10: ffff810004d5f8a8 R11: 0000000000000000 R12: ffff8100f0923978
kernel: R13: 0000000000001000 R14: 0000000000000000 R15: 0000000000001000
kernel: FS:  000000000061eae0(0000) GS:ffffffff803e8880(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
kernel: CR2: 00002aaaaaac6000 CR3: 00000001c8904000 CR4: 00000000000006a0
kernel: Process pdflush (pid: 31173, threadinfo ffff810179036000, task
ffff8100082d80b0)
kernel: Stack: ffffffff88075ef0 000000000000000a ffff81015b734958 ffff81000524f428
kernel:        000000005b734958 ffff8100f0923978 ffff81015b734958 ffff810179037e78
kernel:        0000000000000000 ffffffff88076030
kernel: Call Trace:<ffffffff88075ef0>{:ext3:bget_one+0}
<ffffffff88076030>{:ext3:ext3_ordered_writepage+256}
kernel:        <ffffffff8019ede7>{mpage_writepages+455}
<ffffffff880b19da>{:sunrpc:rpc_sleep_on+58}
kernel:        <ffffffff88075f30>{:ext3:ext3_ordered_writepage+0}
<ffffffff8019d53b>{__writeback_single_inode+491}
kernel:        <ffffffff8013cd6f>{try_to_del_timer_sync+79}
<ffffffff8013cd9b>{del_timer_sync+27}
kernel:        <ffffffff802f474c>{schedule_timeout+156}
<ffffffff8013d6c0>{process_timeout+0}
kernel:        <ffffffff8019d9f1>{writeback_release+1}
<ffffffff8019dc1c>{sync_sb_inodes+508}
kernel:        <ffffffff801495c0>{keventd_create_kthread+0}
<ffffffff8019df50>{writeback_inodes+144}
kernel:        <ffffffff8015bc46>{background_writeout+118}
<ffffffff8015c883>{pdflush+323}
kernel:        <ffffffff8015bbd0>{background_writeout+0}
<ffffffff8015c740>{pdflush+0}
kernel:        <ffffffff80149579>{kthread+217} <ffffffff8010e816>{child_rip+8}
kernel:        <ffffffff801495c0>{keventd_create_kthread+0}
<ffffffff801494a0>{kthread+0}
kernel:        <ffffffff8010e80e>{child_rip+0}
kernel: Code: 48 8b 6e 08 0f 96 c2 44 39 f9 0f 93 c0 09 d0 a8 01 74 17 4d

I turned randomize_va_space back off again and will continue to do more testing.

Comment 22 khb354102 2005-09-23 05:26:00 UTC

In addition to the above, I was also getting a lot of:

kernel: nfs: server <some NFS server> not responding, still trying
kernel: nfs: server <some NFS server> OK

Over and over.. I wonder if the bandwidth was being saturated (100 megabits,
full duplex).

You could do an ls on a directory and it wouldn't be there, and a few seconds
later it would be.

Comment 23 khb354102 2005-09-27 12:26:37 UTC

Getting some more kernel dumps:

general protection fault: 0000 [1] SMP
CPU 0
Modules linked in: nfs nfsd exportfs lockd sunrpc usbcore ext3 jbd mbcache raid0
md_mod sd_mod aic79xx
Pid: 121, comm: kswapd0 Not tainted 2.6.13-gentoo-r2
RIP: 0010:[<ffffffff80193aa5>] <ffffffff80193aa5>{iput+53}
RSP: 0018:ffff8100100c7d88  EFLAGS: 00010202
RAX: 0a676e6f6809676e RBX: ffff8100f0a21d68 RCX: 0000000000000000
RDX: ffff8100f0a21f48 RSI: 0000000000000400 RDI: ffff8100f0a21d68
RBP: ffff8103e0851c90 R08: 0000000000000003 R09: 0000000000000000
R10: ffffffff80421aa0 R11: ffffffff80194ca0 R12: ffff8103e0851c98
R13: 0000000000000034 R14: 00000000001ca8ee R15: 0000000000000000
FS:  000000000061eae0(0000) GS:ffffffff803e8800(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00007fffffffe7e6 CR3: 000000026c256000 CR4: 00000000000006a0
Process kswapd0 (pid: 121, threadinfo ffff8100100c6000, task ffff8103fffa00b0)
Stack: ffff8100f0a21d68 ffffffff80191b55 ffff8100078b0600 0000000000058548
       ffff8101ffff8f40 0000000000000085 00000000000000d0 ffffffff80192107
       00000000001ca8ed ffffffff8016100b
Call Trace:<ffffffff80191b55>{prune_dcache+405}
<ffffffff80192107>{shrink_dcache_memory+23}
       <ffffffff8016100b>{shrink_slab+219} <ffffffff8016250b>{balance_pgdat+619}
       <ffffffff80162797>{kswapd+311} <ffffffff80149ba0>{autoremove_wake_function+0}
       <ffffffff80149ba0>{autoremove_wake_function+0}
<ffffffff8012e040>{schedule_tail+64}
       <ffffffff8010e816>{child_rip+8} <ffffffff8011aff0>{flat_send_IPI_mask+0}
       <ffffffff80162660>{kswapd+0} <ffffffff8010e80e>{child_rip+0}

Code: 48 8b 40 28 48 85 c0 74 05 48 89 df ff d0 48 8d 7b 48 48 c7
RIP <ffffffff80193aa5>{iput+53} RSP <ffff8100100c7d88>
 <0>general protection fault: 0000 [2] SMP
CPU 1
Modules linked in: nfs nfsd exportfs lockd sunrpc usbcore ext3 jbd mbcache raid0
md_mod sd_mod aic79xx
Pid: 120, comm: kswapd1 Not tainted 2.6.13-gentoo-r2
RIP: 0010:[<ffffffff801a11f3>] <ffffffff801a11f3>{inotify_inode_queue_event+323}
RSP: 0018:ffff8103ffeb1d38  EFLAGS: 00010246
RAX: ffff810060a21c40 RBX: 61096c61696e6f5c RCX: 0000000000000000
RDX: ffff8100f0a21a50 RSI: 0000000000000400 RDI: ffff8100f0a21c50
RBP: ffff8103e0851d70 R08: 0000000000000003 R09: 0000000000000000
R10: ffffffff80421aa0 R11: 0000000000000008 R12: ffff8103e0851d78
R13: 0000000074617274 R14: ffff8100f0a21c40 R15: 61096c61696e6f5c
FS:  0000000040110ae0(0000) GS:ffffffff803e8880(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00002aaaaaac6000 CR3: 00000003a4b19000 CR4: 00000000000006a0
Process kswapd1 (pid: 120, threadinfo ffff8103ffeb0000, task ffff8103fffa0770)
Stack: ffff810205539258 ffff8100f0a21c50 0000000000000000 0000040000000000
       ffff8100f0a21a50 ffff8100f0a21a50 ffff8103e0851d70 ffff8103e0851d78
       0000000000000080 00000000001d0527
Call Trace:<ffffffff80191b24>{prune_dcache+356}
<ffffffff80192107>{shrink_dcache_memory+23}
       <ffffffff8016100b>{shrink_slab+219} <ffffffff8016250b>{balance_pgdat+619}
       <ffffffff80162797>{kswapd+311} <ffffffff80149ba0>{autoremove_wake_function+0}
       <ffffffff80149ba0>{autoremove_wake_function+0}
<ffffffff8012e040>{schedule_tail+64}
       <ffffffff8010e816>{child_rip+8} <ffffffff8011aff0>{flat_send_IPI_mask+0}
       <ffffffff80162660>{kswapd+0} <ffffffff8010e80e>{child_rip+0}

RIP <ffffffff801a11f3>{inotify_inode_queue_event+323} RSP <ffff8103ffeb1d38>
 <0>general protection fault: 0000 [3] SMP
CPU 0
Modules linked in: nfs nfsd exportfs lockd sunrpc usbcore ext3 jbd mbcache raid0
md_mod sd_mod aic79xx
Pid: 13541, comm: countCooccurs.p Not tainted 2.6.13-gentoo-r2
RIP: 0010:[<ffffffff80193aa5>] <ffffffff80193aa5>{iput+53}
RSP: 0018:ffff8103e74d79f8  EFLAGS: 00010202
RAX: 0a676e6f6809676e RBX: ffff8100f0a21738 RCX: 0000000000000000
RDX: ffff8100f0a21918 RSI: 0000000000000400 RDI: ffff8100f0a21738
RBP: ffff8103e0851e50 R08: 0000000000000003 R09: 0000000000000000
R10: ffffffff80421aa0 R11: 0000000000000008 R12: ffff8103e0851e58
R13: 0000000000000080 R14: 000000000039b4d3 R15: 0000000000000000
FS:  000000000061eae0(0063) GS:ffffffff803e8800(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fffffffe7e6 CR3: 000000026c256000 CR4: 00000000000006a0
Process countCooccurs.p (pid: 13541, threadinfo ffff8103e74d6000, task
ffff8101ffcb01f0)
Stack: ffff8100f0a21738 ffffffff80191b55 ffff81000786a478 0000000000058548
       ffff8101ffff8f40 0000000000000084 00000000000200d2 ffffffff80192107
       000000000039b4d2 ffffffff8016100b
Call Trace:<ffffffff80191b55>{prune_dcache+405}
<ffffffff80192107>{shrink_dcache_memory+23}
       <ffffffff8016100b>{shrink_slab+219} <ffffffff801621b3>{try_to_free_pages+355}
       <ffffffff88077032>{:ext3:ext3_mark_iloc_dirty+834}
<ffffffff8015a1d2>{__alloc_pages+594}
       <ffffffff88077173>{:ext3:ext3_mark_inode_dirty+51}
<ffffffff801571f7>{generic_file_buffered_write+423}
       <ffffffff80138f15>{current_fs_time+85}
<ffffffff8015638e>{do_generic_mapping_read+1374}
       <ffffffff80193c0e>{inode_update_time+62}
<ffffffff80157c1a>{__generic_file_aio_write_nolock+938}
       <ffffffff80157cd1>{generic_file_aio_write+129}
<ffffffff880741e3>{:ext3:ext3_file_write+35}
       <ffffffff801798c3>{do_sync_write+211} <ffffffff80183279>{cp_new_stat+233}
       <ffffffff80149ba0>{autoremove_wake_function+0}
<ffffffff801799c8>{vfs_write+200}
       <ffffffff80179b63>{sys_write+83} <ffffffff8010dc66>{system_call+126}

Code: 48 8b 40 28 48 85 c0 74 05 48 89 df ff d0 48 8d 7b 48 48 c7
RIP <ffffffff80193aa5>{iput+53} RSP <ffff8103e74d79f8>
 <0>general protection fault: 0000 [4] SMP
CPU 0
Modules linked in: nfs nfsd exportfs lockd sunrpc usbcore ext3 jbd mbcache raid0
md_mod sd_mod aic79xx
Pid: 13716, comm: java Not tainted 2.6.13-gentoo-r2
RIP: 0010:[<ffffffff801a11f3>] <ffffffff801a11f3>{inotify_inode_queue_event+323}
RSP: 0018:ffff810366e05988  EFLAGS: 00010246
RAX: ffff810060a21610 RBX: 756f630a73656962 RCX: 0000000000000000
RDX: ffff8100f0a21420 RSI: 0000000000000400 RDI: ffff8100f0a21620
RBP: ffff8103e06b00d0 R08: 0000000000000003 R09: 0000000000000000
R10: ffffffff80421aa0 R11: 0000000000000008 R12: ffff8103e06b00d8
R13: 000000006c6c6109 R14: ffff8100f0a21610 R15: 756f630a73656962
FS:  0000000040110ae0(0063) GS:ffffffff803e8800(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00002aab4af79000 CR3: 00000003a4b19000 CR4: 00000000000006a0
Process java (pid: 13716, threadinfo ffff810366e04000, task ffff8101ffcb1630)
Stack: ffff8102053b80d0 ffff8100f0a21620 0000000000000000 0000040000000000
       ffff8100f0a21420 ffff8100f0a21420 ffff8103e06b00d0 ffff8103e06b00d8
       0000000000000080 000000000039b4ec
Call Trace:<ffffffff80191b24>{prune_dcache+356}
<ffffffff80192107>{shrink_dcache_memory+23}
       <ffffffff8016100b>{shrink_slab+219} <ffffffff801621b3>{try_to_free_pages+355}
       <ffffffff881061a8>{:nfs:nfs_unlock_request+72}
<ffffffff8015a1d2>{__alloc_pages+594}
       <ffffffff801571f7>{generic_file_buffered_write+423}
       <ffffffff80156e68>{remove_suid+8}
<ffffffff80157c1a>{__generic_file_aio_write_nolock+938}
       <ffffffff80157cd1>{generic_file_aio_write+129}
<ffffffff88101b7f>{:nfs:nfs_file_write+191}
       <ffffffff801798c3>{do_sync_write+211} <ffffffff801143b4>{save_i387+164}
       <ffffffff80114283>{init_fpu+83} <ffffffff801105d3>{math_state_restore+35}
       <ffffffff80149ba0>{autoremove_wake_function+0}
<ffffffff801799c8>{vfs_write+200}
       <ffffffff80179b63>{sys_write+83} <ffffffff8010dc66>{system_call+126}


Code: 4d 8b 7f 10 48 8d 43 10 49 83 ef 10 e9 2d ff ff ff 48 8b 7c
RIP <ffffffff801a11f3>{inotify_inode_queue_event+323} RSP <ffff810366e05988>


This time the machine stayed up and running. I wonder if it's a kswapd bug.
Seems to have something to do with virtual memory.

Comment 24 khb354102 2005-09-28 08:40:44 UTC

I would like to add that prior to a crash of this nature, kswapd tends to peg
the CPU at or near 100%.

Comment 25 Daniel Drake (RETIRED) gentoo-dev

2005-09-29 01:59:56 UTC

Odd, they look fairly random.

If randomize_va_space is 0, do the crashes disappaer?

Could you test vanilla-sources-2.6.14_rc2? Also if you can spare the downtime it
may be worthwhile testing your RAM with memtest86+

Comment 26 khb354102 2005-09-30 06:58:43 UTC

Whether randomize_va_space is 0 or 1 makes absolutely no difference. My
colleague who has been working extensively with the system still believes it to
be some kind of NFS problem.

Comment 27 Daniel Drake (RETIRED) gentoo-dev

2005-10-29 06:59:06 UTC

Please test with gentoo-sources-2.6.14 and let us know if problems still remain.

Comment 28 khb354102 2005-10-31 07:42:58 UTC

We can no longer work on this issue as the machine is now running FreeBSD 5.4.
It is my hope this issue does get resolved soon, however. Best of luck to you all.

Comment 29 Juha Heljoranta 2005-11-04 09:58:15 UTC

emerge fails really often:

Nov  3 23:31:03 sng emerge[12807] general protection rip:2aaaabfefbbc
rsp:7ffffff572d0 error:0
Nov  4 08:58:35 sng emerge[30609]: segfault at 00002aaa0000001b rip
00002aaaabfefbbc rsp 00007ffffff0a050 error 4
Nov  4 19:40:44 sng emerge[18724] general protection rip:2aaaaac8d2c8
rsp:7fffff807b00 error:0

Kernel debug mode on but there is no backtrace.

# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
ondemand
# cat /proc/sys/kernel/randomize_va_space
1
# emerge --info
Portage 2.0.51.22-r3 (default-linux/amd64/2005.1, gcc-3.4.4, glibc-2.3.5-r2,
2.6.14-gentoo x86_64)
=================================================================
System uname: 2.6.14-gentoo x86_64 AMD Athlon(tm) 64 Processor 3500+
Gentoo Base System version 1.6.13
sys-devel/binutils:  2.15.92.0.2-r10
sys-devel/libtool:   1.5.20
virtual/os-headers:  2.6.11-r2
ACCEPT_KEYWORDS="amd64"
AUTOCLEAN="yes"
CBUILD="x86_64-pc-linux-gnu"
CFLAGS="-march=athlon64 -O3 -pipe -fomit-frame-pointer"
CHOST="x86_64-pc-linux-gnu"
CXXFLAGS="-march=athlon64 -O3 -pipe -fomit-frame-pointer"

I'll try changing cpu governor and disable va randomization.