92309 – access to NFS mounted filesystem can deadlock a process

Bug 92309 - access to NFS mounted filesystem can deadlock a process

Summary: access to NFS mounted filesystem can deadlock a process

Status:	RESOLVED FIXED

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	[OLD] Core system (show other bugs)
Hardware:	x86 Linux

Importance:	High critical
Assignee:	Gentoo Kernel Bug Wranglers and Kernel Maintainers

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2005-05-11 15:18 UTC by Paul Forgey
Modified:	2005-07-05 11:53 UTC (History)
CC List:	0 users

See Also:
Package list:
Runtime testing required:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Paul Forgey 2005-05-11 15:18:23 UTC

If a process accesses a lot of files over an NFS mounted filesystem (in my case rsync), the process can hang in a perpetual "disk sleep" state infinitely.  The process can not be terminated with kill -9.  The system reports a minimum load average of 1.00.  The NFS mounted filesystem can not be cleanly unmounted, and the system must be reboot uncleanly.


Reproducible: Always
Steps to Reproduce:
1. Mount a filesystem with hundreds of thousands of files
2. rsync copy these files elsewhere.
3. rsync process will deadlock until the system is rebooted

Actual Results:  
process is not killable with signal 9, load average is 1.00, filesystem can not be unmounted.


kernel version 2.6.11-gentoo-r8, was not a problem in all versions of gentoo-sources 2.4 kernels that I 
ran.

NFS mount from /proc/$pid/mounts: rover:/home/server/public /mnt/public nfs 
ro,v3,rsize=8192,wsize=32768,hard,udp,lock,addr=rover 0 0

Portage 2.0.51.19 (default-linux/x86/2005.0, gcc-3.3.5-20050130, glibc-2.3.4.20041102-r1, 
2.6.11-gentoo-r8 i686)
===============================================================
==
System uname: 2.6.11-gentoo-r8 i686 Pentium III (Coppermine)
Gentoo Base System version 1.4.16
Python:              dev-lang/python-2.3.5,dev-lang/python-2.2.3-r5 [2.3.5 (#1, May  3 2005, 01:35:55)]
distcc 2.16 i686-pc-linux-gnu (protocols 1 and 2) (default port 3632) [enabled]
dev-lang/python:     2.3.5, 2.2.3-r5
sys-apps/sandbox:    [Not Present]
sys-devel/autoconf:  2.13, 2.59-r6
sys-devel/automake:  1.7.9-r1, 1.5, 1.9.5, 1.4_p6, 1.6.3, 1.8.5-r3
sys-devel/binutils:  2.15.92.0.2-r7
sys-devel/libtool:   1.5.16
virtual/os-headers:  2.4.19-r1, 2.6.8.1-r2
ACCEPT_KEYWORDS="x86"
AUTOCLEAN="yes"
CFLAGS="-O3 -mcpu=pentium3 -funroll-loops -pipe"
CHOST="i686-pc-linux-gnu"
CONFIG_PROTECT="/etc /usr/kde/2/share/config /usr/kde/3.3/env /usr/kde/3.3/share/config /usr/
kde/3.3/shutdown /usr/kde/3/share/config /usr/lib/X11/xkb /usr/share/config /var/bind /var/
qmail/control"
CONFIG_PROTECT_MASK="/etc/gconf /etc/terminfo /etc/env.d"
CXXFLAGS="-O3 -mcpu=pentium3 -funroll-loops -pipe"
DISTDIR="/usr/portage/distfiles"
FEATURES="autoaddcvs autoconfig ccache distcc distlocks sandbox sfperms strict"
GENTOO_MIRRORS="http://distfiles.gentoo.org http://distro.ibiblio.org/pub/Linux/distributions/
gentoo"
MAKEOPTS="-j6"
PKGDIR="/usr/portage/packages"
PORTAGE_TMPDIR="/var/tmp"
PORTDIR="/usr/portage"
SYNC="rsync://rsync.gentoo.org/gentoo-portage"
USE="x86 alsa apache2 apm arts avi berkdb bitmap-fonts crypt cups curl emboss encode foomaticdb 
fortran gd gif gpm imap imlib ipv6 jpeg libg++ libwww mad mbox mcal mikmod milter motif mp3 
mpeg mysql ncurses nls ogg oggvorbis opengl oss pam pdflib perl png python quicktime readline sasl 
sdl session slang spell ssl svga tcltk tcpd tiff truetype truetype-fonts type1-fonts vorbis xml2 xmms xv 
zlib userland_GNU kernel_linux elibc_glibc"
Unset:  ASFLAGS, CBUILD, CTARGET, LANG, LC_ALL, LDFLAGS, LINGUAS, PORTDIR_OVERLAY

Comment 1 Daniel Drake (RETIRED) gentoo-dev

2005-05-12 10:47:50 UTC

Is this reproducable in vanilla-sources-2.6.12_rc4 ?

Comment 2 Paul Forgey 2005-05-12 15:39:47 UTC

It isn't reproducible in vanilla 2.6.11.7 (which I assume is the non gentoo version of the gentoo kernel giving me this problem).  Is this sufficient information or would it really help to run a release candidate of the vanilla sources?  I'm hesitant to do that because this is a production server in a DMZ.

Comment 3 Daniel Drake (RETIRED) gentoo-dev

2005-05-12 17:42:41 UTC

Don't worry about testing 2.6.12-rc then. 

I'm thinking this might be related to bug 87403.

I'll ask around to see if anyone has ideas what might be causing this.

Comment 4 Paul Forgey 2005-05-16 15:23:41 UTC

I just reproduced the problem in vanilla 2.6.11.7.  It is significantly harder to reproduce in the vanilla kernel, but is probably not gentoo specific after all then.

Comment 5 Daniel Drake (RETIRED) gentoo-dev

2005-05-17 03:12:03 UTC

Please try and reproduce in vanilla-sources-2.6.12_rc4

Comment 6 Paul Forgey 2005-05-17 23:37:00 UTC

(In reply to comment #5)
> Please try and reproduce in vanilla-sources-2.6.12_rc4

SCSI performance in this kernel is atrocious.  For starters, it won't properly
negotiate full speed with my U160 devices.  Too painful to run on a production
server.

Comment 7 Daniel Drake (RETIRED) gentoo-dev

2005-06-26 02:35:29 UTC

2.6.12 is out now. Any chance you could try and reproduce there?

Comment 8 Paul Forgey 2005-06-27 12:26:31 UTC

Just installed it.  I will have to let it run for a few days before I can say for certain.  I will update this bug 
with what happens by the end of this week.

Comment 9 Paul Forgey 2005-07-01 02:12:34 UTC

So far it is behaving itself!

Comment 10 Daniel Drake (RETIRED) gentoo-dev

2005-07-05 11:53:50 UTC

Great, please reopen if it happens again.