I am seeing file corruption following appends on NFS mounts shared by multiple machines. Some inconsistency despite file locking is expected when writes are simultaneous, but in this case they are not. This behavior is not seen in our prior tested kernel, 2.6.17, but is seen on all versions 2.6.21 through current 2.6.22, r6. Other versions may be affected but I have not tested them all. Using gentoo-sources. Mount options: rw,proto=tcp,vers=3,rsize=32768,wsize=32768,timeo=600,retrans=2,hard,intr,bg,noatime testlinux ~ # rpcinfo -p program vers proto port 100000 2 tcp 111 portmapper 100000 2 udp 111 portmapper 100003 2 udp 2049 nfs 100003 3 udp 2049 nfs 100003 4 udp 2049 nfs 100021 1 udp 32770 nlockmgr 100021 3 udp 32770 nlockmgr 100021 4 udp 32770 nlockmgr 100003 2 tcp 2049 nfs 100003 3 tcp 2049 nfs 100003 4 tcp 2049 nfs 100021 1 tcp 51346 nlockmgr 100021 3 tcp 51346 nlockmgr 100021 4 tcp 51346 nlockmgr 100024 1 udp 32771 status 100024 1 tcp 40714 status 100005 1 udp 876 mountd 100005 1 tcp 879 mountd 100005 2 udp 876 mountd 100005 2 tcp 879 mountd 100005 3 udp 876 mountd 100005 3 tcp 879 mountd Reproducible: Always Steps to Reproduce: 1. Run uname -a >> testfile alternately on two or more machines sharing a filesystem by NFS. Not at the same time, but alternately, with a short delay in between. 2. Inspect results, you should see the error in vi or a hexdump. Actual Results: You will get the result but there will be a padding of null characters, value 0000. Expected Results: The machine should append the result of "uname -a" to the end of the file. # # Network File Systems # CONFIG_NFS_FS=y CONFIG_NFS_V3=y CONFIG_NFS_V3_ACL=y CONFIG_NFS_V4=y # CONFIG_NFS_DIRECTIO is not set CONFIG_NFSD=y CONFIG_NFSD_V2_ACL=y CONFIG_NFSD_V3=y CONFIG_NFSD_V3_ACL=y CONFIG_NFSD_V4=y CONFIG_NFSD_TCP=y CONFIG_LOCKD=y CONFIG_LOCKD_V4=y CONFIG_EXPORTFS=y CONFIG_NFS_ACL_SUPPORT=y CONFIG_NFS_COMMON=y CONFIG_SUNRPC=y CONFIG_SUNRPC_GSS=y
Created attachment 130037 [details] Kernel Configuration used...
Created attachment 130038 [details] Example of flawed output
testlinux ~ # emerge --info Portage 2.1.2.9 (default-linux/x86/2006.1, gcc-3.4.4, glibc-2.5-r4, 2.6.22-gentoo-r6-vmware i686) ================================================================= System uname: 2.6.22-gentoo-r6-vmware i686 Intel(R) Xeon(R) CPU E5335 @ 2.00GHz Gentoo Base System release 1.12.9 Timestamp of tree: Tue, 04 Sep 2007 16:00:01 +0000 app-shells/bash: 3.2_p15-r1 dev-java/java-config: 1.2.11-r1 dev-lang/python: 2.4.4-r4 dev-python/pycrypto: 2.0.1-r5 sys-apps/baselayout: 1.12.9-r2 sys-apps/sandbox: 1.2.17 sys-devel/autoconf: 2.13, 2.61 sys-devel/automake: 1.4_p6, 1.5, 1.6.3, 1.7.9-r1, 1.8.5-r3, 1.9.6-r1, 1.10 sys-devel/binutils: 2.17 sys-devel/gcc-config: 1.3.12-r6 sys-devel/libtool: 1.5.23b virtual/os-headers: 2.6.11-r2 ACCEPT_KEYWORDS="x86" CBUILD="i686-pc-linux-gnu" CFLAGS="-O2 -mtune=i686 -pipe" CHOST="i686-pc-linux-gnu" CONFIG_PROTECT="/etc /usr/lib/X11/xkb" CONFIG_PROTECT_MASK="/etc/env.d /etc/gconf /etc/revdep-rebuild /etc/terminfo /etc/texmf/web2c" CXXFLAGS="-O2 -mtune=i686 -pipe" DISTDIR="/usr/portage/distfiles" FEATURES="distlocks metadata-transfer sandbox sfperms strict" GENTOO_MIRRORS="http://distfiles.gentoo.org http://distro.ibiblio.org/pub/linux/distributions/gentoo" PKGDIR="/usr/portage/packages" PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --compress --force --whole-file --delete --delete-after --stats --timeout=180 --exclude='/distfiles' --exclude='/local' --exclude='/packages'" PORTAGE_TMPDIR="/var/tmp" PORTDIR="/usr/portage" SYNC="rsync://rsync.gentoo.org/gentoo-portage" USE="berkdb bitmap-fonts cli cracklib crypt cups dri fortran gdbm gpm iconv ipv6 isdnlog midi mudflap ncurses nls nptl nptlonly openmp pam pcre perl ppds pppd python readline reflection session spl ssl tcpd truetype-fonts type1-fonts unicode x86 xorg zlib" ALSA_CARDS="ali5451 als4000 atiixp atiixp-modem bt87x ca0106 cmipci emu10k1 emu10k1x ens1370 ens1371 es1938 es1968 fm801 hda-intel intel8x0 intel8x0m maestro3 trident usb-audio via82xx via82xx-modem ymfpci" ALSA_PCM_PLUGINS="adpcm alaw asym copy dmix dshare dsnoop empty extplug file hooks iec958 ioplug ladspa lfloat linear meter mulaw multi null plug rate route share shm softvol" ELIBC="glibc" INPUT_DEVICES="keyboard mouse evdev" KERNEL="linux" LCD_DEVICES="bayrad cfontz cfontz633 glk hd44780 lb216 lcdm001 mtxorb ncurses text" USERLAND="GNU" VIDEO_CARDS="apm ark chips cirrus cyrix dummy fbdev glint i128 i740 i810 imstt mach64 mga neomagic nsc nv r128 radeon rendition s3 s3virge savage siliconmotion sis sisusb tdfx tga trident tseng v4l vesa vga via vmware voodoo" Unset: CTARGET, EMERGE_DEFAULT_OPTS, INSTALL_MASK, LANG, LC_ALL, LDFLAGS, LINGUAS, MAKEOPTS, PORTAGE_COMPRESS, PORTAGE_COMPRESS_FLAGS, PORTAGE_RSYNC_EXTRA_OPTS, PORTDIR_OVERLAY
Can you please test this with the latest development kernel (2.6.23-rc5 as of this writing)? Also, can you include the dmesg output for both your test machines, including the logging during your test? Thanks.
I've tested 2.6.23-rc5, and 2.6.23-rc6. The problem was fixed in 2.6.23-rc6... I suggest backporting this, it is a serious that can lead to file corruption when multiple gentoo boxes are accessing the same files over NFS.
commit 1b3b4a1a2deb7d3e5d66063bd76304d840c966b3 Author: Trond Myklebust <Trond.Myklebust@netapp.com> Date: Tue Aug 28 10:29:36 2007 -0400 NFS: Fix a write request leak in nfs_invalidate_page() Ryusuke Konishi says: The recent truncate_complete_page() clears the dirty flag from a page before calling a_ops->invalidatepage(), ^^^^^^ static void truncate_complete_page(struct address_space *mapping, struct page *page) { ... cancel_dirty_page(page, PAGE_CACHE_SIZE); <--- Inserted here at kernel 2.6.20 if (PagePrivate(page)) do_invalidatepage(page, 0); ---> will call a_ops->invalidatepage() ... } and this is disturbing nfs_wb_page_priority() from calling nfs_writepage_locked() that is expected to handle the pending request (=nfs_page) associated with the page. int nfs_wb_page_priority(struct inode *inode, struct page *page, int how) { ... if (clear_page_dirty_for_io(page)) { ret = nfs_writepage_locked(page, &wbc); if (ret < 0) goto out; } ... } Since truncate_complete_page() will get rid of the page after a_ops->invalidatepage() returns, the request (=nfs_page) associated with the page becomes a garbage in nfs_inode->nfs_page_tree. ------------------------ Fix this by ensuring that nfs_wb_page_priority() recognises that it may also need to clear out non-dirty pages that have an nfs_page associated with them. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Nevermind, I was wrong, this is still broken in 2.6.23-rc6...
Thanks for testing. Please file an upstream bug report at http://bugzilla.kernel.org and post the new bug URL here.
closing after no response. Harvey, if this is still an issue then please reopen this bug after you have tested the latest development kernel (currently v2.6.28-rc7) and filed a bug report upstream