Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 197264 - [2.6.23 regression] NFS data corruption
Summary: [2.6.23 regression] NFS data corruption
Status: RESOLVED FIXED
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: [OLD] Core system (show other bugs)
Hardware: All Linux
: High critical (vote)
Assignee: Gentoo Kernel Bug Wranglers and Kernel Maintainers
URL:
Whiteboard: linux-2.6.23-regression
Keywords: InVCS
Depends on:
Blocks:
 
Reported: 2007-10-28 04:35 UTC by Henry Wertz
Modified: 2007-11-02 17:29 UTC (History)
4 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments
possible fix (nfs.patch,2.38 KB, patch)
2007-11-01 11:01 UTC, Daniel Drake (RETIRED)
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Henry Wertz 2007-10-28 04:35:12 UTC
With kernel 2.6.23, I've found that nfs writes sometimes have data corruption -- mainly large writes.

Reproducible: Always

Steps to Reproduce:
1. rsync some files that are typically 3MB+ (6MB+ more reliably).
2. rsync them back someplace, WITH -c (checksum) flag.
3. rsync them again with the -c (checksum) option -- the files resend because they are corrupted!




     I found avi or pdf files seem to corrupt most reliably.  I ran something similar to 
rsync -avPS -c beta:/mnt/1TB/*.avi .
rsync -avPS -c *.avi beta:/mnt/1TB/putthemback/
rsync -avPS -c *.avi beta:/mnt/1TB/putthemback/
        ^^^^^^^^------------- rsync sends some changes to these files, checksum fails and a few blocks must be resent.  If I rsync yet again, the same files resend again.

     /mnt/1TB is an NFS mount, mounted defaults,noauto,intr,tcp,timeo=300 
     (the timeo=300 was from when I had some wireless links in the mix, but now it is all 100mbit ethernet..)
     If I mount these NFS links "sync" data corruption does not occur.
     To make sure this wasn't some disk fault or bad RAM or something, I tried NFS'ing around in other directions and had the same faults.  I didn't check if the data corruption was identical but it wouldn't surprise me.  Again, "sync" avoids problems.
Comment 1 Henry Wertz 2007-10-28 04:41:25 UTC
     Oh yeah, here's my emerge --info on one machine: 
Portage 2.1.3.16 (default-linux/x86/2007.0/desktop, gcc-4.2.2, glibc-2.6.1-r0, 2.6.23-gentoo i686)
=================================================================
System uname: 2.6.23-gentoo i686 AMD Athlon(tm) XP 2100+
Timestamp of tree: Sun, 28 Oct 2007 02:50:01 +0000
distcc 2.18.3 i686-pc-linux-gnu (protocols 1 and 2) (default port 3632) [enabled]
ccache version 2.4 [enabled]
app-shells/bash:     3.2_p17-r1
dev-java/java-config: 1.3.7, 2.1.2-r1
dev-lang/python:     2.4.4-r4, 2.5.1-r3
dev-python/pycrypto: 2.0.1-r6
dev-util/ccache:     2.4-r7
sys-apps/baselayout: 1.12.10-r5
sys-apps/sandbox:    1.2.18.1-r2
sys-devel/autoconf:  2.13, 2.61-r1
sys-devel/automake:  1.4_p6, 1.5, 1.6.3, 1.7.9-r1, 1.8.5-r3, 1.9.6-r2, 1.10
sys-devel/binutils:  2.18-r1
sys-devel/gcc-config: 1.4.0-r4
sys-devel/libtool:   1.5.24
virtual/os-headers:  2.6.23
ACCEPT_KEYWORDS="x86 ~x86"
CBUILD="i686-pc-linux-gnu"
CFLAGS="-O2 -march=athlon-xp -fomit-frame-pointer"
CHOST="i686-pc-linux-gnu"
CONFIG_PROTECT="/etc /usr/kde/3.5/env /usr/kde/3.5/share/config /usr/kde/3.5/shutdown /usr/share/config /var/bind"
CONFIG_PROTECT_MASK="/etc/env.d /etc/env.d/java/ /etc/gconf /etc/php/apache2-php5/ext-active/ /etc/php/cgi-php5/ext-active/ /etc/php/cli-php5/ext-active/ /etc/revdep-rebuild /etc/terminfo /etc/texmf/web2c /etc/udev/rules.d"
CXXFLAGS="-O2 -march=athlon-xp -fomit-frame-pointer"
DISTDIR="/usr/portage/distfiles"
FEATURES="ccache distcc distlocks fixpackages metadata-transfer sandbox sfperms strict unmerge-orphans userfetch"
GENTOO_MIRRORS="http://gentoo.mirrors.tds.net/gentoo"
MAKEOPTS="-j4"
PKGDIR="/usr/portage/packages"
PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --compress --force --whole-file --delete --delete-after --stats --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages --filter=H_**/files/digest-*"
PORTAGE_TMPDIR="/var/rtmp/"
PORTDIR="/usr/portage"
PORTDIR_OVERLAY="/usr/local/portage"
SYNC="rsync://delta/gentoo-portage"
USE="3dnow 3dnowext X a52 aac acl acpi alsa apache2 arts asf audiofile berkdb bitmap-fonts bonobo browserplugin bzip2 cairo cdr cjk cli cracklib crosscompile crypt css cups curl dbus divx4linux dri dts dv dvd dvdr dvdread eds emboss encode esd evo expat fam fame ffmpeg firefox flac foomaticdb fortran gdbm gif gimp gimpprint glut gmp gnome gpm gstreamer gtk gtk2 gtkhtml hal iconv idn ieee1394 imagemagick insecure-savers ipv6 isdnlog java jpeg junit kde kdeenablefinal kerberos kqemu latin1 lcms ldap lua mad matroska midi mikmod mjpeg mmx mng mozsvg mp3 mpeg mudflap mysql mythtv nas ncurses nls nptl nptlonly nsplugin offensive ogg opengl openmp oss pam pcre pdf perl pic plugin png posix ppds pppd python qt3 qt3support qt4 quicktime readline reflection samba scanner sdl seamonkey session slang spell spl sse ssl subtitles svg tcltk tcpd threads threadsafe tiff truetype truetype-fonts type1-fonts unicode usb v4l v4l2 vorbis win32codecs wma wmf wmp x86 xanim xine xml xorg xv xvid zlib" ALSA_CARDS="ali5451 als4000 atiixp atiixp-modem bt87x ca0106 cmipci emu10k1 emu10k1x ens1370 ens1371 es1938 es1968 fm801 hda-intel intel8x0 intel8x0m maestro3 trident usb-audio via82xx via82xx-modem ymfpci" ALSA_PCM_PLUGINS="adpcm alaw asym copy dmix dshare dsnoop empty extplug file hooks iec958 ioplug ladspa lfloat linear meter mulaw multi null plug rate route share shm softvol" ELIBC="glibc" INPUT_DEVICES="keyboard mouse evdev" KERNEL="linux" LCD_DEVICES="bayrad cfontz cfontz633 glk hd44780 lb216 lcdm001 mtxorb ncurses text" USERLAND="GNU" VIDEO_CARDS="apm ark chips cirrus cyrix dummy fbdev glint i128 i740 i810 imstt mach64 mga neomagic nsc nv r128 radeon rendition s3 s3virge savage siliconmotion sis sisusb tdfx tga trident tseng v4l vesa vga via vmware voodoo"
Unset:  CTARGET, EMERGE_DEFAULT_OPTS, INSTALL_MASK, LANG, LC_ALL, LDFLAGS, LINGUAS, PORTAGE_COMPRESS, PORTAGE_COMPRESS_FLAGS, PORTAGE_RSYNC_EXTRA_OPTS


     Aditionally, if I throw some Ubuntu 7.10 boxes into the mix as machine running the NFS server, machine running rsync, or both, this also results in data corruption, as long as there's a machine with 2.6.23 doing NFS writes.  (eliminating the NFS server, or some rsync bug, as likely causes.)
Comment 2 Robin Johnson archtester Gentoo Infrastructure gentoo-dev Security 2007-10-28 05:54:26 UTC
Could you try between other machines, not your beta box, to exclude that machine? 

In all your examples, one of the sides was beta, and you were going over SSH to beta, and then on beta you claimed that /mnt/1TB was an NFS mount.
Eg:
server (NFS) -> beta -> (rsync over ssh) -> target box

Could you try NFS mounting directly to the target machine, or using non-SSH rsync?
Comment 3 seraph@xs4all.nl 2007-10-28 08:55:13 UTC
I've been having similar issues. For me, it happens consistently when having GIMP write a multi-layered .XCF file to a NFS-mounted directory.

I already downgraded the NFS server to a 2.6.22.9 kernel because of an oops in firewire-ohci. I'll try downgrading the client and see if that solves the corruption problem.
Comment 5 Henry Wertz 2007-10-29 05:55:12 UTC
     I tried that patch, it doesn't affect my problem.  Mounting -o sync does fix it (but slows down NFS writes.)

     To be clear on what I was doing, 
rsync -avPS -c beta:/mnt/1TB/*.avi .
rsync -avPS -c *.avi beta:/mnt/1TB/putthemback/
rsync -avPS -c *.avi beta:/mnt/1TB/putthemback/

      First one went  
delta ---NFS--> beta ---rsync via ssh---> qbert
Then the others go qbert--->beta--->delta .  After each run of rsync, the md5sums of the avis on the final destination are different.

     Note, I tried doing NFS writes from voltron (where I notcied the problem first), beta, and delta to make sure I did not just have a hardware fault.  Also to both 2.6.23-gentoo and 2.6.22-14-generic (Ubuntu 7.10) NFS servers.  The NFS server's not the problem.  I tried what you suggest, rsyncing avis in form of rsync -avPS *.avi /mnt/fattony/tmp/ (/mnt/fattony/ is an nfs mount) results in avis on /mnt/fattony/tmp/ with bad md5sums.  They are the same bad sums on every run though.  I just tried md5suming after the rsync via ssh and realized it's corruption is deterministic too (same first set of bad md5sums, then running again will yield a deterministic second set of bad md5sums).  Copying files with cp doesn't corrupt, and neither does taring them up, cd'ing to /mnt/fattony/tmp/ and untarring them.  And, mounting -o sync results in lower write speeds but again no corruption.

     In addition, I tried the patch above.  It patches in cleanly except one line in the Makefile (which was going to rename the kernel 2.6.23-CITI-NFS4-ALL-1.)  But it doesn't fix this nfs corruption issue...

     One last data point, the corrupted avis do play but with artifacts and small audio dropouts periodically.  corrupted PDFs load but with faulty images.  Almost every faulty PDF said it had an xref table error.. so, the files aren't being totally scrambled.  Based on rsync over ssh's behavior, I would guess the files are over 95% intact.
Comment 6 seraph@xs4all.nl 2007-10-29 06:47:12 UTC
And corrupted .xcf files load only the first layer, dropping all additional ones. Reverting to 2.6.22.9 fixed the problem for me. Perhaps this one should be reported upstream to the kernel maintainers?
Comment 7 Daniel Drake (RETIRED) gentoo-dev 2007-10-29 15:37:23 UTC
So, you're using the system in question as an NFS client, and without making any changes to the server, rebooting between 2.6.22 and 2.6.23 makes this data corruption come and go?

Assuming my understanding is correct:
Can you reproduce this on the latest development kernel, currently 2.6.24-rc1?

Also, if you have time and patience, you can use a bisection process to find the exact commit between 2.6.22 and 2.6.23 which introduced this bug:
http://www.reactivated.net/weblog/archives/2006/01/using-git-bisect-to-find-buggy-kernel-patches/

It will require you to test about 13 kernels, but it is very likely to find the exact problematic commit. (use v2.6.22 as good and v2.6.23 as bad)

If you don't have time to do so, it's fine, we can explore other options. Also, don't do this unless you have confirmed 2.6.24-rc1 is also affected.
Comment 8 seraph@xs4all.nl 2007-10-29 20:42:14 UTC
Yes, you right. My setup is as follows:

NFS server: Sun Blade 100, UltraSPARC 2e with kernel 2.6.22.9
NFS client: Asus P5B (ICH8), Intel Core 2 Duo with kernel 2.6.22.9/2.6.23.1

Without changing anything on the server, I can get the bug to appear and disappear by booting the client with 2.6.23.1 (has bug) or 2.6.22.9 (doesn't have bug) respectively.

As for testing with a 2.6.24-rc1 kernel, well I tried, but it breaks my network interface in such a horrible way that a full powercycle is needed to get it working again. So I won't be touching that one with an eleven-feet pole. (The NIC breakage is already on bugzilla.kernel.org, ID 9257)
Comment 9 Daniel Drake (RETIRED) gentoo-dev 2007-10-29 21:57:41 UTC
OK. Do you have time to try the 2.6.22-2.6.23 bisection regardless?
Comment 10 Henry Wertz 2007-10-30 04:26:04 UTC
(In reply to comment #7)
> So, you're using the system in question as an NFS client, and without making
> any changes to the server, rebooting between 2.6.22 and 2.6.23 makes this data
> corruption come and go?
     Yes exactly.  I ran several 2.6.22 series kernels no problem, 2.6.23 has it.
> 
> Assuming my understanding is correct:
> Can you reproduce this on the latest development kernel, currently 2.6.24-rc1?
> 
> Also, if you have time and patience, you can use a bisection process to find
> the exact commit between 2.6.22 and 2.6.23 which introduced this bug:
> http://www.reactivated.net/weblog/archives/2006/01/using-git-bisect-to-find-buggy-kernel-patches/
> 
> It will require you to test about 13 kernels, but it is very likely to find the
> exact problematic commit. (use v2.6.22 as good and v2.6.23 as bad)
> 
> If you don't have time to do so, it's fine, we can explore other options. Also,
> don't do this unless you have confirmed 2.6.24-rc1 is also affected.
> 
      Wow, that bisection description is some craziness, that is like RIDICULOUSLY slick to help track down the regressions that's for sure.  

     I'll try out 2.6.24-rc1 and let you know -- luckily (?), my computers are older, voltron has a ViaKT400 chipset, Athlon XP 2100+, and RhineII; delta is a Shuttle SFF system with a Via KM400, Sempron 2500+ and RTL-8139.  beta is a MSI SFF system with nforce2 chipset and Athlon XP 2200+ (and nforce2 ethernet). I won't be doing the kernel swithceroo on beta though, it's a mythtv box 8-).    The RhineII and RTL8139 have been stable for years so hopefully 2.6.24-rc1 doesn't blow them up 8-).  I'll post back in a bit with some info.
Comment 11 seraph@xs4all.nl 2007-10-30 06:05:11 UTC
Well, I tried the bisecting, but that only resulted in kernels that wouldn't boot, or even compile. So no big help, I'm afraid. :-(
Comment 12 Henry Wertz 2007-10-30 06:16:00 UTC
>      I'll try out 2.6.24-rc1 and let you know 
     Yeah, 2.6.24-rc1 is fine, I reran rsync tests and no data corruption.  I
rsync'd  voltron--NFS-->delta, and qbert--ssh-->voltron--NFS-->delta (I did
remember to mount async, since sync avoided the bug).  Both worked fine.  I ran
md5sum on voltron and on delta, just to make sure voltron didn't keep a clean
copy in cache but send bad data to delta 8-).

Per some usenet post, to avoid a build failure at kernel link time (undefined
reference to `genapic') I had to also go into linux/arch/x86/kernel/crash.c and
do 3 lines of:
-#ifdef X86_32
+#ifdef CONFIG_X86_32

     (they had a diff to do it, but I suppose google news mangled the
formating, because 2 out of 3 lines failed.)

     So... yeah.  2.6.24-rc1 fixes it 8-).  If it looks like the 2.6.23-gentoo
series will be out for a while before 2.6.24-gentoo comes out, I can try out
any and all nfs patches anyone suggests, or for that matter, I have ccache
installed so probably messing about with git-bisect wouldn't be *too* slow 8-).
Comment 13 Daniel Drake (RETIRED) gentoo-dev 2007-10-30 10:27:03 UTC
Henry, if you could attempt the bisection it would be useful. You may run into similar problems as Jos though. If that is the case, assuming you've marked at least one kernel good or bad, please post the output of "git bisect log"
Comment 14 Henry Wertz 2007-11-01 02:05:32 UTC
(In reply to comment #13)
> Henry, if you could attempt the bisection it would be useful. You may run into
> similar problems as Jos though. If that is the case, assuming you've marked at
> least one kernel good or bad, please post the output of "git bisect log"
> 

     Oh man, that took a while 8-).  I got through the first 10 or so bisects last night, and did that last ~6 or so tonight.  One or two of the patches must have changed all the headers, making ccache completely inffective for several builds; luckily, it worked for the rest speeding builds up greatly 8-). Additionally on about the second bisect I had to disable the cciss block driver (Compaq SmartArray) due to build failure; I do not have that card though so no problem (I've used a SmartArray controller at work; what a piece of overkill 8-).  Here's the result:

voltron linux # git bisect good
44dd151d5c21234cc534c47d7382f5c28c3143cd is first bad commit
commit 44dd151d5c21234cc534c47d7382f5c28c3143cd
Author: Trond Myklebust <Trond.Myklebust@netapp.com>
Date:   Sat May 19 11:58:03 2007 -0400

    NFS: Don't mark a written page as uptodate until it is on disk

    The write may fail, so we should not mark the page as uptodate until we are
    certain that the data has been accepted and written to disk by the server.

    Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

:040000 040000 2c795d415e29b7ce96d161c7d746f5a770ed85b3 421123c7d93457ca338bcf967236d74928d98788 M      fs

     Must be a logic error, based on the description it sure sounds like that patch would increase data safety rather than hosing it 8-).


Comment 15 Daniel Drake (RETIRED) gentoo-dev 2007-11-01 11:01:33 UTC
Created attachment 134877 [details, diff]
possible fix

Perfect! Thanks a lot for doing that. The bisection result lead me right to this, which does sound like it's the fix in question. Please apply it to 2.6.23 and see if the problem goes away.
Comment 16 Henry Wertz 2007-11-02 03:18:03 UTC
(In reply to comment #15)
> Created an attachment (id=134877) [edit]
> possible fix
> 
> Perfect! Thanks a lot for doing that. The bisection result lead me right to
> this, which does sound like it's the fix in question. Please apply it to 2.6.23
> and see if the problem goes away.
> 

     Yup that worked for me.  Thanks much.
Comment 17 seraph@xs4all.nl 2007-11-02 06:54:28 UTC
Seems to have done the trick here as well. Thank you very much.
Comment 18 Mike Pagano gentoo-dev 2007-11-02 17:29:59 UTC
Please note that genpatches-2.6.23-2 has been released with this patch included, gentoo-sources-2.6.23-r1 is using it.