1.Install the mentioned versions of the packages in the description. 0.6.2-r1 or 0.6.2-r2 versions of the zfs related packages. 2.Reboot 3.Download some large files with ktorrent 4.Wait for ktorrent to hang 5.Reboot 6.Run data check for the torrents in ktorrent. Every downloaded chunk will be failed and must be redownloaded. Which of course does not work since the same happens again. I observed this first with the r1 versions of the packages and immediately went back to 0.6.2 and had no problem. I later tried the r2 version and once more ran into the same behavior. I have had 0 problems with the 0.6.2 versions of the packages. My current zfs mask: # cat /etc/portage/package.mask/zfs-debugging =sys-kernel/spl-0.6.2-r1 =sys-fs/zfs-kmod-0.6.2-r1 =sys-fs/zfs-0.6.2-r1 =sys-kernel/spl-0.6.2-r2 =sys-fs/zfs-kmod-0.6.2-r2 =sys-fs/zfs-0.6.2-r2 Reproducible: Always # zpool status -v pool: tank state: ONLINE status: The pool is formatted using a legacy on-disk format. The pool can still be used, but some features are unavailable. action: Upgrade the pool using 'zpool upgrade'. Once this is done, the pool will no longer be accessible on software that does not support feature flags. scan: scrub canceled on Sat Oct 12 04:02:05 2013 config: NAME STATE READ WRITE CKSUM tank ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 sdd ONLINE 0 0 0 sdb ONLINE 0 0 0 mirror-1 ONLINE 0 0 0 sde ONLINE 0 0 0 sdf ONLINE 0 0 0 mirror-2 ONLINE 0 0 0 sdg ONLINE 0 0 0 sdc ONLINE 0 0 0 errors: No known data errors #emerge --info Portage 2.2.7 (default/linux/amd64/13.0/desktop/kde, gcc-4.7.3, glibc-2.17, 3.11.5-gentoo x86_64) ================================================================= System uname: Linux-3.11.5-gentoo-x86_64-Intel-R-_Core-TM-_i7-3930K_CPU_@_3.20GHz-with-gentoo-2.2 KiB Mem: 65957832 total, 41608788 free KiB Swap: 0 total, 0 free Timestamp of tree: Wed, 23 Oct 2013 23:30:01 +0000 ld GNU ld (GNU Binutils) 2.23.2 app-shells/bash: 4.2_p45 dev-java/java-config: 2.2.0 dev-lang/python: 2.7.5-r3, 3.2.5-r3, 3.3.2-r2 dev-util/cmake: 2.8.12 dev-util/pkgconfig: 0.28 sys-apps/baselayout: 2.2 sys-apps/openrc: 0.12.3 sys-apps/sandbox: 2.6-r1 sys-devel/autoconf: 2.13, 2.69 sys-devel/automake: 1.11.6, 1.12.6, 1.14 sys-devel/binutils: 2.23.2 sys-devel/gcc: 4.7.3-r1, 4.8.1-r1 sys-devel/gcc-config: 1.8 sys-devel/libtool: 2.4.2 sys-devel/make: 3.82-r4 sys-kernel/linux-headers: 3.11 (virtual/os-headers) sys-libs/glibc: 2.17 Repositories: gentoo sunrise multimedia sabayon steam-overlay dotnet pipelight magnus_local ACCEPT_KEYWORDS="amd64 ~amd64" ACCEPT_LICENSE="* -@EULA" CBUILD="x86_64-pc-linux-gnu" CFLAGS="-march=native -O2 -pipe" CHOST="x86_64-pc-linux-gnu" CONFIG_PROTECT="/etc /usr/share/config /usr/share/gnupg/qualified.txt /usr/share/themes/oxygen-gtk/gtk-2.0" CONFIG_PROTECT_MASK="/etc/ca-certificates.conf /etc/dconf /etc/env.d /etc/fonts/fonts.conf /etc/gconf /etc/gentoo-release /etc/revdep-rebuild /etc/sandbox.d /etc/terminfo /etc/texmf/language.dat.d /etc/texmf/language.def.d /etc/texmf/updmap.d /etc/texmf/web2c" CXXFLAGS="-march=native -O2 -pipe" DISTDIR="/usr/portage/distfiles" FCFLAGS="-O2 -pipe" FEATURES="assume-digests binpkg-logs config-protect-if-modified distlocks ebuild-locks fixlafiles merge-sync news parallel-fetch preserve-libs protect-owned sandbox sfperms strict unknown-features-warn unmerge-logs unmerge-orphans userfetch userpriv usersandbox usersync" FFLAGS="-O2 -pipe" GENTOO_MIRRORS="http://distfiles.gentoo.org" LANG="en_US.UTF-8" LDFLAGS="-Wl,-O1 -Wl,--as-needed" MAKEOPTS="-j10" PKGDIR="/usr/portage/packages" PORTAGE_CONFIGROOT="/" PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --omit-dir-times --compress --force --whole-file --delete --stats --human-readable --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages" PORTAGE_TMPDIR="/var/tmp" PORTDIR="/usr/portage" PORTDIR_OVERLAY="/var/lib/layman/sunrise /var/lib/layman/multimedia /var/lib/layman/sabayon /var/lib/layman/steam /var/lib/layman/dotnet /var/lib/layman/pipelight /home/malm/portage" SYNC="rsync://rsync.gentoo.org/gentoo-portage" USE="X a52 aac acl acpi alsa amd64 avahi avx avx256 bash-completion berkdb bluetooth bluray branding bzip2 cairo cdda cdr cli consolekit cracklib crypt css cups cxx dbus declarative dri dts dvd dvdr emboss encode exif fam firefox flac fortran gdbm gif gnome gpm gtk iconv icu inotify ipv6 jpeg kde kipi lcms ldap libnotify lm_sensors mad mmx mng modules mp3 mp4 mpeg mtp mudflap multilib ncurses networkmanager nls nptl ogg opengl openmp pam pango pcre pdf phonon plasma png policykit ppds qt3support qt4 rdp readline samba scanner sdl semantic-desktop session spell sse sse2 sse3 sse4 sse4_1 sse4_2 sse4a ssl ssse3 startup-notification svg tcpd tiff truetype udev udisks unicode upower usb vaapi vdpau vorbis wxwidgets x264 xcb xcomposite xinerama xml xscreensaver xv xvid xvmc zeroconf zlib zsh-completion" ABI_X86="64" ALSA_CARDS="ali5451 als4000 atiixp atiixp-modem bt87x ca0106 cmipci emu10k1x ens1370 ens1371 es1938 es1968 fm801 hda-intel intel8x0 intel8x0m maestro3 trident usb-audio via82xx via82xx-modem ymfpci" APACHE2_MODULES="authn_core authz_core socache_shmcb unixd actions alias auth_basic authn_alias authn_anon authn_dbm authn_default authn_file authz_dbm authz_default authz_groupfile authz_host authz_owner authz_user autoindex cache cgi cgid dav dav_fs dav_lock deflate dir disk_cache env expires ext_filter file_cache filter headers include info log_config logio mem_cache mime mime_magic negotiation rewrite setenvif speling status unique_id userdir usertrack vhost_alias" CALLIGRA_FEATURES="kexi words flow plan sheets stage tables krita karbon braindump author" CAMERAS="ptp2" COLLECTD_PLUGINS="df interface irq load memory rrdtool swap syslog" ELIBC="glibc" GPSD_PROTOCOLS="ashtech aivdm earthmate evermore fv18 garmin garmintxt gpsclock itrax mtk3301 nmea ntrip navcom oceanserver oldstyle oncore rtcm104v2 rtcm104v3 sirf superstar2 timing tsip tripmate tnt ubx" INPUT_DEVICES="keyboard mouse evdev" KERNEL="linux" LCD_DEVICES="bayrad cfontz cfontz633 glk hd44780 lb216 lcdm001 mtxorb ncurses text" LIBREOFFICE_EXTENSIONS="presenter-console presenter-minimizer" LINGUAS="en_US en_GB en sv sv_SE ja ja_JP" OFFICE_IMPLEMENTATION="libreoffice" PHP_TARGETS="php5-5" PYTHON_SINGLE_TARGET="python2_7" PYTHON_TARGETS="python2_7 python3_2" RUBY_TARGETS="ruby19 ruby18" USERLAND="GNU" VIDEO_CARDS="radeon r600 modesetting fbdev vesa" XTABLES_ADDONS="quota2 psd pknock lscan length2 ipv4options ipset ipp2p iface geoip fuzzy condition tee tarpit sysrq steal rawnat logmark ipmark dhcpmac delude chaos account" Unset: CPPFLAGS, CTARGET, EMERGE_DEFAULT_OPTS, INSTALL_MASK, LC_ALL, PORTAGE_BUNZIP2_COMMAND, PORTAGE_COMPRESS, PORTAGE_COMPRESS_FLAGS, PORTAGE_RSYNC_EXTRA_OPTS, USE_PYTHO
I should add that I saw no output in messages that seemed relevant to me.
(In reply to Magnus Lidbom from comment #0) > 4.Wait for ktorrent to hang > 5.Reboot How does that work? Why do you reboot? Did you sync disks first?
(In reply to Jeroen Roovers from comment #2) > (In reply to Magnus Lidbom from comment #0) > > 4.Wait for ktorrent to hang > > 5.Reboot > > How does that work? Why do you reboot? Did you sync disks first? I reboot because the computer is not really functional anymore. Ktorrent hanging is the example I used because it is easily reproducible(for me at least). Probably simply because it writes a large amount of data. But xbmc is hanging as well etc etc. I did not do any manual things to sync drives. I just reboot as usual. I can't swear to this(and am not really eager to try it out again but I will try to find the time if you need me to....), but I believe that all the data appeared to be correct until after the reboot. Wild speculation: It is possible that no changes are ever flushed to permanent storage when I encounter this problem. The computer in question has 64GB or RAM so it could cache up large amounts of writes without running out of RAM. Possibly that is exactly what happens. I encounter the hangs when zfs refuses to cache more writes...
I've completed a scrub now with no errors detected on the pool.
We can scratch the specific versions of ZFS. Apparently having only seen the problem with those versions was a coincidence. It just occurred with 0.6.2. I took some time to take down notes on what I observed this time(on another computer). Here they are: first saw a hang in ktorrent. opened chrome to check email. Chrome hanged. Konsole for debugging hanged. Tried to kill those processes when detected as unresponsive but then plasma/kde hung. ctrl+alt+f1 iotop hangs console completely. ctrl+c does not work ctrl+alt+f2 #tail -f -n 1000 /var/log/messages keeps writing about new snapshots being created by zfs-auto-snapshot no errors ore anythings that looks relevant in /var/log/messages that I can find #dmesg shows nothing new/of relevance. create created_during_zfs_hang.txt in home folder(in the zfs pool) with a single line of text. Wait for next frequent snapshot to occur. reboot: Long wait at Stopping VMWare Authentication Daemon Waited for more than 2 minutes and then hit the hardware reset button. Booted without problem. Check data on the 4 ktorrent downloads that had completed since last I checked I got 100% of all chunks that were supposed to be downloaded "Failed" and 0% downloaded as a result. Combined size of those downloads are about 4GB created_during_zfs_hang.txt is present with the correct text in it. The snapshot that was created after the "hang" also contained created_during_zfs_hang.txt with the correct data. I'm no longer even sure this is related to zfs.
Some more findings/clues: I have ktorrent configured to "Check data when download is finished". So the downloaded 4GB were checked and deemed correct when the downloads completed. Despite this not a single chunk passed check after reboot. I compared the contents of one of the files with the same file in several old snapshots. All versions of that file were identical including the latest. I redownloaded that file and of course it now differed compared to all the old versions. I'll reboot now and see if the changes persist.
The changes to that file persisted across a reboot. I also know for a fact that that file was non-corrupt as far as xbmc was concerned before the last hang/reboot. It is a video that I had watched from start to end. I'll redownload another of the files and use the hard reset button to restart this time and see if the changes persist.
I hit the reset button the moment the download was verified. After reboot some 70% of the chunks were valid. Seems reasonable with non-sync flushing of writes and a fast download. It is now complete again and has been so while I'm typing this message. Will hit reset again after posting it. I assume everything should be flushed by now unless something is broken.
Yes. The file was entirely consistent after this hard reset. So even with hard resets the behavior is sane in the standard case. The file is consistent within seconds of write. Not utterly broken the next day and in multiple snapshots like I've been seeing after the hangs. Off to do more digging.
Random note: I use spl-0.6.2 + zfs0.6.2-r3 + kernel.org raw sources 3.11.2 daily without any major problems. (However, I have had some really poor performance on certain operations, rm -Rf type stuff on big trees, where ext4 beats it by 30x on the same operation on a different partition on the same physical disk. Not clear on the reasoning there, it's good enough for my workload.)
Just checking in to report that I've been running the latest ~x86 versions of the packages in this bug report for months now with no problems. I'm almost completely convinced that the behavior I observed was a critical bug in the zfs code I was running at the the time. On the other hand I'm not sure keeping an open bug around for an issue that apparently no-one is currently encountering seeing seems debatable at best...
I usually take care of non-packaging ZFS-related bugs at the upstream tracker as it gets more eye balls on problems. An unfortunate consequence of that is that this slipped through the deluge of bug mail that I receive. I am usually available and responsive on IRC, which tends to make matters worse as users cut into the front of the line by pinging me there. That being said, I believe that I resolved this bug in late October of last year at upstream when I ported the following fix: https://github.com/zfsonlinux/zfs/commit/a117a6d66e5cf1e9d4f173bccc786a169e9a8e04 This came into Gentoo in early November of last year with the 0.6.2-r3 ebuilds when I backported it to 0.6.2-r3 if I recall correctly. As such, I am closing this as obsolete.