Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 489210 - sys-kernel/spl-0.6.2-r1,2 sys-fs/zfs-kmod-0.6.2-r1,2 =sys-fs/zfs-0.6.2-r1 - system hangs, filesystem corrupt after reboot
Summary: sys-kernel/spl-0.6.2-r1,2 sys-fs/zfs-kmod-0.6.2-r1,2 =sys-fs/zfs-0.6.2-r1 - s...
Status: RESOLVED OBSOLETE
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: [OLD] Core system (show other bugs)
Hardware: AMD64 Linux
: Highest critical
Assignee: Richard Yao (RETIRED)
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-10-23 23:57 UTC by Magnus Lidbom
Modified: 2014-06-24 16:40 UTC (History)
0 users

See Also:
Package list:
Runtime testing required: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Magnus Lidbom 2013-10-23 23:57:16 UTC
1.Install the mentioned versions of the packages in the description. 0.6.2-r1 or 0.6.2-r2 versions of the zfs related packages.
2.Reboot
3.Download some large files with ktorrent
4.Wait for ktorrent to hang
5.Reboot
6.Run data check for the torrents in ktorrent. Every downloaded chunk will be failed and must be redownloaded. Which of course does not work since the same happens again.

I observed this first with the r1 versions of the packages and immediately went back to 0.6.2 and had no problem.
I later tried the r2 version and once more ran into the same behavior. 
I have had 0 problems with the 0.6.2 versions of the packages. 

My current zfs mask:
# cat /etc/portage/package.mask/zfs-debugging 
=sys-kernel/spl-0.6.2-r1
=sys-fs/zfs-kmod-0.6.2-r1
=sys-fs/zfs-0.6.2-r1

=sys-kernel/spl-0.6.2-r2
=sys-fs/zfs-kmod-0.6.2-r2
=sys-fs/zfs-0.6.2-r2

Reproducible: Always




# zpool status -v
  pool: tank
 state: ONLINE
status: The pool is formatted using a legacy on-disk format.  The pool can
        still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
        pool will no longer be accessible on software that does not support
        feature flags.
  scan: scrub canceled on Sat Oct 12 04:02:05 2013
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sdd     ONLINE       0     0     0
            sdb     ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            sde     ONLINE       0     0     0
            sdf     ONLINE       0     0     0
          mirror-2  ONLINE       0     0     0
            sdg     ONLINE       0     0     0
            sdc     ONLINE       0     0     0

errors: No known data errors


#emerge --info

Portage 2.2.7 (default/linux/amd64/13.0/desktop/kde, gcc-4.7.3, glibc-2.17, 3.11.5-gentoo x86_64)
=================================================================
System uname: Linux-3.11.5-gentoo-x86_64-Intel-R-_Core-TM-_i7-3930K_CPU_@_3.20GHz-with-gentoo-2.2
KiB Mem:    65957832 total,  41608788 free
KiB Swap:          0 total,         0 free
Timestamp of tree: Wed, 23 Oct 2013 23:30:01 +0000
ld GNU ld (GNU Binutils) 2.23.2
app-shells/bash:          4.2_p45
dev-java/java-config:     2.2.0
dev-lang/python:          2.7.5-r3, 3.2.5-r3, 3.3.2-r2
dev-util/cmake:           2.8.12
dev-util/pkgconfig:       0.28
sys-apps/baselayout:      2.2
sys-apps/openrc:          0.12.3
sys-apps/sandbox:         2.6-r1
sys-devel/autoconf:       2.13, 2.69
sys-devel/automake:       1.11.6, 1.12.6, 1.14
sys-devel/binutils:       2.23.2
sys-devel/gcc:            4.7.3-r1, 4.8.1-r1
sys-devel/gcc-config:     1.8
sys-devel/libtool:        2.4.2
sys-devel/make:           3.82-r4
sys-kernel/linux-headers: 3.11 (virtual/os-headers)
sys-libs/glibc:           2.17
Repositories: gentoo sunrise multimedia sabayon steam-overlay dotnet pipelight magnus_local
ACCEPT_KEYWORDS="amd64 ~amd64"
ACCEPT_LICENSE="* -@EULA"
CBUILD="x86_64-pc-linux-gnu"
CFLAGS="-march=native -O2 -pipe"
CHOST="x86_64-pc-linux-gnu"
CONFIG_PROTECT="/etc /usr/share/config /usr/share/gnupg/qualified.txt /usr/share/themes/oxygen-gtk/gtk-2.0"
CONFIG_PROTECT_MASK="/etc/ca-certificates.conf /etc/dconf /etc/env.d /etc/fonts/fonts.conf /etc/gconf /etc/gentoo-release /etc/revdep-rebuild /etc/sandbox.d /etc/terminfo /etc/texmf/language.dat.d /etc/texmf/language.def.d /etc/texmf/updmap.d /etc/texmf/web2c"
CXXFLAGS="-march=native -O2 -pipe"
DISTDIR="/usr/portage/distfiles"
FCFLAGS="-O2 -pipe"
FEATURES="assume-digests binpkg-logs config-protect-if-modified distlocks ebuild-locks fixlafiles merge-sync news parallel-fetch preserve-libs protect-owned sandbox sfperms strict unknown-features-warn unmerge-logs unmerge-orphans userfetch userpriv usersandbox usersync"
FFLAGS="-O2 -pipe"
GENTOO_MIRRORS="http://distfiles.gentoo.org"
LANG="en_US.UTF-8"
LDFLAGS="-Wl,-O1 -Wl,--as-needed"
MAKEOPTS="-j10"
PKGDIR="/usr/portage/packages"
PORTAGE_CONFIGROOT="/"
PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --omit-dir-times --compress --force --whole-file --delete --stats --human-readable --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages"
PORTAGE_TMPDIR="/var/tmp"
PORTDIR="/usr/portage"
PORTDIR_OVERLAY="/var/lib/layman/sunrise /var/lib/layman/multimedia /var/lib/layman/sabayon /var/lib/layman/steam /var/lib/layman/dotnet /var/lib/layman/pipelight /home/malm/portage"
SYNC="rsync://rsync.gentoo.org/gentoo-portage"
USE="X a52 aac acl acpi alsa amd64 avahi avx avx256 bash-completion berkdb bluetooth bluray branding bzip2 cairo cdda cdr cli consolekit cracklib crypt css cups cxx dbus declarative dri dts dvd dvdr emboss encode exif fam firefox flac fortran gdbm gif gnome gpm gtk iconv icu inotify ipv6 jpeg kde kipi lcms ldap libnotify lm_sensors mad mmx mng modules mp3 mp4 mpeg mtp mudflap multilib ncurses networkmanager nls nptl ogg opengl openmp pam pango pcre pdf phonon plasma png policykit ppds qt3support qt4 rdp readline samba scanner sdl semantic-desktop session spell sse sse2 sse3 sse4 sse4_1 sse4_2 sse4a ssl ssse3 startup-notification svg tcpd tiff truetype udev udisks unicode upower usb vaapi vdpau vorbis wxwidgets x264 xcb xcomposite xinerama xml xscreensaver xv xvid xvmc zeroconf zlib zsh-completion" ABI_X86="64" ALSA_CARDS="ali5451 als4000 atiixp atiixp-modem bt87x ca0106 cmipci emu10k1x ens1370 ens1371 es1938 es1968 fm801 hda-intel intel8x0 intel8x0m maestro3 trident usb-audio via82xx via82xx-modem ymfpci" APACHE2_MODULES="authn_core authz_core socache_shmcb unixd actions alias auth_basic authn_alias authn_anon authn_dbm authn_default authn_file authz_dbm authz_default authz_groupfile authz_host authz_owner authz_user autoindex cache cgi cgid dav dav_fs dav_lock deflate dir disk_cache env expires ext_filter file_cache filter headers include info log_config logio mem_cache mime mime_magic negotiation rewrite setenvif speling status unique_id userdir usertrack vhost_alias" CALLIGRA_FEATURES="kexi words flow plan sheets stage tables krita karbon braindump author" CAMERAS="ptp2" COLLECTD_PLUGINS="df interface irq load memory rrdtool swap syslog" ELIBC="glibc" GPSD_PROTOCOLS="ashtech aivdm earthmate evermore fv18 garmin garmintxt gpsclock itrax mtk3301 nmea ntrip navcom oceanserver oldstyle oncore rtcm104v2 rtcm104v3 sirf superstar2 timing tsip tripmate tnt ubx" INPUT_DEVICES="keyboard mouse evdev" KERNEL="linux" LCD_DEVICES="bayrad cfontz cfontz633 glk hd44780 lb216 lcdm001 mtxorb ncurses text" LIBREOFFICE_EXTENSIONS="presenter-console presenter-minimizer" LINGUAS="en_US en_GB en sv sv_SE ja ja_JP" OFFICE_IMPLEMENTATION="libreoffice" PHP_TARGETS="php5-5" PYTHON_SINGLE_TARGET="python2_7" PYTHON_TARGETS="python2_7 python3_2" RUBY_TARGETS="ruby19 ruby18" USERLAND="GNU" VIDEO_CARDS="radeon r600 modesetting fbdev vesa" XTABLES_ADDONS="quota2 psd pknock lscan length2 ipv4options ipset ipp2p iface geoip fuzzy condition tee tarpit sysrq steal rawnat logmark ipmark dhcpmac delude chaos account"
Unset:  CPPFLAGS, CTARGET, EMERGE_DEFAULT_OPTS, INSTALL_MASK, LC_ALL, PORTAGE_BUNZIP2_COMMAND, PORTAGE_COMPRESS, PORTAGE_COMPRESS_FLAGS, PORTAGE_RSYNC_EXTRA_OPTS, USE_PYTHO
Comment 1 Magnus Lidbom 2013-10-23 23:59:06 UTC
I should add that I saw no output in messages that seemed relevant to me.
Comment 2 Jeroen Roovers (RETIRED) gentoo-dev 2013-10-24 14:20:03 UTC
(In reply to Magnus Lidbom from comment #0)
> 4.Wait for ktorrent to hang
> 5.Reboot

How does that work? Why do you reboot? Did you sync disks first?
Comment 3 Magnus Lidbom 2013-10-24 15:09:02 UTC
(In reply to Jeroen Roovers from comment #2)
> (In reply to Magnus Lidbom from comment #0)
> > 4.Wait for ktorrent to hang
> > 5.Reboot
> 
> How does that work? Why do you reboot? Did you sync disks first?

I reboot because the computer is not really functional anymore. Ktorrent hanging is the example I used because it is easily reproducible(for me at least). Probably simply because it writes a large amount of data. But xbmc is hanging as well etc etc. 

I did not do any manual things to sync drives. I just reboot as usual.


I can't swear to this(and am not really eager to try it out again but I will try to find the time if you need me to....), but I believe that all the data appeared to be correct until after the reboot. 

Wild speculation: 
It is possible that no changes are ever flushed to permanent storage when I encounter this problem. The computer in question has 64GB or RAM so it could cache up large amounts of writes without running out of RAM. Possibly that is exactly what happens. I encounter the hangs when zfs refuses to cache more writes...
Comment 4 Magnus Lidbom 2013-10-24 15:54:56 UTC
I've completed a scrub now with no errors detected on the pool.
Comment 5 Magnus Lidbom 2013-10-24 16:41:02 UTC
We can scratch the specific versions of ZFS. Apparently having only seen the problem with those versions was a coincidence. It just occurred with 0.6.2.

I took some time to take down notes on what I observed this time(on another computer). Here they are: 

first saw a hang in ktorrent.
opened chrome to check email. Chrome hanged.
Konsole for debugging hanged.
Tried to kill those processes when detected as unresponsive but then plasma/kde hung.

ctrl+alt+f1 
iotop hangs console completely. ctrl+c does not work

ctrl+alt+f2  
#tail -f -n 1000 /var/log/messages 
keeps writing about new snapshots being created by zfs-auto-snapshot
no errors ore anythings that looks relevant in /var/log/messages that I can find

#dmesg 
shows nothing new/of relevance.

create created_during_zfs_hang.txt in home folder(in the zfs pool) with a single line of text. 
Wait for next frequent snapshot to occur.

reboot:
Long wait at Stopping VMWare Authentication Daemon
Waited for more than 2 minutes and then hit the hardware reset button.

Booted without problem.
Check data on the 4 ktorrent downloads that had completed since last I checked I got 100% of all chunks that were supposed to be downloaded "Failed" and 0% downloaded as a result.
Combined size of those downloads are about 4GB
created_during_zfs_hang.txt is present with the correct text in it.
The snapshot that was created after the "hang" also contained created_during_zfs_hang.txt with the correct data.


I'm no longer even sure this is related to zfs.
Comment 6 Magnus Lidbom 2013-10-24 16:57:48 UTC
Some more findings/clues:
I have ktorrent configured to "Check data when download is finished". So the downloaded 4GB were checked and deemed correct when the downloads completed.
Despite this not a single chunk passed check after reboot.

I compared the contents of one of the files with the same file in several old snapshots. All versions of that file were identical including the latest.

I redownloaded that file and of course it now differed compared to all the old versions. 
I'll reboot now and see if the changes persist.
Comment 7 Magnus Lidbom 2013-10-24 17:05:04 UTC
The changes to that file persisted across a reboot.
I also know for a fact that that file was non-corrupt as far as xbmc was concerned before the last hang/reboot. It is a video that I had watched from start to end.

I'll redownload another of the files and use the hard reset button to restart this time and see if the changes persist.
Comment 8 Magnus Lidbom 2013-10-24 17:12:13 UTC
I hit the reset button the moment the download was verified. 
After reboot some 70% of the chunks were valid. Seems reasonable with non-sync flushing of writes and a fast download. It is now complete again and has been so while I'm typing this message. Will hit reset again after posting it. I assume everything should be flushed by now unless something is broken.
Comment 9 Magnus Lidbom 2013-10-24 17:17:41 UTC
Yes. The file was entirely consistent after this hard reset. So even with hard resets the behavior is sane in the standard case. The file is consistent within seconds of write. Not utterly broken the next day and in multiple snapshots like I've been seeing after the hangs. Off to do more digging.
Comment 10 Walter 2014-01-08 23:47:34 UTC
Random note: I use spl-0.6.2 + zfs0.6.2-r3 + kernel.org raw sources 3.11.2 daily without any major problems.

(However, I have had some really poor performance on certain operations, rm -Rf type stuff on big trees, where ext4 beats it by 30x on the same operation on a different partition on the same physical disk. Not clear on the reasoning there, it's good enough for my workload.)
Comment 11 Magnus Lidbom 2014-04-30 17:08:17 UTC
Just checking in to report that I've been running the latest ~x86 versions of the packages in this bug report for months now with no problems.

I'm almost completely convinced that the behavior I observed was a critical bug in the zfs code I was running at the the time. 

On the other hand I'm not sure keeping an open bug around for an issue that apparently no-one is currently encountering seeing seems debatable at best...
Comment 12 Richard Yao (RETIRED) gentoo-dev 2014-06-24 16:40:14 UTC
I usually take care of non-packaging ZFS-related bugs at the upstream tracker as it gets more eye balls on problems. An unfortunate consequence of that is that this slipped through the deluge of bug mail that I receive. I am usually available and responsive on IRC, which tends to make matters worse as users cut into the front of the line by pinging me there.

That being said, I believe that I resolved this bug in late October of last year at upstream when I ported the following fix:

https://github.com/zfsonlinux/zfs/commit/a117a6d66e5cf1e9d4f173bccc786a169e9a8e04

This came into Gentoo in early November of last year with the 0.6.2-r3 ebuilds when I backported it to 0.6.2-r3 if I recall correctly. As such, I am closing this as obsolete.