Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 529639 - sys-kernel/gentoo-sources-3.16.5 - secondary server fails to mount OCFS2 with DRBD volume
Summary: sys-kernel/gentoo-sources-3.16.5 - secondary server fails to mount OCFS2 with...
Status: RESOLVED TEST-REQUEST
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: [OLD] Core system (show other bugs)
Hardware: AMD64 Linux
: Normal normal (vote)
Assignee: Gentoo Kernel Bug Wranglers and Kernel Maintainers
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-11-17 19:40 UTC by Adam Randall
Modified: 2014-12-23 22:37 UTC (History)
1 user (show)

See Also:
Package list:
Runtime testing required: ---


Attachments
Kernel configuration for 3.10.25-gentoo (config.3.10.25.txt,83.42 KB, text/plain)
2014-11-17 19:42 UTC, Adam Randall
Details
Kernel configuration for 3.12.21-gentoo-r1 (config.3.12.21-r1.txt,86.60 KB, text/plain)
2014-11-17 19:43 UTC, Adam Randall
Details
Kernel configuration for 3.16.5-gentoo (config.3.16.5.txt,89.55 KB, text/plain)
2014-11-17 19:43 UTC, Adam Randall
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Adam Randall 2014-11-17 19:40:19 UTC
I've recently upgraded from versions gentoo-sources kernels 3.12.21-r1 and 3.10.25 on 13 of my servers (three virtualized). Of those servers, four of them run OCFS2 clusters (1 per pair), with a DRBD backend.

When I upgraded to 3.16.5 the following things occurred:

1) Once the primary server mounted the OCFS2 volume, the secondary server could not mount the OCFS2 volume. Upon doing so produced a number of timeout messages in dmesg like below:

[16449.796297] o2net: Connection to node node1 (num 0) at 10.10.254.12:7777 shutdown, state 8
[16449.796345] o2net: No longer connected to node node1 (num 0) at 10.10.254.12:7777
[16449.796376] (mount.ocfs2,6190,4):dlm_request_join:1470 ERROR: Error -112 when sending message 510 (key 0x666c6172) to node 0
[16449.796379] (mount.ocfs2,6190,4):dlm_try_to_join_domain:1649 ERROR: status = -112
[16449.796383] (mount.ocfs2,6190,4):dlm_join_domain:1951 ERROR: status = -112
[16449.796480] (mount.ocfs2,6190,4):dlm_register_domain:2209 ERROR: status = -112
[16449.796506] (mount.ocfs2,6190,4):o2cb_cluster_connect:368 ERROR: status = -112
[16449.796510] (mount.ocfs2,6190,4):ocfs2_dlm_init:3001 ERROR: status = -112
[16449.796533] (mount.ocfs2,6190,4):ocfs2_mount_volume:1860 ERROR: status = -112
[16449.796574] ocfs2: Unmounting device (147,1) on (node 0)
[16449.796582] (mount.ocfs2,6190,4):ocfs2_fill_super:1234 ERROR: status = -112

2) Trying to mount the OCFS2 volume multiple times on the secondary server caused the OCFS2 volume on the primary server to lock up, and both machines rebooted automatically, though not at the same time. Unfortunately, the kernel messages that were being dumped to the console were lost in reboots.

3) DRBD synchronization has been very slow between servers, to the point where 20 MiB was taking hours to complete.

4) Doing file directory listings was taking extremely long time on the ocfs2 volumes. For example, on the server running 3.12.21-r1, a find operation on a directory containing in excess of 500k files took 1.5 minutes. On the same server running 3.16.5, it took several hours.

Moving back to 3.12.21-r1 on one pair of servers, and 3.10.25 on the other pair of servers solved all connection and speed issues.

I will attach my kernel configurations for 3.10.25, 3.12.21-r1 and 3.16.5 (so you can all laugh at me probably). In the future, I will be testing out 3.14.21 to see if the problem exists there as well, or not.

As these are production machines, I'm not going to be able to take them down and do testing on them arbitrarily.

Additionally, iptables being enable or disabled did not affect reliability of any of the above.

Reproducible: Always

Steps to Reproduce:
1) Configure gentoo-sources-3.16.5 to have ocfs2 enabled
2) Reboot into new kernel
3) create, or use existing, ocfs2 cluster volume w/ drbd backend synchronization
4) attempt to bring both servers up in primary/primary mode, and mount ocfs2 volumes on both
Actual Results:  
Chaos

Expected Results:  
Harmony

Portage 2.2.8-r2 (default/linux/amd64/13.0/no-multilib, gcc-4.8.3, glibc-2.19-r1, 3.12.21-gentoo-r1 x86_64)
=================================================================
System uname: Linux-3.12.21-gentoo-r1-x86_64-Intel-R-_Xeon-R-_CPU_X5650_@_2.67GHz-with-gentoo-2.2
KiB Mem:    32930584 total,  27261376 free
KiB Swap:    2097148 total,   2097148 free
Timestamp of tree: Mon, 17 Nov 2014 05:45:01 +0000
ld GNU ld (Gentoo 2.24 p1.4) 2.24
app-shells/bash:          4.2_p53
dev-lang/perl:            5.18.2-r2
dev-lang/python:          2.7.7, 3.3.5-r1, 3.4.1
dev-util/cmake:           2.8.12.2-r1
dev-util/pkgconfig:       0.28-r1
sys-apps/baselayout:      2.2
sys-apps/openrc:          0.12.4
sys-apps/sandbox:         2.6-r1
sys-devel/autoconf:       2.69
sys-devel/automake:       1.11.6, 1.13.4
sys-devel/binutils:       2.24-r3
sys-devel/gcc:            4.8.3
sys-devel/gcc-config:     1.7.3
sys-devel/libtool:        2.4.2-r1
sys-devel/make:           4.0-r1
sys-kernel/linux-headers: 3.16 (virtual/os-headers)
sys-libs/glibc:           2.19-r1
Repositories: gentoo SSIS
ACCEPT_KEYWORDS="amd64"
ACCEPT_LICENSE="* -@EULA"
CBUILD="x86_64-pc-linux-gnu"
CFLAGS="-march=native -O2 -pipe"
CHOST="x86_64-pc-linux-gnu"
CONFIG_PROTECT="/etc"
CONFIG_PROTECT_MASK="/etc/ca-certificates.conf /etc/env.d /etc/fonts/fonts.conf /etc/gconf /etc/gentoo-release /etc/php/apache2-php5.3/ext-active/ /etc/php/apache2-php5.5/ext-active/ /etc/php/cgi-php5.3/ext-active/ /etc/php/cgi-php5.5/ext-active/ /etc/php/cli-php5.3/ext-active/ /etc/php/cli-php5.5/ext-active/ /etc/revdep-rebuild /etc/sandbox.d /etc/terminfo"
CXXFLAGS="-march=native -O2 -pipe"
DISTDIR="/usr/portage/distfiles"
EMERGE_DEFAULT_OPTS="-q --with-bdeps y"
FCFLAGS="-O2 -pipe"
FEATURES="assume-digests binpkg-logs config-protect-if-modified distlocks ebuild-locks fixlafiles merge-sync news parallel-fetch preserve-libs protect-owned sandbox sfperms strict unknown-features-warn unmerge-logs unmerge-orphans userfetch userpriv usersandbox usersync"
FFLAGS="-O2 -pipe"
GENTOO_MIRRORS="http://gentoo.osuosl.org/ http://gentoo.cs.uni.edu/ http://mirror.usu.edu/mirrors/gentoo/"
LANG="en_US.utf8"
LDFLAGS="-Wl,-O1 -Wl,--as-needed"
MAKEOPTS="-j25"
PKGDIR="/usr/portage/packages"
PORTAGE_CONFIGROOT="/"
PORTAGE_RSYNC_OPTS="--recursive --links --safe-links --perms --times --omit-dir-times --compress --force --whole-file --delete --stats --human-readable --timeout=180 --exclude=/distfiles --exclude=/local --exclude=/packages"
PORTAGE_TMPDIR="/var/tmp"
PORTDIR="/usr/portage"
PORTDIR_OVERLAY="/usr/local/portage"
SYNC="rsync://192.168.0.156/gentoo-portage"
USE="acl amd64 apache2 bash-completion berkdb bzip2 cli corefonts cracklib crypt ctype curl cxx djvu dri filter fontconfig fortran fpx ftp gcj gd gdbm gif gnutls graphviz gs hash hdri iconv imagemagick ipv6 jbig jpeg jpeg2k lcms ldap-sasl logrotate lzma mmx modules ncurses nls nptl nptlonly openexr openmp openssl pam pcntl pcre pdf pdo png posix python readline samba sasl session sharedmem simplexml smtp snmp soap sockets sse sse2 ssh ssl svg syslog tcpd tiff truetype unicode vim-syntax webp wmf xml zip zlib" ABI_X86="64" ALSA_CARDS="ali5451 als4000 atiixp atiixp-modem bt87x ca0106 cmipci emu10k1x ens1370 ens1371 es1938 es1968 fm801 hda-intel intel8x0 intel8x0m maestro3 trident usb-audio via82xx via82xx-modem ymfpci" APACHE2_MODULES="actions alias auth_basic auth_digest authn_anon authn_dbd authn_dbm authn_default authn_file authz_dbm authz_default authz_groupfile authz_host authz_owner authz_user autoindex cache dav dav_fs dav_lock dbd deflate dir disk_cache env expires ext_filter file_cache filter headers ident imagemap include info log_config logio mem_cache mime mime_magic negotiation proxy proxy_ajp proxy_balancer proxy_connect proxy_http rewrite setenvif so speling status unique_id userdir usertrack vhost_alias" CALLIGRA_FEATURES="kexi words flow plan sheets stage tables krita karbon braindump author" CAMERAS="ptp2" COLLECTD_PLUGINS="df interface irq load memory rrdtool swap syslog" ELIBC="glibc" GPSD_PROTOCOLS="ashtech aivdm earthmate evermore fv18 garmin garmintxt gpsclock itrax mtk3301 nmea ntrip navcom oceanserver oldstyle oncore rtcm104v2 rtcm104v3 sirf superstar2 timing tsip tripmate tnt ublox ubx" INPUT_DEVICES="keyboard mouse evdev" KERNEL="linux" LCD_DEVICES="bayrad cfontz cfontz633 glk hd44780 lb216 lcdm001 mtxorb ncurses text" LIBREOFFICE_EXTENSIONS="presenter-console presenter-minimizer" OFFICE_IMPLEMENTATION="libreoffice" PHP_TARGETS="php5-3" PYTHON_SINGLE_TARGET="python2_7" PYTHON_TARGETS="python2_7 python3_3" RUBY_TARGETS="ruby19 ruby20" USERLAND="GNU" VIDEO_CARDS="fbdev glint intel mach64 mga nouveau nv r128 radeon savage sis tdfx trident vesa via vmware dummy v4l" XTABLES_ADDONS="quota2 psd pknock lscan length2 ipv4options ipset ipp2p iface geoip fuzzy condition tee tarpit sysrq steal rawnat logmark ipmark dhcpmac delude chaos account"
Unset:  CPPFLAGS, CTARGET, INSTALL_MASK, LC_ALL, PORTAGE_BUNZIP2_COMMAND, PORTAGE_COMPRESS, PORTAGE_COMPRESS_FLAGS, PORTAGE_RSYNC_EXTRA_OPTS, USE_PYTHON
Comment 1 Adam Randall 2014-11-17 19:42:45 UTC
Created attachment 389589 [details]
Kernel configuration for 3.10.25-gentoo
Comment 2 Adam Randall 2014-11-17 19:43:07 UTC
Created attachment 389591 [details]
Kernel configuration for 3.12.21-gentoo-r1
Comment 3 Adam Randall 2014-11-17 19:43:23 UTC
Created attachment 389593 [details]
Kernel configuration for 3.16.5-gentoo
Comment 4 Adam Randall 2014-11-17 19:45:58 UTC
If it matters, here is my DRBD configuration:

resource r0 {
        disk {
                al-extents 3389;
                disk-barrier no;
                disk-flushes no;
        }

        startup {
                wfc-timeout  15;
                degr-wfc-timeout 60;
                become-primary-on both;
        }

        net {
                allow-two-primaries;
                after-sb-0pri discard-zero-changes;
                after-sb-1pri discard-secondary;
                after-sb-2pri disconnect;
                max-buffers 8000;
                max-epoch-size 8000;
                sndbuf-size 512k;
        }

        on node1 {
                device    /dev/drbd1;
                disk      /dev/sda1;
                address   10.10.254.10:7789;
                meta-disk internal;
        }

        on node2 {
                device    /dev/drbd1;
                disk      /dev/sda1;
                address   10.10.254.11:7789;
                meta-disk internal;
        }
}
Comment 5 Adam Randall 2014-11-17 19:47:40 UTC
And here is the OCFS2 configuration:

cluster:
        heartbeat_mode = global
        node_count = 2
        name = c1

node:
        number = 1
        cluster = c1
        ip_port = 7777
        ip_address = 10.10.254.10
        name = node1

node:
        number = 2
        cluster = c1
        ip_port = 7777
        ip_address = 10.10.254.11
        name = node2

heartbeat:
        cluster = c1
        region = 14DF63D68F504B188E4370E0C31523C3
Comment 6 Mike Pagano gentoo-dev 2014-12-23 16:45:38 UTC
you have a couple of choices here.

You can do a git bisect from that last working kernel to the first non-working one.

Or you can upgrade to the latest kernel (3.18.1 as of this writing) and see if this issue has been addressed.

I do see some things out there that say this was fixed in 3.16.7 but I cannot locate a commit that fixed it.
Comment 7 Adam Randall 2014-12-23 17:12:39 UTC
To be honest, I never thought I'd hear back on this report since it's so niche. Still, thank you very much for the feedback.

It will be somewhat time consuming for me to bring a pair of my servers up to 3.18.1, so confirmation that it's fixed will be awhile.

Happy holidays!
Comment 8 Mike Pagano gentoo-dev 2014-12-23 22:37:24 UTC
Ok, I'll close as test-request for now, if you do get the time to test please let me know the results.