|Summary:||sys-cluster/glusterfs-3.12.3[libtirpc]: glusterd crashes with segmentation fault|
|Product:||Gentoo Linux||Reporter:||Erik Zscheile <erik.zscheile.ytrizja>|
|Component:||Current packages||Assignee:||Andreas K. Hüttel <dilfridge>|
|Severity:||normal||CC:||benjamin.beier, bugs, bugs, chewi, cluster, manwe|
|Package list:||Runtime testing required:||---|
|Bug Depends on:||653796|
Description Erik Zscheile 2017-12-04 20:19:23 UTC
reference: https://bugs.gentoo.org/635172#c16 When sys-cluster/glusterfs-3.12.3 is compiled with libtirpc, glusterd (and glusterfsd) crash with segmentation fault, possible because libtirpc is incompatible with glusterfs. Reproducible: Always Steps to Reproduce: 1. USE="libtirpc" emerge -v1 '=sys-cluster/glusterfs-3.12.3' 2. /etc/init.d/glusterd restart Actual Results: glusterd crashes with a segmentation fault inside XDR procedures Expected Results: glusterd spawns and keeps running The libtirpc use flag is there because glibc-2.26 isn't bundled with RPC stuff anymore in gentoo. glusterfs-3.12.2 with libtirpc doesn't work. glusterfs-3.12.2 without libtirpc works.
Comment 1 James Le Cuirot 2017-12-04 20:34:04 UTC
dilfridge, hope you don't mind me assigning this one to you.
Comment 2 Erik Zscheile 2017-12-05 15:26:17 UTC
I received an answer from Thorsten Kuluk (maintainer of libtirpc). ---- Hi, your build environment is messed up. You compile against the header files of glibc, but link against libtirpc. This will never work. If you link against libtirpc, you have to compile against libtirpc header files, too. Thorsten On Fri, Dec 01, Erik Kai Alain Zscheile wrote: > Hello Steve Dickson and Thorsten Kukuk, upstream maintainers of libtirpc, I've noticed a failure in GlusterFS on Gentoo Linux when it's compiled against libtirpc. This leads into a segmentation fault, but I can't figure out where exactly the bug comes from. The bug appears to be in the XDR part of libtirpc and probabaly produces a null pointer dereference inside (or in something called inside) xdr_u_int64_t(xdrs, ullp) while xdrs and ullp aren't null. references: https://bugzilla.redhat.com/show_bug.cgi?id=1519315 https://bugs.gentoo.org/635172 Erik Zscheile > -- Thorsten Kukuk firstname.lastname@example.org http://www.linux-nis.org|http://osm.thkukuk.de|http://www.thkukuk.de -------------------------------------------------------------------- Key fingerprint = A368 676B 5E1B 3E46 CFCE 2D97 F8FD 4E23 56C6 FB4B
Comment 3 James Le Cuirot 2017-12-05 15:36:39 UTC
That might be my fault then as I wrote the patch, though it's also concerning as they've just merged it upstream. Please post your config.log and build log then.
Comment 4 Erik Zscheile 2017-12-05 15:42:31 UTC
Comment 5 James Le Cuirot 2017-12-05 15:48:21 UTC
(In reply to Erik Zscheile from comment #4) > backlink: https://bugzilla.redhat.com/show_bug.cgi?id=1521004 Err, please hold fire there, the patch might be fine. I want to see what's going on in your logs first.
Comment 6 Erik Zscheile 2017-12-05 16:10:46 UTC
workdir + logfiles: http://ezscheile.bplaced.net/glusterd-ptd-pack.tar.xz
Comment 7 James Le Cuirot 2017-12-05 16:33:11 UTC
I don't know why they think you're building against glibc's headers as I can clearly see -I/usr/include/tirpc in config.log and build.log. There are some redefinition warnings though so I'll take a closer look later. It's worth adding that even though I contributed the patch to use libtirpc upstream, it was already being used in master when some IPv6-related configure flag was enabled. I need to check whether they made other changes for this.
Comment 8 Erik Zscheile 2017-12-05 19:33:08 UTC
I think the redefinitions are there because (in libglusterfs) compat.h is included before the RPC and XDR headers.
Comment 9 James Le Cuirot 2017-12-05 23:41:21 UTC
Created attachment 508446 [details, diff] glusterfs-3.12.3-libtirpc.patch I wasn't able to reproduce your issue (does it segfault immediately?) but I have found the cause of the redefinitions and updated the patch accordingly. Please replace the existing patch, rebuild, and report back.
Comment 10 Erik Zscheile 2017-12-06 19:26:55 UTC
It doesn't segfault immediately, but if you execute: strace -f /usr/sbin/glusterd after a while it throws a segmentation fault. Note that in my configuration I have some connected gluster peers, which are disconnected. I applied the new patch and the error is still there.
Comment 11 James Le Cuirot 2017-12-06 22:25:41 UTC
Okay, I was able to reproduce the issue after starting the other peer. Luckily I still had it installed even though I'm not using it for the foreseeable future.
Comment 12 James Le Cuirot 2017-12-07 22:13:04 UTC
I'm really struggling here and I'm concerned because my upstream patches have already been merged. I wonder if this is not a problem in the latest master but running that without actually installing it is proving tricky.
Comment 13 James Le Cuirot 2017-12-08 23:07:54 UTC
I made a 9999 ebuild so that I could install from master. I was able to reproduce the problem there but when I pass --with-ipv6-default, the problem goes away, which is obviously very interesting! My recent patch should have taken account of this but evidently I'm missing something.
Comment 14 James Le Cuirot 2017-12-09 13:34:04 UTC
Created attachment 509026 [details, diff] glusterfs-3.12.3-libtirpc.patch Here's a new patch that seems to work. Trouble is I don't really understand why. If it gets past the initial crash for you then please test it more thoroughly.
Comment 15 Larry the Git Cow 2017-12-09 18:50:56 UTC
The bug has been referenced in the following commit(s): https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=22c2b5002a396b340e12df805ca051ac5743d70d commit 22c2b5002a396b340e12df805ca051ac5743d70d Author: James Le Cuirot <email@example.com> AuthorDate: 2017-12-09 18:50:20 +0000 Commit: James Le Cuirot <firstname.lastname@example.org> CommitDate: 2017-12-09 18:50:48 +0000 sys-cluster/glusterfs: Bump to 3.13.0 and add 9999 live ebuild I have forced --with-ipv6-default when the libtirpc flag is enabled to avoid bug #639838. This bug affects 3.12.3 too but the aforementioned flag is not available in that version. I'll try to get this issue resolved properly and then push for stabilisation. Bug: https://bugs.gentoo.org/639838 Package-Manager: Portage-2.3.17, Repoman-2.3.6 sys-cluster/glusterfs/Manifest | 1 + sys-cluster/glusterfs/glusterfs-3.13.0.ebuild | 224 ++++++++++++++++++++++++++ sys-cluster/glusterfs/glusterfs-9999.ebuild | 224 ++++++++++++++++++++++++++ 3 files changed, 449 insertions(+)}
Comment 16 Erik Zscheile 2017-12-09 20:16:19 UTC
ok, tried the new patch, segfault happens again, but with a different backtrace: #0 __GI_xdr_uint64_t (xdrs=0x7fe3880ffc10, uip=0x7fe3880ffcb0) at xdr_intXX_t.c:71 #1 0x00007fe39111ba71 in xdr_gf_dump_rsp (xdrs=0x7fe3880ffc10, objp=0x7fe3880ffcb0) at rpc-common-xdr.c:203 #2 0x00007fe391344aa3 in xdr_sizeof () from /lib64/libtirpc.so.3 #3 0x00007fe39156270b in rpcsvc_dump (req=0x7fe37c002cb0) at rpcsvc.c:2088 #4 0x00007fe391562f72 in rpcsvc_handle_rpc_call (svc=0x1fc6100, trans=trans@entry=0x7fe37c0019f0, msg=msg@entry=0x7fe37c002b60) at rpcsvc.c:711 #5 0x00007fe3915631d6 in rpcsvc_notify (trans=0x7fe37c0019f0, mydata=<optimized out>, event=<optimized out>, data=0x7fe37c002b60) at rpcsvc.c:805 #6 0x00007fe391565143 in rpc_transport_notify (this=this@entry=0x7fe37c0019f0, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=<optimized out>) at rpc-transport.c:538 #7 0x00007fe38858e382 in socket_event_poll_in (this=this@entry=0x7fe37c0019f0, notify_handled=_gf_true) at socket.c:2315 #8 0x00007fe38858e557 in socket_event_handler (fd=fd@entry=6, idx=idx@entry=2, gen=gen@entry=1, data=data@entry=0x7fe37c0019f0, poll_in=1, poll_out=0, poll_err=0) at socket.c:2467 #9 0x00007fe3917f73da in event_dispatch_epoll_handler (event=0x7fe388101e7c, event_pool=0x1fb7770) at event-epoll.c:583 #10 event_dispatch_epoll_worker (data=0x2037300) at event-epoll.c:659 #11 0x00007fe390cec839 in start_thread (arg=0x7fe388102700) at pthread_create.c:456 #12 0x00007fe390a2aadf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
Comment 17 James Le Cuirot 2017-12-09 20:24:14 UTC
As I suspected, it may be necessary to enable more of the --with-ipv6-default code. I have pushed 3.13.0 to the tree, which happens to include that flag, so I have enabled it with the libtirpc USE flag. Please try this version and report back.
Comment 18 Erik Zscheile 2017-12-10 18:12:40 UTC
glusterfs version 3.13.0 from the tree works.
Comment 19 James Le Cuirot 2018-01-25 22:54:16 UTC
It looks like this will get fixed properly in the next release.
Comment 20 manwe 2018-02-28 21:50:02 UTC
I have to put my 5 cents against +libtirpc USE flag that enables --with-ipv6-default. It should be named ipv6 or (ipv6-default that depends on ipv6). My system has ipv4 only (no support in kernel for v6, USE="-ipv6" in make.conf). Today I've upgraded gluster from 3.12.x to 3.13.0 and it stopped working with: [2018-02-28 21:29:05.628887] E [name.c:267:af_inet_client_get_remote_sockaddr] 0-management: DNS resolution failed on host [...host1...] [2018-02-28 21:29:05.629014] E [name.c:267:af_inet_client_get_remote_sockaddr] 0-management: DNS resolution failed on host [...host2...] [2018-02-28 21:29:08.629373] E [name.c:267:af_inet_client_get_remote_sockaddr] 0-management: DNS resolution failed on host [...host1...] [2018-02-28 21:29:08.629516] E [name.c:267:af_inet_client_get_remote_sockaddr] 0-management: DNS resolution failed on host [...host2...] [2018-02-28 21:29:11.629874] E [name.c:267:af_inet_client_get_remote_sockaddr] 0-management: DNS resolution failed on host [...host1...] [2018-02-28 21:29:11.630020] E [name.c:267:af_inet_client_get_remote_sockaddr] 0-management: DNS resolution failed on host [...host2...] [2018-02-28 21:29:14.630366] E [name.c:267:af_inet_client_get_remote_sockaddr] 0-management: DNS resolution failed on host [...host1...] [2018-02-28 21:29:14.630514] E [name.c:267:af_inet_client_get_remote_sockaddr] 0-management: DNS resolution failed on host [...host2...] and so on... glusterd didn't start glusterfs and glusterfsd processes. After emerge without libtirpc everything is fine.
Comment 21 James Le Cuirot 2018-02-28 22:01:56 UTC
(In reply to manwe from comment #20) > I have to put my 5 cents against +libtirpc USE flag that enables > --with-ipv6-default. It should be named ipv6 or (ipv6-default that depends > on ipv6). > > My system has ipv4 only (no support in kernel for v6, USE="-ipv6" in > make.conf). Today I've upgraded gluster from 3.12.x to 3.13.0 and it stopped > working with: > > and so on... glusterd didn't start glusterfs and glusterfsd processes. After > emerge without libtirpc everything is fine. Fair enough but at the same time, that would crash horribly with systems that do enable IPv6. I could try and patch things further but the real fix is just around the corner so let's just sit tight for now. They release often so I expect the next one soon.
Comment 22 manwe 2018-03-01 09:33:45 UTC
OK, so let's keep it turned on by default, but change the name from libtirpc that sais nothing, to something like ipv6-default. This way people like me, without ipv6, can spot it during -uDN @world and disable.
Comment 23 James Le Cuirot 2018-03-08 23:07:56 UTC
4.0.0 has just been released and that should deal with all the libtirpc issues properly so I've pushed it up. I'll close this when it goes stable.
Comment 24 James Le Cuirot 2018-03-08 23:29:04 UTC
I suddenly remembered I needed to add an ipv6 USE flag to control --with-ipv6-default but when I tried it, I found that configure.ac has mixed the clauses up: AC_ARG_WITH([ipv6-default], AC_HELP_STRING([--with-ipv6-default], [Set IPv6 as default.]), [with_ipv6_default=$with_libtirpc], [with_ipv6_default=no]) This should probably be: AC_ARG_WITH([ipv6-default], AC_HELP_STRING([--with-ipv6-default], [Set IPv6 as default.]), , [with_ipv6_default=$with_libtirpc]) I'll ask upstream tomorrow.
Comment 25 Larry the Git Cow 2018-03-09 23:34:58 UTC
The bug has been referenced in the following commit(s): https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=c4217a516c1e8dc5b04400198f9b0e20a37d0bd0 commit c4217a516c1e8dc5b04400198f9b0e20a37d0bd0 Author: James Le Cuirot <email@example.com> AuthorDate: 2018-03-09 23:33:30 +0000 Commit: James Le Cuirot <firstname.lastname@example.org> CommitDate: 2018-03-09 23:34:49 +0000 sys-cluster/glusterfs: Add ipv6 USE flag to control ipv6-default This is important because ipv6-default breaks Gluster for systems that have IPv6 disabled. A couple of patches were required because --without-ipv6-default is broken and the configure summary is sometimes misleading. Bug: https://bugs.gentoo.org/639838 Package-Manager: Portage-2.3.24, Repoman-2.3.6 .../files/glusterfs-TIRPC-config-summary.patch | 48 ++++++++++++++++++++++ .../files/glusterfs-without-ipv6-default.patch | 38 +++++++++++++++++ ...erfs-4.0.0.ebuild => glusterfs-4.0.0-r1.ebuild} | 9 ++-- sys-cluster/glusterfs/glusterfs-9999.ebuild | 9 ++-- sys-cluster/glusterfs/metadata.xml | 1 + 5 files changed, 99 insertions(+), 6 deletions(-)}
Comment 26 Larry the Git Cow 2018-05-26 11:07:38 UTC
The bug has been closed via the following commit(s): https://gitweb.gentoo.org/repo/gentoo.git/commit/?id=726354281e331b53c25f336cf5b8b6a4b8e9beba commit 726354281e331b53c25f336cf5b8b6a4b8e9beba Author: James Le Cuirot <email@example.com> AuthorDate: 2018-05-26 11:06:51 +0000 Commit: James Le Cuirot <firstname.lastname@example.org> CommitDate: 2018-05-26 11:07:30 +0000 sys-cluster/glusterfs: Drop old 3.12.3 Closes: https://bugs.gentoo.org/639838 Package-Manager: Portage-2.3.40, Repoman-2.3.9 sys-cluster/glusterfs/Manifest | 1 - .../files/glusterfs-3.12.3-libtirpc.patch | 45 ----- sys-cluster/glusterfs/glusterfs-3.12.3.ebuild | 218 --------------------- 3 files changed, 264 deletions(-)