There has been some discussion and attempt to troubleshoot this bug on #gentoo, but to my knowledge we have failed to identify any clear cause. The situation is as follows: A Gentoo profile 23.0 system with glibc-2.38-r10 runs ejabberd-24.02 with no issues and has been doing so for years. If NO OTHER CHANGE is made but to upgrade glibc to 2.38-r11 and then reboot, ejabberd will die on start, claiming that required network port 5223 is already in use. netstat -an will show port 5223 unused. So will lsof -i. nc -vlp 5223 will start correctly and work, proving that nothing else is using the port. Try to start ejabberd again, and it will die again, asserting again that port 5223 is already open. Per log: 2024-04-07 12:14:09.083713-04:00 [error] <0.408.0>@ejabberd_listener:report_socket_error/3:552 Failed to open socket at [::]:5223 for ejabberd_c2s: address already in use00 [error] <0.406.0>@supervisor:start_children/2:398 SUPERVISOR REPORT: supervisor: {local,ejabberd_listener} reason: eaddrinuse offender: [{pid,undefined}, {id,{5223,{0,0,0,0,0,0,0,0},tcp}}, {mfargs, {ejabberd_listener,start, [{5223,{0,0,0,0,0,0,0,0},tcp}, ejabberd_c2s, #{access => c2s,zlib => false,send_timeout => 15000, ip => {0,0,0,0,0,0,0,0}, supervisor => true,backlog => 128, dhfile => undefined,max_fsm_queue => 10000, unix_socket => #{},ciphers => undefined, cafile => undefined,shaper => c2s_shaper, protocol_options => undefined, starttls_required => false, max_stanza_size => 262144,tls => true, transport => tcp,starttls => false, accept_interval => 0,use_proxy_protocol => false, tls_compression => false,tls_verify => false}]}}, {restart_type,transient}, {significant,false}, {shutdown,brutal_kill}, {child_type,worker}] 2024-04-07 12:14:09.085641-04:00 [error] <0.380.0>@supervisor:start_children/2:398 SUPERVISOR REPORT: supervisor: {local,ejabberd_sup} errorContext: start_error reason: {shutdown, {failed_to_start_child, {5223,{0,0,0,0,0,0,0,0},tcp}, eaddrinuse}} offender: [{pid,undefined}, {id,ejabberd_listener}, {mfargs,{ejabberd_listener,start_link,[]}}, {restart_type,permanent}, {significant,false}, {shutdown,infinity}, {child_type,supervisor}] 2024-04-07 12:14:09.094764-04:00 [critical] <0.127.0>@ejabberd_app:start/2:68 Failed to start ejabberd application: {error, {shutdown, {failed_to_start_child, ejabberd_listener, {shutdown, {failed_to_start_child, {5223,{0,0,0,0,0,0,0,0},tcp}, eaddrinuse}}}}} After receiving that failure (repeatable exactly across multiple attempts), to NOTHING MORE THAN downgrade glibc back to 2.38-r10 and reboot again, and suddenly ejabberd works perfectly again. AS FAR AS I HAVE YET FOUND, no other application EXCEPT net-im/ejabberd is affected. I have tested with all ejabberd versions from 22.10 up to 24.02-r1 and the failure is exactly the same with any ejabberd version. The failure is EXACTLY reproducible on two different machines, *except* that on the second machine, the failure occurs (equally consistently) while trying to open port 3478. Once again it can be demonstrated that NOTHING else already has 3278 open. dev-lang/erlang is version 26.2.2 (unstable) on machine 1 (updated to try to troubleshoot the problem; the update had no effect), 26.2.1 (stable) on machine 2. I will attach the configuration, emerge --info, and an strace log of the failure in a moment. Reproducible: Always
Created attachment 890511 [details] ejabberd configuration (ejabberd.yml)
Created attachment 890512 [details] emerge --info from machine 1 ('minbar')
Created attachment 890513 [details] emerge --info from machine 2 ('narn')
Created attachment 890514 [details] ejabberd log from machine 2 containing entire log output of a single startup failure
Created attachment 890515 [details] gzipped strace log Precise command that generated this log: narn:root:/var/log/ejabberd:18 # strace -f ejabberdctl start > strace.out 2>&1
SO FAR I have attempted to reproduce this only on amd64 architecture, which is puzzling since most of the changes between sys-libs/glibc-2.38-r10 and -r11 are for SPARC and ARM architectures. I do not currently have any other architecture available to test on. Both test machines are Dell R610s with 24GB of RAM and two Xeon E5620 4-core processors. Both are running kernel 6.7.9.
Additional clarification: — Machine 1 (minbar) STARTED OUT on dev-lang/erlang-26.2.1 the same as machine 2 (narn), and was updated to 26.2.2 during testing as a troubleshooting step. It had no effect, so I did not try dev-lang/erlang-26.2.2 on machine 2. — In all of my testing, minbar 100% consistently fails on port 5223, while narn 100% consistently fails on port 3478. I have absolutely no idea why this is. — I can consistently make either machine succeed or fail on demand by ONLY upgrading/downgrading sys-libs/glibc beteeen 2.38-r10 and 2.38-r11 and rebooting. Nothing else, erlang, ejabberd, anything, has any effect on the failure. — Rebuilding erlang and ejabberd also has no effect. The glibc revision number is the ONLY so-far-known factor.
Phil, I'm sorry I've not had more time yet to try debug it with you. In the meantime, do you think you could try bisect using glibc-9999 configured to use the 2.38 branch?
(In reply to Sam James from comment #8) > Phil, I'm sorry I've not had more time yet to try debug it with you. > > In the meantime, do you think you could try bisect using glibc-9999 > configured to use the 2.38 branch? Um ... I'm sorry, I don't know how to do that. But if you can explain the procedure I'll give it a try.
(In reply to Phil Stracchino (Unix Ronin) from comment #9) > (In reply to Sam James from comment #8) > > Phil, I'm sorry I've not had more time yet to try debug it with you. > > > > In the meantime, do you think you could try bisect using glibc-9999 > > configured to use the 2.38 branch? > > Um ... I'm sorry, I don't know how to do that. But if you can explain the > procedure I'll give it a try. I *think* sam asks you to use the glibc-9999 ebuild to build specific commits of glibc's release/2.38/master master branch and determine which commit introduces the regression via bisection. See https://wiki.gentoo.org/wiki/Bisecting_with_live_ebuilds for more information. That said, -r11 introduces the following patches compared to -r10: [PATCH 56/65] i386: Use generic memrchr in libc (bug 31316) [PATCH 57/65] Mitigation for "clone on sparc might fail with -EFAULT 30e546d [PATCH 58/65] x86_64: Optimize ffsll function code size. 18876c0 [PATCH 59/65] S390: Fix building with --disable-mutli-arch 6f68075 [PATCH 60/65] sparc: Fix broken memset for sparc32 [BZ #31068] 0e383d2 [PATCH 61/65] sparc64: Remove unwind information from signal return aac57fa [PATCH 62/65] sparc: Fix sparc64 memmove length comparison (BZ 31266) 0c5e5ba [PATCH 63/65] sparc: Remove unwind information from signal return b09073e [PATCH 64/65] arm: Remove wrong ldr from _dl_start_user (BZ 31339) 506e47d [PATCH 65/65] malloc: Use __get_nprocs on arena_get2 (BZ 30945) However, as far as I can tell, #56 and #57 are not included in glibc's 2.28 branch. So when bisection returns no bad commit, it could very well be one of those two. Furthermore, I was unable to reproduce that on an amd64 systemd system, running (now) glibc-2.28-r11 and ejabberd-23.04. That does, my no means that your report is invalid. It mostly means that 1. There is probably another unknown factor at play 2. A reproducer is needed more than ever If don't feel like able to bisect this. Then you could still create a new VM, install ejabbered and glibc-2.38-r11 and see if you can reproduce the behavior in that VM.
(In reply to Florian Schmaus from comment #10) > However, as far as I can tell, #56 and #57 are not included in glibc's 2.28 > branch. So when bisection returns no bad commit, it could very well be one > of those two. > > Furthermore, I was unable to reproduce that on an amd64 systemd system, > running (now) glibc-2.28-r11 and ejabberd-23.04. That does, my no means that > your report is invalid. It mostly means that > 1. There is probably another unknown factor at play > 2. A reproducer is needed more than ever Florian, I'm assuming 2.28 is a typo for 2.38 here. Beyond that, do you have any hypotheses about *what* that additional factor might possibly be? I can provide a list of packages in common between both known-affected systems if that might help.
Additional information: Verified that the problem persists in sys-libs/glibc-2.38-r12 and is also reproducible with sys-libs/glibc-2.39-r3. ejabberd continues to work as expected with sys-libs/glibc-2.38-r10.
(In reply to Florian Schmaus from comment #10) > I *think* sam asks you to use the glibc-9999 ebuild to build specific > commits of glibc's release/2.38/master master branch and determine which > commit introduces the regression via bisection. See > https://wiki.gentoo.org/wiki/Bisecting_with_live_ebuilds for more > information. > > That said, -r11 introduces the following patches compared to -r10: > > [PATCH 56/65] i386: Use generic memrchr in libc (bug 31316) > [PATCH 57/65] Mitigation for "clone on sparc might fail with -EFAULT > 30e546d [PATCH 58/65] x86_64: Optimize ffsll function code size. > 18876c0 [PATCH 59/65] S390: Fix building with --disable-mutli-arch > 6f68075 [PATCH 60/65] sparc: Fix broken memset for sparc32 [BZ #31068] > 0e383d2 [PATCH 61/65] sparc64: Remove unwind information from signal return > aac57fa [PATCH 62/65] sparc: Fix sparc64 memmove length comparison (BZ 31266) > 0c5e5ba [PATCH 63/65] sparc: Remove unwind information from signal return > b09073e [PATCH 64/65] arm: Remove wrong ldr from _dl_start_user (BZ 31339) > 506e47d [PATCH 65/65] malloc: Use __get_nprocs on arena_get2 (BZ 30945) > > However, as far as I can tell, #56 and #57 are not included in glibc's 2.28 > branch. So when bisection returns no bad commit, it could very well be one > of those two. Florian, Sam, I'm trying to figure out what I need to do to accomplish this but I seem to be missing some pieces. In particular I'm not sure which branch to pull from https://sourceware.org/git/glibc.git to bisect between 2.38-r10 and 2.38-r11. Can you suggest a starting point?
Not having bisect figured out yet, I've been testing by hand. After a bunch of time wasted today by procedural mistakes (I was forgetting to stop one ill-behaved service that DOES conflict), I have determined that up to glibc-2.38-r12, I can no longer reproduce the problem on 'narn', the non-production machine. And retesting now on the production machine, 'minbar', I can't reproduce it THERE either. I can only conclude that there was an additional, unknown factor which has since changed/been resolved. I will probably never now figure out what it was. This should probably be marked as CANNOT REPRODUCE.
(In reply to Phil Stracchino (Unix Ronin) from comment #14) > Not having bisect figured out yet, I've been testing by hand. After a bunch > of time wasted today by procedural mistakes (I was forgetting to stop one > ill-behaved service that DOES conflict), I have determined that up to > glibc-2.38-r12, I can no longer reproduce the problem on 'narn', the > non-production machine. And retesting now on the production machine, > 'minbar', I can't reproduce it THERE either. > > > I can only conclude that there was an additional, unknown factor which has > since changed/been resolved. I will probably never now figure out what it > was. > > > This should probably be marked as CANNOT REPRODUCE. Bleh. I'll regrettably call it WORKSFORME. Thank you for trying and I'm sorry we didn't get to the bottom of it. I will let you know if I see anything which sounds like it could've been it...