929226 – sys-libs/glibc-2.38-r11 : Unknown regression causes net-im/ejabberd (all versions from 22.10 through 24.02) to fail to start

Bug 929226 - sys-libs/glibc-2.38-r11 : Unknown regression causes net-im/ejabberd (all versions from 22.10 through 24.02) to fail to start

Summary: sys-libs/glibc-2.38-r11 : Unknown regression causes net-im/ejabberd (all vers...

Status:	RESOLVED WORKSFORME

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	Current packages (show other bugs)
Hardware:	All Linux

Importance:	Normal normal
Assignee:	Gentoo Toolchain Maintainers

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2024-04-12 16:35 UTC by Phil Stracchino (Unix Ronin)
Modified:	2024-05-04 12:51 UTC (History)
CC List:	5 users (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
ejabberd configuration (ejabberd.yml) (ejabberd.yml,6.37 KB, application/yaml) 2024-04-12 16:42 UTC, Phil Stracchino (Unix Ronin)	Details
emerge --info from machine 1 ('minbar') (emerge-info.minbar,6.36 KB, text/plain) 2024-04-12 16:43 UTC, Phil Stracchino (Unix Ronin)	Details
emerge --info from machine 2 ('narn') (emerge-info.narn,5.95 KB, text/plain) 2024-04-12 16:43 UTC, Phil Stracchino (Unix Ronin)	Details
ejabberd log from machine 2 containing entire log output of a single startup failure (ejabberd.log,3.13 KB, text/x-log) 2024-04-12 16:44 UTC, Phil Stracchino (Unix Ronin)	Details
gzipped strace log (strace.out.gz,738.66 KB, application/x-gzip-compressed) 2024-04-12 16:45 UTC, Phil Stracchino (Unix Ronin)	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Phil Stracchino (Unix Ronin) 2024-04-12 16:35:40 UTC

There has been some discussion and attempt to troubleshoot this bug on #gentoo, but to my knowledge we have failed to identify any clear cause.  The situation is as follows:

A Gentoo profile 23.0 system with glibc-2.38-r10 runs ejabberd-24.02 with no issues and has been doing so for years.  If NO OTHER CHANGE is made but to upgrade glibc to 2.38-r11 and then reboot, ejabberd will die on start, claiming that required network port 5223 is already in use.

netstat -an will show port 5223 unused.  So will lsof -i.  nc -vlp 5223 will start correctly and work, proving that nothing else is using the port.  Try to start ejabberd again, and it will die again, asserting again that port 5223 is already open.

Per log:

2024-04-07 12:14:09.083713-04:00 [error] <0.408.0>@ejabberd_listener:report_socket_error/3:552 Failed to open socket at [::]:5223 for ejabberd_c2s: address already in use00 [error] <0.406.0>@supervisor:start_children/2:398 SUPERVISOR REPORT:
    supervisor: {local,ejabberd_listener}
    reason: eaddrinuse
    offender: [{pid,undefined},
               {id,{5223,{0,0,0,0,0,0,0,0},tcp}},
               {mfargs,
                   {ejabberd_listener,start,
                       [{5223,{0,0,0,0,0,0,0,0},tcp},
                        ejabberd_c2s,
                        #{access => c2s,zlib => false,send_timeout => 15000,
                          ip => {0,0,0,0,0,0,0,0},
                          supervisor => true,backlog => 128,
                          dhfile => undefined,max_fsm_queue => 10000,
                          unix_socket => #{},ciphers => undefined,
                          cafile => undefined,shaper => c2s_shaper,
                          protocol_options => undefined,
                          starttls_required => false,
                          max_stanza_size => 262144,tls => true,
                          transport => tcp,starttls => false,
                          accept_interval => 0,use_proxy_protocol => false,
                          tls_compression => false,tls_verify => false}]}},
               {restart_type,transient},
               {significant,false},
               {shutdown,brutal_kill},
               {child_type,worker}]

2024-04-07 12:14:09.085641-04:00 [error] <0.380.0>@supervisor:start_children/2:398 SUPERVISOR REPORT:
    supervisor: {local,ejabberd_sup}
    errorContext: start_error
    reason: {shutdown,
                {failed_to_start_child,
                    {5223,{0,0,0,0,0,0,0,0},tcp},
                    eaddrinuse}}
    offender: [{pid,undefined},
               {id,ejabberd_listener},
               {mfargs,{ejabberd_listener,start_link,[]}},
               {restart_type,permanent},
               {significant,false},
               {shutdown,infinity},
               {child_type,supervisor}]

2024-04-07 12:14:09.094764-04:00 [critical] <0.127.0>@ejabberd_app:start/2:68 Failed to start ejabberd application: {error,
                                       {shutdown,
                                        {failed_to_start_child,
                                         ejabberd_listener,
                                         {shutdown,
                                          {failed_to_start_child,
                                           {5223,{0,0,0,0,0,0,0,0},tcp},
                                           eaddrinuse}}}}}


After receiving that failure (repeatable exactly across multiple attempts), to NOTHING MORE THAN downgrade glibc back to 2.38-r10 and reboot again, and suddenly ejabberd works perfectly again.

AS FAR AS I HAVE YET FOUND, no other application EXCEPT net-im/ejabberd is affected.  I have tested with all ejabberd versions from 22.10 up to 24.02-r1 and the failure is exactly the same with any ejabberd version.

The failure is EXACTLY reproducible on two different machines, *except* that on the second machine, the failure occurs (equally consistently) while trying to open port 3478.  Once again it can be demonstrated that NOTHING else already has 3278 open.

dev-lang/erlang is version 26.2.2 (unstable) on machine 1 (updated to try to troubleshoot the problem; the update had no effect), 26.2.1 (stable) on machine 2.

I will attach the configuration, emerge --info, and an strace log of the failure in a moment.


Reproducible: Always

Comment 1 Phil Stracchino (Unix Ronin) 2024-04-12 16:42:29 UTC

Created attachment 890511 [details]
ejabberd configuration (ejabberd.yml)

Comment 2 Phil Stracchino (Unix Ronin) 2024-04-12 16:43:13 UTC

Created attachment 890512 [details]
emerge --info from machine 1 ('minbar')

Comment 3 Phil Stracchino (Unix Ronin) 2024-04-12 16:43:41 UTC

Created attachment 890513 [details]
emerge --info from machine 2 ('narn')

Comment 4 Phil Stracchino (Unix Ronin) 2024-04-12 16:44:22 UTC

Created attachment 890514 [details]
ejabberd log from machine 2 containing entire log output of a single startup failure

Comment 5 Phil Stracchino (Unix Ronin) 2024-04-12 16:45:43 UTC

Created attachment 890515 [details]
gzipped strace log

Precise command that generated this log:

narn:root:/var/log/ejabberd:18 # strace -f ejabberdctl start > strace.out 2>&1

Comment 6 Phil Stracchino (Unix Ronin) 2024-04-12 16:49:28 UTC

SO FAR I have attempted to reproduce this only on amd64 architecture, which is puzzling since most of the changes between sys-libs/glibc-2.38-r10 and -r11 are for SPARC and ARM architectures.  I do not currently have any other architecture available to test on.

Both test machines are Dell R610s with 24GB of RAM and two Xeon E5620 4-core processors.  Both are running kernel 6.7.9.

Comment 7 Phil Stracchino (Unix Ronin) 2024-04-12 18:02:05 UTC

Additional clarification:
— Machine 1 (minbar) STARTED OUT on dev-lang/erlang-26.2.1 the same as machine 2 (narn), and was updated to 26.2.2 during testing as a troubleshooting step.  It had no effect, so I did not try dev-lang/erlang-26.2.2 on machine 2.

— In all of my testing, minbar 100% consistently fails on port 5223, while narn 100% consistently fails on port 3478.  I have absolutely no idea why this is.

— I can consistently make either machine succeed or fail on demand by ONLY upgrading/downgrading sys-libs/glibc beteeen 2.38-r10 and 2.38-r11 and rebooting.  Nothing else, erlang, ejabberd, anything, has any effect on the failure.

— Rebuilding erlang and ejabberd also has no effect.  The glibc revision number is the ONLY so-far-known factor.

Comment 8 Sam James archtester

2024-04-12 18:21:21 UTC

Phil, I'm sorry I've not had more time yet to try debug it with you.

In the meantime, do you think you could try bisect using glibc-9999 configured to use the 2.38 branch?

Comment 9 Phil Stracchino (Unix Ronin) 2024-04-12 18:23:32 UTC

(In reply to Sam James from comment #8)
> Phil, I'm sorry I've not had more time yet to try debug it with you.
> 
> In the meantime, do you think you could try bisect using glibc-9999
> configured to use the 2.38 branch?

Um ...   I'm sorry, I don't know how to do that.  But if you can explain the procedure I'll give it a try.

Comment 10 Florian Schmaus gentoo-dev

2024-04-14 14:36:29 UTC

(In reply to Phil Stracchino (Unix Ronin) from comment #9)
> (In reply to Sam James from comment #8)
> > Phil, I'm sorry I've not had more time yet to try debug it with you.
> > 
> > In the meantime, do you think you could try bisect using glibc-9999
> > configured to use the 2.38 branch?
> 
> Um ...   I'm sorry, I don't know how to do that.  But if you can explain the
> procedure I'll give it a try.

I *think* sam asks you to use the glibc-9999 ebuild to build specific commits of glibc's release/2.38/master master branch and determine which commit introduces the regression via bisection. See https://wiki.gentoo.org/wiki/Bisecting_with_live_ebuilds for more information.

That said, -r11 introduces the following patches compared to -r10:

[PATCH 56/65] i386: Use generic memrchr in libc (bug 31316)
[PATCH 57/65] Mitigation for "clone on sparc might fail with -EFAULT
30e546d [PATCH 58/65] x86_64: Optimize ffsll function code size.
18876c0 [PATCH 59/65] S390: Fix building with --disable-mutli-arch
6f68075 [PATCH 60/65] sparc: Fix broken memset for sparc32 [BZ #31068]
0e383d2 [PATCH 61/65] sparc64: Remove unwind information from signal return
aac57fa [PATCH 62/65] sparc: Fix sparc64 memmove length comparison (BZ 31266)
0c5e5ba [PATCH 63/65] sparc: Remove unwind information from signal return
b09073e [PATCH 64/65] arm: Remove wrong ldr from _dl_start_user (BZ 31339)
506e47d [PATCH 65/65] malloc: Use __get_nprocs on arena_get2 (BZ 30945)

However, as far as I can tell, #56 and #57 are not included in glibc's 2.28 branch. So when bisection returns no bad commit, it could very well be one of those two.

Furthermore, I was unable to reproduce that on an amd64 systemd system, running (now) glibc-2.28-r11 and ejabberd-23.04. That does, my no means that your report is invalid. It mostly means that
1. There is probably another unknown factor at play
2. A reproducer is needed more than ever

If don't feel like able to bisect this. Then you could still create a new VM, install ejabbered and glibc-2.38-r11 and see if you can reproduce the behavior in that VM.

Comment 11 Phil Stracchino (Unix Ronin) 2024-04-14 16:45:25 UTC

(In reply to Florian Schmaus from comment #10)

> However, as far as I can tell, #56 and #57 are not included in glibc's 2.28
> branch. So when bisection returns no bad commit, it could very well be one
> of those two.
> 
> Furthermore, I was unable to reproduce that on an amd64 systemd system,
> running (now) glibc-2.28-r11 and ejabberd-23.04. That does, my no means that
> your report is invalid. It mostly means that
> 1. There is probably another unknown factor at play
> 2. A reproducer is needed more than ever


Florian, I'm assuming 2.28 is a typo for 2.38 here.

Beyond that, do you have any hypotheses about *what* that additional factor might possibly be?  I can provide a list of packages in common between both known-affected systems if that might help.

Comment 12 Phil Stracchino (Unix Ronin) 2024-04-21 16:59:43 UTC

Additional information:

Verified that the problem persists in sys-libs/glibc-2.38-r12 and is also reproducible with sys-libs/glibc-2.39-r3.  ejabberd continues to work as expected with sys-libs/glibc-2.38-r10.

Comment 13 Phil Stracchino (Unix Ronin) 2024-04-21 20:16:17 UTC

(In reply to Florian Schmaus from comment #10)
> I *think* sam asks you to use the glibc-9999 ebuild to build specific
> commits of glibc's release/2.38/master master branch and determine which
> commit introduces the regression via bisection. See
> https://wiki.gentoo.org/wiki/Bisecting_with_live_ebuilds for more
> information.
> 
> That said, -r11 introduces the following patches compared to -r10:
> 
> [PATCH 56/65] i386: Use generic memrchr in libc (bug 31316)
> [PATCH 57/65] Mitigation for "clone on sparc might fail with -EFAULT
> 30e546d [PATCH 58/65] x86_64: Optimize ffsll function code size.
> 18876c0 [PATCH 59/65] S390: Fix building with --disable-mutli-arch
> 6f68075 [PATCH 60/65] sparc: Fix broken memset for sparc32 [BZ #31068]
> 0e383d2 [PATCH 61/65] sparc64: Remove unwind information from signal return
> aac57fa [PATCH 62/65] sparc: Fix sparc64 memmove length comparison (BZ 31266)
> 0c5e5ba [PATCH 63/65] sparc: Remove unwind information from signal return
> b09073e [PATCH 64/65] arm: Remove wrong ldr from _dl_start_user (BZ 31339)
> 506e47d [PATCH 65/65] malloc: Use __get_nprocs on arena_get2 (BZ 30945)
> 
> However, as far as I can tell, #56 and #57 are not included in glibc's 2.28
> branch. So when bisection returns no bad commit, it could very well be one
> of those two.

Florian, Sam, I'm trying to figure out what I need to do to accomplish this but I seem to be missing some pieces.  In particular I'm not sure which branch to pull from https://sourceware.org/git/glibc.git to bisect between 2.38-r10 and 2.38-r11.  Can you suggest a starting point?

Comment 14 Phil Stracchino (Unix Ronin) 2024-04-24 18:41:42 UTC

Not having bisect figured out yet, I've been testing by hand.  After a bunch of time wasted today by procedural mistakes (I was forgetting to stop one ill-behaved service that DOES conflict), I have determined that up to glibc-2.38-r12, I can no longer reproduce the problem on 'narn', the non-production machine.  And retesting now on the production machine, 'minbar', I can't reproduce it THERE either.


I can only conclude that there was an additional, unknown factor which has since changed/been resolved.  I will probably never now figure out what it was.


This should probably be marked as CANNOT REPRODUCE.

Comment 15 Sam James archtester

2024-05-04 12:51:56 UTC

(In reply to Phil Stracchino (Unix Ronin) from comment #14)
> Not having bisect figured out yet, I've been testing by hand.  After a bunch
> of time wasted today by procedural mistakes (I was forgetting to stop one
> ill-behaved service that DOES conflict), I have determined that up to
> glibc-2.38-r12, I can no longer reproduce the problem on 'narn', the
> non-production machine.  And retesting now on the production machine,
> 'minbar', I can't reproduce it THERE either.
> 
> 
> I can only conclude that there was an additional, unknown factor which has
> since changed/been resolved.  I will probably never now figure out what it
> was.
> 
> 
> This should probably be marked as CANNOT REPRODUCE.

Bleh. I'll regrettably call it WORKSFORME. Thank you for trying and I'm sorry we didn't get to the bottom of it. I will let you know if I see anything which sounds like it could've been it...