703222 – dev-db/mariadb-10.2.29 systemd service fails to start in galera cluster

Bug 703222 - dev-db/mariadb-10.2.29 systemd service fails to start in galera cluster

Summary: dev-db/mariadb-10.2.29 systemd service fails to start in galera cluster

Status:	UNCONFIRMED

Alias:	None

Product:	Gentoo Linux
Classification:	Unclassified
Component:	Current packages (show other bugs)
Hardware:	All Linux

Importance:	Normal normal with 1 vote (vote)
Assignee:	Gentoo Linux MySQL bugs team

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2019-12-17 17:46 UTC by Edoardo Liverani
Modified:	2020-07-03 19:07 UTC (History)
CC List:	2 users (show)

See Also:
Package list:
Runtime testing required:	---

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Edoardo Liverani 2019-12-17 17:46:01 UTC

After updating one of the Galera cluster nodes from mariadb-10.2.22 to mariadb-10.2.29, the systemd service fails to start. 
Here's what I see in the logs:

sh[4624]: /usr/bin/galera_recovery: line 71: /tmp/wsrep_recovery.Y7g0X2: Permission denied
systemd[1]: mariadb.service: Control process exited, code=exited, status=1/FAILURE
systemd[1]: mariadb.service: Failed with result 'exit-code'.
systemd[1]: Failed to start MariaDB 10.2.29 database server.

I diffed the service file and the galera_recovery script from the previous version and there don't seem to be any relevant differences. (Maybe this bug comes from upstream?)

If MariaDB is not configured as a Galera cluster, it starts without problems.

/tmp is 777, managed by systemd-tmpfiles.

I was able to reproduce this bug in 2 different clusters (different gentoo installations) so it might be reproducible with this configuration.
After re-emerging version 10.2.22, it starts correctly.

Comment 1 Tomáš Mózes 2019-12-18 03:57:53 UTC

Can you please try 10.2.30?

Comment 2 Edoardo Liverani 2019-12-18 10:21:40 UTC

(In reply to Tomáš Mózes from comment #1)
> Can you please try 10.2.30?

Hi Tomáš, there is no such version in portage. Are you suggesting a version bump?

Comment 3 Tomáš Mózes 2019-12-18 10:59:24 UTC

It's probably enough to copy the 10.2.29 ebuild to your local overlay as 10.2.30.

Comment 4 Edoardo Liverani 2019-12-18 11:32:18 UTC

(In reply to Tomáš Mózes from comment #3)
> It's probably enough to copy the 10.2.29 ebuild to your local overlay as
> 10.2.30.

I made the version bump and nothing changed.
So I tried to mangle the galera_recovery script until I got it working.
First I commented out the lines with chown and chmod on the tmp file (105-106):
#  [ "$euid" = "0" ] && chown $user $log_file
#      chmod 700 $log_file

Then I got a different error:
WSREP: Failed to start mysqld for wsrep recovery: '/usr/bin/galera_recovery: line 71: ./sbin/mysqld: No such file or directory'

So I changed in that line ./sbin/mysqld to mysqld

That way it finally started correctly. (Of course with some security problems for the file permissions?)

As I said before, the diff from the script from version 10.2.22 shows nothing like the changes I made so I still don't get why the new version doesn't work.

BTW I think you could be able to try locally the script with a single-node cluster with the following my.cnf options:
wsrep_on=ON
wsrep_provider=/usr/lib/galera/libgalera_smm.so

(or refer to gentoo wiki galera cluster page)
I'm not sure if a single-node cluster still triggers the script. I'll try in a few.

Comment 5 Tomáš Mózes 2019-12-18 11:52:30 UTC

I currently use OpenRC, so cannot test right away :-/

Comment 6 Edoardo Liverani 2019-12-18 14:20:52 UTC

I found out that what it seemed to me a trivial difference between the scripts, actually changing back
print_defaults="/usr/libexec/mariadb/my_print_defaults"
to
print_defaults="/usr/bin/my_print_defaults"

in the galera_recovery script, made things to work again.
I don't know the difference between these two binaries, I just know the first come from dev-db/mariadb package and the second come from dev-db/mysql-connector-c.

I leave this as a workaround for those who encounter this same problem, but I think the maintainer should have it checked.

BTW I confirm what I said in my previous comment: just adding those two lines to my.cnf configuration will let you test the script also if it's a single node.

Comment 7 Edoardo Liverani 2020-04-07 09:31:02 UTC

This problem persists in dev-db/mariadb-10.2.31

Comment 8 Iain Price 2020-04-23 10:23:25 UTC

(thanks for the previous fix that kept things rolling for the last 4 months :D)


No idea what changed in the latest patch but the ""fix"" to change the my_print_defaults doesn't seem to be working (now gives an error about --mysql parameter being bad)

There's something "Broken" about the mktemp call here ; it seems to be mktemp its self that's returning the permission denied(??). 

My current (probably insecure on a shared machine) bodge (because I need my cluster to work) to make this work is to just change line 28 (and not do the previous stuff about changing my_print_defaults any more)

log_file=/tmp/wsrep_recovery ; cd /usr

It complains that mktemp failed (some check I didn't bother to decode) but at least it seems to start back up and rejoin the cluster now.

Be nice to figure out whats actually broken here so I don't have to bodge my database startup every time a new release gets emerged but I'm a little busy so just dumping information here for now

Also, before anyone panics, something about wsrep_cluster_size drops to 0 after the upgrade to 10.4 from 10.2, but eventually returns to whatever it should be for your cluster size.  There's various other unanswered observations about this, and even when wsrep_cluster_size is zero changes seem to be replicating between the nodes.  Just not the sort of additional panic I needed on top of already bodging things :P

Comment 9 Thomas Deutschmann (RETIRED) gentoo-dev

2020-04-23 13:30:51 UTC

JFYI: At the moment I am the only one still active, left in Gentoo's mysql project. I don't use systemd and I don't use galera cluster. So if you are waiting for "us" to do something, you will probably wait for a long time... sorry about that.

Patches are welcome.

Comment 10 hyrekin 2020-07-03 19:07:13 UTC

Not sure of the implications but this worked for me 

vi /usr/bin/galera_recovery 

#just commented out the safety checks entirely

# Safety checks
#if [ -n "$log_file" -a -f "$log_file" ]; then
#  [ "$euid" = "0" ] && chown $user $log_file
#      chmod 600 $log_file
#else
#  log "WSREP: mktemp failed"
#fi