After updating one of the Galera cluster nodes from mariadb-10.2.22 to mariadb-10.2.29, the systemd service fails to start.
Here's what I see in the logs:
sh: /usr/bin/galera_recovery: line 71: /tmp/wsrep_recovery.Y7g0X2: Permission denied
systemd: mariadb.service: Control process exited, code=exited, status=1/FAILURE
systemd: mariadb.service: Failed with result 'exit-code'.
systemd: Failed to start MariaDB 10.2.29 database server.
I diffed the service file and the galera_recovery script from the previous version and there don't seem to be any relevant differences. (Maybe this bug comes from upstream?)
If MariaDB is not configured as a Galera cluster, it starts without problems.
/tmp is 777, managed by systemd-tmpfiles.
I was able to reproduce this bug in 2 different clusters (different gentoo installations) so it might be reproducible with this configuration.
After re-emerging version 10.2.22, it starts correctly.
Can you please try 10.2.30?
(In reply to Tomáš Mózes from comment #1)
> Can you please try 10.2.30?
Hi Tomáš, there is no such version in portage. Are you suggesting a version bump?
It's probably enough to copy the 10.2.29 ebuild to your local overlay as 10.2.30.
(In reply to Tomáš Mózes from comment #3)
> It's probably enough to copy the 10.2.29 ebuild to your local overlay as
I made the version bump and nothing changed.
So I tried to mangle the galera_recovery script until I got it working.
First I commented out the lines with chown and chmod on the tmp file (105-106):
# [ "$euid" = "0" ] && chown $user $log_file
# chmod 700 $log_file
Then I got a different error:
WSREP: Failed to start mysqld for wsrep recovery: '/usr/bin/galera_recovery: line 71: ./sbin/mysqld: No such file or directory'
So I changed in that line ./sbin/mysqld to mysqld
That way it finally started correctly. (Of course with some security problems for the file permissions?)
As I said before, the diff from the script from version 10.2.22 shows nothing like the changes I made so I still don't get why the new version doesn't work.
BTW I think you could be able to try locally the script with a single-node cluster with the following my.cnf options:
(or refer to gentoo wiki galera cluster page)
I'm not sure if a single-node cluster still triggers the script. I'll try in a few.
I currently use OpenRC, so cannot test right away :-/
I found out that what it seemed to me a trivial difference between the scripts, actually changing back
in the galera_recovery script, made things to work again.
I don't know the difference between these two binaries, I just know the first come from dev-db/mariadb package and the second come from dev-db/mysql-connector-c.
I leave this as a workaround for those who encounter this same problem, but I think the maintainer should have it checked.
BTW I confirm what I said in my previous comment: just adding those two lines to my.cnf configuration will let you test the script also if it's a single node.
This problem persists in dev-db/mariadb-10.2.31
(thanks for the previous fix that kept things rolling for the last 4 months :D)
No idea what changed in the latest patch but the ""fix"" to change the my_print_defaults doesn't seem to be working (now gives an error about --mysql parameter being bad)
There's something "Broken" about the mktemp call here ; it seems to be mktemp its self that's returning the permission denied(??).
My current (probably insecure on a shared machine) bodge (because I need my cluster to work) to make this work is to just change line 28 (and not do the previous stuff about changing my_print_defaults any more)
log_file=/tmp/wsrep_recovery ; cd /usr
It complains that mktemp failed (some check I didn't bother to decode) but at least it seems to start back up and rejoin the cluster now.
Be nice to figure out whats actually broken here so I don't have to bodge my database startup every time a new release gets emerged but I'm a little busy so just dumping information here for now
Also, before anyone panics, something about wsrep_cluster_size drops to 0 after the upgrade to 10.4 from 10.2, but eventually returns to whatever it should be for your cluster size. There's various other unanswered observations about this, and even when wsrep_cluster_size is zero changes seem to be replicating between the nodes. Just not the sort of additional panic I needed on top of already bodging things :P
JFYI: At the moment I am the only one still active, left in Gentoo's mysql project. I don't use systemd and I don't use galera cluster. So if you are waiting for "us" to do something, you will probably wait for a long time... sorry about that.
Patches are welcome.
Not sure of the implications but this worked for me
#just commented out the safety checks entirely
# Safety checks
#if [ -n "$log_file" -a -f "$log_file" ]; then
# [ "$euid" = "0" ] && chown $user $log_file
# chmod 600 $log_file
# log "WSREP: mktemp failed"