Gentoo Websites Logo
Go to: Gentoo Home Documentation Forums Lists Bugs Planet Store Wiki Get Gentoo!
Bug 629360 - dev-db/repmgr-3.3.2 switchover breaks replication cluster
Summary: dev-db/repmgr-3.3.2 switchover breaks replication cluster
Status: UNCONFIRMED
Alias: None
Product: Gentoo Linux
Classification: Unclassified
Component: Current packages (show other bugs)
Hardware: AMD64 Linux
: Normal major (vote)
Assignee: Robin Johnson
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-08-30 09:18 UTC by Chris Travers
Modified: 2019-02-12 12:19 UTC (History)
2 users (show)

See Also:
Package list:
Runtime testing required: ---


Attachments
patch to symlink correct binary name to path. (repmgr-misname-fix.patch,576 bytes, patch)
2017-08-30 09:18 UTC, Chris Travers
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Chris Travers 2017-08-30 09:18:14 UTC
Created attachment 491090 [details, diff]
patch to symlink correct binary name to path.

Scope of Bug Report:
----------------------
I have also filed an upstream bug with a different scope relating to longer-term strategies for hardening the software against this sort of thing, but for shorter-term strategies, this is a distribution problem.  My related upstream bug report is aimed at avoiding problems when distributions do what is done here but requires disruptive changes.  My bug report here aims at fixing the misbehaviour that the ebuild causes due to violating unspoken upstream assumptions.

PostgreSQL dependencies used:
-----------------------------
PostgreSQL 10 beta 3

How to reproduce:
------------------
1. Set up a replication cluster (I used three pgdata directories running on different ports on the same box).  Make sure wal_log_hints are enabled to ensure that you can use pg_rewind.
2. Set up repmgr and register a master and different replicas.  I used three different .conf files in the Postgres user home directory for this.
3.  Make sure rep mgr user can access replica systems.  In my case I just set up Postgres to have passphraseless access via ssh to its own account on localhost.
4. execute repmgr standby switchover against one of the replicas

Expected result:
--------------
Repmgr is supposed to run through safety checks, promote the standby, then connect over ssh, shut down postgres, rewind the master, and reconnect it as a replica.

What happens:
-------------
Repmgr res through its safety checks, promotes the master, connects over ssh and tries to demote the master, but it fails with the error indicating that "repmgr: no such file or directory"

What causes the problem:
------------------------
repmgr10 doesn't work because it is a symlink to the PostgreSQL bin directory file of repmgr rather than having a symlink (like all the other binaries in this dir being the name of the original file).

Impacts:
---------
Because repmgr cannot find itself, it fails hard and dies.  This leaves the replicated. cluster in an extremely inconsistent, possibly split brain, state.

The solution is to supply the attached patch.
Comment 1 Chris Travers 2017-09-01 10:14:06 UTC
The upstream bug report is worth noting here since a different approach might be to apply the patch.  However discussion and so forth there may be valuable in triaging, evaluating and ultimately deciding what to do here.

https://github.com/2ndQuadrant/repmgr/issues/323
Comment 2 Tomáš Mózes 2019-02-12 12:19:17 UTC
(In reply to Chris Travers from comment #1)
> The upstream bug report is worth noting here since a different approach
> might be to apply the patch.  However discussion and so forth there may be
> valuable in triaging, evaluating and ultimately deciding what to do here.
> 
> https://github.com/2ndQuadrant/repmgr/issues/323

Seems like it's fixed upstream, but we probably need to bump the version?