In a Chef Backend High Availability deployment, the etcd
service is extremely sensitive and can get into a bad state across backend nodes due to disk and/or network latency. When this happens, it is common for the cluster to be unable to automatically failover/recover. Initial attempts to recover should follow this general pattern:
- Take a filesystem-level backup
- Remove an internal lock file that prevents promotion of what was the last leader. This file is placed automatically by
leaderl
when it demotes a leader in/var/opt/chef-backend/leaderl/data/no-start-pgsql
- Promote what was the most recent leader (
curl http://127.0.0.1:2379/v2/keys/cb/leader/last_leader
and thenchef-backend-ctl force-leader
on that node)
Sometimes, when the followers are down too long, this procedure is inadequate and you will see the following in followers' /var/log/chef-backend/postgresql/VERSION/current
log files:
2018-04-25_16:36:29.42242 FATAL: the database system is starting up 2018-04-25_16:36:30.90058 LOG: started streaming WAL from primary at 16F3/2D000000 on timeline 88 2018-04-25_16:36:30.90124 FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 00000058000016F30000002D has already been removed 2018-04-25_16:36:30.90125
In this case, you'll need to sync the followers from the leader using a full basebackup
procedure, because the WAL entries have already rotated. Issue the following on one follower node at a time:
chef-backend-ctl stop leaderl chef-backend-ctl cluster-status PSQL_INTERNAL_OK=true chef-backend-ctl pgsql-follow --force-basebackup --verbose LAST_LEADER_IP chef-backend-ctl start
Comments
0 comments
Article is closed for comments.