Chef Backend Cluster - Full manual recovery

Sean Horn -

In a Chef Backend High Availability deployment, the etcd service is extremely sensitive and can get into a bad state across backend nodes due to disk and/or network latency. When this happens, it is common for the cluster to be unable to automatically failover/recover. Initial attempts to recover should follow this general pattern:

1. Take a filesystem-level backup
2. Remove an internal lock file that prevents promotion of what was the last leader. This file is placed automatically by leaderl when it demotes a leader in /var/opt/chef-backend/leaderl/data/no-start-pgsql
3. Promote what was the most recent leader (curl http://127.0.0.1:2379/v2/keys/cb/leader/last_leader and then chef-backend-ctl force-leader on that node)

Sometimes, when the followers are down too long, this procedure is inadequate and you will see the following in followers' /var/log/chef-backend/postgresql/VERSION/current log files:

2018-04-25_16:36:29.42242 FATAL:  the database system is starting up
2018-04-25_16:36:30.90058 LOG:  started streaming WAL from primary at 16F3/2D000000 on timeline 88
2018-04-25_16:36:30.90124 FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 00000058000016F30000002D has already been removed
2018-04-25_16:36:30.90125

In this case, you'll need to sync the followers from the leader using a full basebackup procedure, because the WAL entries have already rotated. Issue the following on one follower node at a time:

chef-backend-ctl stop leaderl
chef-backend-ctl cluster-status
PSQL_INTERNAL_OK=true chef-backend-ctl pgsql-follow --force-basebackup --verbose LAST_LEADER_IP
chef-backend-ctl start
Have more questions? Submit a request

Comments

Powered by Zendesk