Summary
Should your Chef environment experience disruptive network outages it is possible that these can interrupt communications between the various cluster components to the extent that their shared topology becomes invalidated.
This can be observed in a number of ways such as 50x error responses when using Knife or during Chef Infra Client runs or on one of the chef backend followers where it appears that write transactions are being blocked:
/var/log/chef-backend/leaderl/current
2020-08-26_12:44:18.89331 ERROR: cannot execute UPDATE in a read-only transaction
2020-08-26_12:44:18.89333 STATEMENT: UPDATE nodes SET environment= $1, policy_name= $2, policy_group= $3, last_updated_by= $4, updated_at= $5, serialized_object= $6 WHERE id= $7
2020-08-26_12:44:22.63549 LOG: incomplete startup packet
var/log/opscode/opscode-erchef/current
2020-08-26T09:11:32Z erchef@127.0.0.1 method=PUT; path=/organizations/res/nodes/chef-client.chef.io; status=500; req_id=g3IAA2QAEGVyY2hlZkAxMjcuMC4wLjEDAAFEgQn0AAEAAAAA; org_name=res; msg={error,{<<"25006">>,<<"cannot execute UPDATE in a read-only tra"...>>}}; couchdb_groups=false; couchdb_organizations=false; couchdb_containers=false; couchdb_acls=false; 503_mode=false; couchdb_associations=false; couchdb_association_requests=false; req_time=108; rdbms_time=0; rdbms_count=4; solr_time=44; solr_count=1; authz_time=2; authz_count=1; user=cacl002.prod.res.ldc.tcinfra.com; req_api_version=1;
var/log/opscode/opscode-erchef/crash.log
2020-08-26 08:23:48 =ERROR REPORT====
{<<"method=DELETE; path=/organizations/res/nodes/chef-client.chef.io; status=500; ">>,{throw,{delete_from_db,{error,{<<"25006">>,<<"cannot execute DELETE in a read-only transaction">>}}},[{oc_chef_object_db,maybe_delete_authz_id_or_error,3,...
Distribution
Product | Version | Topology |
Chef Infra Server | 12.x + | Frontend |
Chef Backend | 2.x + | Cluster |
Process
Plan
Preparation: N/A
Design: N/A
Configure
Evaluation: N/A
Application: N/A
Troubleshoot
Analysis:
To ensure your cluster health after a network outage/communication breakdown it is important to verify that the nodes are all capable of talking. On a backend cluster node run:
chef-backend-ctl status
chef-backend-ctl cluster-status
Both the above commands should come back with greenlit health and verify the cluster has healed and is synchronised. At this point take note of which IP address belongs to the current leader.
Now check both of the follower nodes log entries under /var/log/chef-backend/leaderl/current. If you observe attempted writes that look like:
2020-08-26_12:44:18.89331 ERROR: cannot execute UPDATE in a read-only transaction
Then assume that the follower on which you observe these is currently incorrectly configured as the leader from the frontend perspective.
Remediation:
The remediation is to perform a rolling restart of on the frontends:
chef-server-ctl restart-services
This should discard the follower IP and update the configuration with the current leader IP. you should be able to verify that chef now works by either checking logs, performing a knife command/chef client run or observing your monitoring as the 50x errors decrease.
Appendix
Related Articles: N/A
Further Reading: N/A
Comments
0 comments
Please sign in to leave a comment.