Problem: My Chef Backend HA cluster system is returning 500s at the UI and for most knife or chef-client actions.
If I run chef-server-ctl status
and I see this
-------------------
Internal Services
-------------------
run: bookshelf: (pid 1856) 1469280s; run: log: (pid 1849) 1469280s
run: haproxy: (pid 1859) 1469280s; run: log: (pid 1853) 1469280s
run: nginx: (pid 1868) 1469280s; run: log: (pid 1866) 1469280s
run: oc_bifrost: (pid 1858) 1469280s; run: log: (pid 1851) 1469280s
run: oc_id: (pid 1861) 1469280s; run: log: (pid 1850) 1469280s
run: opscode-erchef: (pid 1863) 1469280s; run: log: (pid 1854) 1469280s
run: redis_lb: (pid 1860) 1469280s; run: log: (pid 1852) 1469280s
-------------------
External Services
-------------------
down: elasticsearch: failed to connect to http://127.0.0.1:9200: end of file reached
down: postgresql: failed to connect to 127.0.0.1:5432: server closed the connection unexpectedly
That's bad. This means that we are likely running in a Chef Backend HA configuration and the three node backend cluster is down or inaccessible. This could be a firewalling issue. All frontend nodes should have access to tcp ports 5432 and 9200 on all three backend nodes.
The first thing to do in this case is send out the results of running chef-backend-ctl gather-logs on all three backend nodes to Support.
If you are confident that you know which node was last leader, attempt to force-leader on it with chef-backend-ctl force-leader
The system may recover at this point. If it does not, make sure you get those logs to Support as soon as you can so they can take a look.
Comments
0 comments
Article is closed for comments.