All versions, architectures and topologies of Chef Server still running the Reporting and/or Analytics add-ons
If you are still running the Reporting add-on and observe Chef Client runs fail without visible/obvious cause but notice that your logging catches errors like below, then this article is for you. Note this behaviour could also manifest as high writes-pers-second on your Postgres service/datastore if your monitoring detects an appreciable increase in the performance baseline.
Whilst Chef Server is normally operating, you may notice that it begins to back up with an increasing queue size in /var/log/opscode/opscode-erchef/current log:
2019-03-01_17:31:33.21974 [info] Queue Monitor current length = 10891 for VHost "/analytics" and Queue "alaska"
2019-03-01_17:32:03.13027 [info] Queue Monitor current length = 10898 for VHost "/analytics" and Queue "alaska"
2019-03-01_17:32:33.12863 [info] Queue Monitor current length = 10904 for VHost "/analytics" and Queue "alaska"
2019-03-01_17:33:03.21106 [info] Queue Monitor current length = 10907 for VHost "/analytics" and Queue "alaska"
If left unattended, this may result in the queue reaching 100% of its available capacity and it will start to drop messages:
2019-03-01_15:42:33.12641 [info] Queue Monitor current length = 10000 for VHost "/analytics" and Queue "alaska"
2019-03-01_15:42:33.12644 [warning] Queue Monitor has detected RabbitMQ for VHost "/analytics", queue "alaska" capacity at 100.0%
2019-03-01_15:42:33.12646 [warning] Queue Monitor has dropped 2 messages for VHost "/analytics", queue "alaska" since last check due to queue limit exceeded
The alaska queue becoming full can indicate that the analytics endpoint is not addressable or not functioning. It's possible that this can have an upstream impact on the overall performance of Chef Server's Postgres Database because the queue shares the same underlying disk (which will be impacted by higher concurrent reads/writes), which may begin to cascade into its normal operations relating to Chef Client runs. you may be able to diagnose that impact with entries in /var/log/opscode/postgresql/9.2/current:
2019-02-28_21:39:07.45313 STATEMENT: SELECT end_node_run($1, $2, $3, $4, $5, $6, $7) 2019-02-28_21:39:31.00808 ERROR: canceling statement due to statement timeout
Remove the Reporting add-on
If you are unsure whether you need the data for future purposes, you can stop and remove the configuration associated with Reporting but leave the data intact:
# Remove the Reporting endpoint from the chef-clients' view
# restart the nginx LB
chef-server-ctl hup nginx
# Stop the opscode-reporting service
chef-server-ctl stop opscode-reporting
The Reporting service will become unavailable in such a way that the Chef Clients will notice and begin ignoring it.