Environment
All versions, architectures and topologies of Chef Server still running the Reporting and/or Analytics add-ons
Issue
If you are still running the Reporting add-on and observe Chef Client runs fail without visible/obvious cause but notice that your logging catches errors like below, then this article is for you. Note this behaviour could also manifest as high writes-pers-second on your Postgres service/datastore if your monitoring detects an appreciable increase in the performance baseline.
Whilst Chef Server is normally operating, you may notice that it begins to back up with an increasing queue size in /var/log/opscode/opscode-erchef/current log:
2019-03-01_17:31:33.21974 [info] Queue Monitor current length = 10891 for VHost "/analytics" and Queue "alaska"
2019-03-01_17:32:03.13027 [info] Queue Monitor current length = 10898 for VHost "/analytics" and Queue "alaska"
2019-03-01_17:32:33.12863 [info] Queue Monitor current length = 10904 for VHost "/analytics" and Queue "alaska"
2019-03-01_17:33:03.21106 [info] Queue Monitor current length = 10907 for VHost "/analytics" and Queue "alaska"
If left unattended, this may result in the queue reaching 100% of its available capacity and it will start to drop messages:
2019-03-01_15:42:33.12641 [info] Queue Monitor current length = 10000 for VHost "/analytics" and Queue "alaska"
2019-03-01_15:42:33.12644 [warning] Queue Monitor has detected RabbitMQ for VHost "/analytics", queue "alaska" capacity at 100.0%
2019-03-01_15:42:33.12646 [warning] Queue Monitor has dropped 2 messages for VHost "/analytics", queue "alaska" since last check due to queue limit exceeded
Cause
The alaska queue becoming full can indicate that the analytics endpoint is not addressable or not functioning. It's possible that this can have an upstream impact on the overall performance of Chef Server's Postgres Database because the queue shares the same underlying disk (which will be impacted by higher concurrent reads/writes), which may begin to cascade into its normal operations relating to Chef Client runs. you may be able to diagnose that impact with entries in /var/log/opscode/postgresql/9.2/current:
2019-02-28_21:39:07.45313 STATEMENT: SELECT end_node_run($1, $2, $3, $4, $5, $6, $7) 2019-02-28_21:39:31.00808 ERROR: canceling statement due to statement timeout
Resolution
Remove the Reporting add-on
The Reporting add-on is EOL and should be removed. See: How to uninstall the Reporting add-on
If you are unsure whether you need the data for future purposes, you can stop and remove the configuration associated with Reporting but leave the data intact:
# Remove the Reporting endpoint from the chef-clients' view
rm /var/opt/opscode/nginx/etc/addon.d/*-reporting_*.conf
# restart the nginx LB
chef-server-ctl hup nginx
# Stop the opscode-reporting service
chef-server-ctl stop opscode-reporting
The Reporting service will become unavailable in such a way that the Chef Clients will notice and begin ignoring it.
Remove Analytics and RabbitMQ configuration
Analytics is also EOL and should be removed. You should find properties in /etc/opscode/chef-server.rb similar to:
If this configuration was in place as part of a Chef Compliance v1 Server setup you should also comment out the following configuration:
After removing or commenting out those lines, go ahead and stop all the services, remove the rabbitmq db, reconfigure chef server then start the services.
you will have stopped the Chef Server system writing to the alaska queue. The queue can then be removed using rabbitmqctl, or left alone.
Restart Chef Application
Lastly, we need to restart Chef Services on the instance: