Chef Server Older Than 12.3.0 Unable to Start With RabbitMQ Reindexing

Sean Horn -

Summary:

In releases of Chef Server older than 12.3.0, the system was shipped with the /analytics queue uncapped. This means that in the event the linked Analytics server stops consuming messages, the /analytics queue on the Chef Server has the potential to grow infinitely. RabbitMQ handles this scenario by first using all of the memory it has available up to a watermark, then it starts spilling unconsumed messages over onto disk. There is no upper bound on used disk space, by default.

In a Standalone Chef Server install, this disk location is found at /var/opt/opscode/rabbitmq/db/rabbit@localhost

A Chef Server with a very large rabbitmq queue (10s of GBs) will never be able to start back up and run correctly, as opscode-erchef depends on a clean and fast start of rabbitmq.

If the rabbitmq system is busy reindexing messages and has not started up yet, the rabbitmq logfile in /var/opt/opscode/rabbitmq/log/rabbit\@localhost.log will look like this. Compare this message to the next example and notice how we never get to the point where we start a listening TCP socket here. We're just stuck at rebuilding indices.

=INFO REPORT==== 7-May-2015::17:26:16 ===
msg_store_transient: using rabbit_msg_store_ets_index to provide index

=INFO REPORT==== 7-May-2015::17:26:16 ===
msg_store_persistent: using rabbit_msg_store_ets_index to provide index

=WARNING REPORT==== 7-May-2015::17:26:16 ===
msg_store_persistent: rebuilding indices from scratch

But it should look like this, if the rabbitmq system is ready to accept connections from the rest of the system and not stuck reindexing multiple GBs of old unconsumed messages from disk. Notice how the reindexing portion of the following startup was very quick and then the system immediately starts a listening socket.


=WARNING REPORT==== 28-Apr-2016::00:27:22 ===
msg_store_persistent: rebuilding indices from scratch
=INFO REPORT==== 28-Apr-2016::00:27:22 ===
started TCP Listener on 0.0.0.0:5672

 

Solution:

chef-server-ctl stop
rm -r /var/opt/opscode/rabbitmq/db/rabbit@localhost
chef-server-ctl reconfigure
opscode-reporting-ctl reconfigure
chef-server-ctl start

You may need to get the redis_lb password if it has been configured to require authenticated connections. Use the following command to get the redis_lb password.

grep -A1 redis_lb /etc/opscode/private-chef-secrets.json

Enable 503 mode for the Chef Server API.

/opt/opscode/embedded/bin/redis-cli -a REDIS_PASSWORD -p 16379 HSET dl_default 503_mode true

A full reindex example can be found in the How Can I Reindex All Orgs In My Chef Server article and can be used to replace the following single org reindex if needed.

chef-server-ctl reindex ORGNAME

Disable 503 mode for the Chef Server API.

/opt/opscode/embedded/bin/redis-cli -a REDIS_PASSWORD -p 16379 HDEL dl_default 503_mode false

Finally to prevent this issue happening again until you upgrade above Chef Server 12.3.0, you will want to cap your Chef Server's /analytics queue using the set_policy command found at the following link Cap the /analytics queue

 

You can find a release announcement for Chef Server 12.3.0 here. This release included by default a cap on the size of the /analytics queue. Additionally, from this release on, there is automated backpressure on the /analytics queue. If the new daemon notices that the /analytics queue is not being consumed, events are no longer produced to the queue and it will not grow beyond a reasonable upper bound.

https://www.chef.io/blog/2015/11/12/chef-server-12-3-0-release-announcement/

 

 

Have more questions? Submit a request

Comments

Powered by Zendesk