Scaling Automate servers above hundreds of clients on 0.7.85 or 0.8.5

Sean Horn -


In a standard Automate install of 4 CPU/16GB RAM, we assume that the default config will be able to handle a few hundred chef-clients checking in at 30 minute intervals, possibly also running the audit cookbook, and saving some other data types, which would add additional load to the indexing system.

To reach higher scale than the above, possibly up to several thousand chef-client systems checking in at a 30 minute interval, we are assuming that a 3 node Elasticsearch 2.3.1 or higher external cluster running on local SSD disks is installed and already accessible by the Automate 0.7.85 or 0.8.5 install. The members of the ES cluster should have 4GB JVM heaps already configured for their elasticsearch JVM process. The Automate system should have a minimum of 16CPU and 32GB RAM available.

We will implement a tuning loop from here on. A tuning loop consists of changing one thing, monitoring the effects by picking a set of metrics and watching them closely, then adjusting our tuning to take into account any new findings from the tuning effort. Before any tuning occurs, a baseline should be gathered by defining helpful metrics, implementing the metrics in a monitoring system, making sure the monitoring system is working by reading back the stored data to determine the baseline, then beginning the tuning effort.


To determine the overall health of the combined Automate and external ES cluster, along with the standard CPU, memory, disk usage, and IO stats on all involved systems, we assume and require that both the previous and following metrics are being continuously monitored at minimum

  1. Chef Client Runs per Minute (CCR/M) from whatever Chef Servers are proxying to this Automate install
  2. ElasticSearch POST times from the `tail /var/log/delivery/nginx/es_proxy.access.log | grep POST` output on the Automate system. 
  3. data-collector queue length on the Automate system

    export PATH=$PATH:/opt/delivery/embedded/bin
       rabbitmqctl list_queues -p /insights | grep data-collector

  4. chef-backend disk queue length. This can be gathered with `iostat -x 2 -d sdb`on all of the ES cluster members if the disks the ES cluster members use is named "sdb"
  5. Gather ES JVM Heap utilization with `curl -XGET 'http://localhost:9200/_nodes/stats/jvm?pretty'. Use the following to discover the current heap usage of a JVM process using the PID . The second column output is the current heap usage in bits.


su delivery -c '/opt/delivery/embedded/jre/bin/jcmd 55932 GC.class_histogram' | grep TotalTotal     243932380    10484054440


Scaling Logstash on the Automate system

1. Set the heap for the logstash processes to the following and reconfigure Automate

logstash['heap'] = "2g"

2. Adjust /opt/delivery/sv/logstash/run to reflect the following settings on an Automate system with 16 CPUs and 32GB RAM

-w 12
-b 512

3. Run the following script to create three new logstash processes.

for ii in $(seq 2 4); do
  cp -r /opt/delivery/sv/logstash/ /opt/delivery/sv/logstash$ii
  cp -r /opt/delivery/embedded/etc/logstash/conf.d /opt/delivery/embedded/etc/logstash/conf.d$ii
  rm /opt/delivery/embedded/etc/logstash/conf.d$ii/10-websocket-output.conf
  sed -i s/conf\.d/conf\.d$ii/ /opt/delivery/sv/logstash$ii/run
  ln -s /opt/delivery/sv/logstash$ii /opt/delivery/service/logstash$ii
  ln -s /opt/delivery/embedded/bin/sv /opt/delivery/init/logstash$ii
  mkdir -p /var/log/delivery/logstash$ii
  sed -i s/logstash/logstash$ii/ /opt/delivery/sv/logstash$ii/log/run
  delivery-ctl start logstash$ii

Refer back to metrics

After the above tuning, you will want to watch all 4 Automate-specific metrics above to determine whether the system will be able to maintain the increased load, or whether further tuning will be necessary, which completes the tuning loop mentioned earlier.

Have more questions? Submit a request


Powered by Zendesk