Summary
If you have not configured or appropriately configured a Data Lifecycle policy, the amount of historical data in a Chef Automate 2 instance will grow to a point that, at minimum, will seriously degrade performance, and is also likely to cause a disk full event, bringing all services down.
Distribution
Product | Version | Topology |
Chef Automate | 2 | All |
Process
Plan
Preparation: Evaluate your host instance(s) sizing needs appropriately according to published minimum system requirements as well as our scaling documentation ( Chef Automate: Deployment Planning and Performance tuning (transcribed from 'Scaling Chef Automate Beyond 100,000 nodes') ).
Configure the appropriate monitoring and alerting on thresholds which will provide ample time to request more disk space and/or rotate logs. Also be considerate as to whether adequate local disk space for data migrations during upgrades is in place. We recommend that you should have an excess of anything up to 50% free disk if you have a remote share/location to which back-ups are directed or 70% free disk if a local backup location is part of the implementation.
Design: If needs seem greater than a standalone Chef Automate instance can sustain, multi-node topologies need also be considered. Contact your account team for further information deploying a clustered instance of Chef Automate.
Once you have established the amount of storage you require to encompass both updates and data retention you can visit https://automate.chef.io/docs/data-lifecycle/to assess your data lifecycle policies and whether they meet your compliance requirements. This will ensure that you are in a good position to configure the policy in the next section.
Configure
Evaluation: A decision will have to be made per your business and organizational needs around what kind of data you need to retain and for how long; this might involve multiple teams and stakeholders. Understand what kind of retention is currently being applied in your configuration.
Application: Data Lifecycle policies and configuration are covered in our documentation here: https://automate.chef.io/docs/data-lifecycle/
Troubleshoot
Analysis: In the case where appropriate planning and configuration was not followed, you will end up with too many open Elasticsearch indices, which will slow or bring the system to a complete halt. This issue can present itself in a variety of ways. When checking status of Chef Automate, you might see:
# chef-automate status
automate-elasticsearch running CRITICAL
compliance-service running CRITICAL
Chef Infra Client runs could also be failing at the end when trying to send up audit reports:
ERROR: Server returned error 504 for https://CHEF_SERVER_FQDN/organizations/ORGNAME/data-collector, retrying 1/5 in 4s
ERROR: Server returned error 504 for https://CHEF_SERVER_FQDN/organizations/ORGNAME/data-collector, retrying 2/5 in 6s
ERROR: Server returned error 500 for https://CHEF_SERVER_FQDN/organizations/ORGNAME/data-collector, retrying 3/5 in 9s
ERROR: Server returned error 504 for https://CHEF_SERVER_FQDN/organizations/ORGNAME/data-collector, retrying 4/5 in 18s
ERROR: Server returned error 504 for https://CHEF_SERVER_FQDN/organizations/ORGNAME/data-collector, retrying 5/5 in 33s
INFO: HTTP Request Returned 504 Gateway Time-out:
INFO: Error while reporting run start to Data Collector. URL: https://CHEF_SERVER_FQDN/organizations/ORGNAME/data-collector Exception: 504 -- 504 "Gateway Time-out" (This is normal if you do not have Chef Automate)
In this case, you need to comment out the data-collector config found in /etc/opscode/chef-server.rb on your Chef Server and chef-server-ctl reconfigure to prevent the data-collector failures. Afterwards, your chef-client runs should begin succeeding. You will need to undo this step after the Automate 2 system has been returned to service
data_collector['root_url'] = 'https://automate.example.com/data-collector/v0/'
data_collector['proxy'] = true
profiles['root_url'] = 'https://automate.example.com'
#Then, reconfigure
chef-server-ctl reconfigure
Or in the Chef Automate logs:
# chef-automate system-logs
automate-load-balancer.default(O): 2020/02/12 14:14:27 [alert] 94694#0: *899387 1024 worker_connections are not enough while connecting to upstream, client: ::ffff:10.16.78.113, server: AUTOMATE_FQDN, request: "GET /data-collector/v0/ HTTP/1.1", upstream: "https://10.16.78.116:2000/events/data-collector/", host: "AUTOMATE_FQDN"
automate-load-balancer.default(O): 2020/02/12 19:34:26 [error] 11775#0: *1341 upstream timed out (110: Connection timed out) while reading response header from upstream, client: ::ffff:10.192.44.51, server: AUTOMATE_FQDN, request: "POST /data-collector/v0/ HTTP/1.0", upstream: "https://10.192.44.50:2000/events/data-collector/", host: "data-collector"
automate-elasticsearch.default(O): org.elasticsearch.action.UnavailableShardsException: [comp-3-profiles][4] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[comp-3-profiles][4]] containing [index {[comp-3-profiles][_doc][47ffc32c7255a1fab872bb15696817386f0b2496e627972ca42b91f96b8b9354], source[n/a, actual length: [345kb], max length: 2kb]}] and a refresh]
automate-elasticsearch.default(O): org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed
es-sidecar-service.default(O): time="2020-01-30T02:02:32+01:00" level=error msg="Disk free below critical threshold" avail_bytes=73728 host=127.0.0.1 mount="/hab (/dev/mapper/rhel_rhel7-lv_hab)" threshold_bytes=536870912 total_bytes=107344822272
ingest-service.default(O): time="2018-05-16T00:10:09Z" level=error msg="Message failure" error="rpc error: code = Internal desc = elastic: Error 403 (Forbidden): blocked by: [FORBIDDEN/12/index read-only / allow delete (api)]; [type=cluster_block_exception] elastic: Error 403 (Forbidden): blocked by: [FORBIDDEN/12/index read-only / allow delete (api)]; [type=cluster_block_exception]"
Remediation: In order to resolve this issue, we need to clear out many of these old indices -- exactly how many depends on the system, but there are some prerequisite steps to take first.
If some essential system tunings have not been followed, those need to be set up as well. This is covered in the following article:
Essential System Tunings and Pre-Remediation checks for Chef Automate 2
If you were or are low or out of disk space on the partition that /hab resides on, grow it to at least threefold what you think you reasonably need. This is because when backups are taken, a considerable amount disk space will be temporarily consumed. Afterwards, set Elasticsearch indices back to read/write:
# curl -s -XPUT 'localhost:10144/_all/_settings' -H 'Content-Type: application/json' -d'
{
"index.blocks.read_only_allow_delete": null
}
'
(See also https://docs.chef.io/automate/troubleshooting/#recovering-from-low-disk-conditions)
Next is to open up the file handle limit for Automate services:
# systemctl stop chef-automate
# mkdir -p /etc/systemd/system/chef-automate.service.d
# cat <<EOF >> /etc/systemd/system/chef-automate.service.d/custom.conf
[Service]
LimitNOFILE = 128000
EOF
# systemctl daemon-reload
# systemctl start chef-automate
Now, we can do the real work of mitigating this issue, which is to delete old data/indices. We'll first need to tell Elasticsearch to allow us to use wildcards in deletion operations:
# curl -s -XPUT 'localhost:10144/_cluster/settings?pretty' -H 'Content-Type: application/json' -d'
{
"transient" : {
"action.destructive_requires_name" : false
}
}
'
Then, get an idea of your indices with something like:
# curl -s -XGET 'localhost:10144/_cat/indices?v' | tail -n +2 | awk '{print $3}' | sort
You should see a lot of dated comp- and converge- indices. Approximately calculate which previous months fall outside of the current 60-day period and start deleting corresponding indices; for example, this will delete all daily compliance and converge indices from November of 2019:
# curl -s -XDELETE 'localhost:10144/*2019.11*'
After you've made a first pass at deletion, tell Elasticsearch to attempt to reassign all shards with the following:
# curl -s -XPOST 'localhost:10144/_cluster/reroute?retry_failed&pretty'
And watch the allocation happen:
# watch -n 5 curl -s -XGET 'localhost:10144/_cat/allocation?v'
The leftmost numbers atop each other under the shards column should be changing. Once they're approximately equal, your system should be healthy again. If allocation halts or becomes prohibitively slow, try deleting more historical indices, then restarting services with
# systemctl stop chef-automate
# systemctl start chef-automate
Re-issuing the cluster reroute, and watching allocation once again. These procedures resolve the vast majority of Automate partial and full outages.
Next Steps
Once you have plenty of disk space and Automate is running again, go ahead and set up data retention so that the amount of data saved by Automate can be controlled
Appendix
Related Articles: If some essential system tunings have not been followed, those need to be set up as well. This is covered in the following article:
Essential System Tunings and Pre-Remediation checks for Chef Automate 2
Likewise, if your Elasticsearch heap size is too low, that could block this mitigation as well. That is covered in the following article:
Further Reading:
A good rule of thumb to follow if you want to have an extremely responsive system is to allocate something like 3GB of RAM to the Elasticsearch cluster per day of history you'd like to retain.
Elastic recommends roughly 20 shards per GB of Java Heap. Automate creates 30 or more shards per day, at 10 per index. We have seen customers push past this recommendation, but performance degrades across various services unpredictably as allocation approaches 750MB per day of history in a busy system. If your allocation is approaching 1 GB per day of history and you're experiencing performance issues, it's time to reduce the amount of history you retain or increase the amount of RAM assigned to your Elasticsearch installation.
Comments
0 comments
Article is closed for comments.