Abstract
A common failure state for an Automate deployment, especially one in a non-HA, single-node Elasticsearch cluster configuration, is for that single-node Elasticsearch cluster to be trying to persist more data than it can handle and also be performant. The exact amount is difficult to nail down as it will depend on individual system specifications, but suffice it to say that if your standalone Automate instance is approaching more than 75 days' worth of historical compliance/converge data without data lifecycle policies in place, it will almost certainly hit this failure condition.
When this happens, Elasticsearch's memory footprint will grow, excessive file handles will be consumed, and other services will get blocked from starting. Typically, these symptoms will be observed after some manner of restart, system or service, as the issue is exposed when Elasticsearch needs to scan through and assign the entirety of shards/data upon startup; this circumstance is distinct from when the cluster is up as part of normal operation, and ingesting new data piecemeal.
This can manifest in a number of ways; some common ones follow.
Checking status of Chef Automate:
# chef-automate status
automate-elasticsearch running CRITICAL
compliance-service running CRITICAL
Chef Infra Client runs failing at the end when trying to send up audit reports:
ERROR: Server returned error 504 for https://CHEF_SERVER_FQDN/organizations/ORGNAME/data-collector, retrying 1/5 in 4s
ERROR: Server returned error 504 for https://CHEF_SERVER_FQDN/organizations/ORGNAME/data-collector, retrying 2/5 in 6s
ERROR: Server returned error 500 for https://CHEF_SERVER_FQDN/organizations/ORGNAME/data-collector, retrying 3/5 in 9s
ERROR: Server returned error 504 for https://CHEF_SERVER_FQDN/organizations/ORGNAME/data-collector, retrying 4/5 in 18s
ERROR: Server returned error 504 for https://CHEF_SERVER_FQDN/organizations/ORGNAME/data-collector, retrying 5/5 in 33s
INFO: HTTP Request Returned 504 Gateway Time-out:
INFO: Error while reporting run start to Data Collector. URL: https://CHEF_SERVER_FQDN/organizations/ORGNAME/data-collector Exception: 504 -- 504 "Gateway Time-out" (This is normal if you do not have Chef Automate)
Any of the following occurring in Chef Automate logs:
# chef-automate system-logs
automate-load-balancer.default(O): 2020/02/12 14:14:27 [alert] 94694#0: *899387 1024 worker_connections are not enough while connecting to upstream, client: ::ffff:10.16.78.113, server: AUTOMATE_FQDN, request: "GET /data-collector/v0/ HTTP/1.1", upstream: "https://10.16.78.116:2000/events/data-collector/", host: "AUTOMATE_FQDN"
automate-load-balancer.default(O): 2020/02/12 19:34:26 [error] 11775#0: *1341 upstream timed out (110: Connection timed out) while reading response header from upstream, client: ::ffff:10.192.44.51, server: AUTOMATE_FQDN, request: "POST /data-collector/v0/ HTTP/1.0", upstream: "https://10.192.44.50:2000/events/data-collector/", host: "data-collector"
automate-elasticsearch.default(O): org.elasticsearch.action.UnavailableShardsException: [comp-3-profiles][4] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[comp-3-profiles][4]] containing [index {[comp-3-profiles][_doc][47ffc32c7255a1fab872bb15696817386f0b2496e627972ca42b91f96b8b9354], source[n/a, actual length: [345kb], max length: 2kb]}] and a refresh]
automate-elasticsearch.default(O): org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed
As well as 500 errors being returned from the web UI.
Additionally, as part of this type of failure, it's likely for a low disk event to co-occur, which will trigger Elasticsearch to set indices to read-only:
es-sidecar-service.default(O): time="2020-01-30T02:02:32+01:00" level=error msg="Disk free below critical threshold" avail_bytes=73728 host=127.0.0.1 mount="/hab (/dev/mapper/rhel_rhel7-lv_hab)" threshold_bytes=536870912 total_bytes=107344822272
ingest-service.default(O): time="2018-05-16T00:10:09Z" level=error msg="Message failure" error="rpc error: code = Internal desc = elastic: Error 403 (Forbidden): blocked by: [FORBIDDEN/12/index read-only / allow delete (api)]; [type=cluster_block_exception] elastic: Error 403 (Forbidden): blocked by: [FORBIDDEN/12/index read-only / allow delete (api)]; [type=cluster_block_exception]"
It's also likely that Elasticsearch's Java environment will have run up against its heap size limit:
automate-elasticsearch.default(O): [2020-01-29T08:19:59,400][INFO ][o.e.m.j.JvmGcMonitorService] [7WTg1Hc] [gc][old][295][16] duration [27.9s], collections [3]/[30.4s], total [27.9s]/[29.1s], memory [3.7gb]->[3.7gb]/[3.8gb], all_pools {[young] [195.6mb]->[227.1mb]/[266.2mb]}{[survivor] [30.9mb]->[0b]/[33.2mb]}{[old] [3.5gb]->[3.5gb]/[3.5gb]}
automate-elasticsearch.default(O): java.lang.OutOfMemoryError: Java heap space
Out of memory: Kill process 9103 (java) score 595 or sacrifice child
Remediation
To mitigate this issue and stabilize Automate, take the following actions.
System Requirements
If you have not met or exceeded the following system requirements, do that first https://automate.chef.io/docs/system-requirements/
Revert indices to read-write
If you were or are low or out of disk space on the partition that /hab resides on, grow it to at least threefold what you think you reasonably need. This is because when backups are taken, a considerable amount disk space will be temporarily consumed. Afterwards, set Elasticsearch indices back to read/write:
# curl -s -XPUT 'localhost:10144/_all/_settings' -H 'Content-Type: application/json' -d'
{
"index.blocks.read_only_allow_delete": null
}
'
(See also https://docs.chef.io/automate/troubleshooting/#recovering-from-low-disk-conditions)
Essential system tunings
Very importantly, ensure that the following system tunings are in place:
# sysctl vm.max_map_count
vm.max_map_count = 262144
# sysctl vm.dirty_expire_centisecs
vm.dirty_expire_centisecs = 20000
If they are not, set them:
# sysctl -w vm.max_map_count=262144
# sysctl -w vm.dirty_expire_centisecs=20000
And add them to /etc/sysctl.conf so that they persist through system reboots:
# tail -n 2 /etc/sysctl.conf
vm.max_map_count=262144
vm.dirty_expire_centisecs=20000
Increase file descriptors
Next is to open up the file handle limit for Automate services:
# systemctl stop chef-automate
# mkdir -p /etc/systemd/system/chef-automate.service.d
# cat <<EOF >> /etc/systemd/system/chef-automate.service.d/custom.conf
[Service]
LimitNOFILE = 128000
EOF
# systemctl daemon-reload
# systemctl start chef-automate
Elasticsearch heap space
Then, increase the heap space available to Elasticsearch, which should be 50% of the system memory, not to exceed 26GB (see https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html):
# cat <<EOF >> ~/heapsize-patch.toml
[elasticsearch.v1.sys.runtime]
heapsize = "16g"
EOF
# chef-automate config patch ~/heapsize-patch.toml
Prune old data
Up until now, we've merely been giving Elasticsearch room to breathe in terms of system resources so that we can do the real work of mitigating this issue, which is to delete old data/indices. We'll first need to tell Elasticsearch to allow us to use wildcards in deletion operations:
# curl -s -XPUT 'localhost:10144/_cluster/settings?pretty' -H 'Content-Type: application/json' -d'
{
"transient" : {
"action.destructive_requires_name" : false
}
}
'
Then, get an idea of your indices with something like:
# curl -s -XGET 'localhost:10144/_cat/indices?v' | tail -n +2 | awk '{print $3}' | sort
You should see a lot of dated comp- and converge- indices. Approximately calculate which previous months fall outside of the current 60-day period and start deleting corresponding indices; for example, this will delete all daily compliance and converge indices from November of 2019:
# curl -s -XDELETE 'localhost:10144/*2019.11*'
After you've made a first pass at deletion, tell Elasticsearch to attempt to reassign all shards with the following:
# curl -s -XPOST 'localhost:10144/_cluster/reroute?retry_failed&pretty'
And watch the allocation happen:
# watch -n 5 curl -s -XGET 'localhost:10144/_cat/allocation?v'
The leftmost numbers atop each other under the shards column should be changing. Once they're approximately equal, your system should be healthy again. If allocation halts or becomes prohibitively slow, try deleting more historical indices, then restarting services with
# systemctl stop chef-automate
# systemctl start chef-automate
Re-issuing the cluster reroute, and watching allocation once again. These procedures resolve the vast majority of Automate partial and full outages.
Once your system is stabilized, carefully review and implement sound data lifecycle policies to avoid this situation in the future: https://docs.chef.io/automate/data_lifecycle/
Comments
0 comments
Article is closed for comments.