During either an auto-update or as part of a planned upgrade you may find that the system upgrade takes a long time to complete or hangs. When you run `chef-automate upgrade status` the system appears to be upgrading but taking a long time.
you may observe a few noticeable messages in the automate logs:
2020-07-29T15:45:26.688022-05:00 chef-automate hab: ingest-service.default(O): time="2020-07-29T15:45:26-05:00" level=error msg="Failed initializing elasticsearch" error="Error creating index node-state-7 with error: elastic: Error 403 (Forbidden): blocked by: [FORBIDDEN/12/index
read-only / allow delete (api)]; [type=cluster_block_exception]"
You may see references to schema migrations or data transfers within automate-elasticsearch like so:
2020-07-29T16:09:32.097993-05:00 chef-automate hab: authz-service.default(O): time="2020-07-29T16:09:32-05:00" level=info msg="Checking for data migrations..."
2020-07-29T16:09:32.121193-05:00 chef-automate hab: authz-service.default(O): time="2020-07-29T16:09:32-05:00" level=info msg="Checking for remaining schema migrations..."
2020-07-29T16:09:32.123784-05:00 chef-automate hab: authz-service.default(O): time="2020-07-29T16:09:32-05:00" level=info msg="DB initialization complete at version 77"
These are indicative of the system undertaking a data migration as part of an update. It is likely the system is insufficiently resourced in terms of Disk/RAM to perform the migration and is failing to complete the upgrade due to filling the disk when replicating the data to an adjacent location or consuming excessive resources.
2, upgrading beyond:
If you want to ensure you are never the position to receive an unplanned upgrade which impacts your production environment please disable auto-updates. see https://automate.chef.io/docs/install/#disable-automatic-upgrades . We recommend maintaining a test environment in which you can side load some backup data and leave this to routinely upgrade. It may not require any nodes to actively report but should be similar in data retention and resourcing to your production environment. This will help you assess the downtime/impact of a potential upgrade.
It is important to ensure you are notified and can test any new update. Update announcements can be subscribed to at https://discourse.chef.io/c/chef-release/9.
To date Automate has been shipped with 2 releases which will require attention should you be migrating from a version before to a version after either of these releases.
Compliance data schema migration https://automate.chef.io/release-notes/?v=20190410001346
Chef Actions event service migration (TBC but landing in a version in 202008 )
Ensure that in the event you need to extend your disk you can reach your infrastructure/Unix team to do so.
If you are running an external elasticsearch datastore or are not sure of how much disk you will actually use but do have approximations of node count/report frequency/avg. report size please speak to your Chef Account team or ask in the Chef community regarding sizing.
Evaluate your disk usage once you have a retention policy in place, running:
and noting what /hab currently consumes will give you an indication of how much additional diskspace you are actually likely to need for a migration.
As part of the implementation of you capacity planning for data retention you should be looking to ensure that your Chef Automate systems have ample disk. We recommend 50%-70% free space on the partition on which /hab is located (depending on how your backups configured) so that migrations can happen without issue. This means that your disk should never be more than 30%.
We also recommend ensuring the system meets all the specific requirements as per Essential System Tunings and Prerequisite Checks for Chef Automate 2 as these specifics may be tested during a data migration.
In addition to the above we recommend taking the following steps to ensure a painless experience:
- Ensure that your system has an appropriate amount of heap memory assigned to Elasticsearch: https://automate.chef.io/docs/configuration/#setting-elasticsearch-heap
- Schedule the upgrade as close to 00:01 UTC as possible to reduce the amount of data in the current day.
- Test the upgrade in a non-production environment prior to upgrading if you have more than a few GBs of data. Monitor your resource consumption to ensure you have enough throughput and, if necessary, allocate more resources to minimize the impact to your system.
- Disable other resource intensive processes (such as backups, re-indexing, etc.) during the upgrade, or schedule them run at different time before or after the upgrade
Test and time your upgrades on a no-production instance and allow for this plus enough time to troubleshoot (2-3 hours) in a scheduled maintenance window.
When you run `chef-automate upgrade status` the system appears to be upgrading but taking a long time. If you have not accounted for extensive service downtime or if this is an automatic upgrade its important to note that a data migration will render the an associated ingesting of new data and/or historical searches unusable until after it is complete.
You can use that command in conjunction with the out put of both `chef-automate system-logs` and `curl localhost:10141/_cat/indices?v` should give you an indication of whether the system is currently migrating indices and to observe how quickly the system is working through the migration.
In the even that the upgrade errors out you may check `chef-automate status` and notice that ingest/event/compliance services have short uptime and are not healthy/recusively restarting:
Service Name Process State Health Check Uptime (s) PID
event-feed-service running CRITICAL 6 42797
ingest-service running unknown 6 42773
UnhealthyStatusError: System status is unhealthy: One or more services are unhealthy
If you run out of disk space during a migration it is likely that the elasticsearch indices will have surpassed the 85% threshold,triggers FORBIDDEN read-only status of indices and preventing the system from writing further. see Chef Automate 2 degraded performance or disk full due to too much historical data for the fix.
If the system was unable to complete because it is under-provisioned/misconfigured you will need to adjust its heapsize to cope with the overall load. see Chef Automate services down, Elasticsearch and Compliance service in CRITICAL state, web UI and client runs failing with 5xx errors due to insufficient Elasticsearch heap size.
If an auto update is occurring on a virtualised Autoate instance you may wish to revert to an image snapshot and immediately disable auto-updates to allow for you to plan the upgrade. Remember that as part of the migration no new data should be ingested so you will not be losing additional data through this procedure.