All versions, topologies and architectures of Chef Automate.
When making a backup or initiating a restore in Chef Automate, you notice an error:
BackupError: An issue occurred when creating or restoring a backup: Streaming backup events failed: Unable to handle backup event: Backup failed. Check the Chef Automate logs for more information.
You check the logs and see errors like the following:
automate.chef.io hab: deployment-service.default(O): time="2020-05-10T20:00:00+01:00" level=error msg="Backup failed" backup_id=20200510100000 error="Failed to set status object to pending: Failed to commit .status: blob (key \"20200510100000/.status\") (code=Unknown): RequestCanceled: request context canceled\ncaused by: context deadline exceeded"
When checking with your IT/Security team, you find that an overzealous configuration of your antivirus / security agent is preventing the Automate application from writing data without the connections/writes being inspected. This is causing the whole process to slow down to the extent that it may never complete, or in the event of being run that it would not be useful to restore from application backups. An antivirus agent can disrupt/corrupt a backup by attempting to read it whilst it is being written.
You can confirm this by checking the running processes using ps -ef:
root 1637 0.0 0.0 99964 7288 ? S May20 0:00 /opt/ds_agent/ds_agent -w /var/opt/ds_agent -b -i -e /opt/ds_agent/ext
root 1638 0.0 1.2 1498848 207008 ? Sl May20 1:44 \_ /opt/ds_agent/ds_agent -w /var/opt/ds_agent -b -i -e /opt/ds_agent/ext
root 2797 0.0 0.0 46600 2060 ? S May20 0:00 /opt/ds_agent/ds_am -g ../diag -v 5 -d /var/opt/ds_agent/am -P 1 -R
root 2821 0.0 0.8 2221704 135016 ? Sl May20 0:37 \_ /opt/ds_agent/ds_am -g ../diag -v 5 -d /var/opt/ds_agent/am -P 1 -R
It is important to talk to your infrastructure/security teams about Chef Automate's requirements as an application, and ensure that the security agent is deployed with a suitably matched configuration to prevent it from interfering with the Servers operations.
Ensuring that, if you have an IPS (Intrusion Prevention System) endpoint configured, it is aware of valid ranges which your nodes reside in. We've observed IPS drop packets (in dmesg logs) from services which used a loopback/internal load balancer previously, and recommend that the primary IP address and 127.0.0.1 are whitelisted.
If you utilise an antivirus agent (Cylance, Symantec, TrendMicro) that monitors/logs access to directory and file paths, we would recommend whitelisting the following: