When using Chef Backend HA you may observe an increase in cpu and memory consumption over time due to high ingest traffic or increases in API calls (knife search for example) as a result of cookbook changes:
05:37:29 up 35 days, 23:05, 0 users, load average: 10.40, 9.87, 9.29
2020-08-04_10:29:50.46850 [2020-08-04T05:29:50,468][INFO ][o.e.m.j.JvmGcMonitorService] [42e6ba4893112e1f78fa5c07298795de] [gc] overhead, spent [267ms] collecting in the last [1s]
Assuming the underlying infrastructure is virtualised and running close to or 'at-capacity', you may observe a failover of the overall leader or the Postgres cluster leader client fail to write due to broken pipe in chef-backend commands below:
2020-08-04_10:30:39.08084 LOG: could not send data to client: Broken pipe
2020-08-04_10:30:39.08087 FATAL: connection to client lost
Name IP GUID Role PG ES Blocked Eligible
chefbe01 10.10.10.1 cd203a6f484bcdfad0d371f46a055d83 leader leader not_master not_blocked true
chefbe02 10.10.10.2 28c39a63d593c62d62ee6555578a98bd waiting_for_leader unknown not_master not_blocked false
chefbe03 10.10.10.3 42e6ba4893112e1f78fa5c07298795de waiting_for_leader unknown master not_blocked false
Service Local Status Time in State Distributed Node Status
leaderl running (pid 1613) 35d 23h 5m 16s leader: 1; waiting: 2; follower: 0; total: 3
epmd running (pid 1612) 35d 23h 5m 16s status: local-only
etcd running (pid 1599) 35d 23h 5m 16s health: green; healthy nodes: 3/3
postgresql running (pid 11542) 2d 7h 36m 19s leader: 1; offline: 2; syncing: 0; synced: 0
elasticsearch running (pid 1603) 35d 23h 5m 18s state: green; nodes online: 3/3
|Chef Backend||All versions||Cluster|
Liaise with your Infrastructure team to understand how Chef backend will be deployed. chef Backend should be treated as a database cluster, with requirements to maintain availability above those of resource balancing. Whilst it is unlikely that you can have the BE nodes addressed directly to specific CPUs on specific ESX nodes there may be some more viable paths:
- DRS aggression set to conservative, a value of 1
- SDRS rules which require underlying VMDK to remain in a given datastore or DRS rules to tie a VM to a given node subset
- DRS Virtual Machines to Hosts (VM-Host) rules, where by the backend cluster may all be tied to different nodes but within the same group.
- Disable DRS for these nodes and instead ask that they only be moved in conjunction with your team in a clearly defined maintenance window.
All of these rules minimise disruption in the event of resource rebalancing activities.
Whilst you cannot be responsible for the requirements placed on your infrastructure team as governed by your organisation you can establish the requirements of the chef backend HA cluster and attempt to ensure its stability.
Allow for your infrastructure team to return with a viable plan and discuss what, if any monitoring can be made available to understand the role the infrastructure plays with regards to the high availability of the application.
Load testing and HA failover testing at regular intervals will ensure that the application continues to perform as expected. Likewise proactively moving nodes in a maintenance window as clusters near capacity will also allow you to measure migration times, assess performance impacts etc. Devise a runbook based on these measurements and distribute the respective responsibilities amongst the infrastructure, NOC and application teams.
Ideally your Chef admin team will share a runbook on how the underlying infrastructure functions and what to do in the event that you see issues at the application level. This will allow teams to quickly identify if and when changes need to be made, how those changes should be handled and what information is needed in the event that something unexpected happens.
If you are running an 'at-capacity' VMware cluster or similar virtualised infrastructure it is plausible that the spike in cpu/memory usage has exceeded some DRS (Dynamic resource scheduling) rules, resulting in the Vmotion (live VM migration) or SVmotion (live migration of the VM's underlying disk) of the virtual machine on which the leader runs.
If using Vmotion (Compute), The active memory and precise execution state of the virtual machine is rapidly transferred over a high speed network, allowing the virtual machine to instantaneously switch from running on the source ESX host to the destination ESX host. Once the entire memory and system state has been copied over to the target ESX host, VMotion suspends the source virtual machine, copies the bitmap to the target ESX host, and resumes the virtual machine on the target ESX host. If the VM is large in size (more than 8 vCPU or more than 64Gb RAM) this can incur momentary downtime of the application (1-2 seconds) and affect its clustering ability.
If using SVmotion (Storage) it is important to acknowledge that the source storage array and destination storage array must be similarly capable (SSD <> SSD) and not impact the processing or degrade the performance of the leader.
If you see a persistent increase in I/O or CPU usage as a result of svmotion it is important to note that the migration of a database host incurs between 5-22% cpu performance hit. If a BE node does not have head room this will degrade its stability further. This is also true for machines which are VMotioned whilst running.
It is also plausible that during these types of migration, depending on whether the vmotion backplane is isolated, a momentary spike in network throughput between ESX nodes can occur. This can cause latency on the link between BE <> BE nodes or FE <> BE nodes.
It is important to immediately verify whether a leader has been moved as a result of DRS and review the short term and long term performance impact.
Identifying and troubleshooting Vmware infrasturucture is beyond the scope of this article, you can see a snippet or how to identify issues (monitoring, ESXTop etc) in https://medium.com/kenshoos-engineering-blog/performance-degradation-in-production-whos-to-blame-acfbc1648a11
It's plausible that a restart of the postgres services on each follower and then leader will remediate these issues. If a leader is demoted and remains demoted or is corrupted it is also advisable to engage in the instructions to rejoin that node back to the cluster as per https://docs.chef.io/backend_failure_recovery/