When Planning a new deployment or considering an expansion/migration of your existing deployment to a new topology a frequent consideration is data resilience and reliability. Your data will primarily reside in a combination of PostgreSQL and Elasticsearch databases. In the case of Chef Backend HA or A2 Cluster these will be clustered databases.
Whether you are on a public cloud or on-premises infrastructure the same thinking will apply to how these databases are deployed, maintained and upgraded. We will expect that any externalisation of the databases outside of a Chef Product is an automatic acknowledgement of the responsibility of the data which resides there and that Chef Support can provide limited guidance as we will not have the necessary visibility of the system
Chef consider a few of these topologies sustainable in the long term whilst others remain untested and unsupportable. It is important to draw your attention to what we consider to be reliable configurations, what makes these configurations sustainable and what to avoid ahead of any decisions you may make.
|Chef Automate||2||Standalone, Cluster|
|Chef Infra Server||12, 13 -14||Standalone|
|Chef Backend||2.x, 3.x||Cluster|
Whether on premise or in a public cloud you will need to ensure that a few general considerations are made around the infrastructure associated with the databases. Its important to verify the claims made of provisioned infrastructure and where possible obtain reports/benchmarks that back up these claims.
Whether you consume your infrastructure from a cloud provider or on premise team you should be aware that Chef Server/Automate make specific demands and require minimum guarantees in order to maintain stability. If databases are configured to a specification arrived at by either your own capacity planning or that completed on your behalf by a Chef engineer it's likely that the configuration document will come in handy and prove useful when revisiting the topic. Understanding these configurations will help you verify if you ever need to scale up (Postgres concurrent connections) or scale out (Elasticsearch data nodes). The more configurables you have to work with the more finely tuned and performant your cluster may become. If you plan on using DbAAS on a public cloud you should confirm that whatever you use meets the minimum requirements we have.
To quote our scaling documentation:
Elastic strongly recommends the use of Flash or SSD storage that is local to the machine running it (not NAS or SAN). In cloud environments, we’ve found the best performance from machine types with local SSD storage ( AWS I3 or D2, Azure Ls series) , however, the SSD- based network storage options (AWS provisioned-IOPS EBS and Azure Premium Storage) provided acceptable latency in our tests but allowed much larger volume sizes.
In on-prem VM environments we recommend using direct-attached SSD arrays or SAN systems that can provide guaranteed bandwidth reserved for Elasticsearch. For larger datasets, physical machines optimized for big data workloads may be more economical. The storage must be able to handle at least 1000 sustained IOPs per Elasticsearch node with an average latency of 2ms or less.
The same will be true of Postgres, and it will by default write to the same storage when not externalized.
With regards to monitoring we cover much of what is useful and necessary in Chef Automate: Deployment Planning and Performance tuning (transcribed from 'Scaling Chef Automate Beyond 100,000 nodes') regarding both database types.
With regards to the databases' ability to sustain/accommodate high I/O requests you should ensure that all parts of the cluster, frontends and backends are timed from the same sources using NTP. You should have multiple NTP clocks available and the failover between these should be ordered consistently so that instances can tune to the most correct clock. We have observed better and more forgiving timing with chrony when it is available.
Geographically speaking, in the case of any clustered topology closer is better. Neither A2 Cluster nor Chef Backend were designed to perform as geo-distributed clusters and should not be deployed as such. Chef Automate/Infra FE nodes should be co-located with their backend cluster nodes and Standalone Chef Automate/Infra should be colocated with their database components. Nothing should be outside/beyond a region. Using different availability zones for Infra/Automate is acceptable whilst database clusters should be colocated within a single AZ, using cluster placement groups or equivalent in a public cloud.
Platform Specific Considerations:
Each Cloud provider will have their own caveats but as an example regarding operational performance we have observed that the database clusters supplied in Azure cloud were unable to provide continuous availability (which appeared in the form of paused, slow responses) to queries from Chef FE nodes.
We have also observed Azure resource manager (ARM) throttle requests based on monthly subscriptions, rendering a database as visibly available but heavily degraded with only a fraction of reads/writes completing. This cascades into difficult to troubleshoot application issues. See https://docs.microsoft.com/en-us/azure/azure-resource-manager/management/request-limits-and-throttling.
A secondary consideration will also be the nature and management of image based snapshots and services which endorse backups. these should be carefully managed and always take the form of read replicas. The specifics of these are available through each cloud provider and their configurability differs greatly.
The availability of a database leader is impacted when it is backed up and a Chef FE node is not necessarily always made aware if this is happening. In this regard Chef Backend has a well understood pattern, whilst the failover of each external database cluster will require domain expertise within your organisation to perform and troubleshoot.
At the time of writing we'd like to ensure that we publish what is and isn't supportable:
|Provider||function||Chef-Server compatible||Automate compatible|