An advisory document on how to distribute the responsibilities and actions of managing a Chef environment across your organisation's on-call teams.
All Chef Server Products (Automate, Automate Cluster, Chef Infra Server, Chef Habitat Builder), Integrations
SysAdmins, SRE, NOC engineers and Helpdesk staff. Anyone tasked with overseeing the uptime of a Chef Product as part of their day-to-day or on-call work schedule.
In addition to SLAs (https://www.chef.io/service-level-agreement/), we ask that, in order to effectively share the responsibilities, domain-specific expertise, and organisational context necessary to remediate any production issue, Chef Support requires that information below is adhered to.
Whilst How do I write a helpful support ticket? covers how a ticket is raised, it does not cover how a production issue is responded to. This should provide a foundation on which an organisation can base their expectations and merge with their processes to succeed when facing a production outage. Once read, the below should be reviewed and shared amongst your team members/engineers, noted in the pertinent places of your organisation's documentation, and referenced whenever needed.
Chef Support's priority is to ensure the timely and effective communication of materials necessary to investigate and resolve a production issue. To this end, the following key components are critical:
We advise that a prompt and well documented escalation path be available so that the right person can be reached quickly to solve an issue. Chef Support will reconvene on a call if necessary and when the requested persons are available, or if further troubleshooting is required once any steps have been completed.
Whilst we can commit our fast-acting on-call engineers to scheduling a call to resolve an issue, we cannot commit their availability for extensive periods of hold time. It is at the discretion of the engineer as to how long they will attend a call after identifying a fix that can ONLY be enacted by the customer.
Likewise, we strive to provide the fastest resolution times. These extend from having good external processes, capable and efficient expertise at-hand, and the ability to communicate quickly and accurately via our ticketing system, when necessary. In oder to provide the best support for all of our customers, Chef Support will not engage in video conference calls until we have assessed the provided information and have established a reason to do so.
Chef Support will request product and/or environment-specific documentation/logging, or knowledge forwarded to us in whatever time frame is reasonable for you to obtain it. This will include the output of both application CLI commands and native Linux/Windows commands, if these are required. We expect that you will have the capacity to perform these. We expect these to be uploaded to a ticket as attachments if they are longer than a few lines.
In addition, we may require to disprove that the underlying/surrounding environment is not interfering with the product. This may require that you are able to infer blocking configuration from complementary tools (security endpoint/AV software), or that you have access to the monitoring/metrics/logging of either the cloud or on-premise infrastructure on which the Chef product resides. Where necessary/possible, we will ask that dashboards be uploaded as image attachments.
It is important to note for the purposes of solving production issues that we may need to review the state of the product multiple times - it is good practise to flag up this requirement to your team and have a streamlined process that includes proactively sending the relevant logs when opening a support ticket. See How do I know which logs to submit to Chef Support?
It is also important to note that Zendesk serves as our point of documentation, and the more traceable a previous issue between our support staff and your organisation was, the greater a likelihood that this contributes to solving issues quickly in future. We will always prefer that information is passed via ticket rather than shared only on a video conference call.
When resolving an issue, it is imperative that whomever is involved from your organisation possesses the requisite expertise and domain knowledge to complete any task outside of Chef Support's capacity. This would include but would not be limited to:
- DNS entries
- NTP configuration
- Network configuration
- Firewall configuration
- Proxy configuration
- SSL certificate issuance
- Administrative privileges (root level access to systems where necessary)
- Secrets management
- Organisation-specific documented processes
If you are working within a team whose expertise does not encompass all of the above, or you are primarily responsible for the Chef environment but it has external dependencies such as data stores, storage, compute, etc., we suggest that you ensure you have escalation access to these people should an issue be identified.
Chef Support cannot be responsible for changes/events within your environment. Oftentimes, Chef products failing are an indicator that something within your environment has changed and degraded performance of the application.
We advise that any customer who is undergoing maintenance, migrating environments, doing disaster recovery testing, or undergoing a security audit share this at the time of raising the issue.
If you are not the person who is familiar with these procedures, we suggest reaching out someone who is, to ensure that external events are not interfering with the product.
Whilst we have no way to verify these external events conclusively, their acknowledgement and visibility can either rule them out or prevent Chef engineers from misdirecting their attention.
Many of the specifics of these external events can reviewed in the Severity 1 Checklist