Monitoring a Chef server is an essential piece of successfully implementing Chef in your workplace. The same could be said about any system you introduce into your critical workflow and control plane. Monitoring all of the pieces of your chef server can be an involved task, requiring some extra hardware and elbow grease to get a monitoring system into place. Luckily, most companies that need Chef also already have monitoring in place. We recommend you talk to the teams that are SME’s in your in-house monitoring tools and work to get as many of the recommendations we make below implemented as possible.
If you have no current in-house monitoring tools may we suggest some of the following:
Monitoring and alerting tool(s):
Analytical and reporting database (pick one):
A log collection and analysis tool (pick one):
- ElasticSearch Filebeat
A visualization plane:
Adopting these tools will take some time to skill yourself up and understand how they work. The gist of the tools is you have a statsd (e.g. Statsd, or Telegraf with a statsd plugin) and a graphite (e.g. Graphite, or InfluxDB with a Graphite plugin) compatible daemon running on separate infrastructure, and we have a client on your Chef server emit metrics to that server. The statsd server will emit metrics received from the statsd daemons to your Graphite, Influxdb or Opentsdb server, either directly or via Sensu. Sensu alerts on metrics via email, chat, or real people via Pagerduty.
Although this may seem a daunting number of things to learn to monitor your Chef server, you should remember that you should be doing this with any and all business critical applications, and once you’ve invested in doing it once it is much easier in the future.
There are several layers of monitoring we’ll need to look at to get a holistic picture of your Chef server landscape. Each section below is a ‘layer’ of monitoring.
Operating System (OS) Monitoring
There are many options available for operating system monitoring. Nagios, statsd, zabbix and many more tools are perfectly suited for the job. If you already have a statsd server you could use your favourite statsd client. This can also double as your estatsd server as we’ll see ahead. The key things you want to monitor (with any system, really) are:
- Disk Space
- Load Average
- Free Memory
- File Handles
- Number of processes
- Network usage
Chef uses Postgres as its back-end database. Use pgBadger for Postgres log analysis and pg_stat_statements. The pg_stats module can also glean useful information out of the database statistics. Postgres has an article in their wiki dedicated to monitoring:
Application Performance Monitoring
To watch the performance of the chef-server application we have two tools immediately at our disposal. First is estatsd. You can enable estatsd metric emission in your chef server with the following configuration in your chef-server.rb:
### # Estatsd ### estatsd['enable'] = true estatsd['protocol'] = 'statsd' estatsd['vip'] = '<Your statsd server>' estatsd['port'] = '<Your statsd server port>'
You’ll need to run a chef-server-ctl reconfigure to start emitting statsd messages to your statsd server. The statsd server can then relay that information to different sources, such as Graphite or Sensu.
NOTE: There is currently an issue in the chef-server statsd protocol that causes a malformed first line to be emitted to statsd. If you have verbose/debug logging on your statsd server this can cause significant log accumulation. We suggest you turn logging to ‘error’ level on your statsd server before pointing Chef server at your statsd server, until https://github.com/chef/chef-server/issues/679 is resolved.
We also use a folsom metrics to graphite plugin on the Chef server to emit metrics. Most monitoring tools have a plugin available to accept metrics in graphite’s format. You should use that in order to emit these metrics from Chef server to your monitoring tool of choice.
To configure your Chef server to emit graphite based metrics you should include the following in your chef-server.rb:
folsom_graphite['enabled'] = true folsom_graphite['host'] = '<graphite.mycompany.com>' folsom_graphite['port'] = 2003
You should observe a base-line of your chef-servers performance while healthy, and set reasonable default alerts when things like free pooler connections drop too low, or when application response times rise too high.
Application Health Monitoring
On each Chef server there is a Status endpoint located at https://fqdn/_status. It pings the various systems needed for your Chef server to be healthy and if any return an erroneous response, it will return a HTTP 500 error to the requestor.
The endpoint returns its data in JSON format, including some useful database and connection pooling information. You can parse this JSON and ingest it into your monitoring system to gather metrics on various services within the Chef server.
You should also poll this endpoint from your load balancer to check for individual server health in your Chef server infrastructure. When a server responds to the status endpoint with a 500 error, you may consider using your load balancers functionality to remove it from the load balancer pool.
HTTP Response Codes
There is a single place you can look to gauge the health of your Chef server infrastructure, and that is your incoming http requests and the response codes emitted from the Chef server or the load balancer in front of your Chef front-ends. If you are looking directly on your Chef server(s), you should look at your Nginx logs. If your Chef server processes fail Nginx will continue serving requests and responding - usually with a HTTP 5XX response code such as 502 - Bad Gateway.
If you have the ability to, you should monitor responses from your load balancer, and not from the Chef server itself. In the case of hardware or other failure, the load balancer will continue to send responses to incoming requests which you can monitor and alert on. As well, if the Chef server is taking too long to respond to the load balancer and the load balancer has an aggressive timeout, you can sometimes trigger a timeout without a response from the Chef server, such as a HTTP 408 Request Timeout.
Regardless of where you monitor your application’s health you can use tools such as Logstash, Splunk or the ElasticSearch Filebeat product to parse your application logs and transport them to a database.
You should also collect logs for analysis and searching. The two most popular tools for this today are the ELK (ElasticSearch, Logstash, Kibana) and Splunk products. You should collect and grok the various logs in /var/log/opscode/.
There are many graphing solutions available today. Likely the most popular is Grafana. The Influxdb TICK stack also comes with a visualization product. Kibana is a popular graphing front-end for the EL(K) stack.
Regardless of the solution you pick the most important thing is you pick one. It’s important to see historical trends and the leading indicators of issues in your Chef server, and to be able to correlate that with changes in your infrastructure. This helps you create new alerts to catch issues before they become impacting in the future.
Monitoring & Alerting
You’ll want to use any one of the number of monitoring tools available today. Some of the free ones include Sensu, Zabbix and Nagios. One of the important considerations to make is what works in your environment - e.g., is there a firewall blocking SNMP traffic from your Chef server infrastructure to your Nagios server? You may not want to plan to use Nagios with SNMP traps.
There are many people that cover monitoring, trending and alerting in blogs that are much more eloquent than we can place in a short knowledge base article - we’d encourage you to seek out and read those if this interests you. We think the most important things to consider when setting alerts are:
- Have two types of alerts
- One that does not wake people up, perhaps only sending an email, chat-bot notification, or kan-ban board item. Examples of these are:
- Far leading indicators that something ‘might’ be wrong
- Supporting systems that can wait until business hours to be fixed
- Anything else that is not SLA based or time sensitive
- One that does wake people up, sending a notification to your NOC or a system such as PagerDuty. Examples of these are:
- System down alerts
- Impending outage indicators that something is imminently wrong
- Anything that is tied to SLA’s or is imminently time sensitive
- Reduce white noise
- Do you have alerts that routinely don’t get actioned by your team? Kill them! If you don’t people will start becoming numb to alerts and lag on actioning them.
- Automate recovery where possible
- We are auto-maters and makers. Is there an issue that is always resolved by running a specific command? Why are you wasting a Human's time running that manually?
With all this said it is a large endeavour that will take time and iterations. With anything in the DevOps landscape, start with a small goal and iterate. If you are a team with only access to the Chef logs and config on a machine, start with what you control in the Application space, then iterate and engage other teams at your company. If you have limited time but full access to your infrastructure, start with where you see the most value, then iterate to other areas.