Software Resources
- Open Source Chef Server 11.1.4
- RHEL 6.x
Available Hardware Resources
- 8 CPUs
- 16 GB RAM - Mostly page cache
- Lots of disk
Hardware Resource Symptoms
- CPUs pegged with bookshelf (cookbook resource GETs show in nginx access.log) and erchef processes, postgresql worker processes ~50% idle
- Load average higher than 100% the number of CPUs
- IO quite low as most memory taken up by kernel page cache
Erchef Logfile Symptoms
2015-07-03_03:35:27.05170 =ERROR REPORT==== 3-Jul-2015::03:35:27 ===
2015-07-03_03:35:27.05171 webmachine error: path="/environments/prod/cookbook_versions"
2015-07-03_03:35:27.05172 {error,
2015-07-03_03:35:27.05172 {error,
2015-07-03_03:35:27.05173 {badrecord,chef_cookbook_version},
2015-07-03_03:42:25.79206
2015-07-03_03:42:25.79208 =ERROR REPORT==== 3-Jul-2015::03:42:25 ===
2015-07-03_03:42:25.79208 webmachine error: path="/roles/prod"
2015-07-03_03:42:25.79209 {error,{case_clause,{error,no_connections}},
2015-07-03_03:42:25.79210 [{chef_db,fetch_requestor,3,[{file,"src/chef_db.erl"},{line,281}]},
2015-07-03_03:42:25.79210 {chef_wm_base,verify_request_signature,2,
2015-07-03_03:42:25.79211 [{file,"src/chef_wm_base.erl"},{line,248}]},
2015-07-03_03:29:55.75733 =ERROR REPORT==== 3-Jul-2015::03:29:55 ===
2015-07-03_03:29:55.75733 webmachine error: path="/environments/_default/cookbook_versions"
2015-07-03_03:29:55.75734 {error,function_clause,
2015-07-03_03:29:55.75734 [{chef_wm_depsolver,forbidden_for_environment,
2015-07-03_03:29:55.75735 [{error,no_connections},
2015-07-03_03:29:55.75735 {wm_reqdata,'POST',http,
Example Scenarios Leading to Overload On This Hardware
-
Unlimited Bootstrapping
(Presents with the above detailed symptoms)- 1400 - 1500 standard Chef Clients already converged provides the baseload of 50-60% total cpu usage.
- ~190 standard Chef Clients failing to bootstrap and retrying as quickly as possible or with only a 1s sleep in their cycle pushes this machine over the edge into the above pathological behavior.
-
ssh_known_hosts
- Software Symptom: postgresql worker processes spend all of their time decoding JSON
- Hardware Symptoms: High Postgres CPU and IO
- Converge several hundred standard Chef Clients all running just the
ssh_known_hosts
cookbook. This will give a chef server of any size a hard time without the cacher recipe introduced in https://github.com/opscode-cookbooks/ssh_known_hosts/commit/e39e243cfa0f0f589bbd8538a986e1e0c363cc56 . Without a cacher node, all managed nodes request all of the serialized JSON blobs of all the other managed nodes on this server on every converge, resulting in a postgres meltdown. This example is distinct from the above bootstrapping overload failure in its logfile symptoms, but will occur on similar size hardware.
Comments