Overloaded Open Source 11 or Chef Server 12 System

Sean Horn -

Software Resources

  • Open Source Chef Server 11.1.4
  • RHEL 6.x

Available Hardware Resources

  • 8 CPUs
  • 16 GB RAM - Mostly page cache
  • Lots of disk

Hardware Resource Symptoms

  • CPUs pegged with bookshelf (cookbook resource GETs show in nginx access.log) and erchef processes, postgresql worker processes ~50% idle
  • Load average higher than 100% the number of CPUs
  • IO quite low as most memory taken up by kernel page cache

Erchef Logfile Symptoms

2015-07-03_03:35:27.05170 =ERROR REPORT==== 3-Jul-2015::03:35:27 === 
2015-07-03_03:35:27.05171 webmachine error: path="/environments/prod/cookbook_versions" 
2015-07-03_03:35:27.05172 {error, 
2015-07-03_03:35:27.05172 {error, 
2015-07-03_03:35:27.05173 {badrecord,chef_cookbook_version},
2015-07-03_03:42:25.79206 
2015-07-03_03:42:25.79208 =ERROR REPORT==== 3-Jul-2015::03:42:25 === 
2015-07-03_03:42:25.79208 webmachine error: path="/roles/prod" 
2015-07-03_03:42:25.79209 {error,{case_clause,{error,no_connections}}, 
2015-07-03_03:42:25.79210 [{chef_db,fetch_requestor,3,[{file,"src/chef_db.erl"},{line,281}]}, 
2015-07-03_03:42:25.79210 {chef_wm_base,verify_request_signature,2, 
2015-07-03_03:42:25.79211 [{file,"src/chef_wm_base.erl"},{line,248}]},
2015-07-03_03:29:55.75733 =ERROR REPORT==== 3-Jul-2015::03:29:55 === 
2015-07-03_03:29:55.75733 webmachine error: path="/environments/_default/cookbook_versions" 
2015-07-03_03:29:55.75734 {error,function_clause, 
2015-07-03_03:29:55.75734 [{chef_wm_depsolver,forbidden_for_environment, 
2015-07-03_03:29:55.75735 [{error,no_connections}, 
2015-07-03_03:29:55.75735 {wm_reqdata,'POST',http,

Example Scenarios Leading to Overload On This Hardware

  • Unlimited Bootstrapping
    (Presents with the above detailed symptoms)

    • 1400 - 1500 standard Chef Clients already converged provides the baseload of 50-60% total cpu usage.
    • ~190 standard Chef Clients failing to bootstrap and retrying as quickly as possible or with only a 1s sleep in their cycle pushes this machine over the edge into the above pathological behavior.
  • ssh_known_hosts

    • Software Symptom: postgresql worker processes spend all of their time decoding JSON
    • Hardware Symptoms: High Postgres CPU and IO
    • Converge several hundred standard Chef Clients all running just the ssh_known_hosts cookbook. This will give a chef server of any size a hard time without the cacher recipe introduced in https://github.com/opscode-cookbooks/ssh_known_hosts/commit/e39e243cfa0f0f589bbd8538a986e1e0c363cc56 . Without a cacher node, all managed nodes request all of the serialized JSON blobs of all the other managed nodes on this server on every converge, resulting in a postgres meltdown. This example is distinct from the above bootstrapping overload failure in its logfile symptoms, but will occur on similar size hardware.
Have more questions? Submit a request

Comments

Powered by Zendesk