Push Jobs Jobs Crashing Consistently

Sean Horn -

If you see entries like this continuously on the active Chef Server backend in your /var/log/opscode/opscode-pushy-server/current, the most likely issue is that the system times on your opscode-push-jobs-client and push jobs server do not match. They must match to get good push jobs runs. 

2016-04-12_22:50:21.86752 18:50:07.232 [error] (pushy_node_state:364) <0.21274.0> Bad timestamp in message: ={"node":"vm-2628-5132.someone.net","client":"vm-2628-5132","org":"someonechef","type":"heartbeat","sequence":142,"timestamp":"Tue, 12 Apr 2016 22:38:44 GMT","incarnation_id":"223c888a-eca6-4d70-8403-7e128b5f9e0b","job_state":"idle","job_id":null}

The node vm-2628-5132.someone.net has a system time that is about 12 minutes behind the timestamp on the opscode-pushy-server that is trying to send it jobs and that it is heartbeating with.

This issue can happen in several ways

1. Push Jobs Server and Client times not synched
2. Push Jobs Clients separated from the Server by a firewall
3. Potentially, but not reproduced so far. A push jobs client may have the correct time, but be so busy for other reasons that it cannot respond to the heartbeats in a timely manner. So it's heartbeat responses come in too late and the server rejects them.
4. Confirmed. A push server on a small system can receive such a flood of resigned key renewals that it can fall behind far enough that it cannot process them within their TTL. The symptom of this in the logfile looks exactly like 1. above. An indication that it's not 1. is that the Chef Server where the Push Jobs server is running should be extremely busy and the Push Jobs server should be taking a lot of CPU time. This situation occurs with thousands of nodes.

Reference the following for another way to check on the health of push jobs client nodes https://docs.chef.io/plugin_knife_push_jobs.html#node-status . You must have a knife client configured against the desired chef server and the client must have the knife-push gem installed.

The output will look something like this if the push jobs client nodes are available and ready to run jobs

knife node status

acceptance-node-1 available
build-node-test-1 available
build-node-test-2 available
build-node-test-3 available
Have more questions? Submit a request

Comments

Powered by Zendesk