Multiple opscode-erchef Processes On The Loose!

Sean Horn -

If you see messages like the following in your /var/log/opscode/opscode-erchef/* logs on a Chef Server 12 installation, you almost certainly have a rogue opscode-erchef process. The messages occur because there are two erlang processes trying to assert themselves as "erchef@127.0.01" This is not a happy state for an erlang process [1]

Fix Procedure
---------------

You can confirm by doing a ps fauxww | grep opscode-erchef on the system in question.
If this problem is occurring, you will see a setup like the following in the process tree.

Because PID 30278 is missing the "\_" at the beginning of its commandline, you know that it is not associated with a runsv process and should be killed. You can either kill it while the system is running, or stop the chef server with chef-server-ctl stop and hunt down the process and kill it then. Then, start back up with chef-server-ctl start. On HA systems, you will want to be careful with the backends and stop the keepalived on the current Secondary first, before doing any maintenance on the Primary. This issue usually occurs on Frontends or Standalone topology systems.

ps fauxww | grep opscode-erchef
opscode 30278 0.0 0.0 12912 764 ? S 2014 0:02 /opt/opscode/embedded/service/opscode-erchef/erts-5.9.3.1/bin/run_erl /tmp//opt/opscode/embedded/service/opscode-erchef/ /opt/opscode/embedded/service/opscode-erchef/log /opt/opscode/embedded/service/opscode-erchef/bin/oc_erchef console
opscode 30279 1.9 1.6 675480 102416 pts/1 Ssl+ 2014 1902:56 \_ /opt/opscode/embedded/service/opscode-erchef/erts-5.9.3.1/bin/beam.smp -K true -A 5 -- -root /opt/opscode/embedded/service/opscode-erchef -progname oc_erchef -- -home /var/opt/opscode/opscode-erchef -- -boot /opt/opscode/embedded/service/opscode-erchef/releases/0.25.14/oc_erchef -embedded -config /opt/opscode/embedded/service/opscode-erchef/etc/app.config -name erchef@127.0.0.1 -setcookie erchef -pa lib/patches -- console

Explanation
-------------

The process parentage layout of the processes above in full looks like the following.

init -> Upstart -> runsvdir -> runsv ->
opscode-erchef-1-30279 localhost:8000 #This one won but complains about the other process trying to use its name

init ->
opscode-erchef-2-30278 localhost:8000 # This one lost

Both processes want to assume the erchef@127.0.0.1 (erlang) node identifier, but only one wins. In the PID 30278 versus 30279 case above, the correct one was winning. We have seen the one outside the runsvdir hierarchy win before, and then the system can't do any work.

Notes
------

[1] Rogue opscode-erchef logfile messages



2015-02-23_21:13:45.23305 {error_logger,{{2015,2,23},{15,13,45}},"Protocol: ~p: register error: ~p~n",["inet_tcp",{{badmatch,{error,duplicate_name}},[{inet_tcp_dist,listen,1,[{file,"inet_tcp_dist.erl"},{line,70}]},{net_kernel,start_protos,4,[{file,"net_kernel.erl"},{line,1314}]},{net_kernel,start_protos,3,[{file,"net_kernel.erl"},{line,1307}]},{net_kernel,init_node,2,[{file,"net_kernel.erl"},{line,1197}]},{net_kernel,init,1,[{file,"net_kernel.erl"},{line,357}]},{gen_server,init_it,6,[{file,"gen_server.erl"},{line,304}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}]}

2015-02-23_21:13:45.23315 {error_logger,{{2015,2,23},{15,13,45}},crash_report,[[{initial_call,{net_kernel,init,['Argument__1']}},{pid,<0.20.0>},{registered_name,[]},{error_info,{exit,{error,badarg},[{gen_server,init_it,6,[{file,"gen_server.erl"},{line,320}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}},{ancestors,[net_sup,kernel_sup,<0.10.0>]},{messages,[]},{links,[#Port<0.251>,<0.17.0>]},{dictionary,[{longnames,true}]},{trap_exit,true},{status,running},{heap_size,987},{stack_size,24},{reductions,545}],[]]}

2015-02-23_21:13:45.23323 {error_logger,{{2015,2,23},{15,13,45}},supervisor_report,[{supervisor,{local,net_sup}},{errorContext,start_error},{reason,{'EXIT',nodistribution}},{offender,[{pid,undefined},{name,net_kernel},{mfargs,{net_kernel,start_link,[['erchef@127.0.0.1',longnames]]}},{restart_type,permanent},{shutdown,2000},{child_type,worker}]}]}

2015-02-23_21:13:45.23404 {error_logger,{{2015,2,23},{15,13,45}},supervisor_report,[{supervisor,{local,kernel_sup}},{errorContext,start_error},{reason,shutdown},{offender,[{pid,undefined},{name,net_sup},{mfargs,{erl_distribution,start_link,[]}},{restart_type,permanent},{shutdown,infinity},{child_type,supervisor}]}]}

2015-02-23_21:13:45.24253 {error_logger,{{2015,2,23},{15,13,45}},std_info,[{application,kernel},{exited,{shutdown,{kernel,start,[normal,[]]}}},{type,permanent}]}
Have more questions? Submit a request

Comments

Powered by Zendesk