Chef Server 12.15.x to 12.17.x Upgrade Disk Space Failure

Sean Horn -

After it is found that a Chef Server doesn't have enough space in /var for a 12.15.x to 12.17.x upgrade (needs a little more than 2x the currently used disk space. There are ways to cleanup to make the total requirement smaller, which we cover below). From 12.15.x to 12.17.x, there are two major actions in the upgrade

  1. Migrate the database schema from 1.33 to 1.34. This results in a new set of stored procedures in the database that facilitate the new opscode-erchef code being able to reference those new functions for adding and updating users. The system will function for chef-client runs even if this migration doesn't happen at all.
  2. Upgrade Postgresql from 9.2 to 9.6. This action is basically an automated clone or pg_dump/pg_restore on whatever databases happen to be there and all their data. In this case, the combined total storage usage on this system for Chef Server was about 320GB or so. We surmised that the great majority, about 286GB of that might be an opscode_reporting database that is no longer needed. We were correct.

We followed the following steps to first clean up a failed upgrade caused by a lack of disk space, then recover as much filesystem space as we could, then proceed with a successful postgresql upgrade.

  • Uninstalled 12.17.x package
  • Reinstalled 12.15 package, so that we could start up on the old 9.2 dataset
  • Started up on the 9.2 data with /opt/opscode/embedded/bin/chpst -P -U opscode-pgsql -u opscode-pgsql /opt/opscode/embedded/bin/postgres -D /var/opt/opscode/postgresql/9.6/data
  • In another terminal session, we logged in with sudo su - opscode-pgsql; psql opscode_chef and checked the estimated size of all databases with \l+
  • We dropped the opscode_reporting database and recovered 286GB of space for the filesystem with DROP DATABASE opscode_reporting; This resulted in us only having an opscode_chef database that was about 30GB, which is still very large, but more tractable than 5 hours of waiting for 330GB to clone.
  • Unconfigured the reporting DB/service with rm /var/opt/opscode/nginx/addon.d/*report*. This prevents the system from advertising that it has a functional Reporting service and accepting requests on that endpoint that will fail because the service no longer has data.
  • Prepare for the next attempt at the clone for 9.6 with rm -fr /var/opt/opscode/postgresql/9.6
  • Uninstalled the 12.15 package
  • Reinstalled the 12.17 package to get the binaries that work with 9.6 and the new database function
  • chef-server-ctl upgrade took affect and took about 30 minutes to do the clone into the /var/opt/opscode/postgresql/9.6/data area
  • After a quick chef-server-ctl start post upgrade, the system was up and able to take traffic.
Have more questions? Submit a request


Powered by Zendesk