A month ago one of our servers ran out of disk space during a long weekend - because this kind of stuff never happens during working hours, am I right? We caught the issue pretty late (thank you for your patience!), freed up a ton of disk space and did a hard reset. We immediately decided that, to omit similar issues in the future, we’d need to migrate www.phusionpassenger.com to its own dedicated server.

For all kinds of legacy reasons some of the services we’re using ran on a single server. Moving the Phusion Passenger website to a dedicated server was part of a bigger effort to better separate out these services, making sure we’re less dependant on ‘all other things being normal’.

Last week we completed the migration of www.phusionpassenger.com - or, more precisely: the proxy for enterprise packages, our Docs, the naked domain and staging. Services that remain on the ‘old’ server include our blog and the Heroku Status Service tool.

Writing the Ansible playbook

To automate the provisioning of the server we used Ansible, a Red Hat product. Before using Ansible we used Chef Cookbooks to provision servers. Dissatisfied with Chef Cookbooks we decided to give Ansible a chance. Idempotency was an important argument, and Ansible is agentless, so there’s no need for Chef to be installed on the target machine.

You will only need ssh and Python enabled on the target server. You don’t need to upload cookbooks to the server, you can run a local playbook and that will be the single source of truth.

We requested a letsencrypt certificate on the old server and copied that to the new server. We then updated the renewal config file to reflect the new letsencrypt account id.

To transfer the database, we wrote a small Shell script that creates a dump, scp's it to the new server and imports it there:

#!/usr/bin/env bash

# create dump as the website user
su -c "pg_dump --clean --serializable-deferrable website" website > /home/website/website.psql
# copy dump to new server
scp /home/website/website.psql website@<NEW_SERVER_IP>:/home/website/website.psql
# copy import dump on new server
ssh website@<NEW_SERVER_IP> 'psql website < /home/website/website.psql'
nginx -s reload

We set up a proxy in nginx, redirecting all traffic from the old server to the new server. Right before we forwarded all traffic to the new server, we copied the database from the old server to the new server, to minimize data loss:

listen $OLD_IP:443 ssl http2;
server_name www.phusionpassenger.com;
ssl on;
ssl_certificate www.phusionpassenger.com.crt;
ssl_certificate_key www.phusionpassenger.com.key;
include ssl-defaults.conf;
root /var/www/website/current/public;

location / {
  proxy_set_header Host $host;
  proxy_http_version 1.1;
  proxy_buffering off;
  proxy_pass https://$NEW_IP$request_uri;
  proxy_redirect https://www.phusionpassenger.com/ /;
}

Lessons learned

We did not automate the dns changes as the migration was just a one-time event and automating dns changes would take significantly more time than doing those manually. After the dns was updated we just had to wait for the changes to take effect. Because of the proxy'ing on the old server this part of the migration would be seamless.

The dns for both the naked domain (phusionpassenger.com) and the website domain (www.phusionpassenger.com) were updated to the new ip address. In due time when every client has updated its entries for these domains every request will be directed at the new server.

Issues we ran into include:

  • The SSL_ciphers changed and support for TLS1.0 and TLS1.1 was dropped. Updating the cipherlist fixed it but not before causing issues for at least one enterprise customer.
  • The Apt-repo was inaccessible due to incomplete deployment (more specifically: the packages had been signed again, but the gpg signing key was missing), as reported by a handful of customers.

Next time around we could prevent above issues by installing Passenger from the aptitude repo on the staging server, for all supported Ubuntu versions, and test this against staging (which was already migrated several days before production).

Another lesson learned is that as we don't have (much) data on our customers environments (for very good reasons), we should probably send an email before undergoing something major like a server migration, even if it is likely to have no effect on the customer side.

Next steps

We will transition to a highly available infrastructure with multiple servers. At the time of the migration traffic wasn't high by any means and we weren't expecting the APT repo to become critical part of users' deployment pipelines. We’ve learned that customers use it as critical infrastructure, in part because out deb/rpm packages have become much more popular than we anticipated.

We did not deem phusionpassenger.com uptime mission critical, therefore its maintenance and evolution wasn't prioritized over product development (remember, we are a bootstrapped company and as such resources are tight). We’ve learned it's time to step up our game and make it highly available, as a service to our customers and users. We're doing this iteratively, and this migration is the first step. In the meantime, if this stuff keeps you up at night, you might want to look into setting up a private mirror.