Phusion white papers Phusion overview

Duplicity + S3: easy, cheap, encrypted, automated full-disk backups for your servers

By Hongli Lai on November 11th, 2013

Backup

Backups are one of those things that are important, but that a lot of people don’t do. The thought of setting up backups always raised a mental barrier for me for a number of reasons:

  • I have to think about where to backup to.
  • I have to remember to run the backup on a periodic basis.
  • I worry about the bandwidth and/or storage costs.

I still remember the days when a 2.5 GB harddisk was considered large, and when I had to spent a few hours splitting MP3 files and putting them on 20 floppy disks to transfer them between computers. Backing up my entire harddisk would have costed me hundreds of dollars and hours of time. Because of this, I tend to worry about the efficiency of my backups. I only want to backup things that need backing up.

I tended to tweak my backup software and rules to be as efficient as possible. However, this made setting up backups a total pain, and makes it very easy to procrastinate backups… until it is too late.

I learned to embrace Moore’s Law

Times have changed. Storage is cheap, very cheap. Time Machine — Apple’s backup software — taught me to stop worrying about efficiency. Backing up everything not only makes backing up a mindless and trivial task, it also makes me feel safe. I don’t have to worry about losing my data anymore. I don’t have to worry that my backup rules missed an important file.

Backing up desktops and laptops is easy and cheap enough. A 2 TB harddisk costs only $100.

What about servers?

  • Most people can’t go to the data center and attach a hard disk. Buying or renting another harddisk from the hosting provider can be expensive. Furthermore, if your backup device resides on the same location where the data center is, then destruction of the data center (e.g. a fire) will destroy your backup as well.
  • Backup services provided by the hosting provider can be expensive.
  • Until a few years ago, bandwidth was relatively expensive, making backing up the entire harddisk to a remote storage service an unviable option for those with a tight budget.
  • And finally, do you trust that the storage provider will not read or tamper with your data?

Enter Duplicity and S3

Duplicity is a tool for creating incremental, encrypted backups. “Incremental” means that each backup only stores data that has changed since the last backup run. This is achieved by using the rsync algorithm.

What is rsync? It is a tool for synchronizing files between machines. The cool thing about rsync is that it only transfers changes. If you have a directory with 10 GB of files, and your remote machine has an older version of that directory, then rsync only transfers new files or changed files. Of the changed files, rsync is smart enough to only transfer the parts of the files that have changed!

At some point, Ben Escoto authored the tool rdiff-backup, an incremental backup tool which uses an rsync-like algorithm to create filesystem backups. Rdiff-backup also saves metadata such as permissions, owner and group IDs, ACLs, etc. Rdiff-backup stores past versions as well and allows easy rollback to a point in time. It even compresses backups. However, rdiff-backup has one drawback: you have to install it on the remote server as well. This makes it impossible to use rdiff-backup to backup to storage services that don’t allow running arbitrary software.

Ben later created Duplicity, which is like rdiff-backup but encrypts everything. Duplicity works without needing special software on the remote machine and supports many storage methods, for example FTP, SSH, and even S3.

On the storage side, Amazon has consistently lowered the prices of S3 over the past few years. The current price for the US-west-2 region is only $0.09 per GB per month.

Bandwidth costs have also lowered tremendously. Many hosting providers these days allow more than 1 TB of traffic per month per server.

This makes Duplicity and S3 the perfect combination for backing up my servers. Using encryption means that I don’t have to trust my service provider. Storing 200 GB only costs $18 per month.

Setting up Duplicity and S3 using Duply

Duplicity in itself is still a relative pain to use. It has many options — too many if you’re just starting out. Luckily there is a tool which simplifies Duplicity even further: Duply. It keeps your settings in a profile, and supports pre- and post-execution scripts.

Let’s install Duplicity and Duply. If you’re on Ubuntu, you should add the Duplicity PPA so that you get the latest version. If not, you can just install an older version of Duplicity from the distribution’s repositories.

# Replace 'precise' with your Ubuntu version's codename.
echo deb http://ppa.launchpad.net/duplicity-team/ppa/ubuntu precise main | \
sudo tee /etc/apt/sources.list.d/duplicity.list
sudo apt-get update

Then:

# python-boto adds S3 support
sudo apt-get install duplicity duply python-boto

Create a profile. Let’s name this profile “test”.

duply test create

This will create a configuration file in $HOME/.duply/test/conf. Open it in your editor. You will be presented with a lot of configuration options, but only a few are really important. One of them is GPG_KEY and GPG_PW. Duplicity supports asymmetric public-key encryption, or symmetric password-only encryption. For the purposes of this tutorial we’re going to use symmetric password-only encryption because it’s the easiest.

Let’s generate a random, secure password:

openssl rand -base64 20

Comment out GPG_KEY and set a password in GPG_PW:

#GPG_KEY='_KEY_ID_'
GPG_PW='<the password you just got from openssl>'

Scroll down and set the TARGET options:

TARGET='s3://s3-<region endpoint name>.amazonaws.com/<bucket name>/<folder name>'
TARGET_USER='<your AWS access key ID>'
TARGET_PASS='<your AWS secret key>'

Substitute “region endpoint name” with the host name of the region in which you want to store your S3 bucket. You can find a list of host names at the AWS website. For example, for US-west-2 (Oregon):

TARGET='s3://s3-us-west-2.amazonaws.com/myserver.com-backup/main'

Set the base directory of the backup. We want to backup the entire filesystem:

SOURCE='/'

It is also possible to set a maximum time for keeping old backups. In this tutorial, let’s set it to 6 months:

MAX_AGE=6M

Save and close the configuration file.

There are also some things that we never want to backup, such as /tmp, /dev and log files. So we create an exclusion file $HOME/.duply/test/exclude with the following contents:

- /dev
- /home/*/.cache
- /home/*/.ccache
- /lost+found
- /media
- /mnt
- /proc
- /root/.cache
- /root/.ccache
- /run
- /selinux
- /sys
- /tmp
- /u/apps/*/current/log/*
- /u/apps/*/releases/*/log/*
- /var/cache/*/*
- /var/log
- /var/run
- /var/tmp

This file follows the Duplicity file list syntax. The - sign here means “exclude this directory”. For more information, please refer to the Duplicity man page.

Notice that this file excludes Capistrano-deployed Ruby web apps’ log files. If you’re running Node.js apps on your server then it’s easy to exclude your Node.js log files in a similar manner.

Finally, go to the Amazon S3 control panel, and create a bucket in the chosen region:

Create a bucket on S3

Enter the bucket name

Initiating the backup

We’re now ready to initiate the backup. This can take a while, so let’s open a screen session so that we can terminate the SSH session and check back later.

sudo apt-get install screen
screen

Initiate the backup:

sudo duply test backup

Press Esc-D to detach the screen session.

Check back a few hours later. Login to your server and reattach your screen session:

screen -x

You should see something like this, which means that the backup succeeded. Congratulations!

--------------[ Backup Statistics ]--------------
...
Errors 0
-------------------------------------------------

--- Finished state OK at 16:48:16.192 - Runtime 01:17:08.540 ---

--- Start running command POST at 16:48:16.213 ---
Skipping n/a script '/home/admin/.duply/main/post'.
--- Finished state OK at 16:48:16.244 - Runtime 00:00:00.031 ---

Setting up periodic incremental backups with cron

We can use cron, the system’s periodic task scheduler, to setup periodic incremental backups. Edit root’s crontab:

sudo crontab -e

Insert the following:

0 2 * * 7 env HOME=/home/admin duply main backup

This line runs the duply main backup command every Sunday at 2:00 AM. Note that we set the HOME environment variable here to /home/admin. Duply is run as root because the cronjob belongs to root. However the Duply profiles are stored in /home/admin/.duply, which is why we need to set the HOME environment variable here.

If you want to setup daily backups, replace “0 2 * * 7″ with “0 2 * * *”.

Making cron jobs less noisy

Cron has a nice feature: it emails you with the output of every job it has run. If you find that this gets annoying after a while, then you can make it only email you if something went wrong. For this, we’ll need the silence-unless-failed tool, part of phusion-server-tools. This tool runs the given command and swallows its output, unless the command fails.

Install phusion-server-tools and edit root’s crontab again:

sudo git clone https://github.com/phusion/phusion-server-tools.git /tools
sudo crontab -e

Replace:

env HOME=/home/admin duply main backup

with:

/tools/silence-unless-failed env HOME=/home/admin duply main backup

Restoring a backup

Simple restores

You can restore the latest backup with the Duply restore command. It is important to use sudo because this allows Duplicity to restore the original filesystem metadata.

The following will restore the latest backup to a specific directory. The target directory does not need to exist, Duplicity will automatically create it. After restoration, you can move its contents to the root filesystem using mv.

sudo duply main restore /restored_files

You can’t just do sudo duply main restore / here because your system files (e.g. bash, libc, etc) are in use.

Moving the files from /restored_files to / using mv might still not work for you. In that case, consider booting your server from a rescue system and restoring from there.

Restoring a specific file or directory

Use the fetch command to restore a specific file. This restores the /etc/password file in the backup and saves it to /home/admin/password. Notice the lack of leading slash in the etc/password argument.

sudo duply main fetch etc/password /home/admin/password

The fetch command also works on directories:

sudo duply main fetch etc /home/admin/etc

Restoring from a specific date

Every restoration command accepts a date, allowing you to restore from that specific date.

First, use the status command to get an overview of backup dates:

$ duply main status
...
Number of contained backup sets: 2
Total number of contained volumes: 2
 Type of backup set:                            Time:      Num volumes:
                Full         Sat Nov  8 07:38:30 2013                 1
         Incremental         Sat Nov  9 07:43:17 2013                 1
...

In this example, we restore the November 8 backup. Unfortunately we can’t just copy and paste the time string. Instead, we have to write the time in the w3 format. See also the Time Formats section in the Duplicity man page.

sudo duply test restore /restored_files '2013-11-08T07:38:30'

Safely store your keys or passwords!

Whether you used asymmetric public-key encryption or symmetric password-only encryption, you must store them safely! If you ever lose them, you will lose your data. There is no way to recover encrypted data for which the key or password is lost.

My preferred way of storing secrets is to store them inside 1Password and to replicate the data to my phone and tablet so that I have redundant encrypted copies. Alternatives to 1Password include LastPass or KeePass although I have no experience with them.

Conclusion

With Duplicity, Duply and S3, you can setup cheap and secure automated backups in a matter of minutes. For many servers this combo is the silver bullet.

One thing that this tutorial hasn’t dealt with, is database backups. While we’re backing up the database’s raw files, doing so isn’t a good idea. If the database files were being written to at the time the backup was made, then the backup will contain potentially irrecoverably corrupted database files. Even the database’s journaling file or write-ahead log won’t help, because these technologies are designed only to protect against power failures, not against concurrent file-level backup processes. Luckily Duply supports the concept of pre-scripts. In the next part of this article, we’ll cover pre-scripts and database backups.

I hope you’ve enjoyed this article. If you have any comments, please don’t hesitate to post them below. We regularly publish news and interesting articles. If you’re interested, please follow us on Twitter, or subscribe to our newsletter.

Discuss on Hacker News.



  • andyjeffries

    DreamObjects (S3 compatible) is only 5c/GB and no API call charges either. Much cheaper than S3 :-)

    http://www.dreamhost.com/cloud/dreamobjects/

  • http://www.phusion.nl/ Hongli Lai

    Cool, good to know that that option exists. The great thing about Duplicity is that it supports any S3-compatible server. Just change the host name in the TARGET option.

  • Ernestas Lukoševičius

    If you’re running full backups, Amazon Glacier is a better option. 0.01 USD/GB month.

  • Mark

    is Amazon Glacier also S3 compatible, so that I can set the TARGET to something like glacier.eu-west-1.amazonaws.com?

  • http://www.phusion.nl/ Hongli Lai

    It isn’t, it uses a different API. Furthermore, retrieving and restoring something from Glacier takes 4 hours. Since Duplicity has to fetch metadata from the server on every backup run, Glacier is not a feasible option.

  • Ernestas Lukoševičius

    Since the topic says full-disk backups, it is a highly feasible option. Running incremental for full-disk – is not. Sorry for not reading the article and just the topic.

  • Gerry

    Depends! Amazon by default offer higher durability on objects and the more you are storing the cheaper their pricing is: http://aws.amazon.com/s3/pricing/

    It really depends on expected usage

  • http://nandovieira.com.br Nando Vieira

    Restoring a backup with duply main restore / gives an error, so I tried to use --force. This brought new errors (mainly Error '[Errno 17] File exists'), stopping the restore process with a segmentation fault.

    So, is there anything one must do to restore the backup properly? Log file: https://gist.github.com/fnando/7415162.

  • http://abevoelker.com/ Abe Voelker

    Did you consider using Tarsnap?

  • http://www.phusion.nl/ Hongli Lai

    Yes. Tarsnap seems to be doing something similar to Duplicity but is more expensive, at 30 cents per GB per month. I didn’t see why I should use Tarsnap and I already had an AWS account so I went with this instead. I could be wrong though and maybe Tarsnap has advantages over Duplicity.

  • http://www.phusion.nl/ Hongli Lai

    You can try restoring to a temporary directory, then moving the files to the root directory using mv.

  • Dan

    How do you prevent your backup from being deleted if somebody owns your server and gets your API key?

  • David Burley

    If you create a lifecycle rule to move content to glacier, and then another to delete the content a year later, you get the benefits of glacier pricing. The only caveat is you need to make full backups periodically and any restore after something has been moved to glacier requires manual effort to move the content back to S3 before restoring along with the noted delay. However, this comes at considerable savings.

  • Lovingdesigns

    I like this question.

    It is basically hard next to protect yourself from this, you will need to trust the hoster of the physical hardware.

    You could encrypt the drive and unlock it via a tiny ssh server though.

  • Lovingdesigns

    Nice, it’s about half the price of S3!

  • Dan

    It’s a problem I’m facing now doing backups to the cloud. If somebody gets my Rackspace API key, they can do all sorts of really awful things. I wish the providers would let me set permissions to what a certain API key can do.

  • Anon

    If you don’t have full control over your backups you might as well not backup.

  • http://www.phusion.nl/ Hongli Lai

    There are two ways:

    1. Use Amazon IAM permissions. You can create a user with a new API key, and restrict access for this user to download and upload only. Unfortunately this also prevents Duplicity from deleting old backups.

    2. Enable versioning in your bucket. That way you are protected against all deletions.

  • gosukiwi

    This is nice, I didn’t know about Duplicity, I already have my VM backups on DigitalOcean but I’ll bookmark this just in case :P

  • lzap

    If you have a shell on the remote, I can offer this: Incremental backup solutinon based on *pure* rsync and ssh (no tools involved) that makes use of hardlinks to create completele “snapshot” everytime you run it (but does not consume extra space on the target disc)

    https://gist.github.com/lzap/7418061

  • http://tkware.info TK

    I’d think twice about using Glacier for recovering large amounts of data. The storage and bandwidth on the in side is quite reasonable, but for outbound..

    https://news.ycombinator.com/item?id=4412886

  • Ernestas Lukoševičius

    Not sure how is it relevant here, since pricing for bandwidth is exactly the same as in S3 case. http://aws.amazon.com/glacier/pricing/ http://aws.amazon.com/s3/?navclick=true#pricing

  • Ernestas Lukoševičius

    Then you should run your backups using another server, to which your primary server cannot connect, but the backup server – can.

  • Pizzicato Five Fan

    What about using the lifecycle setting on buckets to migrate data to Glacier after a given period of time? Would that also break the functionality?

  • Scott

    Do you know if duplicity will allow bucket to bucket direct transfers? I’d like to backup data already in an S3 bucket to another bucket without bringing the data down to a local machine first. Thanks!

  • http://nandovieira.com.br Nando Vieira

    Yeah, that’s what I thought. Maybe you should consider updating the article (the sudo duply main restore /) because, well, it doesn’t work. :)

  • http://www.phusion.nl/ Hongli Lai

    I think so. Duplicity has a very specific format for its files so you can’t just move something to Glacier.

  • http://www.phusion.nl/ Hongli Lai

    Yes. Use the copy-paste functionality in the S3 control panel.

  • http://kennydude.me/ Joe Simpson

    Dreamhost is a ball game of it you get thrown on a decent server or not.

  • andyjeffries

    I’m not advocating them as a general host, but their S3 compatible object store is cheap… Even better I signed up early so got a lifetime 4c/GB price :-)

    P.S. Freaky, one my best mates is called Joe Simpson, but he wouldn’t have a clue about Dreamhost, S3 and the like :-)

  • Scott

    Thanks, but I meant programmatically.

  • hellvinz

    for chaining backups (database dump + duplicity), there’s https://labs.riseup.net/code/projects/backupninja

  • Neil

    Nice article; thanks!

    One thing I’d be interested to know your opinion on; did you consider any of the more “serous” backup solutions such as Bacula or Amanda?

    And would you consider them if you had to backup a larger number of servers? I’m a little worried about getting a central “view” of the backups if there were 10 (or way more!) servers running duply scripts!

  • Neil

    Obviously “serous” should have read “serious”!

  • http://chromano.in/ Carlos H. Romano

    Not exactly secure, you could truncate the files on the server and the backup server would fetch them… versioning and incremental backups would fix it, but still, recent data won’t be available

  • Gary Rozanc

    Any idea why I keep getting the following error?

    Failed to create bucket (attempt #1) ‘bucket-name-here’ failed (reason: error: [Errno 111] Connection refused)

  • http://Techderp.net/ Moscato SugarRush

    You could always use the s3-compatible dreamobjects at 4 cents per gig stored and 4 cents per gig out with 1 free gig stored if you have google authenticator turned on

    http://www.dreamhost.com/r.cgi?1529236

    Check it out

  • http://Techderp.net/ Moscato SugarRush

    I got the 4 cents per life price 2 days ago

  • http://Techderp.net/ Moscato SugarRush

    this makes me wonder how much outgoing data does it actually use, to run duplicity, since s3 charges 12 cents per gig outgoing

  • JFD

    … or try tklbam (TurnKey Linux Backup And Migration) turnkeylinux.org which uses duplicity + S3 with even more simplicity. It’s now available as a stand alone package using their ppa.

  • Aussie Mike

    Just want to say thank you for a really well-written and useful guide. I went from having the thought “I really should do encrypted backups to S3″, to having the whole thing up and running, in about 15 minutes!

  • http://www.faix.cz/ Jan Faix

    You may use http://s3tools.org/s3cmd cp to copy programmatically.

  • M T

    Is not Glacier’s minimum charge a 3 months? So, if you roll your backups more frequently than once a month, it is not really a money-saver…

  • Ernestas

    Even if you delete your backups the same day, you will not be charged more than 0.03 cents/GB for that month.

  • M T

    Even if you delete your backups the same day, you will not be charged more than 0.03 cents/GB for that month.

    (First of all, that’s 0.03 dollar/GB — or 3 cents per GB.)

    My point was, you will not be charged less than that either. And S3 storage costs slightly less than 3 cents/GB/month. So, if you are storing a GB for less than a month, your cost with Glacier will be 3 cents, and slightly less than that with S3… And you will not need to wait 4 hours for the Glacier-hosted file to be made available either.

  • Cezinha

    Thanks, it helps a lot!