Backups : The Costumer, The Developer and The Anecdote

A few years ago, I wrote an hack-and-ship piece of software. I needed a name for the source code repository so I code-named it "Electric Sheep" in reference to a well-known sci-fi novel. This humble proof-of-concept allowed one to abstract complex commands with a Ruby-based DSL and to execute them on multiple hosts via SSH. Like most of (my) freetime projects, it has been quickly buried beneath layers of serious stuff. Last year, some of my teammates and I decided to resurrect it in the form of a backup tool. Early this year, we started to rewrite the software from the ground up and renamed it Electric Sheep IO to avoid a possible confusion with the homonymous, lately discovered, brilliant and bright Electric Sheep.

Electric Sheep IO is an orchestrator that allows users to archive remote resources (files, directories, databases) from various projects and apps. All backup tasks are executed from a single process that uses SSH to log on remote hosts, create archives, then copy and move them with the aid of several "transports" such as SCP or Amazon S3. Its configuration file format allows to turn the canonical and IMHO ugly:

#!/bin/bash
set -e

DATABASE="war-and-peace"  
DUMP="${DATABASE}-`date +"%Y%m%d"`.sql"  
ARCHIVE="${DUMP}.tar.gz"  
HOST="db1.example.com"  
USER="operator"  
DBUSER="backup"  
DBPASS="Clear text"

ssh ${USER}@${HOST} /bin/bash << EOF  
  mysqldump --user=${DBUSER} --password=${DBPASS} ${DATABASE} > /tmp/${DUMP}
  cd /tmp
  tar cfz ${ARCHIVE} ${DUMP}
  rm -f ${DUMP}
EOF  
scp ${USER}@${HOST}:/tmp/${ARCHIVE} /tmp/${ARCHIVE}  
ssh ${USER}@${HOST} "rm -f /tmp/${ARCHIVE}"

export AWS_ACCESS_KEY="ABCDEFGHIJKLMN"  
export AWS_SECRET_KEY="Clear text"  
aws s3 mv /tmp/${ARCHIVE} s3://mybucket/${ARCHIVE}  

into a reader-friendly this:

host "db-host", hostname: "db1.example.com"  
project "my-database-backup" do  
  resource "database", name: "war-and-peace", host: "db-host"
  remotely as: "operator" do
    mysql_dump user: "backup", password: encrypted("XXXXXXXX")
    tar_gz delete_source: true
  end
  move to: "localhost", using: "scp", as "operator"
  move to: "backup-bucket", using: "s3", access_key_id: 'ABCDEFGHIJKLMN', secret_key: encrypted('XXXXXXXX')
end  

At the beginning, as we had scratched our own itches, we had an implicit agreement on the values we'd like to see at core of the software (central management, convention-over-configuration, readability, repeatability). We've also started to work on what we consider must-have features - scheduling, reporting and notifications, alerts and templates for common stacks are obviously at the top of our todolist. However, we've hit the point where we had to verify we were trying to fix a problem other people were actually experiencing. We had the intuition some people out there had to share our feelings, and see the backup as a painful, stressful, unsexy and boring discipline, but we didn't know for sure.

According to the Boston Computing Network’s Data Loss Statistics, "60% of companies that lose their data will shut down within 6 months of the disaster". Oddly, if it's easy to find articles or blog posts explaining how to backup this kind of database or that sort of file, we haven't found anyone openly relating a lived data loss tragedy. It's probably the most shocking observation we've made: backup is a taboo subject. It falls in the semantic field of disaster, and no one likes to talk about hacked sites, disk crashes, data corruption, errors during database replication, data migration failures or an accidental UPDATE command without a WHERE clause. Self disclosure: I've personally experienced or witnessed all these situations, plus an unexpected, Nietzschean, stern but magnificient DROP DATABASE statement.

We decided to overcome the taboo and set up an experiment. We made a list of every person we knew that had been involved in the development or hosting of an application or website, and talked to each of them. We wrote a short "interview script" so that we get the same bits of context from everyone, but for the most part we went totally off road and listened to people talking about their experience(s) and sharing anecdotes. Manual backups, scripts failing silently, lack of offsite backups, issues while server migrations, corrupted or unusable archives, confusion between replication and backups, we've heard a lot of stories, the worst of them being no backup at all.

The Unspoken Truth™

The Customer

Customers assume backups are done by the development team. But they usually don't ask. They value new features a lot so they're (quite logically) not in the mood to spend a cent for something they consider worthless upfront. Ultimately, they blame the developers whenever a problem occurs, often a few weeks or months after the app has been rolled out in production.

Sysadmin is a job on its own, with specific skills, roles and responsibilities. Backup is traditionally one of the various sysadmins duties. Now, Customer, I'll tell you an untold story: in a lot of - if not most - web or software development teams, you won't find any permanent sysadmin (if you find any). Servers are manually configured, apps are deployed by hand, monitoring is a second-class citizen, system updates happen haphazardly. If you're a lucky guy, someone may copy a few lines of Bash from a Stackoverflow answer and setup a Cron job on a server. No one will ever ensure the scripts are properly running or try to restore one of the snapshots. The development team might have invited a sysadmin or devop to join the party... but you were not willing to pay for this stuff, remember ?

The Developer

A developer's work is to design, implement, test and ship (hopefully) working apps. Period. However, I embraced the values promoted by DevOps. For a really pragmatic reason: we, developers, were forced to adopt these DevOps values well before the name existed.

Due to an increasing pressure on costs and delays, and a lack of resources, lots of developers (especially ones in small companies or startups) have to carry the burden of all engineering roles which were previously the playground of QA teams, analysts, database admins and sysadmins. Often without being paid for this additional workload.

Human beings are only capable of a limited amount of knowledge. It may not be obvious at first glance, but developers fall in the category of human beings. Constantly switching from a role to another is cognitively expensive even for the smartest kids on the block. In this situation, developement teams face many trade-offs and are tempted to postpone the tasks with the least perceived value in the customer point-of-view. Such as backups.

This is, in my opinion, the whole point of what we should try to achieve with Electric Sheep IO (the road ahead is very long): lower the burden by turning the backups into something intuitive, easy, thus contribute to see developers back to code writing.

The Anecdote, at last

During the experiment, I've had a chat with an indie developer who created a mobile game with a REST backend. He explained me that he was a self-taught developer and that he'd been struggling for months with all the components he'd had to play with in his development stack. When we came to the matter of backups, he honestly confessed that he had been for two years without a single database backup. Knowing this app had been his unique source of revenue during this period, I then asked why he had stayed so long in this frightening situation. He answered: "I don't know for sure. I think I was not aware of the risk. When I created the application, it was not mentioned in the tutorial." I recently had a look to the "getting started with X", "beginner's guide to Y" or "install Z in 5 minutes" of several development frameworks, CMS and blogging engines.

He was right. It was not mentioned in the tutorials.