Backing Up Everything (Again)

This will take a while. Bear with me.

I’m obsessive about backing up my data. I don’t want to take the chance of ever losing anything important. But that doesn’t mean I’m a data hoarder. I like to think I’m pragmatic about it. And I don’t trust anyone else to do it for me.

From around 2006 to 2012, I kept a Mac mini attached to our TV with a Drobo hanging off the back. It had all our downloaded movies on it. And every night it would automatically download the latest releases of our favorite TV shows from Usenet so my wife and I could watch them with Plex the next day. It worked great, and all the media files were stored redundantly across multiple hard drives with tons of storage space. (Would it survive a house fire? No. But files like that weren’t critical.) But with the rise of streaming services and useful pay-to-watch stores like iTunes, now I’d rather just pay someone else to handle all of that for me. So, I don’t keep any media files like that locally any longer.

But my email? My financial and business documents? My family’s photo and home video archive? I’m really obsessive about that.

For most of my computing life, all of that data was small enough to fit on my laptop or desktop’s hard drive. In college, I remember burning a CD (not a DVD) every few months will all of my school work, source code, and photos on it for safe keeping. The internet wasn’t yet fast enough to make backing up to a cloud (were clouds even a thing back then?) feasible, so as my data grew I just cloned everything nightly to a spare drive using SuperDuper and Time Machine. It worked for the most part. Sure, I still worried about my house catching fire and destroying my backups, but there really wasn’t an alternative other than occasionally taking one of the backup drives to work or a friend’s house.

But then the internet got fast, really fast, and syncing everything to the cloud became easy and affordable. I was a beta user of Gmail back in 2004. I was an early paid subscriber of Dropbox since around 2008. All of my data was stored in their services and fully available on every computer and – eventually – mobile device. At the time, I thought I had reached peak-backup.

I was wrong.

Now we have too much data. My email is around 20GB. My family’s photo library is approaching 500GB. That’s more data than will fit on my laptop’s puny SSD. It will fit on my iMac, but it leaves precious available space for anything else. I could connect external drives, but that gets messy and further complicates my local backup routine. (Yes, Backblaze is a good, potential solution to that.)

Another problem is that most of our data now is either created directly in the cloud (email, Google Docs, etc) or is immediately sent to it (iPhone photos uploaded to iCloud and/or Google Photos), bypassing my local storage. If you trust Google (or Apple) to keep your data safe and backed up, that’s great. I don’t. I’ve heard too many horror stories about one of Google’s automated AI systems flagging an account and locking out the user. And with no way to contact an actual human, you’re dead in the water along with all your data. Especially if you lose access to your primary email account, which is the key to all your other online accounts.

So, I need a way to backup my newly created cloud data, too. This is getting complicated.

First step. My email. This is easy. Five years ago I setup new email addresses for my personal and business accounts with Fastmail. They’re amazing. I imported my 10+ years worth of email from Google (sadly, my pre-2004 college email and personal accounts are lost to the ether), setup a forwarding rule in Gmail, and with the help of 1Password, changed all of my online services to use my new email. It took about a month to switch everything over, but now the only email coming to my old Gmail address is spam. Fastmail keeps redundant backups of my email. And I have full IMAP copies available on multiple computers in case they don’t. And if something ever goes wrong, unlike Google where their advertisers are the customer – and I’m the product – I pay Fastmail every month and can call up a live human to talk to.

Source code. I’m a paying GitHub customer. Everything’s stored and backed up there. But still, what if they screw up. I ran a small, self-hosted server with GitLab on it for a while instead of GitHub and set it to backup all my code nightly to S3. That worked great. But, I like GitHub’s UI and feature set better. Plus, it’s one less server I have to manage. So, where do I mirror my code to? (Much of my code is checked out locally on my computer, but not all of it.)

Back in 2006, my boss at the web agency I was working at told me about rsync.net. They provide you with a non-interactive Unix shell account that you can pipe data to over SFTP, rysnc, or any other standard Unix tool. You pay by the GB/month, and they scale to petabyte sizes for customers who need that. So, I signed up and used them to backup all of my svn (remember svn?) repos. With the rise of git and switch to GitHub, I cancelled my account and mostly forgot about them.

But, aha!, I now have new data storage problems. Rsync.net could be a great solution again. So, I re-signed up and setup my primary web server to mirror all of my GitHub repos over to them each night. Here’s the script I’m using…

Next up, important documents. Traditionally, I’ve kept everything that would normally go in my Mac’s “Documents” folder in my Dropbox account. That worked great for a long time. But once I started paying Google for extra storage space for Google Photos (more on that later), it felt silly to keep paying Dropbox as well. So, after 10+ years as a paid subscriber, I downgraded to a free account and moved everything into Google Drive. Sure, it’s not as nice as Dropbox, but it works and saves me $10 a month.

Like I said above, I mostly trust Google, but not entirely. So, let’s sync my Google Drive’s contents to rsync.net, too. Edit your Mac’s crontab to add this line…

30 * * * * /usr/bin/rsync -avz /Users/thall/Google\ Drive/ user@server.com:google-drive

Also, I keep all of the really important paperwork that would normally be in a fire safe in my garage in a DEVONthink library so I can search the contents of my PDFs. It’s synced automatically with iCloud and available across my mobile devices. But still, better back that up, too.

45 * * * * /usr/bin/rsync -avz /Users/thall/FireSafe.dtBase2 user@server.com:

So, that’s all of my data except for the big one – my family’s photo and home video archives.

For a long time I kept all my family’s archives in Dropbox. I even made an iOS app dedicated to browsing your library. I could have stuck everything in Apple’s Photos.app where it’s available on my devices via iCloud, but that’s tied to my Apple ID. My wife wouldn’t be able to see those photos. Plus, any photos she took on her phone would get stored in her iCloud account and not synced with the main family archive. So, we used the Dropbox app, signed-in to my account, to backup our phones’ photos.

But, like I said earlier, our photo and video library become to big to comfortably fit in Dropbox. Plus, Google Photos had just been released and it was amazing. Do I like the thought of Google’s AI robots churning through my photos and possibly using that data to sell me advertisements? No. But, their machine-learning expertise and big-data solutions make it really hard to resist. So, I spent a week and moved everything out of Dropbox into Google Photos.

Now everything is sorted into albums, by date, and searchable on any device. I can literally type into their search box “all photos of my wife’s grandmother taken in front of the Golden Gate bridge” and Google returns exactly what I’m looking for. It’s wonderful.

My wife’s phone has the Google Photos app installed with my account on it so every photo she takes gets stored in a shared account we can both access and view on all our devices.

But what’s the recurring theme of this blog post? That’s right. I don’t fully trust any cloud provider to be the only source of my data. Someone clever said “the cloud is just someone else’s computer.” That’s exactly correct. If your data isn’t in at least two different places, it’s not really backed up.

But how do I backup my 500GB+ of photos that are already in Google’s cloud? And then how do I keep new items recently added synced as well?

As usual, I tried to find a way to make it work with rsync.net. I found a great open-source project called rclone. It’s a command line tool that shuffles your files between cloud providers or any SFTP server with lots of configurable options and granularity.

First off, even if rclone does do what I need, I can’t just run it on my Mac. My internet is too slow for the initial backup. I need to use it on one of my servers so I have a fast data center to data center connection between Google and rsync.net.

Getting it setup on one of my Ubuntu servers at Linode was a simple bash one-liner. Configuring it to then work with my Google and rsync.net accounts was just a matter of running their easy-to-use configuration wizard.

Note: rclone doesn’t support a connection to Google Photos. Instead, you need to login to Google Drive on the web and enable the “Automatically put your Google Photos into a folder in My Drive” option in Settings. (And also tell your Google Backup & Sync Mac app not to sync that folder locally – unless you have the space available – I don’t.) Then, rclone can access your Google Photos data via a special folder in your Drive account.

With everything configured, I ran a few connection tests and it all worked as expected. So, I naively ran this command thinking it would sync everything if I let it run long enough:

rclone copy -P "GoogleDrive:Google Photos" rsync:GooglePhotos

Things started out fine. But eventually, due to Google API rate limits, it was quickly throttled to 300KB/sec. That would have taken MONTHS to transfer my data. And, the connection entirely stalled out after about an hour. I even configured rclone to use my own, private Google OAuth keys, but with the same result. So, I needed a better way to do the initial import.

Google offers their Takeout service. It lets you download an archive of ALL your data from any of their services. I requested an archive of my Google Photos account and eight hours later they emailed me to let me know it was ready. Click the email link to their website, boom. Ten 50GB .tgz files. Now what to do with them?

I can’t download them to my Mac and re-upload them – that’s too slow. Instead, I’ll just grab the download URLs and use curl on my server to get them, extract them, and sync them over.

I don’t have enough room on my primary web server – plus I don’t want to saturate my traffic for any customers visiting my website. So, spin up a new Linode, attach a 500GB network volume, and we’re in business. Right? Nope.

The download links are protected behind my Google account (that’s great!) so I need a web browser to authenticate. Back on my Mac, fire up Charles Proxy and begin the downloads in Safari. Once they start, cancel them. Go to Charles, find the final GET connection, and right-click to copy the request as a curl command including all of the authentication headers and cookies. Paste that command into my server’s Terminal window and watch my 500GB archive download at 150MB(!!)/sec.

(Turns out, extracting all of those huge .tgz files took longer than actually downloading them.)

Finally, rsync everything over to my backup server.

And that’s where I currently am right now. Waiting on 500GB worth of photos and videos to stream across the internet from Linode in Atlanta to rsync.net in Denver. It looks like I have about six more hours to go. Once that’s done, the initial seed of my Google Photos backup will be complete. Next, I need a way to backup anything that gets added in the future.

Between the two of us, my wife and I take about 5 to 10 photos a day. Mostly of our kids. Holidays and special events may produce a bunch more at once, but that’s sporadic. All I need to do is sync the last 24 hours worth of new data once every night.

rclone is the perfect tool for this job. It supports a “–max-age=24h” option that will only grab the latest items, so it will comfortably fit within Google’s API rate limits. Once again, setup a cron job on my server like so:

0 0 * * * rclone copy --max-age=24h "GoogleDrive:Google Photos" rsync:GooglePhotos

And, that’s it. I think I’m done. Really, this time.

All of my important data – backed up to multiple storage providers – and available on all of my and my family’s devices. At least until the whole situation changes yet again.

A few more notes:

All of my web server configuration files are stored in git. As are all of my websites’ actual files. But, I still run an hourly cron job to backup all of “/var/www” and “/etc/apache2/sites-available” to rsync.net since it’s actually such a small amount of data. This lets me run one command to re-sync everything in the event I need to move to a new server, without having to clone a ton of individual git repos. (I know I need to learn a better devops technique with reproducible deployments like Ansible, Puppet, or whatever the cool tech is these days. But everything I do is just a standard LAMP stack (no containers, only one or two actual servers), so spinning up a new machine is really just a click in the Linode control panel and couple apt-get commands and dropping my PHP files into a directory.)

My databases are mysqldump’d every hour, versioned, and archived in S3.

All of the source code on my Mac is checked out into a single parent directory in my home folder. It gets rscyn’d offsite every hour, just in case. Think of it as a poor man’s Time Machine in case git fails me.

I do a lot of work in The Omni Group‘s apps – OmniFocus, OmniOutliner, and OmniGraffle. All of those documents are stored in their free WebDAV sync service and mirrored on my Mac and mobile devices.

All of my music purchases have gone through iTunes since that store debuted however many years ago. I can always re-download my purchases (probably?). Non-iTunes music ripped from CDs long ago, and my huge collection of live music, is stored in iTunes Match for a yearly fee. A few years ago when I made the switch to streaming music services and mostly stopped buying new albums, I archived all of my mp3s in Amazon S3 as a backup. I need to set a reminder to upload any new music I’ve acquired as a recurring task once a year or so.

Also, I have Backblaze running on my desktop and laptop doing its thing. So yeah. I guess that’s yet another layer of redundancy.

Automatically Compressing Your Amazon S3 Images Using Yahoo!’s Smush.it Service

I’m totally obsessed with web site performance. It’s one of those nerd niches that really appeal to me. I’ve blogged a few times previously on the topic. Two years ago, (has it really been that long?) I talked about my experiences rebuilding this site following the best practices of YSlow. A few days later I went into detail about how to host and optimize your static content using Amazon S3 as a content delivery network. Later, I took all the techniques I had learned and automated them with a command line tool called s3up. It’s the easiest way to intelligently store your static content in Amazon’s cloud. It sets all the appropriate headers, gzips your data when possible, and even runs your images through Yahoo!’s Smush.it service.

Today I’m pleased to release another part of my deployment tool chain called Autosmush. Think of it as a reverse s3up. Instead of taking local images, smushing them, and then uploading to Amazon, Autosmush scans your S3 bucket, runs each file through Smush.it, and replaces your images with their compressed versions.

This might sound a little bizarre (usless?) at first, but it has done wonders for mine and one of my freelance client’s workflows. This particular client runs a network of very image-heavy sites. Compressing their images has a huge impact on their page load speed and bandwidth costs. The majority of their content comes from a small army of freelance bloggers who submit images along with their posts via WordPress, which then stores them in S3. It would be great if the writers had the technical know-how to optimize their images beforehand, but that’s not reasonable. To fix this, Autosmush scans all the content in their S3 account every night, looking for new, un-smushed images and compresses them.

Autosmush also allowed me to compress the huge backlog of existing images in my Amazon account that I had uploaded prior to using Smush.it.

If you’re interested in giving Autosmush a try, the full source is available on GitHub. You can even run it in a dry-run mode if you’d just like to see a summary of the space you could be saving.

Also, for those of you with giant S3 image libraries, I should point out that Autosmush appends an x-amz-smushed HTTP header to every image it compresses (or images that can’t be compressed further). This lets the script scan extremely quickly through your files, only sending new images to Smush.it and skipping ones it has already processed.

Head on over to the GitHub project page and give Autosmush a try. And please do send in your feedback.

Serving Static Content on Amazon S3 with s3up

I’ve written twice about using Amazon S3 to host your website’s static content. It’s a great solution for small websites without access to a real content delivery network. And now that Amazon has launched CloudFront on top of S3, it’s even better.

But there are still ways we can improve the performance. The trick is to upload our files using custom headers so they’re served back with a proper expiration date and gzipped when possible — i.e., the techniques recommended by YSlow.

I’ve written a command line tool which simplifies this process called s3up. The idea is simple. It uploads a single file to S3, sets a far future expiration date, gzips the content, and versions the file by combining the filename with a timestamp. Each of these actions are optional and can be controlled via the command line.

The basic syntax:

s3up myS3bucket js/somefile.js somefile.js

would upload a local JavaScript file named somefile.js into your Amazon S3 bucket named myS3bucket inside the js folder.

We can build on this command by adding the -x flag, which tells s3up to set a far future expiration header. By default, it chooses a date ten years in the future:

s3up -x myS3bucket js/somefile.js somefile.js

This lets the browser know to keep the file cached indefinitely.

Passing the -z flag uploads two copies of the file. One normally, and a second that is gzipped and renamed filename.gz.extension. You can then dynamically serve the compressed version to browsers that support it.

Finally, the -t flag uploads and renames the file filename.YYYYmmddHHmmss.extension. This lets you easily create versioned files when you need to update an existing file. (You need to use a new filename since the browser was told earlier to always load from the cache.)

Combining all three options we get:

s3up -txz myS3bucket js/somefile.js somefile.js

This uploads your file, compresses it, sets the correct expiration date, and versions the filename — all in one easy step.

If you’d prefer to choose your own string for versioning, you can specify it with the --version flag:

s3up -txz --version=v2 myS3bucket js/somefile.js somefile.js

In that example, the file would be stored in S3 as somefilev2.js.

s3up also echoes the new filename so you can quickly paste it into your HTML. If you’re on a Mac, you can automatically copy the new name onto your clipboard by piping s3up to the pbcopy command.

If you’d like to take this a step further and integrate s3up into your existing deploy scripts, you can leave off the filename argument and instead pipe the data to upload via stdin. So, if you’re using another tool like the YUI Compressor or byuic, you can run:

somecommand | s3up -txz myS3bucket js/somefile.js | pbcopy

Uploading Multiple Files

A common task is uploading a whole folder of images. You can do this in one step by appending a slash (/) to the S3 filename and using wildcards to specify multiple local files. Example:

s3up myS3bucket images/ /path/to/your/images/*.jpg

When s3up sees images/ ending with a slash, it treats it like a directory and stores all of your files into it. Make sure you remember the slash on images/. That’s what triggers the special “directory” mode uploading.

Download

You can grab the latest copy of s3up from the its GitHub project.

Amazon S3 Improvements in PHP-AWS

Two and a half years ago I began working with Amazon Web Services — first with S3 and then SQS and EC2. The code was eventually cleaned up and released as an open source project called PHP-AWS. Since then, it has remained relatively unchanged. Just bug fixes and the occasional support for new AWS features when users contribute patches. It’s not particularly pretty, but it has stood up well against time. The S3 class was recommended by Raymond Yee in his book Pro Web 2.0 Mashups, and a number of people have emailed examples of production sites where the library is being used.

Still, the biggest drawback (particularly with the S3 class) was that the curl commands were piped to the shell — they didn’t use PHP’s native curl extension. This was done to bypass PHP’s memory limits to allow large file uploads (and also because we didn’t have time to find a workaround). It was a dirty hack, plain and simple.

But, today, the S3 class redeems itself with full support for PHP’s curl extension. Not only do large file uploads work (I just tested a 1.5GB disk image), but there is support for new S3 features like query string authentication and copying objects.

As this is a complete rewrite, I consider it a beta release — user beware. I’ve converted over all of my own code to use the new class, which has given it a good bit of testing. It’s stable enough for my purposes, but I don’t want anyone to be surprised if there’s a bug or three lurking about.

Download the code, kick the tires, and please report any bugs you find.

Using Amazon S3 as a Content Delivery Network

[Update: You might also be interested in s3up for storing static content in Amazon S3.]

Earlier this week I posted about my experience redesigning this site, focusing on optimizing my page load times using YSlow. A large part of that process involved storing static content (images, stylesheets, JavaScript) on Amazon S3 and using it like a poor man’s content delivery network (CDN). I made some hand-waving references to a deploy script I wrote which handles syncing content to S3 and also adding expiry headers and gzipping that data. A couple users emailed asking for more info, so, here goes.

Why Amazon S3?

Since its launch, nearly every technical blogger on the net has weighed in on why Amazon S3 is (for lack of a better word) awesome. No need for me to repeat them. I’ll just say quickly that it’s cheap (as in price, not in quality), fast, and easy to use. If you’ve got the deep pockets of a corporation backing you, you could probably find a better deal with another CDN provider, but for bloggers, startups, and small businesses, it’s the best game in town.

Amazon S3 is platform and language agnostic. It’s a massive harddrive in the sky with an open API sitting on top. You can connect to it from any system using just about any programming language. For this tutorial, I’ll be using a slightly modified version of the S3 library I wrote in PHP. I say “slightly modified” because I had to make a few changes to enable setting the expires and gzip headers. These changes will eventually make their way into the official project — I just haven’t done it yet.

The Deploy Process

YSlow recommends hosting static content such as images, stylesheets, and JavaScript files on a CDN to speed up page load times. It’s also best to give each file a far future expiration header (so the browser doesn’t try to reload the asset on each page view) and to gzip it. On a typical webserver like Apache, these are simple changes that you can do programatically through a config file. But Amazon S3 isn’t really a web server. It’s just a “dumb” storage device that happens to be accessible over the web. We’ll need to add the headers ourselves, manually, upfront when we upload.

The deploy script will also need to be smart and not re-upload files that are already on S3 and haven’t changed. To accomplish this we’ll be comparing the file on disk with the ETag value (md5 hash) on S3. Let’s get started.

Images

Deploying images is straight forward.

  • Loop over every image in our /images/ directory.
  • Calculate the file’s md5 hash and compare to the one in S3.
  • If the file doesn’t exist or has changed, upload it using a far futures header.
  • Repeat for the next image.

Source:

<?PHP
    $files = scandir(DOC_ROOT . IMG_PATH);
    foreach($files as $fn)
    {
        if(!in_array(substr($fn, -3), array('jpg', 'png', 'gif'))) continue;
        $object   = IMG_PATH . $fn;
        $the_file = DOC_ROOT . IMG_PATH . $fn;
        // Only upload if the file is different
        if(!$s3->objectIsSame($bucket, $object, md5_file($the_file)))
        {
            echo "Putting $the_file . . . ";
            if($s3->putObject($bucket, $object, $the_file, true, null, array('Expires' => $expires)))
                echo "OK\n";
            else
                echo "ERROR!\n";
        }
        else
            echo "Skipping $the_file\n";
    }

JavaScript and Stylesheets

The same process applies to JavaScript and stylesheets. The only difference is we need to serve gzip encoded versions to browsers that support it. As I said above, S3 won’t do this natively so we need to fake it by uploading a plaintext and a gzipped version of each file and then use PHP to serve the appropriate one to the user.

In the master config file on my website, I set a variable called $gz like so:

<?PHP
    $gz  = strpos($_SERVER['HTTP_ACCEPT_ENCODING'], 'gzip') !== false ? 'gz.' : '';

That snippet detects if the user’s browser supports gzip encoding and sets the variable appropriately. Then, throughout the site, I link to all of my JavaScript and CSS files like this:

<link rel="stylesheet" href="https://clickontyler-clickonideas.netdna-ssl.com/css/main.<?PHP echo $gz;?>css" type="text/css">

That way, if the $gz variable is set, it adds a “gz.” to the filename. Otherwise, the filename doesn’t change. It’s a quick way to transparently give the right file to the browser.

With that out of the way, here’s how I deploy the gzipped content:

<?PHP
    // List your stylesheets here for concatenation...
    $css  = file_get_contents(DOC_ROOT . CSS_PATH . 'reset-fonts-grids.css') . "\n\n";
    $css .= file_get_contents(DOC_ROOT . CSS_PATH . 'screen.css') . "\n\n";
    $css .= file_get_contents(DOC_ROOT . CSS_PATH . 'jquery.lightbox.css') . "\n\n";
    $css .= file_get_contents(DOC_ROOT . CSS_PATH . 'syntax.css') . "\n\n";
    file_put_contents(DOC_ROOT . CSS_PATH . 'combined.css', $css);
    shell_exec('gzip -c ' . DOC_ROOT . CSS_PATH . 'combined.css > ' . DOC_ROOT . CSS_PATH . 'combined.gz.css');
    if(!$s3->objectIsSame($bucket, CSS_PATH . 'combined.css', md5_file(DOC_ROOT . CSS_PATH . 'combined.css')))
    {
        echo "Putting combined.css...";
        if($s3->putObject($bucket, CSS_PATH . 'combined.css', DOC_ROOT . CSS_PATH . 'combined.css', true, null, array('Expires' => $expires)))
            echo "OK\n";
        else
            echo "ERROR!\n";

        echo "Putting combined.gz.css...";
        if($s3->putObject($bucket, CSS_PATH . 'combined.gz.css', DOC_ROOT . CSS_PATH . '/combined.gz.css', true, null, array('Expires' => $expires, 'Content-Encoding' => 'gzip')))
            echo "OK\n";
        else
            echo "ERROR!\n";
    }
    else
        echo "Skipping combined.css\n";

You’ll notice that the first thing I do is concatenate all of my files into a single file — that’s another YSlow recommendation to speed things up. From there, we compress using gzip and then up the two versions. Looking at this code, there’s probably a native PHP extension to handle the gzipping instead of exec’ing a shell command, but I haven’t looked into it (yet).

Also, make sure and notice that I’m adding a Content-Encoding: gzip header to each file. If you don’t do this, the browser will crap out on you when it tries to read the file as plaintext.

And We’re Done

So those are the main bits of the script. You can download the full script (and the S3 library) from my Google Code project.