Fixing Broken Backblaze B2 Scripts when Run From cron

Just a quick note for my future self and anyone else who might be running into this problem.

Last week I migrated all of my backups off of Amazon S3 and rsync.net to Backblaze B2. The cost savings are enormous – especially for a small business like myself. And the server-to-server transfer speeds using their b2 Python script, while not as fast as using a raw rsync connection, are quite a bit quicker than using S3.

Before committing to B2, I gave it a really thorough test by seeding it with 350,000 files totaling 450GB. The whole process took about eight hours coming from my primary Linode server in Atlanta. I was quite pleased.

Anyway, after testing all of my scripts, I put them into cron and ignored them for the next few days assuming they’d “just work”. But when I went back to check on them, I found every one had been failing silently.

At first I thought maybe the b2 command wasn’t found in $PATH when running via cron for some reason, but that wasn’t it. Next I double-checked that b2 was using the correct credentials I had previously authorized it with by hand. Nope.

Turns out, b2 was throwing this Python exception.

Creating a Pipfile for this project...
Creating a virtualenv for this project...
Traceback (most recent call last):
  File "/usr/local/bin/pew", line 7, in <module>
    from pew.pew import pew
  File "/usr/local/lib/python2.7/site-packages/pew/__init__.py", line 11, in <module>
    from . import pew
  File "/usr/local/lib/python2.7/site-packages/pew/pew.py", line 36, in <module>
    from pew._utils import (check_call, invoke, expandpath, own, env_bin_dir,
  File "/usr/local/lib/python2.7/site-packages/pew/_utils.py", line 22, in <module>
    encoding = locale.getlocale()[1] or 'ascii'
  File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/locale.py", line 564, in getlocale
    return _parse_localename(localename)
  File "/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/locale.py", line 477, in _parse_localename
    raise ValueError, 'unknown locale: %s' % localename
ValueError: unknown locale: UTF-8

I’m hardly a Python expert, and I’ve traditionally had nothing but problems anytime I’ve had to do anything with pip, so this didn’t surprise me. What did surprise me was that this error was happening both locally on my Mac (10.14.4) and on my remote Ubuntu 18.04 box.

After some googling I found this bug in pipenv. The solution is to add the following to your b2 scripts that are run by cron:

export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8

And that fixed it.

I know macOS ships with a mostly-broken installation of Python, but the latest Ubuntu LTS? Anyway, if this is common Python/pip knowledge, at least now I know, too.

Backing Up Everything (Again)

This will take a while. Bear with me.

I’m obsessive about backing up my data. I don’t want to take the chance of ever losing anything important. But that doesn’t mean I’m a data hoarder. I like to think I’m pragmatic about it. And I don’t trust anyone else to do it for me.

From around 2006 to 2012, I kept a Mac mini attached to our TV with a Drobo hanging off the back. It had all our downloaded movies on it. And every night it would automatically download the latest releases of our favorite TV shows from Usenet so my wife and I could watch them with Plex the next day. It worked great, and all the media files were stored redundantly across multiple hard drives with tons of storage space. (Would it survive a house fire? No. But files like that weren’t critical.) But with the rise of streaming services and useful pay-to-watch stores like iTunes, now I’d rather just pay someone else to handle all of that for me. So, I don’t keep any media files like that locally any longer.

But my email? My financial and business documents? My family’s photo and home video archive? I’m really obsessive about that.

For most of my computing life, all of that data was small enough to fit on my laptop or desktop’s hard drive. In college, I remember burning a CD (not a DVD) every few months will all of my school work, source code, and photos on it for safe keeping. The internet wasn’t yet fast enough to make backing up to a cloud (were clouds even a thing back then?) feasible, so as my data grew I just cloned everything nightly to a spare drive using SuperDuper and Time Machine. It worked for the most part. Sure, I still worried about my house catching fire and destroying my backups, but there really wasn’t an alternative other than occasionally taking one of the backup drives to work or a friend’s house.

But then the internet got fast, really fast, and syncing everything to the cloud became easy and affordable. I was a beta user of Gmail back in 2004. I was an early paid subscriber of Dropbox since around 2008. All of my data was stored in their services and fully available on every computer and – eventually – mobile device. At the time, I thought I had reached peak-backup.

I was wrong.

Now we have too much data. My email is around 20GB. My family’s photo library is approaching 500GB. That’s more data than will fit on my laptop’s puny SSD. It will fit on my iMac, but it leaves precious available space for anything else. I could connect external drives, but that gets messy and further complicates my local backup routine. (Yes, Backblaze is a good, potential solution to that.)

Another problem is that most of our data now is either created directly in the cloud (email, Google Docs, etc) or is immediately sent to it (iPhone photos uploaded to iCloud and/or Google Photos), bypassing my local storage. If you trust Google (or Apple) to keep your data safe and backed up, that’s great. I don’t. I’ve heard too many horror stories about one of Google’s automated AI systems flagging an account and locking out the user. And with no way to contact an actual human, you’re dead in the water along with all your data. Especially if you lose access to your primary email account, which is the key to all your other online accounts.

So, I need a way to backup my newly created cloud data, too. This is getting complicated.

First step. My email. This is easy. Five years ago I setup new email addresses for my personal and business accounts with Fastmail. They’re amazing. I imported my 10+ years worth of email from Google (sadly, my pre-2004 college email and personal accounts are lost to the ether), setup a forwarding rule in Gmail, and with the help of 1Password, changed all of my online services to use my new email. It took about a month to switch everything over, but now the only email coming to my old Gmail address is spam. Fastmail keeps redundant backups of my email. And I have full IMAP copies available on multiple computers in case they don’t. And if something ever goes wrong, unlike Google where their advertisers are the customer – and I’m the product – I pay Fastmail every month and can call up a live human to talk to.

Source code. I’m a paying GitHub customer. Everything’s stored and backed up there. But still, what if they screw up. I ran a small, self-hosted server with GitLab on it for a while instead of GitHub and set it to backup all my code nightly to S3. That worked great. But, I like GitHub’s UI and feature set better. Plus, it’s one less server I have to manage. So, where do I mirror my code to? (Much of my code is checked out locally on my computer, but not all of it.)

Back in 2006, my boss at the web agency I was working at told me about rsync.net. They provide you with a non-interactive Unix shell account that you can pipe data to over SFTP, rysnc, or any other standard Unix tool. You pay by the GB/month, and they scale to petabyte sizes for customers who need that. So, I signed up and used them to backup all of my svn (remember svn?) repos. With the rise of git and switch to GitHub, I cancelled my account and mostly forgot about them.

But, aha!, I now have new data storage problems. Rsync.net could be a great solution again. So, I re-signed up and setup my primary web server to mirror all of my GitHub repos over to them each night. Here’s the script I’m using…

Next up, important documents. Traditionally, I’ve kept everything that would normally go in my Mac’s “Documents” folder in my Dropbox account. That worked great for a long time. But once I started paying Google for extra storage space for Google Photos (more on that later), it felt silly to keep paying Dropbox as well. So, after 10+ years as a paid subscriber, I downgraded to a free account and moved everything into Google Drive. Sure, it’s not as nice as Dropbox, but it works and saves me $10 a month.

Like I said above, I mostly trust Google, but not entirely. So, let’s sync my Google Drive’s contents to rsync.net, too. Edit your Mac’s crontab to add this line…

30 * * * * /usr/bin/rsync -avz /Users/thall/Google\ Drive/ user@server.com:google-drive

Also, I keep all of the really important paperwork that would normally be in a fire safe in my garage in a DEVONthink library so I can search the contents of my PDFs. It’s synced automatically with iCloud and available across my mobile devices. But still, better back that up, too.

45 * * * * /usr/bin/rsync -avz /Users/thall/FireSafe.dtBase2 user@server.com:

So, that’s all of my data except for the big one – my family’s photo and home video archives.

For a long time I kept all my family’s archives in Dropbox. I even made an iOS app dedicated to browsing your library. I could have stuck everything in Apple’s Photos.app where it’s available on my devices via iCloud, but that’s tied to my Apple ID. My wife wouldn’t be able to see those photos. Plus, any photos she took on her phone would get stored in her iCloud account and not synced with the main family archive. So, we used the Dropbox app, signed-in to my account, to backup our phones’ photos.

But, like I said earlier, our photo and video library become to big to comfortably fit in Dropbox. Plus, Google Photos had just been released and it was amazing. Do I like the thought of Google’s AI robots churning through my photos and possibly using that data to sell me advertisements? No. But, their machine-learning expertise and big-data solutions make it really hard to resist. So, I spent a week and moved everything out of Dropbox into Google Photos.

Now everything is sorted into albums, by date, and searchable on any device. I can literally type into their search box “all photos of my wife’s grandmother taken in front of the Golden Gate bridge” and Google returns exactly what I’m looking for. It’s wonderful.

My wife’s phone has the Google Photos app installed with my account on it so every photo she takes gets stored in a shared account we can both access and view on all our devices.

But what’s the recurring theme of this blog post? That’s right. I don’t fully trust any cloud provider to be the only source of my data. Someone clever said “the cloud is just someone else’s computer.” That’s exactly correct. If your data isn’t in at least two different places, it’s not really backed up.

But how do I backup my 500GB+ of photos that are already in Google’s cloud? And then how do I keep new items recently added synced as well?

As usual, I tried to find a way to make it work with rsync.net. I found a great open-source project called rclone. It’s a command line tool that shuffles your files between cloud providers or any SFTP server with lots of configurable options and granularity.

First off, even if rclone does do what I need, I can’t just run it on my Mac. My internet is too slow for the initial backup. I need to use it on one of my servers so I have a fast data center to data center connection between Google and rsync.net.

Getting it setup on one of my Ubuntu servers at Linode was a simple bash one-liner. Configuring it to then work with my Google and rsync.net accounts was just a matter of running their easy-to-use configuration wizard.

Note: rclone doesn’t support a connection to Google Photos. Instead, you need to login to Google Drive on the web and enable the “Automatically put your Google Photos into a folder in My Drive” option in Settings. (And also tell your Google Backup & Sync Mac app not to sync that folder locally – unless you have the space available – I don’t.) Then, rclone can access your Google Photos data via a special folder in your Drive account.

With everything configured, I ran a few connection tests and it all worked as expected. So, I naively ran this command thinking it would sync everything if I let it run long enough:

rclone copy -P "GoogleDrive:Google Photos" rsync:GooglePhotos

Things started out fine. But eventually, due to Google API rate limits, it was quickly throttled to 300KB/sec. That would have taken MONTHS to transfer my data. And, the connection entirely stalled out after about an hour. I even configured rclone to use my own, private Google OAuth keys, but with the same result. So, I needed a better way to do the initial import.

Google offers their Takeout service. It lets you download an archive of ALL your data from any of their services. I requested an archive of my Google Photos account and eight hours later they emailed me to let me know it was ready. Click the email link to their website, boom. Ten 50GB .tgz files. Now what to do with them?

I can’t download them to my Mac and re-upload them – that’s too slow. Instead, I’ll just grab the download URLs and use curl on my server to get them, extract them, and sync them over.

I don’t have enough room on my primary web server – plus I don’t want to saturate my traffic for any customers visiting my website. So, spin up a new Linode, attach a 500GB network volume, and we’re in business. Right? Nope.

The download links are protected behind my Google account (that’s great!) so I need a web browser to authenticate. Back on my Mac, fire up Charles Proxy and begin the downloads in Safari. Once they start, cancel them. Go to Charles, find the final GET connection, and right-click to copy the request as a curl command including all of the authentication headers and cookies. Paste that command into my server’s Terminal window and watch my 500GB archive download at 150MB(!!)/sec.

(Turns out, extracting all of those huge .tgz files took longer than actually downloading them.)

Finally, rsync everything over to my backup server.

And that’s where I currently am right now. Waiting on 500GB worth of photos and videos to stream across the internet from Linode in Atlanta to rsync.net in Denver. It looks like I have about six more hours to go. Once that’s done, the initial seed of my Google Photos backup will be complete. Next, I need a way to backup anything that gets added in the future.

Between the two of us, my wife and I take about 5 to 10 photos a day. Mostly of our kids. Holidays and special events may produce a bunch more at once, but that’s sporadic. All I need to do is sync the last 24 hours worth of new data once every night.

rclone is the perfect tool for this job. It supports a “–max-age=24h” option that will only grab the latest items, so it will comfortably fit within Google’s API rate limits. Once again, setup a cron job on my server like so:

0 0 * * * rclone copy --max-age=24h "GoogleDrive:Google Photos" rsync:GooglePhotos

And, that’s it. I think I’m done. Really, this time.

All of my important data – backed up to multiple storage providers – and available on all of my and my family’s devices. At least until the whole situation changes yet again.

A few more notes:

All of my web server configuration files are stored in git. As are all of my websites’ actual files. But, I still run an hourly cron job to backup all of “/var/www” and “/etc/apache2/sites-available” to rsync.net since it’s actually such a small amount of data. This lets me run one command to re-sync everything in the event I need to move to a new server, without having to clone a ton of individual git repos. (I know I need to learn a better devops technique with reproducible deployments like Ansible, Puppet, or whatever the cool tech is these days. But everything I do is just a standard LAMP stack (no containers, only one or two actual servers), so spinning up a new machine is really just a click in the Linode control panel and couple apt-get commands and dropping my PHP files into a directory.)

My databases are mysqldump’d every hour, versioned, and archived in S3.

All of the source code on my Mac is checked out into a single parent directory in my home folder. It gets rscyn’d offsite every hour, just in case. Think of it as a poor man’s Time Machine in case git fails me.

I do a lot of work in The Omni Group‘s apps – OmniFocus, OmniOutliner, and OmniGraffle. All of those documents are stored in their free WebDAV sync service and mirrored on my Mac and mobile devices.

All of my music purchases have gone through iTunes since that store debuted however many years ago. I can always re-download my purchases (probably?). Non-iTunes music ripped from CDs long ago, and my huge collection of live music, is stored in iTunes Match for a yearly fee. A few years ago when I made the switch to streaming music services and mostly stopped buying new albums, I archived all of my mp3s in Amazon S3 as a backup. I need to set a reminder to upload any new music I’ve acquired as a recurring task once a year or so.

Also, I have Backblaze running on my desktop and laptop doing its thing. So yeah. I guess that’s yet another layer of redundancy.

Switching from GitHub to GitLab

I’ve been a happy paying customer of GitHub since early 2009. But yesterday, for a few different reasons, I deleted all of my private repositories and moved them over to a self-hosted installation of GitLab. I didn’t make that decision lightly, as I’ve been very happy with GitHub for the last five years, but here’s why…

First, I’ve started working on a new Mac app. Every time I start a new project, unless it’s open source, I create a new private repo for it on GitHub. This project happened to be my 21st private repository on GitHub. If you’re familiar with their pricing structure, you’ll know they charge based on how many private projects you have. $22 a month will get you twenty repos. But as soon as you create that twenty-first one, you graduate onto the $50 a month plan. Maybe if I were actually hosting 50 repositories with GitHub I’d be willing to pay that much, but for the foreseeable future I’m going to be in the low twenties, and $50 a month is just too much. It’s a shame they don’t just outright charge you a dollar per month per project.

The second reason is an issue I’ve been mulling over for quite a while. I love the cloud. I love having my data in the cloud. But some of it is so precious, in this case my code, that I want to know exactly how it’s being taken care of and looked after. While I have no reason to doubt GitHub has plenty of backups in place, I have no way of really knowing for sure how safe my code is. Hosting it myself has its inherit risks, too, but at least I can have full ownership of my data and be certain of the backup strategies in place. This also dovetails nicely with the pleasure nerds like myself get in doing a job themselves. Whether that’s hosting your own email (which I’m not crazy enough to do), managing your own web server (yes, please), or automating your own digital backups, there’s a sick pleasure to be had in doing a job yourself and doing it well.

A final reason for switching away from GitHub was the uneasy feeling I got watching the story of Julie Ann Horvath unfold last week. I didn’t like the idea of my money going to a company that seemed so fundamentally broken. Since then, GitHub has taken forceful, actionable steps to correct the issue, but it still worried me.

So those are my three and a half reasons for moving my private repos away from GitHub. If you agree with me, or if you have your own reasons for wanting to move away, what follows is a brain dump of the steps I took towards getting moved over and situated happily on a GitLab installation.

First off, if you’ve never heard of GitLab, go take a look through their website. It’s a Rails app that is shamefully funny in how closely they’ve copied the look and feel and functionality of GitHub. Everything from the activity timeline, to pull requests, to user and team access roles, to issue tracking, to shareable git-backed gists. It’s all very nicely implemented. Many open source projects start off strong and can later falter when the creators get bored. But I feel fairly confident in GitLab as their community open source version is based off an enterprise product they sell and do support for. Quite a few businesses are using GitLab as a GitHub replacement in situations where their code needs to remain on site.

So, where are we going to host it? My initial thought was to boot up a new virtual server with Rackspace, which is where I host all of my business servers. Rackspace is great. A little expensive, but the customer support makes up for it. Their minimum monthly price for a 512mb server, which is all we’ll need, is around $10 a month. I was nearly about to create the server when I decided to finally take a look at DigitalOcean. They’re the new hotness in cloud hosting and have a reputation for being extremely inexpensive. (Bonus points: they offer two-factor authentication on their user accounts, which is something Rackspace still lacks.) Poking around, I found I could get a comparable 512mb server with DigitalOcean for a flat $5 a month. But what really sealed the deal is they offer one-click installs of various server apps – WordPress, etc. I wasn’t looking forward to the fairly intensive setup that GitLab requires, but amazingly, GitLab is one of DigitalOcean’s one-click installs.

True to their word, I had a ready-to-go GitLab server up and running in less than a minute after clicking the “create” button. All that remained was fine tuning everything to my needs.

The first step upon getting a new cloud server is to secure it. I always follow the steps outlined in this guide. It does a good job of locking everything down and only takes about five minutes to follow.

Of note, when you get to the section about enabling ufw (the firewall), DigitalOcean boxes don’t come with everything you need installed. I had to run the following command before setting up ufw…

sudo apt-get install linux-image-$(uname -r)

Another note, and this is just personal preference, I also modify my ssh port to be something non-standard. That can be changed in…

/etc/ssh/sshd_config

Also, while the user facing side of GitLab is great, I have no idea how security conscious they are. I’d hate for an unpatched security hole in their web app to expose any of my private code. One way to mitigate that chance is to lock down web traffic to the specific IP addresses you’ll be accessing it from. Your home, your office, etc. With ufw it’s just a quick…

sudo ufw allow from your-ip-address to any port 80

for each of your IPs.

Once you’ve gotten the security taken care of, you can move on to configuring GitLab. Most of the hard work is already done for you by DigitalOcean. You’ll just need to fill in the appropriate values in…

/home/git/gitlab-shell/config.yml

and

/home/git/gitlab/config/gitlab.yml

Then restart GitLab with…

sudo service gitlab restart

With all that done, the next step is moving your repositories from GitHub to GitLab. (I’m sure there is a better direct git-to-git way of doing what follows, but this was the simplest solution for my needs.) For each of your repos, do a clean mirror to your Desktop to make sure you’ve got everything.

git clone --mirror git@github.com:username/repo-name.git

Then, cd into the repo directory and….

git remote add gitlab ssh://git@servername.com:22/username/repo.git
git push -f --tags gitlab refs/heads/*:refs/heads/*

That final git push with all the refs will push every branch and all of your tags making sure nothing is left behind.

Once done, you can safely delete your repo from GitHub.

The last step is making sure you have rolling backups of your GitLab installation and repositories in place. I looked into piecing together my own backup script until I realized GitLab already has a rake backup task available that stores everything into a single tar file. Perfect. I can then just upload that to S3 for safe keeping. To do that, we’ll be using s3cmd to handle the uploads.

sudo apt-get install s3cmd

Configure it with…

s3cmd --configure

Then, create a script in your git user’s home directory called backup.sh containing…

cd /home/git/gitlab && PATH=/usr/local/bin:/usr/bin:/bin bundle exec rake gitlab:backup:create RAILS_ENV=production
s3cmd put tmp/backups/`ls tmp/backups/ | grep -i -E '\.tar$' | tail -1` s3://bucket-name/git/

Setup cron to run that script once a day and you’re good.