Pages

Monday, 30 April 2012

Using Git as a backup tool

Git is a great tool – not only for version control. Ever wondered what would happen to your local Git repositories you use to store private projects, documents or other arbitrary data after your hard drive crashes?1 Most probably – if you don’t use any other means of backup – it will be lost: Surprise! No, not really, Git is based on plain files and there’s no way to recover from file corruption.2 Therefore you should not forget to backup your Git repositories, too.
If you use a Git host like GitHub or Gitorious it’s pretty easy to have at least one additional repository to increase your data security. Once you added your online repository as a remote and pushed to it, you have a full3 copy of your repository contents at another location, elminitating your local hard drive as a single point of failure. But what about private repositories you don’t want (or you are not allowed to) host on a third-party server?
The most obvious solution is an own Git host, but this isn’t a simple task. You need the hardware (at least virtual on another physical host) and more importantly you need the knowledge to setup and maintain a Git server. To get the latter you may want to read chapter 4 of Scott Chacon’s excellent book Pro Git. But nevertheless it’s time-consuming and most probably overkill to just have a secure backup copy of your repositories.
A much simpler method is adequate for some ordinary backup scenarios – e.g. backing up to a local NAS (like a USB drive) or another network share (for example your private network drive on your company’s file server). These cases have in common that although data is stored on a remote hard drive, there’s no need to use a sophisticated network transport protocol – like Git’s own protocol or HTTP(S). Everything is transferred on top of the file system, that may be something like NFS, CIFS or AFP. So the heavy lifting is done by your operating system (or better, its file system implementation) and Git does not need to do anything special as the local file protocol is used. Therefore Git does not need to know anything about your infrastructure, hardware or whatever. So you don’t need to configure anything – aside from your remote(s). That’s why setting up Git backups is simple and even works when the underlying infrastructure changes.
If you use Git for more than local storage you should be pretty comfortable with remotes. Although Git provides several options for remotes, like the refs that should be pushed and pulled. Configuring a remote for backup usage is even simpler: A backup repository is configured as a mirror remote:
1
git remote add --mirror backup /Volumes/Storage
That way, you don’t need to worry about what is pushed to this remote repository. Every branch and every tag is always pushed, even if the changes are non-forward. So you never have to worry if you forgot anything. The following will always result in a fully up-to-date remote repository:
1
git push backup
Instead of the more lengthy alternative (git push -f backup refs/*).
You see, backups using Git are pretty easy, but it’s getting even better. Instead of relying on a command to initiate an update of your backup(s), you most probably want to do this automatically. A good way to do this periodically is Unix’ cron or Windows’ scheduled tasks. Just configure the command above to be run inside your repositories at regular intervals.
Another, more sophisticated way for automatic backups is using Git’s own hooks. Using e.g. the post-commit hook, you might update your backups every time you commit to your repository. That might seem to be a bit too much and depending on your repository contents and network speed pushing into your backup repository may really take some time. So you might want to tweak your hook, so that not every commit is pushed, but only some of them. You might check the timestamp or the contents of the commit or several other things – Git’s hooks can be a really mighty tool.
Have a look at the following hooks to see some examples of automatically triggering a new backup:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#!/bin/sh
#
# This code is free software; you can redistribute it and/or modify it under
# the terms of the new BSD License.
#
# Copyright (c) 2010, Sebastian Staudt
#
# This is a demonstration hook that can be used to automatically update a
# backup repository.
#
# Save this file as post-commit to the .git/hooks directory of a Git
# repository and make it executable (`chmod u+x`). This will enable
# automatic backups to a remote repository if the last commit is a specific
# amount of time older than the current one.
# The timeout is the amount of seconds that the last commit has to be older
# than the current one to trigger a new backup
timeout=3600
# The name of the remote repository to backup to. This should be a mirror
# repository, i.e. remote.<remote>.mirror has to be true.
remote=backup
current_timeout=`git log -1 --pretty=format:%ct HEAD`
last_timeout=`git log -1 --pretty=format:%ct HEAD^`
diff=$[$current_timeout - $last_timeout]
if [ $diff -gt $timeout ]
  then git push $remote
fi
view raw gistfile1.sh This Gist brought to you by GitHub.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#!/bin/sh
#
# This code is free software; you can redistribute it and/or modify it under
# the terms of the new BSD License.
#
# Copyright (c) 2010, Sebastian Staudt
#
# This is a demonstration hook that can be used to automatically update a
# backup repository.
#
# Save this file as post-commit to the .git/hooks directory of a Git
# repository and make it executable (`chmod u+x`). This will enable
# automatic backups to a remote repository if a specific path inside your
# repository changed.
# The path is the directory inside your repository that should include
# changes to trigger a new backup
path=lib
# The name of the remote repository to backup to. This should be a mirror
# repository, i.e. remote.<remote>.mirror has to be true.
remote=backup
if [ "`git log -1 --oneline HEAD -- $path`" ]
  then git push $remote
fi
view raw gistfile2.sh This Gist brought to you by GitHub.
Like said before, the hooks above are only samples of what can be done. They may be not that useful for your specific use case and infrastructure. But backing up using Git is practicable and does not need any special hardware or software (except Git of course). Additionally that way of backing up is even useful in pretty restricted enterprise architectures. On the other hand it may also fit the most complex infrastructures – I’m thinking of multiple Git repositories here, where one acts as some sort of proxy, automatically triggering backups to multiple other repository e.g. using a post-receive hook. You see… boundless possibilities.
So you may find better ways to implement your own Git-based backup system – feel free to share them.
  1. You don’t use Git to store things other than source code? Well, than this post might demonstrate that it isn’t some strange nerd fetish to store everything in a VCS repository. It’s a good way to not only store data, but also store its history… securely.
  2. Git’s SHA hashes will help Git to detect corruption of database objects, but there’s no reliable, built-in way to recover from file corruption. There exist ways to recover corrupt objects, though. And this method is neither reliable nor straightforward.
  3. If you push every branch and every tag, although pushing to another repository will never include your stash and reflog.

    from http://stdout.koraktor.de/blog/2010/10/02/using-git-as-a-backup-tool/