Git is a great tool – not only for version control. Ever wondered what would
happen to your local Git repositories you use to store private projects,
documents or other arbitrary data after your hard drive crashes?1 Most
probably – if you don’t use any other means of backup – it will be lost:
Surprise! No, not really, Git is based on plain files and there’s no way to
recover from file corruption.2 Therefore you should not forget to backup
your Git repositories, too.
If you use a Git host like GitHub or Gitorious it’s pretty easy to have at least one additional repository to increase your data security. Once you added your online repository as a remote and pushed to it, you have a full3 copy of your repository contents at another location, elminitating your local hard drive as a single point of failure. But what about private repositories you don’t want (or you are not allowed to) host on a third-party server?
The most obvious solution is an own Git host, but this isn’t a simple task. You need the hardware (at least virtual on another physical host) and more importantly you need the knowledge to setup and maintain a Git server. To get the latter you may want to read chapter 4 of Scott Chacon’s excellent book Pro Git. But nevertheless it’s time-consuming and most probably overkill to just have a secure backup copy of your repositories.
A much simpler method is adequate for some ordinary backup scenarios – e.g. backing up to a local NAS (like a USB drive) or another network share (for example your private network drive on your company’s file server). These cases have in common that although data is stored on a remote hard drive, there’s no need to use a sophisticated network transport protocol – like Git’s own protocol or HTTP(S). Everything is transferred on top of the file system, that may be something like NFS, CIFS or AFP. So the heavy lifting is done by your operating system (or better, its file system implementation) and Git does not need to do anything special as the local file protocol is used. Therefore Git does not need to know anything about your infrastructure, hardware or whatever. So you don’t need to configure anything – aside from your remote(s). That’s why setting up Git backups is simple and even works when the underlying infrastructure changes.
If you use Git for more than local storage you should be pretty comfortable with remotes. Although Git provides several options for remotes, like the refs that should be pushed and pulled. Configuring a remote for backup usage is even simpler: A backup repository is configured as a mirror remote:
That way, you don’t need to worry about what is pushed to this remote
repository. Every branch and every tag is always pushed, even if the changes
are non-forward. So you never have to worry if you forgot anything. The
following will always result in a fully up-to-date remote repository:
Instead of the more lengthy alternative (
You see, backups using Git are pretty easy, but it’s getting even better. Instead of relying on a command to initiate an update of your backup(s), you most probably want to do this automatically. A good way to do this periodically is Unix’ cron or Windows’ scheduled tasks. Just configure the command above to be run inside your repositories at regular intervals.
Another, more sophisticated way for automatic backups is using Git’s own hooks. Using e.g. the
Have a look at the following hooks to see some examples of automatically triggering a new backup:
Like said before, the hooks above are only samples of what can be done. They
may be not that useful for your specific use case and infrastructure. But
backing up using Git is practicable and does not need any special hardware or
software (except Git of course). Additionally that way of backing up is even
useful in pretty restricted enterprise architectures. On the other hand it may
also fit the most complex infrastructures – I’m thinking of multiple Git
repositories here, where one acts as some sort of proxy, automatically
triggering backups to multiple other repository e.g. using a
So you may find better ways to implement your own Git-based backup system – feel free to share them.
If you use a Git host like GitHub or Gitorious it’s pretty easy to have at least one additional repository to increase your data security. Once you added your online repository as a remote and pushed to it, you have a full3 copy of your repository contents at another location, elminitating your local hard drive as a single point of failure. But what about private repositories you don’t want (or you are not allowed to) host on a third-party server?
The most obvious solution is an own Git host, but this isn’t a simple task. You need the hardware (at least virtual on another physical host) and more importantly you need the knowledge to setup and maintain a Git server. To get the latter you may want to read chapter 4 of Scott Chacon’s excellent book Pro Git. But nevertheless it’s time-consuming and most probably overkill to just have a secure backup copy of your repositories.
A much simpler method is adequate for some ordinary backup scenarios – e.g. backing up to a local NAS (like a USB drive) or another network share (for example your private network drive on your company’s file server). These cases have in common that although data is stored on a remote hard drive, there’s no need to use a sophisticated network transport protocol – like Git’s own protocol or HTTP(S). Everything is transferred on top of the file system, that may be something like NFS, CIFS or AFP. So the heavy lifting is done by your operating system (or better, its file system implementation) and Git does not need to do anything special as the local file protocol is used. Therefore Git does not need to know anything about your infrastructure, hardware or whatever. So you don’t need to configure anything – aside from your remote(s). That’s why setting up Git backups is simple and even works when the underlying infrastructure changes.
If you use Git for more than local storage you should be pretty comfortable with remotes. Although Git provides several options for remotes, like the refs that should be pushed and pulled. Configuring a remote for backup usage is even simpler: A backup repository is configured as a mirror remote:
1
|
|
1
|
|
git push -f backup refs/*
).You see, backups using Git are pretty easy, but it’s getting even better. Instead of relying on a command to initiate an update of your backup(s), you most probably want to do this automatically. A good way to do this periodically is Unix’ cron or Windows’ scheduled tasks. Just configure the command above to be run inside your repositories at regular intervals.
Another, more sophisticated way for automatic backups is using Git’s own hooks. Using e.g. the
post-commit
hook, you might update your backups every time you
commit to your repository. That might seem to be a bit too much and depending
on your repository contents and network speed pushing into your backup
repository may really take some time. So you might want to tweak your hook, so
that not every commit is pushed, but only some of them. You might check the
timestamp or the contents of the commit or several other things – Git’s hooks
can be a really mighty tool.Have a look at the following hooks to see some examples of automatically triggering a new backup:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
|
post-receive
hook. You see… boundless possibilities.So you may find better ways to implement your own Git-based backup system – feel free to share them.
-
You don’t use Git to store things other than source code? Well, than
this post might demonstrate that it isn’t some strange nerd fetish to
store everything in a VCS repository. It’s a good way to not only store
data, but also store its history… securely.↩
-
Git’s SHA hashes will help Git to detect corruption of database objects,
but there’s no reliable, built-in way to recover from file corruption.
There exist ways to recover corrupt objects, though. And this
method is neither reliable nor straightforward.↩
-
If you push every branch and every tag, although pushing to another
repository will never include your stash and reflog.
from http://stdout.koraktor.de/blog/2010/10/02/using-git-as-a-backup-tool/