Dar Documentation

Good Backup Practice Short Guide

Presentation

This short guide is here to gather important (and somehow obvious) techniques about computer backups. It also explains the risks you take not following these principles. I thought this was obvious and well known by anyone, up to recently when I started getting feedback of people complaining about their lost data because of bad media or other reasons. To the question "have you tested your archive?", I was surprised to get the negative answers.

This guide is not especially linked to Disk ARchive (aka dar) no more than to any other tool, thus, you can take advantage of reading this document if you are not sure of your backup procedure, whatever is the backup software you use.

Notions

In the following we will speak about backup and archive:

With the previous meaning of an archive you can also make a backup of an archive (for example a clone copy of your archive).

Archives

  1. The first think to do just after making an archive is testing it on its definitive medium. There are several reasons that make this testing important:

    Of course the archive testing must be done when the backup has been put on its definitive place (CD-R, floppy, tape, etc.), if you have to move it (copy to another media), then you need to test it again on the new medium. The testing operation, must read/test all the data, not just list the archive contents (-t option instead of -l option for dar). And of course the archive must have a minimum mechanism to detect errors (dar has one without compression, and two when using compression).

  2. As a replacement for testing, a better operation is to compare the files in the archive with those on the original files on the disk (-d option for dar). This makes the same as testing archive readability and coherence, while also checking that the data is really identical whatever the corruption detection
    mechanisms used are. This operation is not suited for a set of data that changes (like a active system backup), but is probably what you need when creating an archive.

  3. Increasing the degree of security, the next thing to try is to restore the archive in a temporary place or better on another computer. This will let you check that from end to end, you have a good usable backup, on which you can rely. Once you have restored, you will need to compare the result, the diff command can help you here, moreover, this is a program that has no link with dar so it would be very improbable to have a common bug to both dar and diff that let you think both original and restored data are identical while they are not!

  4. Unfortunately, many (all) media do alter with time, and an archive that was properly written on a correct media may become unreadable with time and/or bad environment conditions. Thus of course, take care not to store magnetic storages near magnetic sources (like HiFi speakers) or enclosed in metallic boxes, as well as avoid having sun directly lighting your CD-R(W) DVD-R(W), etc. Also mentioned for many media is humidity: respect the acceptable humidity range for each medium (don't store your data in your bathroom, kitchen, cave, ...). Same thing about the temperature. More generally have a look at the safe environmental conditions described in the documentation, even just once for each media type.

    The problem with archive is that usually you need them for a long time, while the media has a limited lifetime. A solution is to make one (or several) copy (i.e.: backup of archive) of the data when the original support has arrived its half expected life.

    Another solution, is to use Parchive, it works in the principle of RAID disk systems, creating beside each file a par file which can be used later to recover missing part or corrupted part of the original file. Of course, Parchive can work on dar's slices. But, it requires more storage, thus you will have to choose smaller slice size to have place to put Parchive data on your CD-R or DVD-R for example. The amount of data generated by Parchive depends on the redundancy level (Parchive's -r option). Check the notes for more informations about using Parchive with dar. When using read-only medium, you will need to copy the corrupted file to a read-write medium for Parchive can repair it. Unfortunately the usual 'cp' command will stop when the first I/O error will be met, making you unable to get the sane data *after* the corruption. In most case you will not have enough sane data for Parchive to repair you file. For that reason the "dar_cp" tool has been created (it is available included in dar's package). It is a cp-like command that skips over the corruptions (replacing it by a field of zeored bytes, which can be repaired afterward by Parchive) and can copy sane data after the corrupted part.

  5. another problem arrives when an archive is often read. Depending on the medium, the fact to read, often degrades the media little by little, and makes the media's lifetime shorter. A possible solution is to have two copies, one for reading and one to keep as backup, copy which should be never read except for making a new copy. Chances are that the often read copy will "die" before the backup copy, you then could be able to make a new backup copy from the original backup copy, which in turn could become the new "often read" medium.

  6. Of course, if you want to have an often read archive and also want to keep it forever, you could combine the two of the previous techniques, making two copies, one for storage and one for backup. Once you have spent a certain time (medium half lifetime for example), you could make a new copy, and keep them beside the original backup copy in case of.

  7. Another problem, is safety of your data. In some case, the archive you have does not need to be kept a very long time nor it needs to be read often, but instead is very "precious". in that case a solution could be to make several copies that you could store in very different locations. This could prevent data lost in case of fire disaster, or other cataclysms.

  8. Yet another aspect is the privacy of your data. An archive may not have to be accessible to anyone. Several directions could be possible to answer this problem:

    For encryption, dar provides strong encryption inside the archive (blowfish, aes, etc.), it does preserve the direct access feature that avoid you having decrypt the whole the whole archive to restore just one file. But you can also use an external encryption mechanism, like GnuPG to encrypt slice by slice for example, the drawback is that you will have to decrypt each slice at a whole to be able to recover a single file in it.

Backup

Backups act a bit like an archive, except that they are a copy of a changing set of data, which is moreover expected to stay on the original location (the system). But, as an archive, it is a good practice to at least test the resulting backups, and once a year if possible to test the overall backup process by doing a restoration of your system into a new virtual machine or a spare computer, checking that the recovered system is fully operational.

The fact that the data is changing introduces two problems:

The backup has also the role of keeping a recent history of changes. For example, you may have deleted a precious data from your system. And it is quite possible that you notice this mistake long ago after deletion. In that case, an old backup stays useful, in spite of many more recent backups.

In consequences, backup need to be done often for having a minimum delta in case of crash disk. But, having new backup do not mean that older can be removed. A usual way of doing that, is to have a set of media, over which you rotate the backups. The new backup is done over the oldest backup of the set. This way you keep a certain history of your system changes. It is your choice to decide how much archive you want to keep, and how often you will make a backup of your system.

Differential / incremental backup

A point that can increase the history while saving media space required by each backup is the differential backup. A differential backup is a backup done only of what have changed since a previous backup (the "backup of reference"). The drawback is that it is not autonomous and cannot be used alone to restore a full system. Thus there is no problem to keep the differential backup on the same medium as the one where is located the backup of reference.

Doing a lot of consecutive differential backup (taking the last backup as reference for the next differential backup, which some are used to call "incremental" backups), will reduce your storage requirement, but will extra timecost at restoration in case of computer accident. You will have to restore the full backup (of reference), then you will have to restore all the many backup you have done up to the last. This implies that you must keep all the differential backups you have done since the backup of reference, if you wish to restore the exact state of the filesystem at the time of the last differential backup.

It is thus up to you to decide how much differential backup you do, and how much often you make a full backup. A common scheme, is to make a full backup once a week and make differential backup each day of the week. The backup done in a week are kept together. You could then have ten sets of full+differential backups, and a new full backup would erase the oldest full backup as well as its associated differential backups, this way you keep a ten week history of backup with a backup every day, but this is just an example.

An interesting protection suggested by George Foot on the dar-support mailing-list: once you make a new full backup, the idea is to make an additional differential backup based on the previous full backup (the one just older than the one we have just built), which would acts as a substitute for the actual full backup in case something does go wrong with it later on.

Decremental Backup

Based on a feature request for dar made by "Yuraukar" on dar-support mailing-list, the decremental backup provides an interesting approach where the disk requirement is optimized as for the incremental backup, while the latest backup is always a full backup (while this is the oldest that is full, in the incremental backup approach). The drawback here is that there is some extra work at each new backup creation to transform the former more recent backup from a full backup to a so called "decremental" backup.

The decremental backup only contains the difference between the state of the current system and the state the system had at a more ancient date (the date of the full backup corresponding the decremental backup was made).

In other words, the building of decremental backups is the following:

This way you always have a full backup as the lastest backup, and decremental backups as the older ones.

You may still have several sets of backup (one for each week, for example, containing at the end of a week a full backup and 6 decremental backups), but you also may just keep one set (a full backup, and a lot of decremental backups), when you will need more space, you will just have to delete the oldest decremental backups, thing you cannot do with the incremental approach, where deleting the oldest backup, means deleting the full backup that all others following incremental backup are based upon.

At the difference of the incremental backup approach, it is very easy to restore a whole system: just restore the latest backup (by opposition to restoring the more recent full backup, then the as many incremental backup that follow). If now you need to recover a file that has been erased by error, just use a the adequate decremental backup. And it is still possible to restore a whole system globally in a state it had long ago before the lastest backup was done: you will for that restore the full backup (latest backup), then in turn each decremental backup up to the one that correspond to the epoch of you wish. The probability that you have to use all decremental backup is thin compared to the probability you have to use all the incremental backups: there is effectively much more probability to restore a system in a recent state than to restore it in a very old state.

There is however several drawbacks:

time
Doing each time a full backup is time consumming and creating a decremental backup from two full backups is even more time consuming...
temporary disk space
Each time you create a new backup, you temporarily need more space than using the incremental backup, you need to keep two full backups during a short period, plus a decremental backup (usually much smaller than a full backup), even if at then end you remove the oldest full backup.

In conclusion, I would not tell that decremental backup is the panacea, however it exists and may be of interest to some of you. More information about dar's implementation of decremental backup can be found here.


Any other trick/idea/improvement/correction/evidences are welcome!

Denis.