Dar Documentation


Dar/Libdar Internals - Notes





Introduction

Here take place a collection of notes. These have been created after implementation of a given feature, mainly for further reference but also for user information. The ideas behind these notes are to remind some choices of implementation, the arguments that lead to this choices in on side, and in the other side let the user have a room to be informed on the choices done and be able to bring his remarks without having to deeply look in the code to learn dar's internals.

Contents



EA & differential backup

Brief presentation of EA:

EA stands for Extended Attributes. In Unix filesystem a regular file is composed of a set of byte (the data) and an inode. The inode add properties to the file, such as owner, group, permission, dates (last modification date of the data [mtime], last access date to data [atime], and last inode change date [ctime]), etc). Last, the name of the file is not contained in the inode, but in the directory(ies) it is linked to. When a file is linked more than once in the directory tree, we speak about "hard links". This way the same data and associated inode appears several time in the same or different directories. This is not the same as a symbolic links, which is a file that contains the path to another file (which may or may not exist). A symbolic link has its own inode. OK, now let's talk about EA:

Extended attributes is a recent feature of Unix file system (at the time of the writing, year 2002). They extend attributes provided by the inode and associated to a data. They are not part of the inode, nor part of the data, nor part of a given directory. They are stored beside the inode and are a set of pair of key and value. The owner of the file can add or define any key and eventually associate data to it. It can also list and remove a particular key. What they are used for ? A way to associate information to a file.

One particular interesting use of EA, is ACL: Access Control List. ACL can be implemented using EA and add a more fine grain in assigning access permission to file. For more information one EA and ACL, see the site of Andreas Grunbacher:

EA & Differential Backup

to determine that an EA has changed dar looks at the ctime value. if ctime has changed, (due to EA change, but also to permission or owner change) dar saves the EA. ctime also changes, if atime or mtime changes. So if you access a file or modify it, dar will consider that the EA have changed also. This is not really fair, I admit.

Something better would be to compare EA one by one, and record those that have changed or have been deleted. But to be able to compare all EA and their value reference EA must reside in memory. As EA can grow up to 64 KB by file, this can lead to a quick saturation of the virtual memory, which is already enough solicited by the catalogue.

These two schemes implies a different pattern for saving EA in archive. In the first case (no EA in memory except at time of operation on it), to avoid skipping in the archive (and ask the user to change of disks too often), EA must be stored beside the data of the file (if present). Thus they must be distributed all along the archive (except at the end that only contains the catalogue).

In the second case (EA are loaded in memory for comparison), EA must reside beside or within the catalogue, in any case at the end of the archive, not to have to user to need all disks to just take an archive as reference.

As the catalogue, grows already fast with the number of file to save (from a few bytes for hard_link to 400 bytes around per directory inode), the memory saving option has been adopted.

Thus, EA changes are based on the ctime change. Unfortunately, no system call permits to restore ctime. Thus, restoring a differential backup after its reference has been restored, will present restored inode as more recent than those in the differential archive, thus the -r option would prevent any EA restoration. In consequence, -r has been disabled for EA, it does only concern data contents. If you don't want to restore any EA but just more recent data, you can use the following : -r -u "*"



Archive structure in brief


The Slice Level

A slice is composed of a header, data and trailer (the trailer appeared with archive format version 8)

+--------+-------------------------------------------+-------+
| header |  Data                                     |Trailer|
|        |                                           |       |
+--------+-------------------------------------------+-------+

the slice header is composed of
  • a magic number that tells this is a dar slice
  • a internal_name which is unique to a given archive and shared by all slices
  • a flag that tells whether the slice is the last of the archive or whether a trailer is present that contains this info.
  • a extension flag, that was used in older archive but which now always set to 'T' telling that a TLV list follows
  • A TLV (Type Length Value) list of item, it contains the slice size, first slice size. The TLV list will receive any future new field related to slice header.
+-------+----------+------+-----------+-------+
| Magic | internal | flag | extension | TLV   |
| Num.  | name     | byte | byte      | list  |
+-------+----------+------+-----------+-------+

The header is the first thing to be written, and if the current slice is not the last slice (all data to write could not fit in it), the flag field is changed indicating that another slice follows. The header is also the first part to be read. Since archive format 8, the flag is set to a specific value indicating that the information telling whether the slice is the last is placed in a slice trailer.

The TLV list may contain several fields:
  • First slice size [type 1]
  • Other slice size / all slice size if no first slice size is present [type 2]
  • data_name [type 3] . This field is detailed below.
A TLV list is of course a list of TLV:

+-------+----------+------+-----------+- ...-----+-------+
| Number| TLV 1    | TLV 2| TLV 3     |          | TLV n |
| of TLV|          |      |           |          |       |
+-------+----------+------+-----------+--...-----+-------+

Each TLV item is, as commonly, defined as set of three fields:

+---------+-------------------------+-------------------------+
| Type    | Length                  | Value                   |
|(2 bytes)| (arbitrary large value) | (arbitrary large data)  |

+---------+-------------------------+-------------------------+

The 2 bytes type is large enough for today's need (65535 different types while only three used), however TLV 65535 is reserved for future use and will signal a new format for the type field.

To know in which slice and at which position to find a particular data, dar needs to know each file's size. This is the reason why each slice contains the slice size information, in particular the last slice. In older version, dar had to read the first slice first to get this slicing information. Then it could read the archive contents at the end of the last slice. Today, reading the last slice, dar can fetch the slicing scheme from the slice header (what we just detailed) and fetch the archive contents at the end of this same last slice.

The trailer (which is one byte length) is new since archive format version 8 (released 2.4.0). It contains the value that was located in the header flag field in older archive format, telling whether the slice is the last of the archive or not. When writting down a single sliced archive (no -s option provided), both the header and the trailer tell that the slice is the last of the archive (duplicated information). However, when doing multi-sliced archive, it is not possible to known whether a slice is the last before reaching the requested amount of data per slice (which depends on the amount of byte to save, compression ratio, encryption overhead, etc.). Thus the header flag contains a value telling that to know whether the slice is the last or not, one must read the trailer.

In older format, it was necessary to seek back to update the header with the correct information when a new slice had to be created. But, keeping this behavior, it would not have been possible to make a digest "on the fly" (see --hash option). The addition of the trailer was required for that feature: to compute a md5 or sha1 hash of each slice. But, this costs one byte per slice, yes.

Data Name

As seen above in the header fields, we have among others the three following identifiers:
  • magic number
  • internal name
  • data name
as already said, magic number is constant and let dar be (almost) sure a given file is a dar slice file. Also briefly explained, the internal_name is a identifier that let dar be almost sure that several slices are from the same archive (problem car arise if two archives of same basename have their slices mixed together: dar will see that and report it to the user).

The new and not yet described field is the "data_name". The data_name field is also present in the archive catalogue (the table of content) of each archive. It may be the same value as the one in the slice headers (normal archives) or another value if the archive results from a catalogue isolation process.

Why this field? A new feature with release 2.4.0 is the ability to use an extracted catalogue to backup a internal catalogue of a given archive. Comparing the data_name value of the catalogue resulting from the isolation operation to the data_name value present int the slices of an archive to rescue, dar can be (almost) sure that the extracted catalogue matches the data present in the archive the user is trying to use it with.

In brief:

Fields Normal Archive
Resliced Using dar_xform
Resulting From Isolation
 isolated archive
resliced with dar_xform

internal_name (slice header)
A
B
C
D
data_name (slice header)
A
A
C
C
data_name (archive catalogue)
A
A
A
A

Archive Level

The archive level describes the structure of the slice's data field (removing header and trailer of each slice), when they are all sticked together from slice to slice:

+---------+----------------------------+-----------+--------+---------+--------+
| version |   Data                     | catalogue | term 1 | version | term 2 |
| header 
|                            |           |        | trailer |        |
+---------+----------------------------+-----------+--------+---------+--------+

The version header is a short version of the trailer version. It is used when reading an archive in sequential mode, to be able to prepare the proper compression layer, and known whether escape sequence mark are present in the archive.

the version trailer (which may still be called "version header" in some part of the documentation because it was only located at the beginning of the archive in previous archive format) is composed of:
  • edition version of the archive
  • compression algorithm used
  • command line used for creating the archive, now known as "user comment"
  • flag, telling:
    •  whether the archive is encrypted,
    • whether it has escape sequence marks,
    • whether the header/trailier contains an encrypted key
    • whether the header/trailer contains the initial offset field
  • initial offset (telling where starts the data in the archive, is only present in the trailer)
  • crypto algorithm used (present only if flag tells that the archive is encrypted)
  • size of the crypted key that follows (present only if the flag tells an encrypted key is present)
  • encrypted key (encrypted by mean of GPG asymetric algorithm, present only if flag says so)
  • CRC (Cyclic Redundancy Check) computed on the whole version header or trailer
+---------+------+---------------+------+--------+--------+
| edition | algo | command line  | flag | initial|  CRC   |
|         |      |               |      | offset |        |
+---------+------+---------------+------+--------+--------+

The trailer is used when reading an archive in direct access mode, to build the proper compression layer, escape layer (it is needed if mark have been inserted in the archive to un-escape data that could else be taken as an escape sequence mark) and encryption layer.

The data is a suite of file contents, with EA if present. When tape mark is used, a copy of the CRC is placed after's file Data and file's EA, to be used when reading the archive in sequential mode. This CRC is also dropped into the catalogue which takes place at the end of the archive to be used when reading the archive in direct access mode (the default).

  ....--+---------------------+----+------------+-----------+----+---....
        |  file data          | EA | file data  | file data | EA |
        | (may be compressed) |    | (no EA)    |           |    |
  ....--+---------------------+----+------------+-----------+----+---....

the catalogue, contains all inode, directory structure and hard_links information as well as data and EA CRC. The directory structure is stored in a simple way: the inode of a directory comes, then the inode of the files it contains, then a special entry named "EOD" for End of Directory. Considering the following tree:

 - toto
    | titi
    | tutu
    | tata
    |   | blup
    |   +--
    | boum
    | coucou
    +---

it would generate the following sequence for catalogue storage:

+-------+------+------+------+------+-----+------+--------+-----+
|  toto | titi | tutu | tata | blup | EOD | boum | coucou | EOD |
|       |      |      |      |      |     |      |        |     |
+-------+------+------+------+------+-----+------+--------+-----+

EOD takes on byte, and this way no need to store the full path of each file, just the filename is recorded.

the terminator stores the position of the beginning of the catalogue, it is the last thing to be written. Thus dar first reads the terminator, then the catalogue. Well, there is now two terminators, both are meant to be read backward. The second terminator points to the beginning of the "trailer version" which is read first in direct access mode. The first terminator points to the start of the catalogue, which is read once the adhoc compression and encryption layers has been built based on the information found on the "trailer version"

All  Together

Here is an example of how data can be structured in a four slice archive:

+--------+--------+------------------------+--+
| slice  | version|  file data + EA        |Tr|
| header | header |                        |  |
+--------+--------+------------------------+--+

the first slice (just above) has been defined smaller using the -S option

+--------+-----------------------------------------------------------------+--+
| slice  |           file data + EA                                        |Tr|
| header |                                                                 |  |
+--------+-----------------------------------------------------------------+--+

+--------+-----------------------------------------------------------------+--+
| slice  |           file data + EA                                        |Tr|
| header |                                                                 |  |
+--------+-----------------------------------------------------------------+--+

+--------+---------------------+-----------+------ +---------+--------+--+
| slice  |   file data + EA    | catalogue | term 1| version | term 2 |Tr|
| header |                     |           |       | trailer |        |  |
+--------+---------------------+-----------+-------+---------+--------+--+

the last slice is smaller because there was not enough data to make it full.

The archive is written sequentially this way.


Other Levels

Things get a bit more complicated if we consider compression and encryption. The way the problem is addressed in dar's code is a bit like networks are designed in computer science, using the notion of layers. Here, there is a additional constraint, a given layer may or may not be present (encryption, compression, slicing for example). So all layer must have the same interface for serving the layer above them. This interface is defined by the pure virtual class "generic_file", which provides generic methods for reading, writing, skipping, getting the current offset when writing or reading data to a "generic_file". This way the compressor class acts like a file which compresses data wrote to it and writes compressed data to another "generic_file" below it. The strong encryption and scramble classes act the same but in place of compressing/uncompressing they encrypt/decrypt the data to/from another generic_file object. The slicing we have seen above follows the same principle, this is a "sar" object that transfers data wrote to it to several fichier [=file] objects. Class fichier [=file] also inherit from generic_file class, and is a wrapper for the plain file system calls. Some new classes have been added with format 8, in particular the escape class, which inserts escape sequence mark at requested position, and modifies data wrote for it never looks like an escape sequence mark. To reduce the level of context switch when reading the catalogue (which makes a ton of small read), a cache class is also present, it gather small writes made to it into larger writes, and pre-reads a large amount of data to answer to the many small reads when building the catalogue in memory from the archive.

Here are now all currently possible layers together:

              +----+--+----+-...........+---------+
archive       |file|EA|file|            |catalogue|
layout        |data|  |data|            |         |
              +----+--+----+-...........+---------+
                |   |    |      |              |
            +-----+ | +-------+ |              |
sparse      |spars| | |sparse | |              |
file        |file | | |file   | |              |
detection   |detec| | |detect.| |              |
layer       +-----+ | +-------+ |              |
(optional)      |   |    |      |              |
                V   V    V      V              V
              +-----------------------------------+
compression   |         (compressed)  data        |
              +-----------------------------------+
                    |                      |
                    |                      |
                    V                      V
              +-----------------------------------+
escape layer  |   escaped data / escape sequences |
(optional)    +-----------------------------------+
                    |                      |          / First Terminateur
                    |                      |          |
                    |                      |          V
elastic  +---+      |                      |       +----+---+
buffers  |EEE|      |                      |       | T1 |EEE|
         +---+      |                      |       +----+---+
           |        |                      |              |           Second
           V        V                      V              V         Terminator
         +--------------------------------------------------+              |
cipher   |        (encrypted) data / cache if no encryption |              |
         +--------------------------------------------------+              V
                    |                         |               +---------+----+
+-------+           |                         |               | trailer | T2 |
| header|           |                         |               +---------+----+
+-------+           |                         |                    |      |
    |               |                         |                    |      |
    V               V                         V                    V      v
+-----------------------------------------------------------------------------+
|                  data                                                       |
+-----------------------------------------------------------------------------+
        |         |  |         |   |        |   |        |  |    |  |        |
slice   |         |  |         |   |        |   |        |  |    |  |        |
headers |         |  |         |   |        |   |        |  |    |  |        |
 |  |   |         |  |         |   |        |   |        |  |    |  |        |
 |  +---|------\  |  |         |   |        |   |        |  |    |  |        |
 V      V      V  V  V         V   V        V   V        V  V    V  V        V
+---------+  +---------+  +---------+  +---------+  +-------+  +-------+  +----+
|HH| data |  |HH| data |  |HH| data |  |HH| data |  |HH|data|  |HH|data|  |HH| |
+---------+  +---------+  +---------+  +---------+  +-------+  +-------+  +----+
  slice 1      slice 2      slice 3      slice 4      slice 5


The elastic buffers are here to prevent plain text attack, where one knows which data is expected at a given place, an trying to guess the cipher comparing the expected data and the encrypted one. As dar generates structured archives, there would have some possibility that one use this attack to crack an archive encryption. To overcome this problem, elastic buffers have been added at the beginning and at the end of encrypted data. This way it is not possible to know where is located a given archive structure within the encrypted data. The elastic buffers are random data that contain at a random place a pattern that tells the overall size of the buffer (which size is randomly chosen during archive creation). The pattern is of the form ">###<" where the hash field (###) contains the elastic buffer size in binary. Small elastic buffer can be "><" for two bytes or "X" for one byte, but as it is encrypted beside archive data, it is not possible to determine its size for one that does not hold the archive encryption key. Elastic buffer are usually several kilobyte long. Here follows an example of elastic buffer:

972037219>20<8172839


For clarity, the size field between '>' and '<' has been written in decimal instead of binary, as well as the random data inside the elastic buffer. The location of the size field '>20<' is also randomly chosen at creation time.

A Teminateur is short structure that is intended to be read backward. It gives the absolute position of a given item within the archive: The second terminateur let dar skip at the beginning of the archive trailer. The first terminateur (eventually encrypted) let dar skip at the beginning of the catalogue).





Scrambling


Before strong encryption was implemented, dar had only a very simple and weak encryption mechanism. It remains available in current release under the "scram" algorithm name. It mains advantage is that is does not rely on any external library, it is completely part of libdar.

How does it works?

Consider the pass phrase as a string, thus a sequence of bytes, thus a sequence of integer each one between 0 and 255 (including 0 and 255). The data to "scramble" is also a sequence of byte, usually much longer than the pass phrase. The principle is to add byte by byte the pass phrase to the data, modulo 256. The pass phrase is repeated all along the archive. Let's take an example:

the pass phrase is "he\220lo" (where \220 is the character which decimal value is 220). the data is "example"

taken from ASCII standard:
h = 104
l = 108
o = 111
e = 101
x = 120
a = 97
m = 109
p = 112

        e       x       a       m       p       l       e
        101     120     97      109     112     108     101

+       h       e       \220    l       o       h       e
        104     101     220     108     111     104     101

---------------------------------------------------------------

        205     221     317     217     223     212     202

---------------------------------------------------------------
modulo
256 :   205     221     61      217     223     212     202
        \205    \201    =       \217    \223    \212    \202


thus the data "example" will be written in the archive "\205\201=\217\223\212\202"

This method allows to decode any portion without knowing the rest of the data. It does not consume much resources to compute, but it is terribly weak and easy to crack. Of course, the data is more difficult to retrieve without the key when the key is long. Today dar can also use strong encryption (blowfish and few others) and thanks to a encryption block can still avoid reading the whole archive to restore any single file.



Asymmetrical Encryption and Signature


dar relies on gpgme library (GPG made Easy) to provide strong asymmetrical encryption and signature of an archive. Asymmetrical encryption is what you do when you use a public key to cypher data (=encrypt data) and a private key to uncypher it (=decrypt it). However, dar does not encrypt the whole archive that way, not it signs it that way, neither.

Instead, dar relies on the symmetrical strong encryption algorithm it is aware of for some time (blowfish, twofish, camellia, aes, etc.) to cipher the archive. The key used to cipher the archive is chosen randomly, encrypted and eventually signed using the provided recipient email of the user keyring. This encrypted key is then put in the archive header and trailer.

Well, to be more precise about the symmetrical key that is encrypted in the archive: its length user defined and defaults to 512 bytes (4096 bits). A random variation of +0 to +255 bytes is added by libdar to this size. Then the key value itself is chosen randomly. The random generator used here is the one provided by libgcrypt using the GCRY_STRONG_RANDOM entropy level.

Why doing that way and not using the asymmetrical algorithm to cypher the whole archive?
  • Because it would no more be possible to extract a single file from the archive or to read the archive's contents whithout reading the whole archive.
  • Because it would not be possible the quickly verify archive signatures (well, see below)
  • Because it would avoid recovering an corrupted archive after the point of corruption
  • Because it would cost much disk space to encrypt an archive for it be readable by more than one recipients
Yes, you can provide several recipients from your GPG keyring giving their email addresses and also at the same time sign the archive with one of your private key. The resulting archive will be possible to decrypt only by those recipients. And anyone knowing your public key will be able to verify that the archive has been generated by you. Of course the verification only validates the encryption key is from an archive you have personnaly generated, a man in the middle could modify the archive data located after the key in the archive, however it would not be possible to uncipher the tampered data using the signed key, or it would mean that the man in the middle could generate encrypted data using the same symmetric key that is encrypted in the archive. This might be the possible if a recipient's private key has been compromised. Thus the signature of the key is not sufficient to prove the authenticity of the whole archive. To cope with that risks, a better solution is to activate the slice hashing (md5 or sha1) and to sign these small files to be provided beside the archive slices.



Overflow in arithmetic integer operations


Some code explanation about the detection of integer arithmetic operation overflows. We speak about *unsigned* integers, and we have only portable standard ways to detect overflows when using 32 bits or 64 bits integer in place of infinint.

Developed in binary, a number is a finite suite of digits (0 or 1). To obtain the original number from the binary representation, we must multiply each digit by a power of two. example the binary representation "101101" designs the number N where:

N = 2^5 + 2^3 + 2^2 + 2^0

in that context we will say that 5 is the maximum power of N (the power of the higher non null binary digit).

for the addition "+" operation, if an overflow occurs, the result is less than one or both operands, so overflow is not difficult to detect. To convince you, let's assume that the result is greater both operands while it has overflowed. Thus the real result (without overflow) less the first operands should gives the second argument, but here we get a value that is greater than the all 1 bits integer (because there was an overflow and the resulting overflowed value is greater than the first second and the first operand), so this is absurd, and in case of overflow the resulting value is less than one of the operands.

for substraction "-" operation, if the second operand is greater than the first there will be an overflow (result must be unsigned thus positive) else there will not be any overflow. Thus detection is even more simple.

for division "/" and modulo "%" operations, there is never an overflow (there is just illicit the division by zero).

for multiplication "*" operation, a heuristic has been chosen to quickly detect overflow, the drawback is that it may triggers false overflow when number get near the maximum possible integer value. Here is the heuristic used:

given A and B two integers, which max powers are m and n respectively, we have

A < 2^(m+1)
and
B < 2^(n+1)

thus we also have:

A.B < 2^(m+1).2^(n+1)

which is:

A.B < 2^(m+n+2)

by consequences we know that the maximum power of the product of A by B is at most m+n+1 and while m+n+1 is less than or equal to the maximum power of the integer field there will not be overflow else we consider there will be an overflow even if it may not be always the case (this is an heuristic algorithm).



Strong encryption


Several cyphers are available. Remind that "scrambling" is not a strong encryption cypher, all other are.

to be able to use a strong encrypted archive you need to know the three parameters used at creation time:
  • the cypher used (blowfish, ...)
  • the key or password used
  • the encryption block size used
no information about these parameters is stored in the generated archive. If you make an error on just one of them, you will not be able to use your archive. If you forgot one of them, nobody can help you, you can just consider the data in this archive as lost. This is the drawback of strong encryption.

How is it implemented?

To not completely break the possibility to directly access file, the archive is  not encrypted as a whole (as would do an external program). The encryption is done block of data by block of data. Each block can be decrypted, and if you want to read some data somewhere you need to decrypt the whole block(s) it is in.

In consequence, the larger the block size is, the stronger the encryption is. But the larger the block size is too, the longer it will take to recover a given file, in particular when the file size to restore is much smaller than the encryption block size used.

An encryption block size can range from 10 bytes to 4 GB.

If encryption is used as well as compression, compression is done first, then encryption is done on compressed data.

An "elastic buffer" is introduced at the beginning and at the end of the archive, to protect against plain text attack.  The elastic buffer size randomly varies and is defined at execution time. It is composed of random (srand()) values. Two marks characters '>' and '<' delimit the size field, which indicate the byte size of the elastic buffer. The size field is randomly placed in the buffer. Last, the buffer is encrypted with the rest of the data. Typical elastic buffer size range from 1 byte to 10 kB, for both initial and terminal elastic buffers.

Elastic buffers are also used inside compression blocks. The underlying cypher may not be able to encrypt at the requested block size boundary. If necessary a small elastic buffer is appended to the data before encryption, to be able, at restoration time, to know the amount of data and the amount of noise around it.

Let's take an example with blowfish. Blowfish encrypts by multiple of 8 bytes (blowfish chain block cypher). An elastic buffer is always added to the data of a encryption block, its minimal size is 1 byte.

Thus, if you request a encryption block of 3 bytes, these 3 bytes will be padded by an elastic buffer of 5 bytes for these 8 bytes to be encrypted. This will make a very poor compression ratio as only 3 bytes on 8 bytes are significant.

If you request a encryption block of 8 bytes, as there is no room for the minimal elastic buffer of 1 byte, a second 8 byte block is used to put the elastic buffer, so the real encryption block will be 16 bytes.

Ideally, a encryption block of 7 bytes, will use 8 bytes with 1 byte for the elastic buffer.

This problem tends to disappear when the encryption block size grows, so this should not be a problem in normal conditions. Encryption block of 3 bytes is not a good idea to have a strong encryption scheme, for information, the default encryption block size is 10kB.



libdar and thread-safe requirement


This is for those who plane to use libdar in their own programs.

If you plan to have only one thread using libdar there is no problem, of course, you will however have to call one of the get_version() first, as usual. Thing change if you intend to have several concurrent threads using libdar library.

libdar is thread-safe under certain conditions:

Several 'configure' options have an impact on thread-safe support:

--enable-test-memory is a debug option that avoid libdar to be thread-safe,  so don't use it.
--enable-special-alloc (set by default), makes a thread-safe library only if POSIX mutex are available (pthread_mutex_t type).
--disable-thread-safe avoid looking for mutex, so unless --disable-special-alloc is also used, the generated library will not be thread safe.

You can check the thread safe capability of a library thanks to the get_compile_time_feature(...) call from the API. Or use 'dar -V' command, to  quickly have the corresponding values and check using 'ldd' to see which library has been dynamically linked to dar, if applicable.

IMPORTANT:
As more as before it is mandatory to call get_version() call before any other call, when the call returns, libdar is ready for thread safe. Note that even if the prototype does not change get_version() *may* now throw an exception, so use get_version_noexcept() if you don't want to manage exceptions.

For more information about libdar and its API, check the doc/api_tutorial.html document and the API reference manual under doc/html/index.html


Dar_manager and delete files


This is for further reference and explanations.

In dar archive when a file has been deleted since the backup of reference (in case of differential archive), an entry of a special type (called "detruit") is put in the catalogue of the archive which only contains the name of the missing file.

In a dar_manager database, to each files that have been found in one of the archive used to build this database corresponds a list of association. These associations put in relation the mtime (date of modification of the file) to the archive number where the file has been found in that state.

There is thus no way to record "detruit" entries in a dar_manager database, as no date is associated with this type of object. Yes, in a dar archive, we can only notice a file has been destroyed because it is not present in the filesystem but is present in the catalogue of the archive of reference. Thus we know the file has been destroyed between the date the archive of reference has been done and the date the current archive is actually done. Unfortunately, no date is recorded in dar archives telling it has been done at which time.

Thus, from dar_manager, inspecting a catalogue, there is no way to give a significant date to a "detruit" entry. In consequences, for a given file which has been removed, then recreated, then removed again along a series of differential backups, it is not possible to order the times when this file has been removed in the series of date when it has existed.

The ultimate consequence is that if the user asks dar_manager to restore a directory in the state just before a given date (-w option), it will not be possible to know if that file existed at that time. We can effectively see that it was not present in a given archive but as we don't know the date of that archive we cannot determine if it is before of after the date requested by the user, and dar_manager is not able to restore the non existence of a file for a given time, we must use dar directly with the archive that has been done at the date we wish.

Note that having a date stored in each dar archive would not solve the problem without some more informations. First, we should assume that the date is consistent from host to host and from time to time (What if the user change of time due to daylight saving or move around the Earth, or if two users in two different places share a filesystem --- with rsync, nfs, or other mean --- and do backups alternatively...). Let's assume the system time is significant and thus let's imagine what would be the matter if in each archive this date of archive construction was stored.

Then when a detruit object is met in an archive it can be given the date the archive has been built and thus ordered in the series of dates when the corresponding file was found in other archives. So when the user asks for restoration of a directory a given file's state is possible to know, and thus the restoration of the corresponding archive will do what we expect : either remove the file (if the selected backup contains an  "detruit" object, or  restore the file in the state it had).

Suppose now, a dar_manager database built with a series of full backup. There will thus no be any "detruit" objects, but a file may be present or may be missing in a given archive. The solution is thus that once an archive has been integrated in the database, the last step is to scan the whole database for files that have no date associated with this last archive, thus we can assume these files were not present and add the date of the archive creation with the information that this file was removed at that time. Moreover, if the last archive add a file which was not know in archives already present in the database, we must consider this file was deleted in each of these previous archives, but then we must record the dates of creation for all these previous archive to be able to put this information properly in the database. But, in that case we would not be able to make dar removing a file, as no "detruit" object exist (all archive are full backups), and dar_manager should remove itself the entry from the filesystem. Beside the fact that it is not the role to dar_manager to directly interact with the filesystem, dar_manager should record an additional information to know if a file is deleted because it has been found a "detruit" object in an archive, or if it is deleted because it has not been found any entry in an given archive. This is necessary to known whether to rely on dar to remove the file or to make dar_manager do it itself, or maybe better is to never rely on dar to remove a file but always let dar_manager do it itself.

Assuming we accept to make dar_manager able to rm entries from filesystem without relying on dar, we must store the date of the archive creation in each archive, and store these dates for each archive in dar_manager databases. Then instead of using the mtime of each file, we could do something much more simple in database: for each file, record if it was present or not in each archive used to built the database, and beside this, store only the archive creation date of each archive. This way, dar_manager would only have for each file to take the last state of the file (deleted or present) before the given date (or the last known state if no date is given) and either restore the file from the corresponding archive or remove it.

But if a user has removed a file by accident and only notice this mistake after several backups, it would become painful to restore this file, as the user should find manually at which date it was present to be able to feed dar_manager with the proper -w option, this worse than looking for the last archive that has the file we look for.

Here we are back to the restoration of a file and the restoration of a state. By state, I mean the state a directory tree had at a given time, like a photo. In its original version dar_manager was aimed to restore files, whatever they exist or not in the last archive added to a database. It only finds the last archive where the file is present. Making dar_manager restore a state, and thus considering files that have been removed at a given date, is no more no less than restoring from a given archive, directly with dar. So all this discussion about the fact that dar_manager is not able to handle files that have been removed, to arrive to the fact that adding this feature to dar_manager will make it become quite useless... sight. But, that was necessary.



Native Language Support / gettext / libintl


Native Language Support (NLS) is the fact a given program can display its messages in different languages. For dar, this is implemented using the gettext tools. This tool must be installed on the system for dar can be able to display messages in another language than English. 

Things are the following:
- On a system without gettext dar will not use gettext at all. All messages will be in English (OK maybe better saying Frenglish) ;-)
- On a system with gettext dar will use the system's gettext, unless you use --disable-nls option with the configure script.

If NLS is available you just have to set the LANG environment variable to your locale settings to change the language in which dar displays its messages (see ABOUT-NLS for more about the LANG variable).

just for information, gettext() is the name of the call that makes translations of string in the program. This call is implemented in the library called 'libintl' (intl for Internationalization). Last point, gettext by translating strings, makes the Native Language Support (NLS) possible, in other words, it let you have the messages of your preferred  programs being displayed in you native language for those not having the English as mother tong.

This was necessary to say, because you may miss the links between "gettext" , "libintl" and "NLS".

READ the ABOUT-NLS file at the root of the source package to learn more about the way to display dar's messages in your own language. Note that not all languages are yet supported, this is up to you to send me a translation in your language and/or contact a translating team as explained in ABOUT-NLS.

To know which languages are supported by dar, read the po/LINGUAS file and check out for the presence of the corresponding *.po files in this directory.



Dar Release Process

Development Phase:
Dar receive new features during the development phase, at this stage sources are modified and tested after each feature addition. The development sources are stored in a GIT repository at sourceforge, repository you can access in read-only.

Frozen API Phase:
No new feature that would change the API are added. The API shall be documented enough to let API users give their feedback about the design and its implementation. During this time, development continues, whatever is necessary while it does not changes the API, like documentation of the whole project, problem fix in libdar, new features in command-line part of the source, and so on.

Pre-release Phase:
Once the documentation and API is stable, comes the pre-release phase, this phase starts and ends by a email to the dar-news mailing-list. At this period intensive test is done on the pre-release source, feedback and information about new pre-release packages are exchanged through the pre-release mailing-list, this mailing-list lives only during the pre-release phases and is not archived, nor visible through a mail to news gateway. Of  course, you are welcome to participate in the testing process and report to the pre-release mailing list any problem you could meet with a given pre-release package.

Release Phase:
Some little time after pre-release has ended, a first package is released (last number version is zero) and available at sourceforge for download. This phase also begins by an email to dar-news mailing-list. During that phase, users may report bugs/problem about the released software,  depending on the amount of bugs found and of their importance a new release will take place to only fixe these found bugs (no features is added), the last number of the version is incremented by one and a new mail to dar-news is sent with the list of problem fixed by the new release. The release phase ends when a new release phase begins, thus during a release phase a concurrent development phase takes place, then a frozen API, then a pre-release phase but for a new major version (the first or the second number of the version changes).

Dar's Versions

package release version

Dar packages are release during the pre-release phase (see above). Each version is identified by three number separated by dot like for example, version 2.3.0 . The last number is incremented between releases that take place in the same release phase (just bug have been fixed), the middle number increments at each pre-release phase. Last the first number is incremented when a major change in the software structure took place [version 2.0.0 has seen the split of dar's code in one part related to command-line and the rest put in a library called libdar, that can be accessed by a well defined API even by external softwares (like kdar for example). Version 2.0.0 has also seen the apparition of the configure script and the use of the gnu tools autoconf, autmake, libtool and gettext, thus in particular the possibility to have internationalization].

Note that release versionning is completely different from what is done for the Linux kernel, here for dar all versionnized packages are stable released software and thus stability increases with the last number of the version.

Libdar version

Unfortunately, the release version does not give much information about the compatibility of different libdar version, from the point of view of an external application, that thus has not been released with libdar and may be faced to different libdar versions. So, libdar has its own version. It is also a three number version, (for example, current libdar version is version 3.1.2), but each number has a different meaning. The last number increases with a new version that only fixes bugs, the middle number increases with when new features has been added but stay compatible with older libdar version in the way to use older features. Last the first number changes when the API has been changed in a way that no ascendant compatibility is no more possible for some features.

Other versions


beside the libdar library, you can find five command-line applications: dar, dar_xform, dar_slave, dar_manager and dar_cp. These except dar have their own version which is here too made of three numbers. Their meaning is the same as the meaning for the package release version: The last number increases upon bug fix, the middle upon new feature, the first upon major architecture changes.

Archive format version

When new features come, it is sometime necessary to change the structure of the archive. To be able to know the format used in the archive, a field is present in each archive that defines this format. Each dar binary can thus read all archive format, well of course a particular version cannot guess the format of archive that have been defined *after* that dar binary version has been released. If you try to open a recent archive with an old dar binary, you will have a warning about the fact that dar is probably not able to read the archive, dar will then ask you if you want to proceed anyway. Of course, you can try to read it, but this is at your own risk. In particular, depending on the feature used (See the Changelog to know which feature required to upgrade the archive format), you may succeed reading a recent archive with an old dar binary and get neither error nor warning, but this does not mean that dar did all that was necessary to restore the files properly, so it is advised to avoid using an archive with a version of dar that is tool old to handle the archive format properly (and rather reserve this possibility only in case of necessity).

Cross reference matrix

OK, you may now find that this is a bit complex so a list of version is give below. Just remember that there are two points of view: The command-line user and the external application developer.

Date
release and
(dar version)
Archive format
Database
Format
libdar version
dar_xform
dar_slave
dar_manager
dar_cp
dar_split
April 2nd, 2002
1.0.0
01
----- ----- ----- ----- ----- ----- -----
April 24th, 2002
1.0.1
01
----- ----- ----- ----- ----- ----- -----
May 8th, 2002
1.0.2
01
----- ----- ----- ----- ----- ----- -----
May 27th, 2002
1.0.3
01
----- ----- ----- ----- ----- ----- -----
June 26th, 2002
1.1.0
02
----- ----- 1.0.0
1.0.0
----- ----- -----
Nov. 4th, 2002
1.2.0
03
01 ----- 1.1.0
1.1.0
1.0.0
----- -----
Jan. 10th, 2003
1.2.1
03
01 ----- 1.1.0 1.1.0 1.0.0
----- -----
May 19th, 2003
1.3.0
03
01 ----- 1.1.0
1.1.0
1.1.0
----- -----
Nov. 2nd, 2003
2.0.0
03
01 1.0.0
1.1.0
1.1.0
1.2.0
1.0.0
-----
Nov. 21th, 2003
2.0.1
03
01 1.0.1
1.1.0
1.1.0
1.2.0
1.0.0
-----
Dec. 7th, 2003
2.0.2
03
01 1.0.2
1.1.0
1.1.0
1.2.0
1.0.0
-----
Dec. 14th, 2003
2.0.3
03
01 1.0.2
1.1.0
1.1.0
1.2.1
1.0.0
-----
Jan. 3rd, 2004
2.0.4
03
01 1.0.2
1.1.0
1.1.0
1.2.1
1.0.0
-----
Feb. 8th, 2004
2.1.0
03
01 2.0.0
1.2.0
1.2.0
1.2.1
1.0.0
-----
March 5th, 2004
2.1.1
03
01 2.0.1
1.2.1
1.2.1
1.2.2
1.0.0
-----
March 12th, 2004
2.1.2
03
01 2.0.2
1.2.1
1.2.1
1.2.2
1.0.0
-----
May 6th, 2004
2.1.3
03
01 2.0.3
1.2.1
1.2.1
1.2.2
1.0.1
-----
July 13th, 2004
2.1.4
03
01 2.0.4
1.2.1
1.2.1
1.2.2
1.0.1
-----
Sept. 12th, 2004
2.1.5
03
01 2.0.5
1.2.1
1.2.1
1.2.2
1.0.1
-----
Jan. 29th, 2005
2.1.6
03
01 2.0.5
1.2.1
1.2.1
1.2.2
1.0.1
-----
Jan. 30th, 2005
2.2.0
04
01 3.0.0
1.3.0
1.3.0
1.3.0
1.0.1
-----
Feb. 20th, 2005
2.2.1
04
01 3.0.1
1.3.1
1.3.1
1.3.1
1.0.1
-----
May 12th, 2005
2.2.2
04
01 3.0.2
1.3.1
1.3.1
1.3.1
1.0.2
-----
Sept. 13th, 2005
2.2.3
04
01 3.1.0
1.3.1
1.3.1
1.3.1
1.0.2
-----
Nov. 5th, 2005
2.2.4
04
01 3.1.1
1.3.1
1.3.1
1.3.1
1.0.2
-----
Dec. 6th, 2005
2.2.5
04
01 3.1.2
1.3.1
1.3.1
1.3.1
1.0.2
-----
Jan. 19th, 2006
2.2.6
04
01 3.1.3
1.3.1
1.3.1
1.3.1
1.0.3
-----
Feb. 24th, 2006
2.2.7
04
01 3.1.4
1.3.1
1.3.1
1.3.1
1.0.3
-----
Feb. 24th, 2006
2.3.0
05
01 4.0.0
1.4.0
1.3.2
1.4.0
1.1.0
-----
June 26th, 2006
2.3.1
05
01 4.0.1
1.4.0
1.3.2
1.4.0
1.1.0
-----
Oct. 30th, 2006
2.3.2
05
01 4.0.2
1.4.0 1.3.2 1.4.0 1.1.0 -----
Feb. 24th, 2007
2.3.3
05
01 4.1.0
1.4.0
1.3.2
1.4.1
1.2.0
-----
June 30th, 2007
2.3.4
06
01 4.3.0
1.4.0
1.3.2
1.4.1
1.2.0
-----
Aug. 28th, 2007
2.3.5
06
01 4.4.0
1.4.1
1.3.3
1.4.2
1.2.1
-----
Sept. 29th, 2007
2.3.6
06
01 4.4.1
1.4.1
1.3.3
1.4.2
1.2.1
-----
Feb. 10th, 2008
2.3.7
06
01 4.4.2
1.4.2
1.3.4
1.4.3
1.2.2
-----
June 20th, 2008
2.3.8
07
01 4.4.3
1.4.2
1.3.4
1.4.3
1.2.2
-----
May 22nd, 2009
2.3.9
07
01 4.4.4
1.4.2
1.3.4
1.4.3
1.2.2
-----
April 9th, 2010
2.3.10
07
01 4.4.5
1.4.2
1.3.4
1.4.3
1.2.2
-----
March 13th, 2011
2.3.11 07 01 4.5.0 1.4.3 1.3.4 1.4.3 1.2.2 -----
February 25th, 2012 2.3.12 07 01
4.5.1 1.4.3 1.3.4 1.4.3 1.2.2 -----
June 2nd, 2011
2.4.0 08 02 5.0.0 1.5.0
1.4.0
1.5.0
1.2.3
-----
July 21st, 2011
2.4.1
08
02 5.1.0
1.5.0
1.4.0
1.6.0
1.2.3
-----
Sept. 5th, 2011
2.4.2
08
02 5.1.1
1.5.0
1.4.0
1.6.0
1.2.3
-----
February 25th, 2012
2.4.3
08
03 5.2.0
1.5.0 1.4.0 1.7.0
1.2.3
-----
March 17th, 2012
2.4.4
08
03 5.2.1
1.5.0
1.4.0
1.7.1
1.2.3
-----
April 15th, 2012
2.4.5
08
03 5.2.2
1.5.1
1.4.1
1.7.2
1.2.4
-----
June 24th, 2012
2.4.6
08
03 5.2.3
1.5.2
1.4.2
1.7.3
1.2.5
-----
July 5th, 2012
2.4.7
08
03 5.2.4
1.5.2
1.4.3
1.7.3
1.2.5
-----
September 9th, 2012
2.4.8
08
03 5.3.0
1.5.3
1.4.4
1.7.4
1.2.6
-----
January 6th, 2013
2.4.9
08
03 5.3.1
1.5.3
1.4.4
1.7.4
1.2.7
-----
March 9th, 2013
2.4.10
08
03 5.3.2
1.5.3
1.4.4
1.7.4
1.2.7
-----
Aug. 26th, 2013
2.4.11
08
03 5.4.0
1.5.4
1.4.5
1.7.5
1.2.8
-----
January 19th, 2014
2.4.12
08
03 5.5.0
1.5.4
1.4.5
1.7.6
1.2.8
-----
April 21st, 2014
2.4.13
08
03 5.6.0
1.5.5
1.4.5
1.7.7
1.2.8
-----
June 15th, 2014
2.4.14
08
03 5.6.1
1.5.5
1.4.5
1.7.7
1.2.8
-----
September 6th, 2014
2.4.15
08
03 5.6.2
1.5.6
1.4.6
1.7.8
1.2.8
-----
January 18th, 2015
2.4.16
08
03 5.6.3
1.5.6
1.4.6
1.7.8
1.2.8
-----
January 31th, 2015
2.4.17
08
03 5.6.4
1.5.6
1.4.6
1.7.8
1.2.8

August 30th, 2015 2.4.18 08.1 03 5.6.5 1.5.6 1.4.6 1.7.8 1.2.8 -----
October 4th, 2015
2.4.19
08.1
03 5.6.6
1.5.6
1.4.6
1.7.8
1.2.8
-----
November 21th, 2015
2.4.20
08.1
03
5.6.7
1.5.8
1.4.8
1.7.10
1.2.10
-----
April 24th, 2016
2.4.21
08.1
03 5.6.8 1.5.9 1.4.9 1.7.11 1.2.10 -----
June 5th, 2016
2.4.22
08.1
03 5.6.9
1.5.9
1.4.9
1.7.11
1.2.10
-----
October 29th, 2016
2.4.23
08.1
03 5.6.9
1.5.9
1.4.9
1.7.11
1.2.10
-----
January 21st, 2017
2.4.24
08.1
03 5.6.10
1.5.9
1.4.9
1.7.11
1.2.10
-----
October 4th, 2015
2.5.0 09 04
5.7.0 1.5.7 1.4.7 1.7.9 1.2.9 1.0.0
October 17th, 2015
2.5.1 09 04
5.7.1 1.5.8 1.4.8 1.7.10 1.2.10 1.0.0
November 21st, 2015
2.5.2
09
04
5.7.2
1.5.8
1.4.8
1.7.10
1.2.10
1.0.0
January 4th, 2016
2.5.3
09
04
5.7.3
1.5.8
1.4.8
1.7.10
1.2.10
1.0.0
April 24th, 2016
2.5.4
09 04 5.8.0 1.5.9 1.4.9 1.7.11 1.2.10 1.0.0
June 5th, 2016
2.5.5
09
04
5.8.1
1.5.9 1.4.9
1.7.11
1.2.10
1.0.0
September 10th, 2016
2.5.6
09
04 5.8.2
1.5.9 1.4.9 1.7.11 1.2.10 1.0.0
October 29th, 2016 2.5.7
09
04
5.8.3
1.5.9
1.4.9
1.7.11
1.2.10
1.0.0
January 2nd, 2017 2.5.8
09
04
5.8.4
1.5.9
1.4.9
1.7.11
1.2.10
1.0.0
January 21st, 2017 2.5.9
09
04
5.9.0
1.5.9
1.4.9
1.7.11
1.2.10
1.0.0
April 4th, 2017
2.5.10
09
04
5.10.0
1.5.9
1.4.9
1.7.11
1.2.10
1.0.0
June 23rd, 2017
2.5.11
09
04
5.11.0
1.5.9
1.4.9
1.7.12
1.2.10
1.0.0



How symmetric encryption is performed in dar/libdar

Symmetric encryption are those that use a same key/password/passphrase to cipher data and to uncipher it, like for example blofish, AES, serpent, camellia, twofish and so on.

The user provides:
  • a key in the form of a passphrase or password
  • a cipher algorithm
  • a block size (which defaults to 10 kio)
Data to encrypt/decrypt is sliced by blocks of given size and each block is provided to the encryption engine beside its number inside the archive. To be able to encrypt at once 10 kio or more, dar uses the CBC mode (Cipher Block Chaining). This mode requires an Initial Vector (IV) to be set for each block. The value of this block may be known and predictable by an attacker, that's not a problem. Using different IV for each block of data avoid having for a given key thus here for two different blocks the same encryption result (totally or just starting by the same sequence) when clear data is identical or starts by the same sequence.

To prevent clear text attack (which is when the attacker has an idea of the clear data), each archive starts and ends by an elastic buffer, which is pseudo random generated data of variable size, both elastic buffers are encrypted with the rest of the real data. This way the predictable archive structure is shifted at some random place inside the archive, which avoid the possibility of a clear text attack (or at least make it much more difficult to achieve).

Thanks to the encyrption per block, dar is able to uncipher data at random place without having to uncipher the whole archive: only the block containing the need data is unciphered. This is necessary to face both data corruption and fast restoration of a subset of the archive.

Here follows a diagram of the way key, block number, cipher algorithm and Initial Vector (IV) are defined:

           +------------------- [ algo in CBC mode ] -----------------------------> main key handle
algorithm -+                           ^                                                  |
           +---> max key len           |                                                  |
                     |                 |                                                  |
                     |                 |                                                  |
                     v                 |                                                  |
password ------> [ pkcs5 ] --> hashed_password ------------+                              |
                                                           |                              |
                                                           |                              |
                                                           v                              |
                                                    [ SHA1/SHA256 ]                       |
                                                           |                              |
                                                           |                              |
                                                           v                              |
                                                      essiv_password                      |
                                                           |                              |
                                                           |                              |
                                                           v                              |
                                           [ Blowfish/AES256 in ECB mode ]                |
                                                           |                              |
                                                           |                              |
                                                           v                              |
                                                    essiv key handle                      |
                                                           |                              |     Initialization
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .|. . . . . . . . . . . . . . . | . . . . . . . . . .
                                                           v                              |
block_number_in_archive ----------------------------> [ encrypt ] ------> IV -----+       |
                                                                                  |       |
                                                                                  |       |
                                                                                  v       v
data ------------------------------------------------------------------------> [ encrypt/decrypt ] -----> data
sliced by block
of given size






How asymmetric encryption is performed in dar/libdar

Dar does not encrypt the whole archive with a recipient's public key, but rather randomly chooses a password for symmetric encryption (as seen above), encrypts that password with the recipient's public keys (eventually signing it with your own private key) and drops a copy of this ciphered/signed data into the archive. At reading time, dar read the archive header to find the encrypted password, decrypt it using the user's private key then use that password now in clear to decrypt the rest of the archive with the adhoc symmetric algorithm.

Why not using only asymmetric encryption from end to end?

First, for a given fixed amount of data, the resulting ciphered data size may vary. Thus sticking ciphered block together in the archive would not lead to easily know where a block starts and where is located its end. Second, doing that way allows an archive to have several different recipients, the password is ciphered for each of them and the archive is readable by any specified recipient, while they do not share any key. Doing tat way has very little impact of archive size.

But then, for a multi recipient archive, any recipient has access to the signed and encrypted key and could reuses the same encryption key which is may also be signed by the original sender to build a new archive with totally different content, then sending it to another recipient of the original archive, faking this way the original sender signature.

For that reason, a checksum of the internal catalogue is also signed during the archive creation process. Any modification of the archive content will lead the CRC stored in the catalogue to mismatch the modified data or if the CRC is also updated inside the catalogue, the CRC of the catalogue will fail, or last, if the CRC of the catalogue is also updated the signature will then not match. There is still room to modify data in a way that the CRC does not change, but this limits a lot of the possibilities of change as there is also a constraint on the archive length.