DAR's Documentation

DAR's Limitations

Here follows a description of the known limitation you should consult before creating a bug report for dar:

Fixed Limits

System variable limits

Memory

Dar uses virtual memory (= RAM+swap) to be able to add the list of file saved at the end of each archive. Dar uses its own integer type (called "infinint") that do not have limit (unlike 32 bits or 64 bits integers). This makes dar already able to manage Zettabytes volumes and above even if the systems cannot yet manage such file sizes. Nevertheless, this has an overhead with memory and  CPU usage, added to the C++ overhead for the datastructure. All together dar needs a average of 650 bytes of virtual memory by saved file with dar-2.1.0 and around 850 with dar-2.4.x (that's the price to pay for new features). Thus, for example if you have 110,000 files to save, whatever is the total amount of data to save, dar will require around 67 MB of virtual memory.

Now, when doing catalogue extraction or differential backup, dar has in memory two catalogues, thus the amount of memory space needed is the double (134 MB in the example). Why ? Because for differential backup, dar starts with the catalogue of the archive of reference which is needed to know which files to save and which not to save, and in another hand, builds the catalogue of the new archive all along the process. Now, for catalogue extraction, the process is equivalent to making a differential backup just after a full backup.

As you guess merging two archives into a third one requires even more memory (memory to store the first archive to merge, the second archive to merge and the resulting archive to produce).

This memory issue, is not a limit by itself, but you need enough virtual memory to be able to save your data (if necessary you can still add swap space, as partition or as a plain file).

Integers

To overcome the previously explained memory issue, dar can be built in an other mode. In this other mode, "infinint" is replaced by 32 bits or 64 bits integers, as defined by the use of --enable-mode=32 or --enable-mode=64 options given to configure script. The executables built this way (dar, dar_xform, dar_slave and dar_manager) run faster and use much less memory than the "full" versions using "infinint". But yes, there are drawbacks: slice size, file size, dates, number of files to backup, total archive size (sum of all slices), etc, are bounded by the maximum value of the used integer, which is 4,294,967,296 for 32 bits and 18,446,744,073,709,551,616 for 64 bits integers. In clear the 32 bits version cannot handle dates after year 2106 and file sizes over 4 GB. While the 64 bits version cannot handle dates after around 500 billion years (which is longer than the estimated age of the Universe: 15 billion years) and file larger than around 18 EB (18 exa bytes).

Since version 2.5.4 another parameter depends on the maximum supported integer number: the number of entries in a given archive. In other words, you will not be able to have more than 4 giga files (4 billion files) in a single archive when using libdar32, 18 exa file with libdar64 and no limitation with libdar based on infinint.

What the comportment when such a limit is reached ? For compatibility with the rest of the code, limited length integers (32 or 64 for now) cannot be used as-is, they are enclosed in a C++ class, which will report overflow in arithmetic operations. Archives generated with all the different version of dar will stay compatible between them, but the 32 bits or 64 bits will not be able to read or produce all possible archives. In that case, dar suite program will abort with an error message asking you to use the "full" version of dar program.

Command line

On several systems, command-line long options are not available. This is due to the fact that dar relies on GNU getopt. Systems like FreeBSD do not have by default GNU getopt, instead the getopt function proposed from the standard library does not support long options, nor optional arguments. On such system you will have to use short options only, and to overcome the lack of optional argument you need to explicitly set the argument. For example in place of "-z" use "-z 9" and so on (see dar's man page section "EXPLICIT OPTIONAL ARGUMENTS"). All options for dar's features are available with FreeBSD's getopt, just using short options and explicit arguments.

Else you can install GNU getopt as a separated library called libgnugetopt. If the include file <getopt.h> is also available, the configure script will detect it and use this library. This way you can have long options on FreeBSD for example.

Another point concerns the comand line length limitation. All system (correct me if I am wrong) do limit the size of the command line. If you want to add more options to dar than your system can afford, you can use the -B option instead an put all dar's arguments (or just some of them) in the file pointed to by this -B option. -B can be used several times on command line and is recursive (you can use -B from a file read by -B option).

Dates

Unix files have up to four dates:

In dar, these dates are stored as integers (the number of seconds elapsed since Jan 1st, 1970) as Unix systems do, since release 2.5.0 it can also save, store and restore the microsecond part of these dates on system that support it. As seen above, the limitation is not due to dar but on the integer used, so if you use infinint, you should be able to store any date as far in the future as you want. Of course dar cannot stores dates before Jan the 1st of 1970, but it should not be a very big problem as there should not be surviving files older than that epoch ;-)

There is no standard way under Unix to change the ctime. So Dar is not able to restore the ctime date of files.

Symlinks

On system that provide the lutimes() system call, the dates of a symlink can be restored. On systems that do not provide that system call, if you modify the mtime of an existing symlink, you end modifying the mtime of the file targeted by that symlink, keeping untouched the mtime of the symlink itself! For that reason, without lutimes() support, dar is avoids restoring mtime of symlink.

SFTP protocol

dar/libdar cannot read an .ssh/known_hosts containing "ecdsa-sha2-nistp256" lines, even if the host you are connected to known by an "ssh-rsa" key, as soon as one host in the known_hosts file is not "ssh-rsa" dar/libdar will fail the host validation and abort. This is restriction comes from libssh that seems used by libcurl which libdar relies on for network related features. There is some workaround decribed in the man page about the DAR_SFTP_KNOWNHOSTS_FILE environment variable.

As of January 2021, libssh 1.9.0 seems to have fixed this issue, so the problem should be fixed shortly once libcurl will have integrated this new feature.

Multi-threading performance and memory requirement

Don't expect the execution time to be exactly divided by the number of threads you will use, when comparing with the equivalent single threaded execution. There is first an overhead to have several threads working in the same process (protocols, synchronization on shared memory). There is also the nature of the work that is not always possible to process with threads.

With release 2.7.0 the two most CPU intensive tasks involved in the building and use of a dar backup (compression and encryption) have been enhanced to support multiple threads. Still remains some light tasks (CRC calculation, escape sequence/tape marks, sparse file detection/reconstruction), that are difficult to parallelize while they require much less CPU than these first two tasks, so the multi-threading investement (processing, but also software developement) does not worth for them

Third, the ciphering/deciphering process may be interrupted when seeking in the archive for a particular file, this slightly reduces performances of multi-threading. For compression even when multi-threading is used, the compression stays per file (to bring robustness), and is thus reset after each new file. The gain is very interesting for large files and you have better not to select a too small block size for compression, like having several tens of kilobytes or even a megabyte. Large compression block (which are used scattered between different threads) is interesting as small files will be handeled by a single thread with little overhead and large files will leverage multiple threads. Last, your disk access may become the bottleneck that drives the total execution time of the operation, adding more thread will not provide any improvement in that condition.

The drawback of the multi-threading used within libbdar is the memory requirement. Each worker thread will work with a block in memory (in fact two: one for compressed/ciphered data, the other for their decompressed/deciphered counterpart). In addition to the worker threads you specify, two helper threads are needed, one to dispatch the data blocks to the workers threads, the other to gather resulting blocks. These two are not CPU intensive but may hold memory blocks for treatment while the worker threads do also.