Learn Linux 18: Archiving And Backup

Published

Contents


Introduction

One of the primary tasks of a computer system’s administrator is keeping the system’s data secure. One way this is done is by performing timely backups of the system’s files. Even if you’re not a system administrator, it is often useful to make copies of things and move large collections of files from place to place and from device to device. This chapter will introduce the following commands:

  • gzip - Compress files
  • bzip2 - A block sorting file compressor
  • tar - Tape archiving utility
  • zip - Package and compress files
  • rsync - Remote file and directory synchronization

Compressing Files

Throughout the history of computing, there has been a struggle to get the most data into the smallest available space, whether that space be memory, storage devices, or network bandwidth. Many of the data services that we take for granted today, such as mobile phone service, high-definition television, or broadband Internet, owe their existence to effective data compression techniques.

Data compression is the process of removing redundancy from data. Let’s consider an imaginary example. Suppose we had an entirely black picture file with the dimensions of 100 pixels by 100 pixels. In terms of data storage (assuming 24 bits, or 3 bytes per pixel), the image will occupy 30,000 bytes of storage. 100 × 100 × 3 = 30,000

An image that is all one color contains entirely redundant data. If we were clever, we could encode the data in such a way that we simply describe the fact that we have a block of 10,000 black pixels. So, instead of storing a block of data containing 30,000 zeros (black is usually represented in image files as zero), we could compress the data into the number 10,000, followed by a zero to represent our data. Such a data compression scheme is called run-length encoding and is one of the most rudimentary compression techniques. Today’s techniques are much more advanced and complex, but the basic goal remains the same—get rid of redundant data. Compression algorithms (the mathematical techniques used to carry out the compression) fall into two general categories.

  • Lossless. Lossless compression preserves all the data contained in the original. This means that when a file is restored from a compressed version, the restored file is exactly the same as the original, uncompressed version.
  • Lossy. Lossy compression, on the other hand, removes data as the compression is performed to allow more compression to be applied. When a lossy file is restored, it does not match the original version; rather, it is a close approximation. Examples of lossy compression are JPEG (for images) and MP3 (for music).

In our discussion, we will look exclusively at lossless compression since most data on computers cannot tolerate any data loss.

gzip

The gzip program is used to compress one or more files. When executed, it replaces the original file with a compressed version of the original. The corresponding gunzip program is used to restore compressed files to their original, uncompressed form.

[user@linux ~]$ ls -l /etc > document.txt
[user@linux ~]$ ls -l document.txt
-rw-r--r-- 1 user user 15738 2017-05-09 12:00 document.txt
[user@linux ~]$ gzip document.txt
[user@linux ~]$ ls -l document.txt.gz
-rw-r--r-- 1 user user 3230 2017-05-09 12:00 document.txt.gz
[user@linux ~]$ gunzip document.txt
-rw-r--r-- 1 user user 15738 2017-05-09 12:00 document.txt

bzip2

The bzip2 program, by Julian Seward, is similar to gzip but uses a different compression algorithm that achieves higher levels of compression at the cost of compression speed. In most regards, it works in the same fashion as gzip.

[user@linux ~]$ ls -l /etc > document.txt
[user@linux ~]$ ls -l document.txt
-rw-r--r-- 1 user user 15738 2017-05-09 12:00 document.txt
[user@linux ~]$ bzip2 document.txt
[user@linux ~]$ ls -l document.txt.bz2
-rw-r--r-- 1 user user 2792 2017-05-09 12:00 document.txt.bz2
[user@linux ~]$ bunzip document.txt
-rw-r--r-- 1 user user 15738 2017-05-09 12:00 document.txt

Archiving Files

A common file-management task often used in conjunction with compression is archiving. Archiving is the process of gathering up many files and bundling them together into a single large file. Archiving is often done as part of system backups. It is also used when old data is moved from a system to some type of long-term storage.

tar

In the Unix-like world of software, the tar program is the classic tool for archiving files. Its name, short for tape archive, reveals its roots as a tool for making backup tapes. While it is still used for that traditional task, it is equally adept on other storage devices. We often see filenames that end with the extension .tar or .tgz, which indicate a “plain” tar archive and a gzipped archive, respectively. A tar archive can consist of a group of separate files, one or more directory hierarchies, or a mixture of both.

zip

The zip program is both a compression tool and an archiver. The file format used by the program is familiar to Windows users, as it reads and writes .zip files. In Linux, however, gzip is the predominant compression program, with bzip2 being a close second.

Synchronizing Files And Directories

A common strategy for maintaining a backup copy of a system involves keeping one or more directories synchronized with another directory (or directories) located on either the local system (usually a removable storage device of some kind) or a remote system. We might, for example, have a local copy of a website under development and synchronize it from time to time with the “live” copy on a remote web server.

In the Unix-like world, the preferred tool for this task is rsync. This program can synchronize both local and remote directories by using the rsync remote-update protocol, which allows rsync to quickly detect the differences between two directories and perform the minimum amount of copying required to bring them into sync. This makes rsync very fast and economical to use, compared to other kinds of copy programs.

rsync is invoked like this:

rsync options source destination

Where source and destination are one of the following:

  • A local file or directory.
  • A remote file or directory in the form of [user@]host:path.
  • A remote rsync server specified with a URI of rsync://[user@] host[:port]/path.

Note that either the source or the destination must be a local file. Remote-to-remote copying is not supported.

Let’s try rsync on some local files.

[user@linux ~]$ mkdir -p test/source_dir test/destination_dir && cd test

Next, we’ll synchronize the source_dir directory with a corresponding copy in destination_dir.

[user@linux ~]$ rsync -av source_dir destination_dir

If we only want to copy the contents of source_dir directory we can append a trailing / or /* to the source directory name.

[user@linux ~]$ rsync -av source_dir/ destination_dir

We’ve included both the -a option (for archiving—causes recursion and preservation of file attributes) and the -v option (verbose output) to make a mirror of the source_dir directory within destination_dir. While the command runs, we will see a list of the files and directories being copied. At the end, we will see a summary message like this indicating the amount of copying performed:

sent 135759 bytes  received 57870 bytes  387258.00 bytes/sec
total size is 3230  speedup is 0.02

If we run the command again, we will see a different result.

[user@linux ~]$ rsync -av source_dir destination_dir
building file list ... done

sent 22635 bytes  received 20 bytes  45310.00 bytes/sec
total size is 3230  speedup is 0.14

Notice that there was no listing of files. This is because rsync detected that there were no differences, and therefore it didn’t need to copy anything.

We see that rsync detected the change and copied only the updated file. There is a subtle but useful feature we can use when we specify an rsync source. Let’s consider two directories.

[user@linux ~]$ ls
source destination

Using rsync Over A Network

One of the real beauties of rsync is that it can be used to copy files over a network. After all, the r in rsync stands for “remote.” Remote copying can be done in one of two ways. The first way is with another system that has rsync installed, along with a remote shell program such as ssh. Let’s say we had another system on our local network with a lot of available hard drive space and we wanted to perform our backup operation using the remote system instead of an external drive. Assuming that it already had a directory named /backup where we could deliver our files, we could do this:

[user@linux ~]$ sudo rsync -av --delete --rsh=ssh /etc /home /usr/local remote-computer:/backup

We made two changes to our command to facilitate the network copy. First, we added the --rsh=ssh option, which instructs rsync to use the ssh program as its remote shell. In this way, we were able to use an ssh-encrypted tunnel to securely transfer the data from the local system to the remote host. Second, we specified the remote host by prefixing its name (in this case the remote host is named remote-computer) to the destination pathname.

The second way that rsync can be used to synchronize files over a network is by using an rsync server. rsync can be configured to run as a daemon and listen to incoming requests for synchronization. This is often done to allow mirroring of a remote system. For example, Red Hat Software maintains a large repository of software packages under development for its Fedora distribution. It is useful for software testers to mirror this collection during the testing phase of the distribution release cycle. Since files in the repository change frequently (often more than once a day), it is desirable to maintain a local mirror by periodic synchronization, rather than by bulk copying of the repository. One of these repositories is kept at Duke University; we could mirror it using our local copy of rsync and their rsync server like this:

[user@linux ~]$ mkdir fedora-devel
[user@linux ~]$ rsync -av –-delete rsync://archive.linux.duke.edu/fedora/linux/development/rawhide/Everything/x86_64/os/ fedora-devel

In this example, we use the URI of the remote rsync server, which consists of a protocol (rsync://), followed by the remote hostname (archive.linux.duke .edu), followed by the pathname of the repository.

Summary

In this chapter we’ve looked at the common compression and archiving programs used on Linux and other Unix-like operating systems. For archiving files, the tar/gzip combination is the preferred method on Unix-like systems, while zip/unzip is used for interoperability with Windows systems. Finally, we looked at the rsync program (a personal favorite), which is very handy for efficient synchronization of files and directories across systems.