Contents
Introduction
One of the primary tasks of a computer system’s administrator is keeping the system’s data secure. One way this is done is by performing timely backups of the system’s files. Even if you’re not a system administrator, it is often useful to make copies of things and move large collections of files from place to place and from device to device. This chapter will introduce the following commands:
gzip
- Compress filesbzip2
- A block sorting file compressortar
- Tape archiving utilityzip
- Package and compress filesrsync
- Remote file and directory synchronization
Compressing Files
Throughout the history of computing, there has been a struggle to get the most data into the smallest available space, whether that space be memory, storage devices, or network bandwidth. Many of the data services that we take for granted today, such as mobile phone service, high-definition television, or broadband Internet, owe their existence to effective data compression techniques.
Data compression is the process of removing redundancy from data. Let’s consider an imaginary example. Suppose we had an entirely black picture file with the dimensions of 100 pixels by 100 pixels. In terms of data storage (assuming 24 bits, or 3 bytes per pixel), the image will occupy 30,000 bytes of storage. 100 × 100 × 3 = 30,000
An image that is all one color contains entirely redundant data. If we were clever, we could encode the data in such a way that we simply describe the fact that we have a block of 10,000 black pixels. So, instead of storing a block of data containing 30,000 zeros (black is usually represented in image files as zero), we could compress the data into the number 10,000, followed by a zero to represent our data. Such a data compression scheme is called run-length encoding and is one of the most rudimentary compression techniques. Today’s techniques are much more advanced and complex, but the basic goal remains the same—get rid of redundant data. Compression algorithms (the mathematical techniques used to carry out the compression) fall into two general categories.
- Lossless. Lossless compression preserves all the data contained in the original. This means that when a file is restored from a compressed version, the restored file is exactly the same as the original, uncompressed version.
- Lossy. Lossy compression, on the other hand, removes data as the compression is performed to allow more compression to be applied. When a lossy file is restored, it does not match the original version; rather, it is a close approximation. Examples of lossy compression are JPEG (for images) and MP3 (for music).
In our discussion, we will look exclusively at lossless compression since most data on computers cannot tolerate any data loss.
gzip
The gzip
program is used to compress one or more files. When executed, it replaces the original file with a compressed version of the original. The corresponding gunzip
program is used to restore compressed files to their original, uncompressed form.
[user@linux ~]$ ls -l /etc > document.txt
[user@linux ~]$ ls -l document.txt
-rw-r--r-- 1 user user 15738 2017-05-09 12:00 document.txt
[user@linux ~]$ gzip document.txt
[user@linux ~]$ ls -l document.txt.gz
-rw-r--r-- 1 user user 3230 2017-05-09 12:00 document.txt.gz
[user@linux ~]$ gunzip document.txt
-rw-r--r-- 1 user user 15738 2017-05-09 12:00 document.txt
bzip2
The bzip2
program, by Julian Seward, is similar to gzip
but uses a different compression algorithm that achieves higher levels of compression at the cost of compression speed. In most regards, it works in the same fashion as gzip
.
[user@linux ~]$ ls -l /etc > document.txt
[user@linux ~]$ ls -l document.txt
-rw-r--r-- 1 user user 15738 2017-05-09 12:00 document.txt
[user@linux ~]$ bzip2 document.txt
[user@linux ~]$ ls -l document.txt.bz2
-rw-r--r-- 1 user user 2792 2017-05-09 12:00 document.txt.bz2
[user@linux ~]$ bunzip document.txt
-rw-r--r-- 1 user user 15738 2017-05-09 12:00 document.txt
Archiving Files
A common file-management task often used in conjunction with compression is archiving. Archiving is the process of gathering up many files and bundling them together into a single large file. Archiving is often done as part of system backups. It is also used when old data is moved from a system to some type of long-term storage.
tar
In the Unix-like world of software, the tar
program is the classic tool for archiving files. Its name, short for tape archive, reveals its roots as a tool for making backup tapes. While it is still used for that traditional task, it is equally adept on other storage devices. We often see filenames that end with the extension .tar or .tgz, which indicate a “plain” tar archive and a gzipped archive, respectively. A tar archive can consist of a group of separate files, one or more directory hierarchies, or a mixture of both.
zip
The zip
program is both a compression tool and an archiver. The file format used by the program is familiar to Windows users, as it reads and writes .zip files. In Linux, however, gzip
is the predominant compression program, with bzip2
being a close second.
Synchronizing Files And Directories
A common strategy for maintaining a backup copy of a system involves keeping one or more directories synchronized with another directory (or directories) located on either the local system (usually a removable storage device of some kind) or a remote system. We might, for example, have a local copy of a website under development and synchronize it from time to time with the “live” copy on a remote web server.
In the Unix-like world, the preferred tool for this task is rsync
. This program can synchronize both local and remote directories by using the rsync remote-update protocol, which allows rsync
to quickly detect the differences between two directories and perform the minimum amount of copying required to bring them into sync. This makes rsync
very fast and economical to use, compared to other kinds of copy programs.
rsync
is invoked like this:
rsync options source destination
Where source and destination are one of the following:
- A local file or directory.
- A remote file or directory in the form of
[user@]host:path
. - A remote rsync server specified with a URI of
rsync://[user@] host[:port]/path
.
Note that either the source or the destination must be a local file. Remote-to-remote copying is not supported.
Let’s try rsync on some local files.
[user@linux ~]$ mkdir -p test/source_dir test/destination_dir && cd test
Next, we’ll synchronize the source_dir directory with a corresponding copy in destination_dir.
[user@linux ~]$ rsync -av source_dir destination_dir
If we only want to copy the contents of source_dir directory we can append a trailing /
or /*
to the source directory name.
[user@linux ~]$ rsync -av source_dir/ destination_dir
We’ve included both the -a
option (for archiving—causes recursion and preservation of file attributes) and the -v
option (verbose output) to make a mirror of the source_dir directory within destination_dir. While the command runs, we will see a list of the files and directories being copied. At the end, we will see a summary message like this indicating the amount of copying performed:
sent 135759 bytes received 57870 bytes 387258.00 bytes/sec
total size is 3230 speedup is 0.02
If we run the command again, we will see a different result.
[user@linux ~]$ rsync -av source_dir destination_dir
building file list ... done
sent 22635 bytes received 20 bytes 45310.00 bytes/sec
total size is 3230 speedup is 0.14
Notice that there was no listing of files. This is because rsync detected that there were no differences, and therefore it didn’t need to copy anything.
We see that rsync
detected the change and copied only the updated file. There is a subtle but useful feature we can use when we specify an rsync
source. Let’s consider two directories.
[user@linux ~]$ ls
source destination
Using rsync Over A Network
One of the real beauties of rsync
is that it can be used to copy files over a network. After all, the r in rsync
stands for “remote.” Remote copying can be done in one of two ways. The first way is with another system that has rsync
installed, along with a remote shell program such as ssh. Let’s say we had another system on our local network with a lot of available hard drive space and we wanted to perform our backup operation using the remote system instead of an external drive. Assuming that it already had a directory named /backup where we could deliver our files, we could do this:
[user@linux ~]$ sudo rsync -av --delete --rsh=ssh /etc /home /usr/local remote-computer:/backup
We made two changes to our command to facilitate the network copy. First, we added the --rsh=ssh
option, which instructs rsync
to use the ssh
program as its remote shell. In this way, we were able to use an ssh-encrypted tunnel to securely transfer the data from the local system to the remote host. Second, we specified the remote host by prefixing its name (in this case the remote host is named remote-computer) to the destination pathname.
The second way that rsync
can be used to synchronize files over a network is by using an rsync
server. rsync
can be configured to run as a daemon and listen to incoming requests for synchronization. This is often done to allow mirroring of a remote system. For example, Red Hat Software maintains a large repository of software packages under development for its Fedora distribution. It is useful for software testers to mirror this collection during the testing phase of the distribution release cycle. Since files in the repository change frequently (often more than once a day), it is desirable to maintain a local mirror by periodic synchronization, rather than by bulk copying of the repository. One of these repositories is kept at Duke University; we could mirror it using our local copy of rsync
and their rsync
server like this:
[user@linux ~]$ mkdir fedora-devel
[user@linux ~]$ rsync -av –-delete rsync://archive.linux.duke.edu/fedora/linux/development/rawhide/Everything/x86_64/os/ fedora-devel
In this example, we use the URI of the remote rsync
server, which consists of a protocol (rsync://
), followed by the remote hostname (archive.linux.duke .edu), followed by the pathname of the repository.
Summary
In this chapter we’ve looked at the common compression and archiving programs used on Linux and other Unix-like operating systems. For archiving files, the tar
/gzip
combination is the preferred method on Unix-like systems, while zip
/unzip
is used for interoperability with Windows systems. Finally, we looked at the rsync
program (a personal favorite), which is very handy for efficient synchronization of files and directories across systems.