Contents
Introduction
All Unix-like operating systems rely heavily on text files for data storage. So it makes sense that there are many tools for manipulating text. In this chapter, we will look at programs that are used to “slice and dice” text. This chapter will introduce the following commands:
cut
- Remove sections from each line of filespaste
- Merge lines of filesjoin
- Join lines of two files on a common fieldcomm
- Compare two sorted files line by linediff
- Compare files line by linepatch
- Apply a diff file to an originaltr
- Translate or delete characterssed
- Stream editor for filtering and transforming textaspell
- Interactive spellchecker
Slicing And Dicing
The next three programs we will discuss are used to peel columns of text out of files and recombine them in useful ways.
cut
The cut
program is used to extract a section of text from a line and output the extracted section to standard output. It can accept multiple file arguments or input from standard input.
Here is a list of cut
options:
-c
- Extract the portion of the line defined by list. The list may consist of one or more comma-separated numerical ranges.-f
- Extract one or more fields from the line as defined by list. The list may contain one or more fields or field ranges separated by commas.-d
- When-f
is specified, use delim as the field delimiting character. By default, fields must be separated by a single tab character.--complement
- Extract the entire line of text, except for those portions specified by-c
and/or-f
.
Let’s take a look at this distros.txt file.
[user@linux ~]$ cat -A distros.txt
SUSE^I10.2^I12/07/2006$
Fedora^I10^I11/25/2008$
SUSE^I11.0^I06/19/2008$
Ubuntu^I8.04^I04/24/2008$
Fedora^I8^I11/08/2007$
SUSE^I10.3^I10/04/2007$
Ubuntu^I6.10^I10/26/2006$
Fedora^I7^I05/31/2007$
Ubuntu^I7.10^I10/18/2007$
Ubuntu^I7.04^I04/19/2007$
SUSE^I10.1^I05/11/2006$
Fedora^I6^I10/24/2006$
Fedora^I9^I05/13/2008$
Ubuntu^I6.06^I06/01/2006$
Ubuntu^I8.10^I10/30/2008$
Fedora^I5^I03/20/2006$
There are no embedded spaces, just single tab characters between the fields. Because the file uses tabs rather than spaces, we’ll use the -f
option to extract a field.
[user@linux ~]$ cut -f 3 distros.txt
12/07/2006
11/25/2008
06/19/2008
04/24/2008
11/08/2007
10/04/2007
10/26/2006
05/31/2007
10/18/2007
04/19/2007
05/11/2006
10/24/2006
05/13/2008
06/01/2006
10/30/2008
03/20/2006
Now let’s extract the year from each line.
[user@linux ~]$ cut -f 3 distros.txt | cut -c 7-10
2006
2008
2008
2008
2007
2007
2006
2007
2007
2007
2006
2006
2008
2006
2008
2006
When working with fields, it is possible to specify a different field delimiter rather than the tab character. Here we will extract the first field from the /etc/passwd file. Using the -d
option, we are able to specify the colon character as the field delimiter.
[user@linux ~]$ cut -d ':' -f 1 /etc/passwd | head
root
daemon
bin
sys
sync
games
man
lp
mail
news
paste
The paste
command does the opposite of cut
. Rather than extracting a column of text from a file, it adds one or more columns of text to a file.
First let’s produce a list of distros sorted by date and store the result in a file called distros-by-date.txt.
Next, we will use cut
to extract the first two fields from the file (the distro name and version) and store that result in a file named distro-versions.txt.
The final piece of preparation is to extract the release dates and store them in a file named distro-dates.txt.
[user@linux ~]$ sort -k 3.7nbr -k 3.1nbr -k 3.4nbr distros.txt > distros-by-date.txt
[user@linux ~]$ cut -f 1,2 distros-by-date.txt > distros-versions.txt
[user@linux ~]$ head distros-versions.txt
Fedora 10
Ubuntu 8.10
SUSE 11.0
Fedora 9
Ubuntu 8.04
Fedora 8
Ubuntu 7.10
SUSE 10.3
Fedora 7
Ubuntu 7.04
[user@linux ~]$ cut -f 3 distros-by-date.txt > distros-dates.txt
[user@linux ~]$ head distros-dates.txt
11/25/2008
10/30/2008
06/19/2008
05/13/2008
04/24/2008
11/08/2007
10/18/2007
10/04/2007
05/31/2007
04/19/2007
We now have the parts we need. To complete the process, use paste
to put the column of dates ahead of the distro names and versions, thus creating a chronological list.
[user@linux ~]$ paste distros-dates.txt distros-versions.txt
11/25/2008 Fedora 10
10/30/2008 Ubuntu 8.10
06/19/2008 SUSE 11.0
05/13/2008 Fedora 9
04/24/2008 Ubuntu 8.04
11/08/2007 Fedora 8
10/18/2007 Ubuntu 7.10
10/04/2007 SUSE 10.3
05/31/2007 Fedora 7
04/19/2007 Ubuntu 7.04
12/07/2006 SUSE 10.2
10/26/2006 Ubuntu 6.10
10/24/2006 Fedora 6
06/01/2006 Ubuntu 6.06
05/11/2006 SUSE 10.1
03/20/2006 Fedora 5
join
In some ways, join
is like paste
in that it adds columns to a file, but it uses a unique way to do it. A join is an operation usually associated with relational databases where data from multiple tables with a shared key field is combined to form a desired result. The join
program performs the same operation. It joins data from multiple files based on a shared key field.
To demonstrate the join
program, we’ll need to make a couple of files with a shared key. To do this, we will use our distros-by-date.txt file. From this file, we will construct two additional files. One contains the release dates (which will be our shared key for this demonstration) and the release names.
[user@linux ~]$ cut -f 1,1 distros-by-date.txt > distros-names.txt
[user@linux ~]$ paste distros-dates.txt distros-names.txt > distros-key-names.txt
[user@linux ~]$ head distros-key-names.txt
11/25/2008 Fedora
10/30/2008 Ubuntu
06/19/2008 SUSE
05/13/2008 Fedora
04/24/2008 Ubuntu
11/08/2007 Fedora
10/18/2007 Ubuntu
10/04/2007 SUSE
05/31/2007 Fedora
04/19/2007 Ubuntu
The second file contains the release dates and the version numbers, as shown here.
[user@linux ~]$ cut -f 2,2 distros-by-date.txt > distros-vernums.txt
[user@linux ~]$ paste distros-dates.txt distros-vernums.txt > distros-key-vernums.txt
[user@linux ~]$ head distros-key-vernums.txt
11/25/2008 10
10/30/2008 8.10
06/19/2008 11.0
05/13/2008 9
04/24/2008 8.04
11/08/2007 8
10/18/2007 7.10
10/04/2007 10.3
05/31/2007 7
04/19/2007 7.04
We now have two files with a shared key (the “release date” field). It is important to point out that the files must be sorted on the key field for join
to work properly.
[user@linux ~]$ join distros-key-names.txt distros-key-vernums.txt | head
11/25/2008 Fedora 10
10/30/2008 Ubuntu 8.10
06/19/2008 SUSE 11.0
05/13/2008 Fedora 9
04/24/2008 Ubuntu 8.04
11/08/2007 Fedora 8
10/18/2007 Ubuntu 7.10
10/04/2007 SUSE 10.3
05/31/2007 Fedora 7
04/19/2007 Ubuntu 7.04
Comparing Text
It is often useful to compare versions of text files. For system administrators and software developers, this is particularly important. A system administrator may, for example, need to compare an existing configuration file to a previous version to diagnose a system problem. Likewise, a programmer frequently needs to see what changes have been made to programs over time.
comm
The comm
program compares two text files and displays the lines that are unique to each one and the lines they have in common. To demonstrate, we will create two nearly identical text files using cat
.
comm
produces three columns of output. The first column contains lines unique to the first file argument, the second column contains the lines unique to the second file argument, and the third column contains the lines shared by both files. comm
supports options in the form -n
, where n
is either 1, 2, or 3. When used, these options specify which columns to suppress. For example, if we wanted to output only the lines shared by both files, we would suppress the output of the first and second columns.
diff
Like the comm
program, diff
is used to detect the differences between files. However, diff
is a much more complex tool, supporting many output formats and the ability to process large collections of text files at once. diff
is often used by software developers to examine changes between different versions of program source code and thus has the ability to recursively examine directories of source code, often referred to as source trees. One common use for diff
is the creation of diff files or patches that are used by programs such as patch
(which we’ll discuss shortly) to convert one version of a file (or files) to another version.
patch
The patch
program is used to apply changes to text files. It accepts output from diff
and is generally used to convert older-version files into newer versions. Let’s consider a famous example. The Linux kernel is developed by a large, loosely organized team of contributors who submit a constant stream of small changes to the source code. The Linux kernel consists of several million lines of code, while the changes that are made by one contributor at one time are quite small. It makes no sense for a contributor to send each developer an entire kernel source tree each time a small change is made. Instead, a diff
file is submitted. The diff
file contains the change from the previous version of the kernel to the new version with the contributor’s changes. The receiver then uses the patch
program to apply the change to their own source tree. Using diff
/patch
offers two significant advantages.
- The
diff
file is small, compared to the full size of the source tree. - The
diff
file concisely shows the change being made, allowing reviewers of the patch to quickly evaluate it.
Editing On The Fly
Our experience with text editors has been largely interactive, meaning that we manually move a cursor around and then type our changes. However, there are non-interactive ways to edit text as well. It’s possible, for example, to apply a set of changes to multiple files with a single command.
tr
The tr
program is used to transliterate characters. We can think of this as a sort of character-based search-and-replace operation. Transliteration is the process of changing characters from one alphabet to another. For example, converting characters from lowercase to uppercase is transliteration.
sed
The name sed
is short for stream editor. It performs text editing on a stream of text, either a set of specified files or standard input. sed
is a powerful and somewhat complex program (there are entire books about it), so we will not cover it completely here.
aspell
aspell
is an interactive spelling checker. The aspell
program is the successor to an earlier program named ispell
and can be used, for the most part, as a drop-in replacement. While the aspell
program is mostly used by other programs that require spellchecking capability, it can also be used effectively as a stand-alone tool from the command line. It has the ability to intelligently check various types of text files, including HTML documents, C or C++ programs, email messages, and other kinds of specialized texts.
Summary
In this chapter, we looked at a few of the many command line tools that operate on text.