Learn Linux 19: Regular Expressions

Published

Contents


Introduction

Text data plays an important role on all Unix-like systems, such as Linux. But before we can fully appreciate all the features offered by these tools, first we have to examine a technology that is frequently associated with the most sophisticated uses of these tools—regular expressions.

What Are Regular Expressions?

Simply put, regular expressions are symbolic notations used to identify patterns in text. In some ways, they resemble the shell’s wildcard method of matching file and pathnames but on a much grander scale. Regular expressions are supported by many command line tools and by most programming languages to facilitate the solution of text manipulation problems. However, to further confuse things, not all regular expressions are the same; they vary slightly from tool to tool and from programming language to language. For our discussion, we will limit ourselves to regular expressions as described in the POSIX standard (which will cover most of the command line tools), as opposed to many programming languages (most notably Perl), which use slightly larger and richer sets of notations.

grep

The main program we will use to work with regular expressions is our old pal grep. The name grep is actually derived from the phrase “global regular expression print,” so we can see that grep has something to do with regular expressions. In essence, grep searches text files for text matching a specified regular expression and outputs any line containing a match to standard output.

Here is a list of commonly used grep options:

  • -i - Ignore case. Do not distinguish between uppercase and lowercase characters.
  • -v - Invert match. Normally, grep prints lines that contain a match. This option causes grep to print every line that does not contain a match.
  • -c - Print the number of matches (or non-matches if the -v option is also specified) instead of the lines themselves.
  • -l - Print the name of each file that contains a match instead of the lines themselves.
  • -L - Like the -l option, but print only the names of files that do not contain matches.
  • -n - Prefix each matching line with the number of the line within the file.
  • -h - For multifile searches, suppress the output of filenames.

Metach­ara­cters (Escaped With \)

  • ^ [ . $ { * ( \ + ) | ? < >

Anchors

  • ^ - Start of string, or start of line in multi-line pattern
  • \A - Start of string
  • $ - End of string, or end of line in multi-line pattern
  • \Z - End of string
  • \b - Word boundary
  • \B - Not word boundary
  • \< - Start of word
  • \> - End of word

Character Classes

  • \c - Control character
  • \s - White space
  • \S - Not white space
  • \d - Digit
  • \D - Not digit
  • \w - Word
  • \W - Not word
  • \x - Hexade­cimal digit
  • \O - Octal digit

POSIX

  • [:upper:] - Upper case letters
  • [:lower:] - Lower case letters
  • [:alpha:] - All letters
  • [:alnum:] - Digits and letters
  • [:digit:] - Digits
  • [:xdigit:] - Hexade­cimal digits
  • [:punct:] - Punctu­ation
  • [:blank:] - Space and tab
  • [:space:] - Blank characters
  • [:cntrl:] - Control characters
  • [:graph:] - Printed characters
  • [:print:] - Printed characters and spaces
  • [:word:] - Digits, letters and underscore

Quanti­fiers

  • * - 0 or more
  • + - 1 or more
  • ? - 0 or 1

Groups And Ranges

  • . - Any character except new line (\n)
  • (a|b) - a or b
  • (...) - Group
  • (?:...) - Passive (non-c­apt­uring) group
  • [abc] - Range (a or b or c)
  • [^abc] - Not (a or b or c)
  • [a-q] - Lower case letter from a to q
  • [A-Q] - Upper case letter from A to Q
  • [0-7] - Digit from 0 to 7
  • \x - Group/­sub­pattern number “­x”

Special Characters

  • \n - New line
  • \r - Carriage return
  • \t - Tab
  • \v - Vertical tab
  • \f - Form feed
  • \xxx - Octal character xxx
  • \xhh - Hex character hh

String Replac­ement

  • $n - nth non-pa­ssive group
  • $2 - “­xyz­” in /^(abc­(xy­z))$/
  • $1 - “­xyz­” in /^(?:a­bc)­(xyz)$/
  • `$“ - Before matched string
  • $' - After matched string
  • $+ - Last matched string
  • $& - Entire matched string

Assertions

  • ?= - Lookahead assertion
  • ?! - Negative lookahead
  • ?<= - Lookbehind assertion
  • ?!= or ?<! - Negative lookbehind
  • ?> - Once-only Subexp­ression
  • ?() - Condition [if then]
  • ?()| - Condition [if then else]
  • ?# - Comment

Summary

In this chapter, we saw a few of the many uses of regular expressions. We can find even more if we use regular expressions to search for additional applications that use them.