Chapter 9. Text processing

Table of Contents

9.1. Simple text manipulation
9.2. Regular expressions
9.3. grep

Text manipulation is one of the things that UNIX excels at, because it forms the heart of the UNIX philosophy, as described in Section 2.4, “The UNIX philosophy”. Most UNIX commands are simple programs that read data from the standard input, performs some operation on the data, and sends the result to the program's standard output. These programs basically act as an filters, that can be connected as a pipeline. This allows the user to put the UNIX tools to uses that the writers never envisioned. In later chapters we will see how you can build simple filters yourself.

This chapter describes some simple, but important, UNIX commands that can be used to manipulate text. After that, we will dive into regular expressions, a sublanguage that can be used to match text patterns.

9.1. Simple text manipulation

9.1.1. Repeating what is said

The most simple text filter is the cat, it does nothing else than sending the data from stdin to stdout:

$ echo "hello world" | cat
hello world
      

Another useful feature is that you can let it send the contents of a file to the standard output:

$ cat file.txt
Hello, this is the content of file.txt
      

cat really lives up to its name when multiple files are added as arguments. This will concatenate the files, in the sense that it will send the contents of all files to the standard output, in the same order as they were specified as an argument. The following screen snippet demonstrates this:

$ cat file.txt file1.txt file2.txt
Hello, this is the content of file.txt
Hello, this is the content of file1.txt
Hello, this is the content of file2.txt
      

9.1.2. Text statistics

The wc command provides statistics about a text file or text stream. Without any parameters, it will print the number of lines, the number of words, and the number of bytes respectively. A word is delimited by one white space character, or a sequence of whitespace characters.

The following example shows the number of lines, words, and bytes in the canonical “Hello world!” example:

$ echo "Hello world!" | wc 
       1       2      13
      

If you would like to print just one of these components, you can use one of the -l (lines), -w (words), or -c (bytes) parameters. For instance, adding just the -l parameter will show the number of lines in a file:

$ wc -l /usr/share/dict/words 
  235882 /usr/share/dict/words
      

Or, you can print additional fields by adding a parameter:

$ wc -lc /usr/share/dict/words
 235882 2493082 /usr/share/dict/words
      

Please note that, no matter the order in which the options were specified, the output order will always be the same (lines, words, bytes).

Since -c prints the number bytes, this parameter may not represent the number of characters that a text holds, because the character set in use maybe be wider than one byte. To this end, the -m parameter has been added which prints the number of characters in a text, independent of the character set. -c and -m are substitutes, and can never be used at the same time.

The statistics that wc provides are more useful than they may seem on the surface. For example, the -l parameter is often used as a counter for the output of a command. This is convenient, because many commands seperate logical units by a newline. Suppose that you would like to count the number of files in your home directory having a filename ending with .txt. You could do this by combining find to find the relevant files and wc to count the number of occurences:

$ find ~ -name '*.txt' -type f | wc -l
      

9.1.3. Manipulating characters

The tr command can be used to do common character operations, like swapping characters, deleting characters, and squeezing character sequences. Depending on the operation, one or two sets of characters should be specified. Besides normal characters, there are some special character sequences that can be used:

\character

This notation is used to specify characters that need escaping, most notably \n (newline), \t (horizontal tab), and \\ (backslash).

character1-character2

Implicitly insert all characters from character1 to character2. This notation should be used with care, because it does not always give the expected result. For instance, the sequence a-d may yield abcd for the POSIX locale (language setting), but this may not be true for other locales.

[:class:]

Match a predefined class of characters. All possible classes are shown in Table 9.1, “tr character classes”.

[character*]

Repeat character until the second set is as long as the first set of characters. This notation can only be used in the second set.

[character*n]

Repeat character n times.

Table 9.1. tr character classes

Class Meaning
[:alnum:] All letters and numbers.
[:alpha:] Letters.
[:blank:] Horizontal whitespace (e.g. spaces and tabs).
[:cntrl:] Control characters.
[:digit:] All digits (0-9).
[:graph:] All printable characters, except whitespace.
[:lower:] Lowercase letters.
[:print:] All printable characters, including horizontal whitespace, but excluding vertical whitespace.
[:punct:] Punctuation characters.
[:space:] All whitespace.
[:upper:] Uppercase letters.
[:xdigit:] Hexadecimal digits (0-9, a-f).

9.1.3.1. Swapping characters

The default operation of tr is to swap (translate) characters. This means that the n-th character in the first set is replaced with the n-th character in the second set. For example, you can replace all e's with i's and o's with a's with one tr operation:

$ echo 'Hello world!' | tr 'eo' 'ia'
Hilla warld!
	

When the second set is not as large as the first set, the last character in the second set will be repeated. Though, this does not necessarily apply to other UNIX systems. So, if you want to use tr in a system-independent manner, explicitly define what character should be repeated. For instance

$ echo 'Hello world!' | tr 'eaiou' '[@*]'
H@ll@ w@rld!
	

Another particularity is the use of the repetition syntax in the middle of the set. Suppose that set 1 is abcdef, and set 2 @[-*]!. tr will replace a with @, b, c, d, and e with -, and f with !. Though some other UNIX systems follow replace a with @, and the rest of the set characters with -. So, a more correct notation would be the more explicit @[-*4]!, which gives the same results on virtually all UNIX systems:

$ echo 'abcdef' | tr 'abcdef' '@[-*4]!'
@----!
	

9.1.3.2. Squeezing character sequences

When the -s parameter is used, tr will squeeze all characters that are in the second set. This means that a sequence of the same characters will be reduced to one character. Let's squeeze the character "e":

$ echo "Let's squeeze this." | tr -s 'e'
Let's squeze this.
	

We can combine this with translation to show a useful example of tr in action. Suppose that we would like to mark al vowels with the at sign (@), with consecutive vowels represented by one at sign. This can easily be done by piping two tr commands:

$ echo "eenie meenie minie moe" | tr 'aeiou' '[@*]' | tr -s '@'
@n@ m@n@ m@n@ m@
	

9.1.3.3. Deleting characters

Finally, tr can be used to delete characters. If the -d parameter is used, all characters from the first set are removed:

$ echo 'Hello world!' | tr -d 'lr'
Heo wod!
	

9.1.4. Cutting and pasting text columns

The cut command is provided by UNIX systems to “cut” one or more columns from a file or stream, printing it to the standard output. It is often useful to selectively pick some information from a text. cut provides three approaches to cutting information from files:

  1. By byte.

  2. By character, which is not the same as cutting by byte on systems that use a character set that is wider than eight bits.

  3. By field, that is delimited by a character.

In all three approaches, you can specify the element to choose by its number starting at 1. You can specify a range by using a dash (-). So, M-N means the Mth to the Nth element. Leaving M out (-N) selects all elements from the first element to the Nth element. Leaving N out (M-) selects the Mth element to the last element. Multiple elements or ranges can be combined by separating them by commas (,). So, for instance, 1,3- selects the first element and the third to the last element.

Data can be cut by field with the -f fields parameter. By default, the horizontal tab is used a separator. Let's have a look at cut in action with a tiny Dutch to English dictionary:

$ cat dictionary
appel   apple
banaan  banana
peer    pear
      

We can get all English words by selecting the first field:

$ cut -f 2 dictionary
apple
banana
pear
      

That was quite easy. Now let's do the same thing with a file that has a colon as the field separator. We can easily try this by converting the dictionary with the tr command that we have seen earlier, replacing all tabs with colons:

$ tr '\t' ':' < dictionary > dictionary-new
$ cat dictionary-new
appel:apple
banaan:banana
peer:pear
      

If we use the same command as in the previous example, we do not get the correct output:

$ cut -f 2 dictionary-new
appel:apple
banaan:banana
peer:pear
      

What happens here is that the delimiter could not be found. If a line does not contain the delimiter that is being used, the default behavior of cut is to print the complete line. You can prevent this with the -s parameter.

To use a different delimiter than the horizontal tab, add the -d delimter_char parameter to set the delimiting character. So, in this case of our dictionary-new file, we will ask cut to use the colon as a delimiter:

$ cut -d ':' -f 2 dictionary-new
apple
banana
pear
      

If a field that was specified does not exist in a line, that particular field is not printed.

The -b bytes and -c characters respectively select bytes and characters from the text. On older systems a character used to be a byte wide. But newer systems can provide character sets that are wider than one byte. So, if you want to be sure to grab complete characters, use the -c parameter. An entertaining example of seeing the -c parameter in action is to find the ten most common sets of the first three characters of a word. Most UNIX systems provide a list of words that are separated by a new line. We can use cut to get the first three characters of the words in the word list, add uniq to count identical three character sequences, and use sort to sort them reverse-numerically (sort is described in Section 9.1.5, “Sorting text”). Finally, we will use head to get the ten most frequent sequences:

$ cut -c 1-4 /usr/share/dict/words | uniq -c | sort -nr | head
    254 inte
    206 comp
    169 cons
    161 cont
    150 over
    125 tran
    111 comm
    100 disc
     99 conf
     96 reco
      

Having concluded with that nice piece of UNIX commands in action, we will move on to the paste command, which combines files in columns in a single text stream.

Usage of paste is very simple, it will combine all files given as an argument, separated by a tab. With the list of English and Dutch words, we can generate a tiny dictionary:

$ paste dictionary-en dictionary-nl
apple   appel
banana  banaan
pear    peer
      

You can also combine more than two files:

$ paste dictionary-en dictionary-nl dictionary-de 
apple   appel   Apfel
banana  banaan  Banane
pear    peer    Birne
      

If one of the files is longer, the column order is maintained, and empty entries are used to fill up the entries of the shorter files.

You can use another delimiter by adding the -d delimiter parameter. For example, we can make a colon-separated dictionary:

$ paste -d ':' dictionary-en dictionary-nl
apple:appel
banana:banaan
pear:peer
      

Normally, paste combines files as different columns. You can also let paste use the lines of each file as columns, and put the columns of each file on a separate line. This is done with the -s parameter:

$ paste -s dictionary-en dictionary-nl dictionary-de
apple   banana  pear
appel   banaan  peer
Apfel   Banane  Birne
      

9.1.5. Sorting text

UNIX offers the sort command to sort text. sort can also check whether a file is in sorted order, and merge two sorted files. sort can sort in dictionary and numerical orders. The default sort order is the dictionary order. This means that text lines are compared character by character, sorted as specified in the current collating sequence (which is specified through the LC_COLLATE environment variable). This has a catch when you are sorting numbers, for instance, if you have the numbers 1 to 10 on different lines, the sequence will be 1, 10, 2, 3, etc. This is caused by the per-character interpretation of the dictionary sort. If you want to sort lines by number, use the numerical sort.

If no additional parameters are specified, sort sorts the input lines in dictionary order. For instance:

$ cat << EOF | sort
orange
apple
banana
EOF
apple
banana
orange
      

As you can see, the input is correctly ordered. Sometimes there are two identical lines. You can merge identical lines by adding the -u parameter. The two samples listed below illustrate this.

$ cat << EOF | sort
orange
apple
banana
banana
EOF
apple
banana
banana
orange
$ cat << EOF | sort -u
orange
apple
banana
banana
EOF
apple
banana
orange
      

There are some additional parameters that can be helpful to modify the results a bit:

  • The -f parameter makes the sort case-insensitive.

  • If -d is added, only blanks and alphanumeric characters are used to determine the order.

  • The -i parameter makes sort ignore non-printable characters.

You can sort files numerically by adding the -n parameter. This parameter stops reading the input line when a non-numeric character was found. The leading minus sign, decimal point, thousands separator, radix character (that separates an exponential from a normal number), and blanks can be used as a part of a number. These characters are interpreted where applicable.

The following example shows numerical sort in action, by piping the output of du to sort. This works because du specifies the size of each file as the first field.

$ du -a /bin | sort -n
0       /bin/kernelversion
0       /bin/ksh
0       /bin/lsmod.modutils
0       /bin/lspci
0       /bin/mt
0       /bin/netcat
[...]
      

In this case, the output is probably not useful if you want to read the output in a paginator, because the smallest files are listed first. This is where the -r parameter becomes handy. This reverses the sort order.

$ du -a /bin | sort -nr
4692    /bin
1036    /bin/ksh93
668     /bin/bash
416     /bin/busybox
236     /bin/tar
156     /bin/ip
[...]
      

The -r parameter also works with dictionary sorts.

Quite often, files use a layout with multiple columns, and you may want to sort a file by a different column than the first column. For instance, consider the following score file named score.txt:

John:US:4
Herman:NL:3
Klaus:DE:5
Heinz:DE:3
      

Suppose that we would like to sort the entries in this file by the two-letter country name. sort allows us to sort a file by a column with the -k col1[,col2] parameter. Where col1 up to col2 are used as fields for sorting the input. If col2 is not specified, all fields up till the end of the line are used. So, if you want to use just one column, use -k col1,col1. You can also specify the the starting character within a column by adding a period (.) and a character index. For instance, -k 2.3,4.2 means that the second column starting from the third character, the third column, and the fourth column up to (and including) the second character.

There is yet another particularity when it comes to sorting by columns: by default, sort uses a blank as the column separator. If you use a different separator character, you will have to use the -t char parameter, that is used to specify the field separator.

With the -t and -k parameters combined, we can sort the scores file by country code:

$ sort -t ':' -k 2,2 scores.txt
Heinz:DE:3
Klaus:DE:5
Herman:NL:3
John:US:4
    

So, how can we sort the file by the score? Obviously, we have to ask sort to use the third column. But sort uses a dictionary sort by default[6]. You could use the -n, but sort also allows a more sophisticated approach. You can append the one or more of the n, r>, f, d, i, or b to the column specifier. These letters represent the sort parameters with the same name. If you add just the starting column, append it to that column, otherwise, add it to the ending column.

The following command sorts the file by score:

$ sort -t ':' -k 3n /home/daniel/scores.txt
Heinz:DE:3
Herman:NL:3
John:US:4
Klaus:DE:5
    

It is good to follow this approach, rather than using the parameter variants, because sort allows you to use more than one -k parameter. And, adding these flags to the column specification, will allow you to sort by different columns in different ways. For example using sort with the -k 3,3n -k 2,2 parameters will sort all lines numerically by the third column. If some lines have identical numbers in the third column, these lines can be sorted further with a dictionary sort of the second column.

If you want to check whether a file is already sorted, you can use the -c parameter. If the file was in a sorted order, sort will return the value 0, otherwise 1. We can check this by echoing the value of the ? variable, which holds the return value of the last executed command.

$ sort -c scores.txt ; echo $?
1
$ sort scores.txt | sort -c ; echo $?
0
    

The second command shows that this actually works, by piping the output of the sort of scores.txt to sort.

Finally, you can merge two sorted files with the -m parameter, keeping the correct sort order. This is faster than concatenating both files, and resorting them.

# sort -m scores-sorted.txt scores-sorted2.txt
    

9.1.6. Differences between files

Since text streams, and text files are very important in UNIX, it is often useful to show the differences between two text files. The main utilities for working with file differences are diff and patch. diff shows the differences between files. The output of diff can be processed by patch to apply the changes between two files to a file. “diffs” are also form the base of version/source management systems. The following sections describe diff and patch. To have some material to work with, the following two C source files are used to demonstrate these commands. These files are named hello.c and hello2.c respectively.


#include <stdio.h>

void usage(char *programName);

int main(int argc, char *argv[]) {
  if (argc == 1) {
    usage(argv[0]);
    return 1;
  }

  printf("Hello %s!\n", argv[1]);

  return 0;
}

void usage(char *programName) {
  printf("Usage: %s name\n", programName);
}

      
#include <stdio.h>
#include <time.h>

void usage(char *programName);

int main(int argc, char *argv[]) {
  if (argc == 1) {
    usage(argv[0]);
    return 1;
  }

  printf("Hello %s!\n", argv[1]);

  time_t curTime = time(NULL);
  printf("The date is %s\n", asctime(localtime(&curTime)));


  return 0;
}

void usage(char *programName) {
  printf("Usage: %s name\n", programName);
}
      

9.1.6.1. Listing differences between files

Suppose that you received the program hello.c from a friend, and you modified it to give the user the current date and time. You could just send your friend the updated program. But if a file grows larger, the can become uncomfortable, because the changes are harder to spot. Besides that, your friend may have also received modified program sources from other persons. This is a typical situation where diff becomes handy. diff shows the differences between two files. It most basic syntax is diff file file2, which shows the differences between file and file2. Let's try this with the our source files:

$ diff hello.c hello2.c
1a2 1
> #include <time.h> 2
12a14,17
>   time_t curTime = time(NULL);
>   printf("The date is %s\n", asctime(localtime(&curTime)));
>
	

The additions from hello2.c are visible in this output, but the format may look a bit strange. Actually, these are commands that can be interpreted by the ed line editor. We will look at a more comfortable output format after touching the surface of the default output format.

Two different elements can be distilled from this output:

1

This is an ed command that specified that text should be appended (a) after line 2.

2

This is the actual text to be appended after the second line. The “>” sign is used to mark lines that are added.

The same elements are used to add the second block of text. What about lines that are removed? We can easily see how they are represented by swapping the two parameters to diff, showing the differences between hello2.c and hello.c:

$ diff hello2.c hello.c
2d1 1
< #include <time.h> 2
14,16d12
<   time_t curTime = time(NULL);
<   printf("The date is %s\n", asctime(localtime(&curTime)));
<
	

The following elements can be distinguished:

1

This is the ed delete command (d), stating that line 2 should be deleted. The second delete command uses a range (line 14 to 17).

2

The text that is going to be removed is preceded by the “<” sign.

That's enough of the ed-style output. The GNU diff program included in Slackware Linux supports so-called unified diffs. Unified diffs are very readable, and provide context by default. diff can provide unified output with the -u flag:

$ diff -u hello.c hello2.c
--- hello.c     2006-11-26 20:28:55.000000000 +0100 1
+++ hello2.c    2006-11-26 21:27:52.000000000 +0100 2
@@ -1,4 +1,5 @@ 3
 #include <stdio.h> 4
+#include <time.h> 5

 void usage(char *programName);

@@ -10,6 +11,9 @@

   printf("Hello %s!\n", argv[1]);

+  time_t curTime = time(NULL);
+  printf("The date is %s\n", asctime(localtime(&curTime)));
+
   return 0;
 }

	

The following elements can be found in the output

1

The name of the original file, and the timestamp of the last modification time.

2

The name of the changed file, and the timestamp of the last modification time.

3

This pair of numbers show the location and size of the chunk that the text below affects in the original file and the modified file. So, in this case the numbers mean that in the affected chunk in the original file starts at line 1, and is four lines long. In the modified file the affected chunk starts at line 1, and is five lines long. Different chunks in diff output are started by this header.

4

A line that is not preceded by a minus (-) or plus (+) sign is unchanged. Unmodified lines are included because they give contextual information, and to avoid that too many chunks are made. If there are only a few unmodified lines between changes, diff will choose to make only one chunk, rather than two chunks.

5

A line that is preceded by a plus sign (+) is an addition to the modified file, compared to the original file.

As with the ed-style diff format, we can see some removals by swapping the file names:

$ diff -u hello2.c hello.c

--- hello2.c    2006-11-26 21:27:52.000000000 +0100
+++ hello.c     2006-11-26 20:28:55.000000000 +0100
@@ -1,5 +1,4 @@
 #include <stdio.h>
-#include <time.h>

 void usage(char *programName);

@@ -11,9 +10,6 @@

   printf("Hello %s!\n", argv[1]);

-  time_t curTime = time(NULL);
-  printf("The date is %s\n", asctime(localtime(&curTime)));
-
   return 0;
 }


	

As you can see from this output, lines that are removed from the modified file, in contrast to the original file are preceded by the minus (-) sign.

When you are working on larger sets of files, it's often useful to compare whole directories. For instance, if you have the original version of a program source in a directory named hello.orig, and the modified version in a directory named hello, you can use the -r parameter to recursively compare both directories. For instance:

$ diff -ru hello.orig hello
diff -ru hello.orig/hello.c hello/hello.c

--- hello.orig/hello.c  2006-12-04 17:37:14.000000000 +0100
+++ hello/hello.c       2006-12-04 17:37:48.000000000 +0100
@@ -1,4 +1,5 @@
 #include <stdio.h>
+#include <time.h>

 void usage(char *programName);

@@ -10,6 +11,9 @@

   printf("Hello %s!\n", argv[1]);

+  time_t curTime = time(NULL);
+  printf("The date is %s\n", asctime(localtime(&curTime)));
+
   return 0;
 }


	

It should be noted that this will only compare files that are available in both directories. The GNU version of diff, that is used by Slackware Linux provides the -N parameter. This parameters treats files that exist in only one of both directories as if it were an empty file. So for instance, if we have added a file named Makefile to the hello directory, using the -N parameter will give the following output:

$ diff -ruN hello.orig hello

diff -ruN hello.orig/hello.c hello/hello.c
--- hello.orig/hello.c  2006-12-04 17:37:14.000000000 +0100
+++ hello/hello.c       2006-12-04 17:37:48.000000000 +0100
@@ -1,4 +1,5 @@
 #include <stdio.h>
+#include <time.h>

 void usage(char *programName);

@@ -10,6 +11,9 @@

   printf("Hello %s!\n", argv[1]);

+  time_t curTime = time(NULL);
+  printf("The date is %s\n", asctime(localtime(&curTime)));
+
   return 0;
 }

diff -ruN hello.orig/Makefile hello/Makefile
--- hello.orig/Makefile 1970-01-01 01:00:00.000000000 +0100
+++ hello/Makefile      2006-12-04 17:39:44.000000000 +0100
@@ -0,0 +1,2 @@
+hello: hello.c
+       gcc -Wall -o $@ $<

	

As you can see the chunk indicator says that the chunk in the original file starts at line 0, and is 0 lines long.

UNIX users often exchange the output of diff, usually called “diffs” or “patches”. The next section will show you how you can handle diffs. But you are now able to create them yourself, by redirecting the output of diff to a file. For example:

$ diff -u hello.c hello2.c > hello_add_date.diff
	

If you have multiple diffs, you can easily combine them to one diff, by concatenating the diffs:

$ cat diff1 diff2 diff3 > combined_diff
	

But make sure that they were created from the same directory if you want to use the patch utility that is covered in the next section.

9.1.6.2. Modifying files with diff output

Suppose that somebody would send you the output of diff for a file that you have created. It would be tedious to manually incorporate all the changes that were made. Fortunately, the patch can do this for you. patch accepts diffs on the standard input, and will try to change the original file, according to the differences that are registered in the diff. So, for instance, if we have the hello.c file, and the patch that we produced previously based on the changes between hello.c and hello2.c, we can patch hello.c to become equal to its counterpart:

$ patch < hello_add_date.diff
patching file hello.c
	

If you have hello2.c, you can check whether the files are identical now:

$ diff -u hello.c hello2.c
	

There is no output, so this is the case. One of the nice features of patch is that it can revert the changes made through a diff, by using the -R parameter:

$ patch -R < hello_add_date.diff
	

In these examples, the original file is patched. Sometimes you may want to want to apply the patch to a file with a different name. You can do this by providing the name of a file as the last argument:

$ patch helloworld.c < hello_add_date.diff
patching file helloworld.c
	

You can also use patch with diffs that were generated with the -r parameter, but you have to take a bit of care. Suppose that the header of a particular file in the diff is as follows:


--------------------------
|diff -ruN hello.orig/hello.c hello/hello.c
|--- hello.orig/hello.c 2006-12-04 17:37:14.000000000 +0100
|+++ hello/hello.c      2006-12-04 17:37:48.000000000 +0100
--------------------------

If you process this diff with patch, it will attempt to change hello.c. So, the directory that holds this file has to be the active directory. You can use the full pathname with the -p n, where n is the number of pathname components that should be stripped. A value of 0 will use the path as it is specified in the patch, 1 will strip the first pathname component, etc. In this example, stripping the first component will result in patching of hello.c. According to the Single UNIX Specification version 3 standard, the path that is preceded by --- should be used to construct the file that should be patched. The GNU version of patch does not follow the standard here. So, it is best to strip off to the point where both directory names are equal (this is usually the top directory of the tree being changed). In most cases where relative paths are used this can be done by using -p 1. For instance:

$ cd hello.orig
$ patch -p 1 < ../hello.diff
	

Or, you can use the -d parameter to specify in which directory the change has to be applied:

$ patch -p 1 -d hello.orig < hello.diff
patching file hello.c
patching file Makefile
	

If you want to keep a backup when you are changing a file, you can use the -b parameter of patch. This will make a copy of every affected file named filename.orig, before actually changing the file:

$ patch -b < hello_add_date.diff
$ ls -l hello.c*
-rw-r--r-- 1 daniel daniel 382 2006-12-04 21:41 hello.c
-rw-r--r-- 1 daniel daniel 272 2006-12-04 21:12 hello.c.orig
	

Sometimes a file can not be patched. For instance, if it has already been patched, it has changed to much to apply the patch cleanly, or if the file does not exist at all. In this case, the chunks that could not be saved are stored in a file with the name filename.rej, where filename is the file that patch tried to modify.

9.2. Regular expressions

9.2.1. Introduction

In daily life, you will often want to some text that matches to a certain pattern, rather than a literal string. Many UNIX utilities implement a language for matching text patterns, regular expressions (regexps). Over time the regular expression language has grown, there are now basically three regular expression syntaxes:

  • Traditional UNIX regular expressions.

  • POSIX extended regular expressions.

  • Perl-compatible regular expressions (PCRE).

POSIX regexps are mostly a superset of traditional UNIX regexps, and PCREs a superset of POSIX regexps. The syntax that an application supports differs per application, but almost all applications support at least POSIX regexps.

Each syntactical unit in a regexp expresses one of the following things:

  • A character: this is the basis of every regular expression, a character or a set of characters to be matched. For instance, the letter p or the the sign ,.

  • Quantification: a quantifier specifies how many times the preceding character or set of characters should be matched.

  • Alternation: alternation is used to match “a or b” in which a and b can be a character or a regexp.

  • Grouping: this is used to group subexpressions, so that quantification or alternation can be applied to the group.

9.2.2. Traditional UNIX regexps

This section describes traditional UNIX regexps. Because of a lack of standardisation, the exact syntax may differ a bit per utility. Usually, the manual page of a command provides more detailed information about the supported basic or traditional regular expressions. It is a good idea to learn traditional regexps, but to use POSIX regexps for your own scripts.

9.2.2.1. Matching characters

Characters are matched by themselves. If a specific character is used as a syntactic character for regexps, you can match that character by adding a backslash. For instance, \+ matches the plus character.

A period (.) matches any character, for instance, the regexp b.g matches bag, big, and blg, but not bit.

The period character, often provides too much freedom. You can use square brackets ([]) to specify characters which can be matched. For instance, the regexp b[aei]g matches bag, beg, and big, but nothing else. You can also match any character but the characters in a set by using the square brackets, and using the caret (^) as the first character. For instance, b[^aei]g matches any three character string that starts with b and ends with g, with the exception of bag, beg, and big. It is also possible to match a range of characters with a dash (-). For example, a[0-9] matches a followed by a single number character.

Two special characters, the caret (^) and the dollar sign ($), respectively match the start and end of a line. This is very handy for parsing files. For instance, you can match all lines that start with a hash (#) with the regexp ^#.

9.2.2.2. Quantification

The simplest quantification sign that traditional regular expressions support is the (Kleene) star (*). This matches zero or arbitrary instances of the preceding character. For instance, ba* matches b, babaa, etc. You should be aware that a single character folowed by a star without any context matches every string, because c* also matches a string that has zero c characters.

More specific repetitions can be specified with backslash-escaped curly braces. \{x,y\} matches the preceding character at least x times, but not more than y times. So, ba\{1,3\} matches ba, baa, and baaa.

9.2.2.3. Grouping

Backslash-escaped parentheses group various characters together, so that you can apply quantification or alternation to a group of characters. For instance, \(ab\)\{1,3\} matches ab, abab, and ababab.

9.2.2.4. Alternation

A backslash-escaped pipe vertical bar (\|) allows you to match either of two expressions. This is not useful for single characters, because a\|b is equivalent to [ab], but it is very useful in conjunction with grouping. Suppose that you would like an expression that matches apple and pear, but nothing else. This can be done easily with the vertical bar: (apple)|(pear).

9.2.3. POSIX extended regular expressions

POSIX regular expressions build upon traditional regular expressions, adding some other useful primitives. Another comforting difference is that grouping parenthesises, quantification accolades, and the alternation sign (|) are not backslash-escaped. If they are escaped, they will match the literal characters instead, thus resulting in the opposite behavior of traditional regular expressions. Most people find POSIX extended regular expressions much more comfortable, making them more widely used.

9.2.3.1. Matching characters

Normal character matching has not changed compared to the traditional regular expressions described in Section 9.2.2.1, “Matching characters”

9.2.3.2. Quantification

Besides the Kleene star (*), that matches the preceding character or group zero or more times, POSIX extended regular expressions add two new simple quantification primitives. The plus sign (+) matches the preceding character or group one or more times. For example, a+, matches a (or any string with more consecutive a's), but does not match zero a's. The questions mark character (?) matches the preceding character zero or one time. So, ba?d matches bd and bad, but not baad or bed.

Curly braces are used for repetition, like traditional regular expressions. Though the backslash should be omitted. To match ba and baa, one should use ba{1,2} rather than ba\{1,2\}.

9.2.3.3. Grouping

Grouping is done in the same manner as traditional regular expressions, leaving out the escape-backslashes before the parenthesises. For example, (ab){1,3} matches ab, abab, and ababab.

9.2.3.4. Alternation

Alternation is done in the same manner as with traditional regular expressions, leaving out the escape-backslashes before the vertical bar. So, (apple)|(pear) matches apple and pear.

9.3. grep

9.3.1. Basic grep usage

We have now arrived at one of the most important utilties of the UNIX System, and the first occasion to try and use regular expressions. The grep command is used to search a text stream or a file for a pattern. This pattern is a regular expression, and can either be a basic regular expression or a POSIX extended regular expression (when the -E parameter is used). By default, grep will write the lines that were matched to the standard output. In the most basic syntax, you can specify a regular expression as an argument, and grep will search matches in the text from the standard input. This is a nice manner to practice a bit with regular expressions.

$ grep '^\(ab\)\{2,3\}$'
ab
abab
abab
ababab
ababab
abababab
      

The example listed above shows a basic regular expression in action, that matches a line solely consisting of two or three times the ab string. You can do the same thing with POSIX extended regular expressions, by adding the -E (for extended) parameter:

$ grep -E '^(ab){2,3}$'
ab
abab
abab
ababab
ababab
abababab
      

Since the default behavior of grep is to read from the standard input, you can add it to a pipeline to get the interesting parts of the output of the preceding commands in the pipeline. For instance, if you would like to search for the string 2006 in the third column in a file, you could combine the cut and grep command:

$ cut -f 3 | grep '2006'
      

9.3.2. grepping files

Naturally, grep can also directly read a file, rather than the standard input. As usual, this is done by adding the files to be read as the last arguments. The following example will print all lines from the /etc/passwd file that start with the string daniel:.

$ grep "^daniel" /etc/passwd
daniel:*:1001:1001:Daniel de Kok:/home/daniel:/bin/sh
      

With the -r option, grep will recursively traverse a directory structure, trying to find matches in each file that was encountered during the traversal. Though, it is better to combine grep with find and the -exec operand in scripts that have to be portable.

$ grep -r 'somepattern' somedir
      

is the non-portable functional equivalent of

$ find /somedir -type f -exec grep 'somepattern' {} \; -print
      

9.3.3. Pattern behavior

grep can also print all lines that do not match the pattern that was used. This is done by adding the -v parameter:

$ grep -Ev '^(ab){2,3}$'
ab
ab
abab
ababab
abababab
abababab
      

If you want to use the pattern in a case-insensitive manner, you can add the -i parameter. For example:

$ grep -i "a"
a
a
A
A
      

You can also match a string literally with the -F parameter:

$ grep -F 'aa*'
a
aa*
aa*
      

9.3.4. Using multiple patterns

As we have seen, you can use the alternation character (|) to match either of two or more subpatterns. If two patterns that you would like to match differ a lot, it is often more comfortable to make two separate patterns. grep allows you to use more than one pattern by separating patterns with a newline character. So, for example, if you would like to print lines that match either the a or b pattern, this can be done easily by starting a new line:

$ grep 'a
b'
a
a
b
b
c
      

This works, because quotes are used, and the shell passes quoted parameters literally. Though, it must be admitted that this is not quite pretty. grep accepts one or more -e pattern parameters, giving the opportunity to specify more than one parameter on one line. The grep invocation in the previous example could be rewritten as:

$ grep -e 'a' -e 'b'
      


[6] Of course, that will not really matter in this case, because we don't use numbers higher than 9, and virtually all character sets have numbers in a numerical order).