Table of Contents
Text manipulation is one of the things that UNIX excels at, because it forms the heart of the UNIX philosophy, as described in Section 2.4, “The UNIX philosophy”. Most UNIX commands are simple programs that read data from the standard input, performs some operation on the data, and sends the result to the program's standard output. These programs basically act as an filters, that can be connected as a pipeline. This allows the user to put the UNIX tools to uses that the writers never envisioned. In later chapters we will see how you can build simple filters yourself.
This chapter describes some simple, but important, UNIX commands that can be used to manipulate text. After that, we will dive into regular expressions, a sublanguage that can be used to match text patterns.
The most simple text filter is the cat, it does nothing else than sending the data from stdin to stdout:
$ echo "hello world" | cat
hello world
Another useful feature is that you can let it send the contents of a file to the standard output:
$ cat file.txt
Hello, this is the content of file.txt
cat really lives up to its name when multiple files are added as arguments. This will concatenate the files, in the sense that it will send the contents of all files to the standard output, in the same order as they were specified as an argument. The following screen snippet demonstrates this:
$ cat file.txt file1.txt file2.txt
Hello, this is the content of file.txt
Hello, this is the content of file1.txt
Hello, this is the content of file2.txt
The wc command provides statistics about a text file or text stream. Without any parameters, it will print the number of lines, the number of words, and the number of bytes respectively. A word is delimited by one white space character, or a sequence of whitespace characters.
The following example shows the number of lines, words, and bytes in the canonical “Hello world!” example:
$ echo "Hello world!" | wc
1 2 13
If you would like to print just one of these components, you
can use one of the -l
(lines), -w (words), or
-c (bytes) parameters.
For instance, adding just the -l parameter will show the number
of lines in a file:
$ wc -l /usr/share/dict/words
235882 /usr/share/dict/words
Or, you can print additional fields by adding a parameter:
$ wc -lc /usr/share/dict/words
235882 2493082 /usr/share/dict/words
Please note that, no matter the order in which the options were specified, the output order will always be the same (lines, words, bytes).
Since -c prints the
number bytes, this parameter may not represent the number of
characters that a text holds, because the character set in use
maybe be wider than one byte. To this end, the -m parameter has been added which
prints the number of characters in a text, independent of the
character set. -c and
-m are substitutes, and
can never be used at the same time.
The statistics that wc provides are more
useful than they may seem on the surface. For example, the
-l parameter is often
used as a counter for the output of a command. This is
convenient, because many commands seperate logical units by a
newline. Suppose that you would like to count the number of
files in your home directory having a filename ending with
.txt
$ find ~ -name '*.txt' -type f | wc -l
The tr command can be used to do common character operations, like swapping characters, deleting characters, and squeezing character sequences. Depending on the operation, one or two sets of characters should be specified. Besides normal characters, there are some special character sequences that can be used:
This notation is used to specify characters that need escaping, most notably \n (newline), \t (horizontal tab), and \\ (backslash).
Implicitly insert all characters from character1 to character2. This notation should be used with care, because it does not always give the expected result. For instance, the sequence a-d may yield abcd for the POSIX locale (language setting), but this may not be true for other locales.
Match a predefined class of characters. All possible classes are shown in Table 9.1, “tr character classes”.
Repeat character until the second set is as long as the first set of characters. This notation can only be used in the second set.
Repeat character n times.
Table 9.1. tr character classes
| Class | Meaning |
|---|---|
| [:alnum:] | All letters and numbers. |
| [:alpha:] | Letters. |
| [:blank:] | Horizontal whitespace (e.g. spaces and tabs). |
| [:cntrl:] | Control characters. |
| [:digit:] | All digits (0-9). |
| [:graph:] | All printable characters, except whitespace. |
| [:lower:] | Lowercase letters. |
| [:print:] | All printable characters, including horizontal whitespace, but excluding vertical whitespace. |
| [:punct:] | Punctuation characters. |
| [:space:] | All whitespace. |
| [:upper:] | Uppercase letters. |
| [:xdigit:] | Hexadecimal digits (0-9, a-f). |
The default operation of tr is to swap (translate) characters. This means that the n-th character in the first set is replaced with the n-th character in the second set. For example, you can replace all e's with i's and o's with a's with one tr operation:
$ echo 'Hello world!' | tr 'eo' 'ia'
Hilla warld!
When the second set is not as large as the first set, the last character in the second set will be repeated. Though, this does not necessarily apply to other UNIX systems. So, if you want to use tr in a system-independent manner, explicitly define what character should be repeated. For instance
$ echo 'Hello world!' | tr 'eaiou' '[@*]'
H@ll@ w@rld!
Another particularity is the use of the repetition syntax in the middle of the set. Suppose that set 1 is abcdef, and set 2 @[-*]!. tr will replace a with @, b, c, d, and e with -, and f with !. Though some other UNIX systems follow replace a with @, and the rest of the set characters with -. So, a more correct notation would be the more explicit @[-*4]!, which gives the same results on virtually all UNIX systems:
$ echo 'abcdef' | tr 'abcdef' '@[-*4]!'
@----!
When the -s parameter
is used, tr will squeeze all characters
that are in the second set. This means that a sequence of the
same characters will be reduced to one character. Let's
squeeze the character "e":
$ echo "Let's squeeze this." | tr -s 'e'
Let's squeze this.
We can combine this with translation to show a useful example of tr in action. Suppose that we would like to mark al vowels with the at sign (@), with consecutive vowels represented by one at sign. This can easily be done by piping two tr commands:
$ echo "eenie meenie minie moe" | tr 'aeiou' '[@*]' | tr -s '@'
@n@ m@n@ m@n@ m@
The cut command is provided by UNIX systems to “cut” one or more columns from a file or stream, printing it to the standard output. It is often useful to selectively pick some information from a text. cut provides three approaches to cutting information from files:
By byte.
By character, which is not the same as cutting by byte on systems that use a character set that is wider than eight bits.
By field, that is delimited by a character.
In all three approaches, you can specify the element to choose by its number starting at 1. You can specify a range by using a dash (-). So, M-N means the Mth to the Nth element. Leaving M out (-N) selects all elements from the first element to the Nth element. Leaving N out (M-) selects the Mth element to the last element. Multiple elements or ranges can be combined by separating them by commas (,). So, for instance, 1,3- selects the first element and the third to the last element.
Data can be cut by field with the -f fields parameter. By default,
the horizontal tab is used a separator. Let's have a look at
cut in action with a tiny Dutch to English
dictionary:
$ cat dictionary
appel apple
banaan banana
peer pear
We can get all English words by selecting the first field:
$ cut -f 2 dictionary
apple
banana
pear
That was quite easy. Now let's do the same thing with a file that has a colon as the field separator. We can easily try this by converting the dictionary with the tr command that we have seen earlier, replacing all tabs with colons:
$tr '\t' ':' < dictionary > dictionary-new$cat dictionary-newappel:apple banaan:banana peer:pear
If we use the same command as in the previous example, we do not get the correct output:
$ cut -f 2 dictionary-new
appel:apple
banaan:banana
peer:pear
What happens here is that the delimiter could not be found.
If a line does not contain the delimiter that is being used,
the default behavior of cut is to print the
complete line. You can prevent this with the -s parameter.
To use a different delimiter than the horizontal tab, add the
-d delimter_char
parameter to set the delimiting character. So, in this case of
our dictionary-new
$ cut -d ':' -f 2 dictionary-new
apple
banana
pear
If a field that was specified does not exist in a line, that particular field is not printed.
The -b bytes and
-c characters
respectively select bytes and characters from the text. On
older systems a character used to be a byte wide. But newer
systems can provide character sets that are wider than one
byte. So, if you want to be sure to grab complete characters,
use the -c parameter.
An entertaining example of seeing the -c parameter in action is to find
the ten most common sets of the first three characters of a
word. Most UNIX systems provide a list of words that are
separated by a new line. We can use cut to
get the first three characters of the words in the word list,
add uniq to count identical three character sequences, and
use sort to sort them reverse-numerically
(sort is described in Section 9.1.5, “Sorting text”). Finally, we will use
head to get the ten most frequent sequences:
$ cut -c 1-4 /usr/share/dict/words | uniq -c | sort -nr | head
254 inte
206 comp
169 cons
161 cont
150 over
125 tran
111 comm
100 disc
99 conf
96 reco
Having concluded with that nice piece of UNIX commands in action, we will move on to the paste command, which combines files in columns in a single text stream.
Usage of paste is very simple, it will combine all files given as an argument, separated by a tab. With the list of English and Dutch words, we can generate a tiny dictionary:
$ paste dictionary-en dictionary-nl
apple appel
banana banaan
pear peer
You can also combine more than two files:
$ paste dictionary-en dictionary-nl dictionary-de
apple appel Apfel
banana banaan Banane
pear peer Birne
If one of the files is longer, the column order is maintained, and empty entries are used to fill up the entries of the shorter files.
You can use another delimiter by adding the -d delimiter parameter. For
example, we can make a colon-separated dictionary:
$ paste -d ':' dictionary-en dictionary-nl
apple:appel
banana:banaan
pear:peer
Normally, paste combines files as different
columns. You can also let paste use the
lines of each file as columns, and put the columns of each
file on a separate line. This is done with the -s parameter:
$ paste -s dictionary-en dictionary-nl dictionary-de
apple banana pear
appel banaan peer
Apfel Banane Birne
UNIX offers the sort command to sort text. sort can also check whether a file is in sorted order, and merge two sorted files. sort can sort in dictionary and numerical orders. The default sort order is the dictionary order. This means that text lines are compared character by character, sorted as specified in the current collating sequence (which is specified through the LC_COLLATE environment variable). This has a catch when you are sorting numbers, for instance, if you have the numbers 1 to 10 on different lines, the sequence will be 1, 10, 2, 3, etc. This is caused by the per-character interpretation of the dictionary sort. If you want to sort lines by number, use the numerical sort.
If no additional parameters are specified, sort sorts the input lines in dictionary order. For instance:
$ cat << EOF | sort
orange
apple
banana
EOF
apple
banana
orange
As you can see, the input is correctly ordered. Sometimes
there are two identical lines. You can merge identical lines
by adding the -u
parameter. The two samples listed below illustrate this.
$cat << EOF | sortorange apple banana banana EOF apple banana banana orange $cat << EOF | sort -uorange apple banana banana EOF apple banana orange
There are some additional parameters that can be helpful to modify the results a bit:
The -f parameter
makes the sort case-insensitive.
If -d is added,
only blanks and alphanumeric characters are used to
determine the order.
The -i parameter
makes sort ignore non-printable
characters.
You can sort files numerically by adding the -n parameter. This parameter stops
reading the input line when a non-numeric character was found.
The leading minus sign, decimal point, thousands separator,
radix character (that separates an exponential from a normal
number), and blanks can be used as a part of a number. These
characters are interpreted where applicable.
The following example shows numerical sort in action, by piping the output of du to sort. This works because du specifies the size of each file as the first field.
$ du -a /bin | sort -n
0 /bin/kernelversion
0 /bin/ksh
0 /bin/lsmod.modutils
0 /bin/lspci
0 /bin/mt
0 /bin/netcat
[...]
In this case, the output is probably not useful if you want to
read the output in a paginator, because the smallest files are
listed first. This is where the -r parameter becomes handy. This
reverses the sort order.
$ du -a /bin | sort -nr
4692 /bin
1036 /bin/ksh93
668 /bin/bash
416 /bin/busybox
236 /bin/tar
156 /bin/ip
[...]
The -r parameter also
works with dictionary sorts.
Quite often, files use a layout with multiple columns, and you
may want to sort a file by a different column than the first
column. For instance, consider the following score file named
score.txt
John:US:4
Herman:NL:3
Klaus:DE:5
Heinz:DE:3
Suppose that we would like to sort the entries in this file by
the two-letter country name. sort allows us
to sort a file by a column with the -k col1[,col2] parameter. Where
col1 up to col2 are
used as fields for sorting the input. If
col2 is not specified, all fields up till
the end of the line are used. So, if you want to use just one
column, use -k col1,col1.
You can also specify the the starting character within a column
by adding a period (.) and a character
index. For instance, -k
2.3,4.2 means that the second column starting from
the third character, the third column, and the fourth column up
to (and including) the second character.
There is yet another particularity when it comes to sorting by
columns: by default, sort uses a blank as the
column separator. If you use a different separator character,
you will have to use the -t char
parameter, that is used to specify the field separator.
With the -t and
-k parameters combined,
we can sort the scores file by country code:
$ sort -t ':' -k 2,2 scores.txt
Heinz:DE:3
Klaus:DE:5
Herman:NL:3
John:US:4
So, how can we sort the file by the score? Obviously, we have to
ask sort to use the third column. But sort uses a dictionary
sort by default[6]. You could use the -n, but sort also
allows a more sophisticated approach. You can append the one or
more of the n, r>,
f, d,
i, or b to the column
specifier. These letters represent the sort
parameters with the same name. If you add just the starting
column, append it to that column, otherwise, add it to the
ending column.
The following command sorts the file by score:
$ sort -t ':' -k 3n /home/daniel/scores.txt
Heinz:DE:3
Herman:NL:3
John:US:4
Klaus:DE:5
It is good to follow this approach, rather than using the
parameter variants, because sort allows you
to use more than one -k
parameter. And, adding these flags to the column specification,
will allow you to sort by different columns in different ways.
For example using sort with the -k 3,3n -k 2,2 parameters will sort
all lines numerically by the third column. If some lines have
identical numbers in the third column, these lines can be sorted
further with a dictionary sort of the second column.
If you want to check whether a file is already sorted, you can
use the -c parameter. If
the file was in a sorted order, sort will return the value
0, otherwise 1. We can
check this by echoing the value of the ?
variable, which holds the return value of the last executed
command.
$sort -c scores.txt ; echo $?1 $sort scores.txt | sort -c ; echo $?0
The second command shows that this actually works, by piping the
output of the sort of scores.txt
Finally, you can merge two sorted files with the -m parameter, keeping the correct
sort order. This is faster than concatenating both files, and
resorting them.
# sort -m scores-sorted.txt scores-sorted2.txt
Since text streams, and text files are very important in UNIX,
it is often useful to show the differences between two text
files. The main utilities for working with file differences
are diff and
patch. diff shows the
differences between files. The output of
diff can be processed by
patch to apply the changes between two
files to a file. “diffs” are also form the base
of version/source management systems. The following sections
describe diff and patch.
To have some material to work with, the following two C source
files are used to demonstrate these commands. These files are
named hello.chello2.c
#include <stdio.h>
void usage(char *programName);
int main(int argc, char *argv[]) {
if (argc == 1) {
usage(argv[0]);
return 1;
}
printf("Hello %s!\n", argv[1]);
return 0;
}
void usage(char *programName) {
printf("Usage: %s name\n", programName);
}
#include <stdio.h> #include <time.h> void usage(char *programName); int main(int argc, char *argv[]) { if (argc == 1) { usage(argv[0]); return 1; } printf("Hello %s!\n", argv[1]); time_t curTime = time(NULL); printf("The date is %s\n", asctime(localtime(&curTime))); return 0; } void usage(char *programName) { printf("Usage: %s name\n", programName); }
Suppose that you received the program
hello.cfilefile2
$ diff hello.c hello2.c 1a2> #include <time.h>
12a14,17 > time_t curTime = time(NULL); > printf("The date is %s\n", asctime(localtime(&curTime))); >
The additions from hello2.c
Two different elements can be distilled from this output:
|
This is an ed command that
specified that text should be appended
( |
|
|
This is the actual text to be appended after the second line. The “>” sign is used to mark lines that are added. |
The same elements are used to add the second block of text.
What about lines that are removed? We can easily see how
they are represented by swapping the two parameters to
diff, showing the differences between
hello2.chello.c
$diff hello2.c hello.c2d1< #include <time.h>
14,16d12 < time_t curTime = time(NULL); < printf("The date is %s\n", asctime(localtime(&curTime))); <
The following elements can be distinguished:
|
This is the ed delete command
( |
|
|
The text that is going to be removed is preceded by the “<” sign. |
That's enough of the ed-style
output. The GNU diff program included in Slackware Linux
supports so-called unified diffs. Unified diffs are very
readable, and provide context by
default. diff can provide unified output
with the -u flag:
$ diff -u hello.c hello2.c --- hello.c 2006-11-26 20:28:55.000000000 +0100+++ hello2.c 2006-11-26 21:27:52.000000000 +0100
@@ -1,4 +1,5 @@
#include <stdio.h>
+#include <time.h>
void usage(char *programName); @@ -10,6 +11,9 @@ printf("Hello %s!\n", argv[1]); + time_t curTime = time(NULL); + printf("The date is %s\n", asctime(localtime(&curTime))); + return 0; }
The following elements can be found in the output
|
The name of the original file, and the timestamp of the last modification time. |
|
|
The name of the changed file, and the timestamp of the last modification time. |
|
|
This pair of numbers show the location and size of the chunk that the text below affects in the original file and the modified file. So, in this case the numbers mean that in the affected chunk in the original file starts at line 1, and is four lines long. In the modified file the affected chunk starts at line 1, and is five lines long. Different chunks in diff output are started by this header. |
|
|
A line that is not preceded by a minus (-) or plus (+) sign is unchanged. Unmodified lines are included because they give contextual information, and to avoid that too many chunks are made. If there are only a few unmodified lines between changes, diff will choose to make only one chunk, rather than two chunks. |
|
|
A line that is preceded by a plus sign (+) is an addition to the modified file, compared to the original file. |
As with the ed-style diff format, we can see some removals by swapping the file names:
$ diff -u hello2.c hello.c
--- hello2.c 2006-11-26 21:27:52.000000000 +0100
+++ hello.c 2006-11-26 20:28:55.000000000 +0100
@@ -1,5 +1,4 @@
#include <stdio.h>
-#include <time.h>
void usage(char *programName);
@@ -11,9 +10,6 @@
printf("Hello %s!\n", argv[1]);
- time_t curTime = time(NULL);
- printf("The date is %s\n", asctime(localtime(&curTime)));
-
return 0;
}
As you can see from this output, lines that are removed from the modified file, in contrast to the original file are preceded by the minus (-) sign.
When you are working on larger sets of files, it's often
useful to compare whole directories. For instance, if you
have the original version of a program source in a directory
named hello.orighello-r
parameter to recursively compare both directories. For
instance:
$ diff -ru hello.orig hello
diff -ru hello.orig/hello.c hello/hello.c
--- hello.orig/hello.c 2006-12-04 17:37:14.000000000 +0100
+++ hello/hello.c 2006-12-04 17:37:48.000000000 +0100
@@ -1,4 +1,5 @@
#include <stdio.h>
+#include <time.h>
void usage(char *programName);
@@ -10,6 +11,9 @@
printf("Hello %s!\n", argv[1]);
+ time_t curTime = time(NULL);
+ printf("The date is %s\n", asctime(localtime(&curTime)));
+
return 0;
}
It should be noted that this will only compare files that
are available in both directories. The GNU version of diff,
that is used by Slackware Linux provides the
-N parameter. This
parameters treats files that exist in only one of both
directories as if it were an empty file. So for instance,
if we have added a file named Makefilehello-N parameter will
give the following output:
$ diff -ruN hello.orig hello
diff -ruN hello.orig/hello.c hello/hello.c
--- hello.orig/hello.c 2006-12-04 17:37:14.000000000 +0100
+++ hello/hello.c 2006-12-04 17:37:48.000000000 +0100
@@ -1,4 +1,5 @@
#include <stdio.h>
+#include <time.h>
void usage(char *programName);
@@ -10,6 +11,9 @@
printf("Hello %s!\n", argv[1]);
+ time_t curTime = time(NULL);
+ printf("The date is %s\n", asctime(localtime(&curTime)));
+
return 0;
}
diff -ruN hello.orig/Makefile hello/Makefile
--- hello.orig/Makefile 1970-01-01 01:00:00.000000000 +0100
+++ hello/Makefile 2006-12-04 17:39:44.000000000 +0100
@@ -0,0 +1,2 @@
+hello: hello.c
+ gcc -Wall -o $@ $<
As you can see the chunk indicator says that the chunk in the original file starts at line 0, and is 0 lines long.
UNIX users often exchange the output of diff, usually called “diffs” or “patches”. The next section will show you how you can handle diffs. But you are now able to create them yourself, by redirecting the output of diff to a file. For example:
$ diff -u hello.c hello2.c > hello_add_date.diff
If you have multiple diffs, you can easily combine them to one diff, by concatenating the diffs:
$ cat diff1 diff2 diff3 > combined_diff
But make sure that they were created from the same directory if you want to use the patch utility that is covered in the next section.
Suppose that somebody would send you the output of
diff for a file that you have created. It
would be tedious to manually incorporate all the changes
that were made. Fortunately, the patch
can do this for you. patch accepts diffs
on the standard input, and will try to change the original
file, according to the differences that are registered in
the diff. So, for instance, if we have the
hello.chello.chello2.chello.c
$ patch < hello_add_date.diff
patching file hello.c
If you have hello2.c
$ diff -u hello.c hello2.c
There is no output, so this is the case. One of the nice
features of patch is that it can revert
the changes made through a diff, by using the -R parameter:
$ patch -R < hello_add_date.diff
In these examples, the original file is patched. Sometimes you may want to want to apply the patch to a file with a different name. You can do this by providing the name of a file as the last argument:
$ patch helloworld.c < hello_add_date.diff
patching file helloworld.c
You can also use patch with diffs that
were generated with the -r parameter, but you have to
take a bit of care. Suppose that the header of a particular
file in the diff is as follows:
--------------------------
|diff -ruN hello.orig/hello.c hello/hello.c
|--- hello.orig/hello.c 2006-12-04 17:37:14.000000000 +0100
|+++ hello/hello.c 2006-12-04 17:37:48.000000000 +0100
--------------------------
If you process this diff with patch, it
will attempt to change hello.c-p n, where
n is the number of pathname components
that should be stripped. A value of 0
will use the path as it is specified in the patch,
1 will strip the first pathname
component, etc. In this example, stripping the first
component will result in patching of
hello.c
$cd hello.orig$patch -p 1 < ../hello.diff
Or, you can use the -d parameter to specify in which
directory the change has to be applied:
$ patch -p 1 -d hello.orig < hello.diff
patching file hello.c
patching file Makefile
If you want to keep a backup when you are changing a file,
you can use the -b
parameter of patch. This will make a copy
of every affected file named
filename.orig
$patch -b < hello_add_date.diff$ls -l hello.c*-rw-r--r-- 1 daniel daniel 382 2006-12-04 21:41 hello.c -rw-r--r-- 1 daniel daniel 272 2006-12-04 21:12 hello.c.orig
Sometimes a file can not be patched. For instance, if it has
already been patched, it has changed to much to apply the
patch cleanly, or if the file does not exist at all. In this
case, the chunks that could not be saved are stored in a
file with the name filename.rej
In daily life, you will often want to some text that matches to a certain pattern, rather than a literal string. Many UNIX utilities implement a language for matching text patterns, regular expressions (regexps). Over time the regular expression language has grown, there are now basically three regular expression syntaxes:
Traditional UNIX regular expressions.
POSIX extended regular expressions.
Perl-compatible regular expressions (PCRE).
POSIX regexps are mostly a superset of traditional UNIX regexps, and PCREs a superset of POSIX regexps. The syntax that an application supports differs per application, but almost all applications support at least POSIX regexps.
Each syntactical unit in a regexp expresses one of the following things:
A character: this is the basis of every regular expression, a character or a set of characters to be matched. For instance, the letter p or the the sign ,.
Quantification: a quantifier specifies how many times the preceding character or set of characters should be matched.
Alternation: alternation is used to match “a or b” in which a and b can be a character or a regexp.
Grouping: this is used to group subexpressions, so that quantification or alternation can be applied to the group.
This section describes traditional UNIX regexps. Because of a lack of standardisation, the exact syntax may differ a bit per utility. Usually, the manual page of a command provides more detailed information about the supported basic or traditional regular expressions. It is a good idea to learn traditional regexps, but to use POSIX regexps for your own scripts.
Characters are matched by themselves. If a specific character is used as a syntactic character for regexps, you can match that character by adding a backslash. For instance, \+ matches the plus character.
A period (.) matches any character, for instance, the regexp b.g matches bag, big, and blg, but not bit.
The period character, often provides too much freedom. You can use square brackets ([]) to specify characters which can be matched. For instance, the regexp b[aei]g matches bag, beg, and big, but nothing else. You can also match any character but the characters in a set by using the square brackets, and using the caret (^) as the first character. For instance, b[^aei]g matches any three character string that starts with b and ends with g, with the exception of bag, beg, and big. It is also possible to match a range of characters with a dash (-). For example, a[0-9] matches a followed by a single number character.
Two special characters, the caret (^) and the dollar sign ($), respectively match the start and end of a line. This is very handy for parsing files. For instance, you can match all lines that start with a hash (#) with the regexp ^#.
The simplest quantification sign that traditional regular expressions support is the (Kleene) star (*). This matches zero or arbitrary instances of the preceding character. For instance, ba* matches b, babaa, etc. You should be aware that a single character folowed by a star without any context matches every string, because c* also matches a string that has zero c characters.
More specific repetitions can be specified with backslash-escaped curly braces. \{x,y\} matches the preceding character at least x times, but not more than y times. So, ba\{1,3\} matches ba, baa, and baaa.
Backslash-escaped parentheses group various characters together, so that you can apply quantification or alternation to a group of characters. For instance, \(ab\)\{1,3\} matches ab, abab, and ababab.
A backslash-escaped pipe vertical bar (\|) allows you to match either of two expressions. This is not useful for single characters, because a\|b is equivalent to [ab], but it is very useful in conjunction with grouping. Suppose that you would like an expression that matches apple and pear, but nothing else. This can be done easily with the vertical bar: (apple)|(pear).
POSIX regular expressions build upon traditional regular expressions, adding some other useful primitives. Another comforting difference is that grouping parenthesises, quantification accolades, and the alternation sign (|) are not backslash-escaped. If they are escaped, they will match the literal characters instead, thus resulting in the opposite behavior of traditional regular expressions. Most people find POSIX extended regular expressions much more comfortable, making them more widely used.
Normal character matching has not changed compared to the traditional regular expressions described in Section 9.2.2.1, “Matching characters”
Besides the Kleene star (*), that matches the preceding character or group zero or more times, POSIX extended regular expressions add two new simple quantification primitives. The plus sign (+) matches the preceding character or group one or more times. For example, a+, matches a (or any string with more consecutive a's), but does not match zero a's. The questions mark character (?) matches the preceding character zero or one time. So, ba?d matches bd and bad, but not baad or bed.
Curly braces are used for repetition, like traditional regular expressions. Though the backslash should be omitted. To match ba and baa, one should use ba{1,2} rather than ba\{1,2\}.
Grouping is done in the same manner as traditional regular expressions, leaving out the escape-backslashes before the parenthesises. For example, (ab){1,3} matches ab, abab, and ababab.
We have now arrived at one of the most important utilties of
the UNIX System, and the first occasion to try and use regular
expressions. The grep command is used to
search a text stream or a file for a pattern. This pattern is
a regular expression, and can either be a basic regular
expression or a POSIX extended regular expression (when the
-E parameter is
used). By default, grep will write the
lines that were matched to the standard output. In the most
basic syntax, you can specify a regular expression as an
argument, and grep will search matches in
the text from the standard input. This is a nice manner to
practice a bit with regular expressions.
$grep '^\(ab\)\{2,3\}$'ababababababababababababababab
The example listed above shows a basic regular expression in
action, that matches a line solely consisting of two or three
times the ab string. You can do the same
thing with POSIX extended regular expressions, by adding the
-E (for extended)
parameter:
$grep -E '^(ab){2,3}$'ababababababababababababababab
Since the default behavior of grep is to read from the standard input, you can add it to a pipeline to get the interesting parts of the output of the preceding commands in the pipeline. For instance, if you would like to search for the string 2006 in the third column in a file, you could combine the cut and grep command:
$ cut -f 3 | grep '2006'
Naturally, grep can also directly read a
file, rather than the standard input. As usual, this is done
by adding the files to be read as the last arguments. The
following example will print all lines from the
/etc/passwd
$ grep "^daniel" /etc/passwd
daniel:*:1001:1001:Daniel de Kok:/home/daniel:/bin/sh
With the -r option,
grep will recursively traverse a directory
structure, trying to find matches in each file that was encountered
during the traversal.
Though, it is better to combine grep with
find and the -exec
operand in scripts that have to be portable.
$ grep -r 'somepattern' somedir
is the non-portable functional equivalent of
$ find /somedir -type f -exec grep 'somepattern' {} \; -print
grep can also print all lines that do not
match the pattern that was used. This is done by adding the
-v parameter:
$grep -Ev '^(ab){2,3}$'ababababababababababababababab
If you want to use the pattern in a case-insensitive manner,
you can add the -i
parameter. For example:
$grep -i "a"aaAA
You can also match a string literally with the -F parameter:
$grep -F 'aa*'aaa*aa*
As we have seen, you can use the alternation character (|) to match either of two or more subpatterns. If two patterns that you would like to match differ a lot, it is often more comfortable to make two separate patterns. grep allows you to use more than one pattern by separating patterns with a newline character. So, for example, if you would like to print lines that match either the a or b pattern, this can be done easily by starting a new line:
$grep 'a b'aabb c
This works, because quotes are used, and the shell passes
quoted parameters literally. Though, it must be admitted that
this is not quite pretty. grep accepts one
or more -e pattern
parameters, giving the opportunity to specify more than one
parameter on one line. The grep invocation
in the previous example could be rewritten as:
$ grep -e 'a' -e 'b'
[6] Of course, that will not really matter in this case, because we don't use numbers higher than 9, and virtually all character sets have numbers in a numerical order).