News

Process text faster with Awk and Sed

Make your life easier by using Sed and Awk to work with text from the terminal

Process text faster with Awk and Sed

Sed and Awk are two famous and characteristic UNIX command line tools. The name Sed comes from “stream editor”, and the tool was developed from 1973 to 1974 by Lee E. McMahon. Sed processes its input line by line and is mainly used for text substitution. The name Awk is derived from the surnames of its three authors – Alfred Aho, Peter Weinberger and Brian Kernighan – and is a handy pattern matching programming language that is Turing-complete.

Sed can handle and quickly process arbitrarily long inputs – its processing capabilities are similar to the capabilities of the ed editor (including regular expression support). Nevertheless, its abilities are limited because it only makes one pass over the data, cannot go backwards and cannot manipulate numbers. Awk resolves most of the limitations of Sed because it is based more on the C programming language than on the ed text editor. Awk reads its input line by line and will then automatically split each line into  fields. It allows you to do analyis, extraction and the reporting of data as well as supports arrays, associative arrays and regular expressions. Neither of these two utilities alter their input files, and they’re a powerful tool in tandem.

(Ed: Open the terminal screenshots to view them at full size.)

01 Install Awk and Sed

The good news is that every Linux distribution comes with Sed and at least one version of Awk already installed so you do not have to install any of them. Nevertheless, if you want to install a different Awk variant, you may need to do it either manually or by using the package manager of your Linux version.

Step 1

02 Using Sed

The following Sed code globally replaces a given string with another in a text file:

$ sed ‘s/string1/string2/g’ textfile

If you want to save the original file contents to a new file and the changed version to the original file, you should use the -i option as follows:

$ sed -i.bak ‘s/three/3/g’ text

The previous command replaces every occurrence of the word “three” with “3”, saves the output to the file “text” and keeps the original content inside “text.bak”. If you want to perform a case insensitive global search and replace, you should use the following command:

$ sed -i.bak ‘s/three/3/gi’ text

Step 2

03 Using Awk

The “Hello World!” program in Awk can be written in two ways. The first one is the command line version and the second one is by storing the Awk code inside a separate file. It is better to store your Awk code on a separate file especially if you are going to reuse it. By default, Awk splits input lines into fields based on whitespace (spaces and tabs). You can change the default behaviour by using the -F command line option and providing another character.

Step 3

04 A Sed example

Sed allows you to make changes to multiple files. The following example changes the path of the Perl binary from “/usr/local/bin/perl” to “/usr/bin/perl” in multiple files:

$ sed ‘s//usr/local/bin/perl//usr/bin/perl/g’ *.pl

Note that you have to use the escape character “” in order to use the “/” character as a normal character. Always remember that once a line is read by Sed, the previous line is gone forever.

Step 4

05 An Awk example

If you want to print the first and third columns of a text file, you should run the following Awk command:

awk < file ‘{ print $1, $3 }’

If you want to print the second and third columns in reverse order, just execute the following command:

awk < file ‘{ print $3, $2 }’

To print the last two columns of the input, use:

$ awk ‘{print $(NF - 1), $NF}’

Here, NF is a variable that holds the number of fields in the current line and $NF is the last field on the current line. The $0 variable represents the current line.

Step 5

06 Awk variables

Awk has built-in variables but also supports user-defined variables. The following Awk code uses the n variable to count the number of lines that contain a given string:

$ awk ‘/three/ { n++ }; END { print n+0 }’

You might say that if you combine grep and wc, you can achieve the same task without writing an Awk program, but Awk allows you to do additional processing of the data. The FILENAME built-in variable holds the filename of the file currently being process but you cannot use it inside the BEGIN block because it is not defined yet.

Step 6

07 Doing simple things with Awk

The following command finds the number of unique IPs stored on an Apache log file:

$ cat access.log | awk ‘{print $1}’ | sort | uniq | wc -l

The Awk ‘{print $1}’ command prints the first column of each line, which is the IP address of the client machine. The next command finds the total Mbytes transferred by summing up the contents of the tenth field:

$ awk ‘{ sum += $10 } END { print sum/1024/1024 “ Mbytes” }’ access.log

Step 7

08 Awk built-in functions

Awk has built-in functions for string, number, time and date manipulation. You can easily generate random numbers using Awk but be warned that Awk’s rand() function can generate repeatable output. The presented example uses an array named rnd to store the generated random numbers.

Step 8

09 Pattern matching

The following Awk code identifies if the input is an integer, a string or an empty line:

/[0-9]+/ { print “This is an integer” }
/[A-z]+/ { print “This is a string” }
/^$/    { print “This is an empty line!” }

Please note that input can match more than one rule. An Awk script can become a program by changing its permissions and inserting the following line at its start (similar to a Perl script):

#!/usr/bin/awk -f

Step 9

10 Processing Apache Log files using Awk

The next command uses many UNIX command line utilities to calculate the total number of requests per hour found on an Apache log file:

$ cat access.log | cut -d[ -f2 | cut -d] -f1 | awk -F: ‘{print $2}’ | sort -n | uniq -c | awk ‘{print $2, $1}’ > timeN.txt

First, it prints the log file on the standard output (cat access.log) and uses it as the input to the next command (cut -d[ -f2) that deletes everything from the beginning of each line to the first “[“ character. The next command (cut -d] -f1) does something similar: it deletes everything from the first “]” character found in the line to the end of the line.

So, up until now the output looks similar to the following:

27/Jul/2014:06:34:09 +0300

The awk -F: ‘{print $2}’ command defines the “:” character as the field separator and prints the second field that contains the hour of the request. The “sort –n” command sorts the output numerically and not alphabetically. The “uniq” command omits repeated lines but when used with the “-c” option it prefixes lines by the number of occurrences. Finally, awk ‘{print $2, $1}’ swaps the data in the two columns of the output. The “> timeN.txt” part saves the output to a file named timeN.txt. If you sort the output by the second column, you can find the busiest hours of the day.

Step 10

11 Word counting

This is a relatively challenging example that shows some of the advanced capabilities of Awk. The presented Awk script (wordFreq.awk) reads a text file and counts how many times each word appears in the text file with the help of associative arrays. If you want to remove all characters except letters, numbers and spaces you should filter your input using Sed:

$ sed ‘s/[^a-zA-Z0-9 ]//g’ text | ./wordFreq.awk | sort -k3rn

Last, if you want to convert all words to lowercase to avoid duplicates, the following Awk code will do the job:

$i = tolower($i)

Step 11

12 Nawk, Gawk and Mawk

On a Linux system, Nawk is usually a symbolic link to either Gawk or Mawk. The biggest difference between the various variants and the original version of Awk is that the variants do support a larger set of built-in functions and variables. Nowadays, every Linux system uses an improved version of the
Awk utility.

Step 12

13 More Apache Log file processing

Another trick for you – to find the total number of unique IP addresses of the current day, you should execute the following command:

$ cat access.log | grep `date ‘+%e/%b/%G’` | awk ‘{print $1}’ | sort | uniq -c | wc -l

The following command finds the top ten requests on a log file, grouped by the IP address:

$ cat access.log | awk ‘{print “requests from “ $1}’ | sort | uniq -c | sort -nr | head

The following command prints the total number of requests grouped by the status code, sorted by the numeric value of the status code:

$ awk ‘{print $9}’ access.log | sort -rn | uniq -c

Step 13

14 Plan to succeed

Awk may not be a complete replacement for either Perl or Python but, as you can see here, it does its job well and allows you to work efficiently with text files. Try to write the “Word Counting” program or to process Apache log files using C and you are going to appreciate Awk. Also, Awk programs are easier to debug because they are smaller. On the other hand, if you want to store the results in a MySQL database, Perl or Python would be a smarter choice because they can communicate directly with a MySQL database. The main point here is that learning and using Sed and Awk can be very beneficial for every Linux user or administrator.

 

×