0% found this document useful (0 votes)
73 views

Awk Is One of The Most Powerful Utilities Used in The Unix World. Whenever It Comes To Text Parsing

This document provides examples of using awk to parse text files and extract specific fields or lines matching certain patterns. It demonstrates awk's basic syntax and use of variables like NR, $0, $1, FS, OFS. Examples shown include printing specific columns, filtering lines by pattern in the entire line or a single column, passing variables to awk, and quoting output.

Uploaded by

kalu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views

Awk Is One of The Most Powerful Utilities Used in The Unix World. Whenever It Comes To Text Parsing

This document provides examples of using awk to parse text files and extract specific fields or lines matching certain patterns. It demonstrates awk's basic syntax and use of variables like NR, $0, $1, FS, OFS. Examples shown include printing specific columns, filtering lines by pattern in the entire line or a single column, passing variables to awk, and quoting output.

Uploaded by

kalu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 39

awk is one of the most powerful utilities used in the unix world.

Whenever it comes to text parsing,


sed and awk do some unbelievable things. In this first article on awk, we will see the basic usage of
awk.

The syntax of awk is:

awk 'pattern{action}' file

where the pattern indicates the pattern or the condition on which the action is to be executed for
every line matching the pattern. In case of a pattern not being present, the action will be executed for
every line of the file. In case of the action part not being present, the default action of printing the line
will be done. Let us see some examples:

Assume a file, say file1, with the following content:

$ cat file1

Name Domain

Deepak Banking

Neha Telecom

Vijay Finance

Guru Migration

This file has 2 fields in it. The first field indicates the name of a person, and the second field
denoting their expertise, the first line being the header record.

1. To print only the names present in the file:

$ awk '{print $1}' file1

Name

Deepak

Neha

Vijay

Guru
The above awk command does not have any pattern or condition. Hence, the action will be
executed on every line of the file. The action statement reads "print $1". awk, while reading a file,
splits the different columns into $1, $2, $3 and so on. And hence the first column is accessible using
$1, second using $2, etc. And hence the above command prints all the names which happens to be
first column in the file.

2. Similarly, to print the second column of the file:

$ awk '{print $2}' file1

Domain

Banking

Telecom

Finance

Migration

3. In the first example, the list of names got printed along with the header record. How to omit the
header record and get only the names printed?

$ awk 'NR!=1{print $1}' file1

Deepak

Neha

Vijay

Guru

The above awk command uses a special variable NR. NR denotes line number ranging from 1 to
the actual line count. The conditon 'NR!=1' indicates not to execute the action part for the first line of
the file, and hence the header record gets skipped.

4. How do we print the entire file contents?

$ awk '{print $0}' file1

Name Domain
Deepak Banking

Neha Telecom

Vijay Finance

Guru Migration

$0 stands for the entire line. And hence when we do "print $0", the whole line gets printed.

5. How do we get the entire file content printed in other way?

$ awk '1' file1

Name Domain

Deepak Banking

Neha Telecom

Vijay Finance

Guru Migration

The above awk command has only the pattern or condition part, no action part. The '1' in the
pattern indicates "true" which means true for every line. As said above, no action part denotes just to
print which is the default when no action statement is given, and hence the entire file contents get
printed.

Let us now consider a file with a delimiter. The delimiter used here is a comma. The comma
separated file is called csv file. Assuming the file contents to be:

$ cat file1

Name,Domain,Expertise

Deepak,Banking,MQ Series

Neha,Telecom,Power Builder

Vijay,Finance,CRM Expert

Guru,Migration,Unix
This file contains 3 fields. The new field being the expertise of the respective person.

6. Let us try to print the first column of this csv file using the same method as mentioned in Point
1.

$ awk '{print $1}' file1

Name,Domain,Expertise

Deepak,Banking,MQ

Neha,Telecom,Power

Vijay,Finance,CRM

Guru,Migration,Unix

The output looks weird. Isnt it? We expected only the first column to get printed, but it printed little
more and that too not a definitive one. If you notice carefully, it printed every line till the first space is
encountered. awk, by default, uses the white space as the delimiter which could be a single space,
tab space or a series of spaces. And hence our original file was split into fields depending on space.

Since our requirement now involves dealing with a file which is comma separated, we need to
specify the delimiter.

$ awk -F"," '{print $1}' file1

Name

Deepak

Neha

Vijay

Guru

awk has a command line option "-F' with which we can specify the delimiter. Once the delimiter is
specified, awk splits the file on the basis of the delimiter specified, and hence we got the names by
printing the first column $1.

7. awk has a special variable called "FS" which stands for field separator. In place of the
command line option "-F', we can also use the "FS".
$ awk '{print $1,$3}' FS="," file1

Name Expertise

Deepak MQ Series

Neha Power Builder

Vijay CRM Expert

Guru Unix

8. Similarly, to print the second column:

$ awk -F, '{print $2}' file1

Domain

Banking

Telecom

Finance

Migration

9. To print the first and third columns, ie., the name and the expertise:

$ awk -F"," '{print $1, $3}' file1

Name Expertise

Deepak MQ Series

Neha Power Builder

Vijay CRM Expert

Guru Unix

10. The output shown above is not easily readable since the third column has more than one word. It
would have been better had the fields being displayed are present with a delimiter. Say, lets use
comma to separate the output. Also, lets discard the header record.

$ awk -F"," 'NR!=1{print $1,$3}' OFS="," file1

Deepak,MQ Series

Neha,Power Builder

Vijay,CRM Expert

Guru,Unix

OFS is another awk special variable. Just like how FS is used to separate the input fields, OFS
(Output field separator) is used to separate the output fields.

In one of our earlier articles, we saw how to read a file in awk. At times, we might have some
requirements wherein we need to pass some arguments to the awk program or to access a shell
variable or an environment variable inside awk. Let us see in this article how to pass and access
arguments in awk:

Let us take a sample file with contents, and a variable "x":

$ cat file1

24

12

34

45

$ echo $x

Now, say we want to add every value with the shell variable x.

1.awk provides a "-v" option to pass arguments. Using this, we can pass the shell variable to it.
$ awk -v val=$x '{print $0+val}' file1

27

15

37

48

As seen above, the shell variable $x is assigned to the awk variable "val". This variable "val" can
directly be accessed in awk.

2. awk provides another way of passing argument to awk without using -v. Just before specifying the
file name to awk, provide the shell variable assignments to awk variables as shown below:

$ awk '{print $0,val}' OFS=, val=$x file1

24,3

12,3

34,3

45,3

3. How to access environment variables in awk? Unlike shell variables, awk provides a way to
access the environment variables without passing it as above. awk has a special variable ENVIRON
which does the needful.

$ echo $x

$ export x

$ awk '{print $0,ENVIRON["x"]}' OFS=, file1

24,3

12,3

34,3
45,3

Quoting file content:

Some times we might have a requirement wherein we have to quote the file contents. Assume, you
have a file which contains the list of database tables. And for your requirement, you need to quote
the file contents:

$ cat file

CUSTOMER

BILL

ACCOUNT

4. Pass a variable to awk which contains the double quote. Print the quote, line, quote.

$ awk -v q="'" '{print q $0 q}' file

'CUSTOMER'

'BILL'

'ACCOUNT'

5. Similarly, to double quote the contents, pass the variable within single quotes:

$ awk '{print q $0 q}' q='"' file

"CUSTOMER"

"BILL"

"ACCOUNT"
In one of our earlier articles on awk series, we had seen the basic usage of awk or gawk. In this, we
will see mainly how to search for a pattern in a file in awk. Searching pattern in the entire line or in a
specific column.

Let us consider a csv file with the following contents. The data in the csv file contains kind of
expense report. Let us see how to use awk to filter data from the file.

$ cat file

Medicine,200

Grocery,500

Rent,900

Grocery,800

Medicine,600

1. To print only the records containing Rent:

$ awk '$0 ~ /Rent/{print}' file

Rent,900

~ is the symbol used for pattern matching. The / / symbols are used to specify the pattern. The
above line indicates: If the line($0) contains(~) the pattern Rent, print the line. 'print' statement by
default prints the entire line. This is actually the simulation of grep command using awk.

2. awk, while doing pattern matching, by default does on the entire line, and hence $0 can be left off
as shown below:

$ awk '/Rent/{print}' file

Rent,900

3. Since awk prints the line by default on a true condition, print statement can also be left off.

$ awk '/Rent/' file


Rent,900

In this example, whenever the line contains Rent, the condition becomes true and the line gets
printed.

4. In the above examples, the pattern matching is done on the entire line, however, the pattern we
are looking for is only on the first column. This might lead to incorrect results if the file contains the
word Rent in other places. To match a pattern only in the first column($1),

$ awk -F, '$1 ~ /Rent/' file

Rent,900

The -F option in awk is used to specify the delimiter. It is needed here since we are going to work
on the specific columns which can be retrieved only when the delimiter is known.

5. The above pattern match will also match if the first column contains "Rents". To match exactly
for the word "Rent" in the first column:

$ awk -F, '$1=="Rent"' file

Rent,900

6. To print only the 2nd column for all "Medicine" records:

$ awk -F, '$1 == "Medicine"{print $2}' file

200

600

7. To match for patterns "Rent" or "Medicine" in the file:

$ awk '/Rent|Medicine/' file

Medicine,200

Rent,900
Medicine,600

8. Similarly, to match for this above pattern only in the first column:

$ awk -F, '$1 ~ /Rent|Medicine/' file

Medicine,200

Rent,900

Medicine,600

9. What if the the first column contains the word "Medicines". The above example will match it as
well. In order to exactly match only for Rent or Medicine,

$ awk -F, '$1 ~ /^Rent$|^Medicine$/' file

Medicine,200

Rent,900

Medicine,600

The ^ symbol indicates beginning of the line, $ indicates the end of the line. ^Rent$ matches
exactly for the word Rent in the first column, and the same is for the word Medicine as well.

10. To print the lines which does not contain the pattern Medicine:

$ awk '!/Medicine/' file

Grocery,500

Rent,900

Grocery,800

The ! is used to negate the pattern search.

11. To negate the pattern only on the first column alone:


$ awk -F, '$1 !~ /Medicine/' file

Grocery,500

Rent,900

Grocery,800

12. To print all records whose amount is greater than 500:

$ awk -F, '$2>500' file

Rent,900

Grocery,800

Medicine,600

13. To print the Medicine record only if it is the 1st record:

$ awk 'NR==1 && /Medicine/' file

Medicine,200

This is how the logical AND(&&) condition is used in awk. The records needed to be retrieved is
only if it is the first record(NR==1) and the record is a medicine record.

14. To print all those Medicine records whose amount is greater than 500:

$ awk -F, '/Medicine/ && $2>500' file

Medicine,600

15. To print all the Medicine records and also those records whose amount is greater than 600:

$ awk -F, '/Medicine/ || $2>600' file

Medicine,200
Rent,900

Grocery,800

Medicine,600

This is how the logical OR(||) condition is used in awk.

In one of our earlier articles, we had discussed about joining all lines in a file and also joining every
2 lines in a file. In this article, we will see the how we can join lines based on a pattern or joining
lines on encountering a pattern using awk or gawk.

Let us assume a file with the following contents. There is a line with START in-between. We have to
join all the lines following the pattern START.

$ cat file

START

Unix

Linux

START

Solaris

Aix

SCO

1. Join the lines following the pattern START without any delimiter.

$ awk '/START/{if (NR!=1)print "";next}{printf $0}END{print "";}'


file

UnixLinux

SolarisAixSCO
Basically, what we are trying to do is: Accumulate the lines following the START and print them
on encountering the next START statement. /START/ searches for lines containing the pattern
START. The command within the {} will work only on lines containing the START pattern. Prints a
blank line if the line is not the first line(NR!=1). Without this condition, a blank line will come in the
very beginning of the output since it encounters a START in the beginning.

The next command prevents the remaining part of the command from getting executed for the
START lines. The second part of braces {} works only for the lines not containing the START. This
part simply prints the line without a terminating new line character(printf). And hence as a result, we
get all the lines after the pattern START in the same line. The END label is put to print a newline at
the end without which the prompt will appear at the end of the last line of output itself.

2. Join the lines following the pattern START with space as delimiter.

$ awk '/START/{if (NR!=1)print "";next}{printf "%s ",$0}END{print


"";}' file

Unix Linux

Solaris Aix SCO

This is same as the earlier one except it uses the format specifier %s in order to accommodate an
additional space which is the delimiter in this case.

3. Join the lines following the pattern START with comma as delimiter.

$ awk '/START/{if (x)print x;x="";next}{x=(!x)?$0:x","$0;}END{print


x;}' file

Unix,Linux

Solaris,Aix,SCO

Here, we form a complete line and store it in a variable x and print the variable x whenever a new
pattern starts. The command: x=(!x)?$0:x","$0 is like the ternary operator in C or Perl. It means if x is
empty, assign the current line($0) to x, else append a comma and the current line to x. As a
result, x will contain the lines joined with a comma following the START pattern. And in the END
label, x is printed since for the last group there will not be a START pattern to print the earlier group.

4. Join the lines following the pattern START with comma as delimiter with also the pattern
matching line.

$ awk '/START/{if (x)print x;x="";}{x=(!x)?$0:x","$0;}END{print x;}'


file
START,Unix,Linux

START,Solaris,Aix,SCO

The difference here is the missing next statement. Because next is not there, the commands
present in the second set of curly braces are applicable for the START line as well, and hence it also
gets concatenated.

5. Join the lines following the pattern START with comma as delimiter with also the pattern
matching line. However, the pattern line should not be joined.

$ awk '/START/{if (x)print


x;print;x="";next}{x=(!x)?$0:x","$0;}END{print x;}' file

START

Unix,Linux

START

Solaris,Aix,SCO

In this, instead of forming START as part of the variable x, the START line is printed. As a result, the
START line comes out separately, and the remaining lines get joined.

n this article of the awk series, we will see the different scenarios in which we need to split a file into
multiple files using awk. The files can be split into multiple files either based on a condition, or based
on a pattern or because the file is big and hence needs to split into smaller files.

Sample File1:
Let us consider a sample file with the following contents:

$ cat file1

Item1,200

Item2,500

Item3,900

Item2,800
Item1,600

1. Split the file into 3 different files, one for each item. i.e, All records pertaining to Item1 into a
file, records of Item2 into another, etc.

$ awk -F, '{print > $1}' file1

The files generated by the above command are as below:

$ cat Item1

Item1,200

Item1,600

$ cat Item3

Item3,900

$ cat Item2

Item2,500

Item2,800

This looks so simple, right? print prints the entire line, and the line is printed to a file whose name
is $1, which is the first field. This means, the first record will get written to a file named 'Item1', and
the second record to 'Item2', third to 'Item3', 4th goes to 'Item2', and so on.

2. Split the files by having an extension of .txt to the new file names.

$ awk -F, '{print > $1".txt"}' file1

The only change here from the above is concatenating the string ".txt" to the $1 which is the first
field. As a result, we get the extension to the file names. The files created are below:

$ ls *.txt

Item2.txt Item1.txt Item3.txt


3. Split the files by having only the value(the second field) in the individual files, i.e, only 2nd
field in the new files without the 1st field:

$ awk -F, '{print $2 > $1".txt"}' file1

The print command prints the entire record. Since we want only the second field to go to the
output files, we do: print $2.

$ cat Item1.txt

200

600

4. Split the files so that all the items whose value is greater than 500 are in the file "500G.txt",
and the rest in the file "500L.txt".

$ awk -F, '{if($2<=500)print > "500L.txt";else print > "500G.txt"}'


file1

The output files created will be as below:

$ cat 500L.txt

Item1,200

Item2,500

$ cat 500G.txt

Item3,900

Item2,800

Item1,600

Check the second field($2). If it is lesser or equal to 500, the record goes to "500L.txt", else to
"500G.txt".
Other way to achieve the same thing is using the ternary operator in awk:
$ awk -F, '{x=($2<=500)?"500L.txt":"500G.txt"; print > x}' file1

The condition for greater or lesser than 500 is checked and the appropriate file name is assigned
to variable x. The record is then written to the file present in the variable x.

Sample File2:
Let us consider another file with a different set of contents. This file has a pattern 'START' at
frequent intervals.

$ cat file2

START

Unix

Linux

START

Solaris

Aix

SCO

5. Split the file into multiple files at every occurrence of the pattern START .

$ awk '/START/{x="F"++i;}{print > x;}' file2

This command contains 2 sets of curly braces: The control goes to the first set of braces only on
encountering a line containing the pattern START. The second set will be encountered by every line
since there is no condition and hence always true.
On encountering the pattern START, a new file name is created and stored. When the first START
comes, x will contain "F1" and the control goes to the next set of braces and the record is written to
F1, and the subsequent records go the file "F1" till the next START comes. On encountering next
START, x will contain "F2" and the subsequent lines goes to "F2" till the next START, and it
continues.

$ cat F1

START

Unix
Linux

Solaris

$ cat F2

START

Aix

SCO

6. Split the file into multiple files at every occurrence of the pattern START. But the line
containing the pattern should not be in the new files.

$ awk '/START/{x="F"++i;next}{print > x;}' file2

The only difference in this from the above is the inclusion of the next command. Due to the next
command, the lines containing the START enters the first curly braces and then starts reading the
next line immediately due to the next command. As a result, the START lines does not get to the
second curly braces and hence the START does not appear in the split files.

$ cat F1

Unix

Linux

$ cat F2

Solaris

Aix

SCO

7. Split the file by inserting a header record in every new file.

$ awk '/START/{x="F"++i;print "ANY HEADER" > x;next}{print > x;}'


file2
The change here from the earlier one is this: Before the next command, we write the header
record into the file. This is the right place to write the header record since this is where the file is
created first.

$ cat F1

ANY HEADER

Unix

Linux

$ cat F2

ANY HEADER

Solaris

Aix

SCO

Sample File3:
Let us consider a file with the sample contents:

$ cat file3

Unix

Linux

Solaris

AIX

SCO

8. Split the file into multiple files at every 3rd line . i.e, First 3 lines into F1, next 3 lines into F2
and so on.

$ awk 'NR%3==1{x="F"++i;}{print > x}' file3


In other words, this is nothing but splitting the file into equal parts. The condition does the trick
here: NR%3==1 : NR is the line number of the current record. NR%3 will be equal to 1 for every 3rd
line such as 1st, 4th, 7th and so on. And at every 3rd line, the file name is changed in the variable x,
and hence the records are written to the appropriate files.

$ cat F1

Unix

Linux

Solaris

$ cat F2

Aix

SCO

Sample File4:
Let us update the above file with a header and trailer:

$ cat file4

HEADER

Unix

Linux

Solaris

AIX

SCO

TRAILER

9. Split the file at every 3rd line without the header and trailer in the new files.

sed '1d;$d;' file4 | awk 'NR%3==1{x="F"++i;}{print > x}'


The earlier command does the work for us, only thing is to pass to the above command without
the header and trailer. sed does it for us. '1d' is to delete the 1st line, '$d' to delete the last line.

$ cat F1

Unix

Linux

Solaris

$ cat F2

AIX

SCO

10. Split the file at every 3rd line, retaining the header and trailer in every file.

$ awk 'BEGIN{getline f;}NR%3==2{x="F"++i;a[i]=x;print f>x;}{print >


x}END{for(j=1;j<i;j++)print> a[j];}' file4

This one is little tricky. Before the file is processed, the first line is read using getline into the
variable f. NR%3 is checked with 2 instead of 1 as in the earlier case because since the first line is a
header, we need to split the files at 2nd, 5th, 8th lines, and so on. All the file names are stored in the
array "a" for later processing.
Without the END label, all the files will have the header record, but only the last file will have the
trailer record. So, the END label is to precisely write the trailer record to all the files other than the
last file.

$ cat F1

HEADER

Unix

Linux

Solaris

TRAILER

$ cat F2

HEADER
Aix

SCO

TRAILER

In this article of awk series, we will see how to use awk to read or parse text or CSV files containing
multiple delimiters or repeating delimiters. Also, we will discuss about some peculiar delimiters and
how to handle them using awk.

Let us consider a sample file. This colon separated file contains item, purchase year and a set of
prices separated by a semicolon.

$ cat file

Item1:2010:10;20;30

Item2:2012:12;29;19

Item3:2014:15;50;61

1. To print the 3rd column which contains the prices:

$ awk -F: '{print $3}' file

10;20;30

12;29;19

15;50;61

This is straight forward. By specifying colon(:) in the option with -F, the 3rd column can be retrieved
using the $3 variable.

2. To print the 1st component of $3 alone:

$ awk -F '[:;]' '{print $4}' file

20
29

50

What did we do here? Specified multiple delimiters, one is : and other is ; . How awk parses the
file? Its simple. First, it looks at the delimiters which is colon(:) and semi-colon(;). This means, while
reading the line, as and when the delimiter : or ; is encountered, store the part read in $1. Continue
further. Again on encountering one of the delimiters, store the read part in $2. And this continues till
the end of the line is reached. In this way, $4 contained the first part of the price component above.
Note: Always keep in mind. While specifying multiple delimiters, it has to be specified inside
square brackets( [;:] ).

3. To sum the individual components of the 3rd column and print it:

$ awk -F '[;:]' '{$3=$3+$4+$5;print $1,$2,$3}' OFS=: file

Item1:2010:60

Item2:2012:60

Item3:2014:126

The individual components of the price($3) column are available in $3, $4 and $5. Simply, sum
them up and store in $3, and print all the variables. OFS (output field separator) is used to specify
the delimiter while printing the output.
Note: If we do not use the OFS, awk will print the fields using the default output delimiter which is
space.

4. Un-group or re-group every record depending on the price column:

$ awk -F '[;:]' '{for(i=3;i<=5;i++){print $1,$2,$i;}}' OFS=":" file

Item1:2010:10

Item1:2010:20

Item1:2010:30

Item2:2012:12

Item2:2012:29

Item2:2012:19
Item3:2014:15

Item3:2014:50

Item3:2014:61

The requirement here is: New records have to be created for every component of the price
column. Simply, a loop is run on from columns 3 to 5, and every time a record is framed using the
price component.

5-6. Read file in which the delimiter is square brackets:

$ cat file

123;abc[202];124

125;abc[203];124

127;abc[204];124

5. To print the value present within the brackets:

$ awk -F '[][]' '{print $2}' file

202

203

204

At the first sight, the delimiter used in the above command might be confusing. Its simple. 2
delimiters are to be used in this case: One is [ and the other is ]. Since the delimiters itself is square
brackets which is to be placed within the square brackets, it looks tricky at the first instance.

Note: If square brackets are delimiters, it should be put in this way only, meaning first ] followed by [.
Using the delimiter like -F '[[]]' will give a different interpretation altogether.

6. To print the first value, the value within brackets, and the last value:

$ awk -F '[][;]' '{print $1,$3,$5}' OFS=";" file

123;202;124
125;203;124

127;204;124

3 delimiters are used in this case with semi-colon also included.

7-8. Read or parse a file containing a series of delimiters:

$ cat file

123;;;202;;;203

124;;;213;;;203

125;;;222;;;203

The above file contains a series of 3 semi-colons between every 2 values.

7. Using the multiple delimiter method:

$ awk -F'[;;;]' '{print $2}' file

Blank output !!! The above delimiter, though specified as 3 colons is as good as one delimiter
which is a semi-colon(;) since they are all the same. Due to this, $2 will be the value between the
first and the second semi-colon which in our case is blank and hence no output.

8. Using the delimiter without square brackets:

$ awk -F';;;' '{print $2}' file

202

213

222
The expected output !!! No square brackets is used and we got the output which we wanted.

Difference between using square brackets and not using it : When a set of delimiters are
specified using square brackets, it means an OR condition of the delimiters. For example, -F
'[;:]' means to separate the contents either on encountering ':' or ';'. However, when a set of
delimiters are specified without using square brackets, awk looks at them literally to separate the
contents. For example, -F ':;' means to separate the contents only on encountering a colon followed
by a semi-colon. Hence, in the last example, the file contents are separated only when a set of 3
continuous semi-colons are encountered.

9. Read or parse a file containing a series of delimiters of varying lengths:


In the below file, the 1st and 2nd column are separated using 3 semi-colons, however the 2nd
and 3rd are separated by 4 semi-colons

$ cat file

123;;;202;;;;203

124;;;213;;;;203

125;;;222;;;;203

$ awk -F';'+ '{print $2,$3}' file

202 203

213 203

222 203

The '+' is a regular expression. It indicates one or more of previous characters. ';'+ indicates one
or more semi-colons, and hence both the 3 semi-colons and 4 semi-colons get matched.

10. Using a word as a delimiter:

$ cat file

123Unix203

124Unix203

125Unix203

Retrieve the numbers before and after the word "Unix" :


$ awk -F'Unix' '{print $1, $2}' file

123 203

124 203

125 203

In this case, we use the word "Unix" as the delimiter. And hence $1 and $2 contained the
appropriate values . Keep in mind, it is not just the special characters which can be used as
delimiters. Even alphabets, words can also be used as delimiters.

P.S: We will discuss about the awk split command on how to use it in these types of multiple
delimited files.

In one of our earlier articles, we discussed how to access or pass shell variables to awk. In this, we
will see how to access the awk variables in shell? Or How to access awk variables as shell variables
? Let us see the different ways in which we can achieve this.

Let us consider a file with the sample contents as below:

$ cat file

Linux 20

Solaris 30

HPUX 40

1. Access the value of the entry "Solaris" in a shell variable, say x:

$ x=`awk '/Solaris/{a=$2;print a}' file`

$ echo $x

30

This approach is fine as long as we want to access only one value. What if we have to access
multiple values in shell?

2. Access the value of "Solaris" in x, and "Linux" in y:


$ z=`awk '{if($1=="Solaris")print "x="$2;if($1=="Linux")print
"y="$2}' file`

$ echo "$z"

y=20

x=30

$ eval $z

$ echo $x

30

$ echo $y

20

awk sets the value of "x" and "y" awk variables and prints which is collected in the shell variable
"z". The eval command evaluates the variable meaning it executes the commands present in the
variable. As a result, "x=30" and "y=20" gets executed, and they become shell variables x and y with
appropriate values.

3. Same using the sourcing method:

$ awk '{if($1=="Solaris")print "x="$2;if($1=="Linux")print "y="$2}'


file > f1

$ source f1

$ echo $x

30

$ echo $y

20

Here, instead of collecting the output of awk command in a variable, it is re-directed to a temporary
file. The file is then sourced or in other words executed in the same shell. As a result, "x" and "y"
become shell variables.
Note: Depending on the shell being used, the appropriate way of sourcing has to be done. The
"source" command is used here since the default shell is bash.
How to manipulate a text / CSV file using awk/gawk? How to insert/add a column between columns,
remove columns, or to update a particular column? Let us discuss in this article.

Consider a CSV file with the following contents:

$ cat file

Unix,10,A

Linux,30,B

Solaris,40,C

Fedora,20,D

Ubuntu,50,E

1. To insert a new column (say serial number) before the 1st column

$ awk -F, '{$1=++i FS $1;}1' OFS=, file

1,Unix,10,A

2,Linux,30,B

3,Solaris,40,C

4,Fedora,20,D

5,Ubuntu,50,E

$1=++i FS $1 => Space is used to concatenate columns in awk. This expression concatenates a
new field(++i) with the 1st field along with the delimiter(FS), and assigns it back to the 1st field($1).
FS contains the file delimiter.

2. To insert a new column after the last column

$ awk -F, '{$(NF+1)=++i;}1' OFS=, file

Unix,10,A,1
Linux,30,B,2

Solaris,40,C,3

Fedora,20,D,4

Ubuntu,50,E,5

$NF indicates the value of last column. Hence,by assigning something to $(NF+1), a new field is
inserted at the end automatically.

3. Add 2 columns after the last column:

$ awk -F, '{$(NF+1)=++i FS "X";}1' OFS=, file

Unix,10,A,1,X

Linux,30,B,2,X

Solaris,40,C,3,X

Fedora,20,D,4,X

Ubuntu,50,E,5,X

The explanation gives for the above 2 examples holds good here.

4. To insert a column before the 2nd last column

$ awk -F, '{$(NF-1)=++i FS $(NF-1);}1' OFS=, file

Unix,1,10,A

Linux,2,30,B

Solaris,3,40,C

Fedora,4,20,D

Ubuntu,5,50,E

NF-1 points to the 2nd last column. Hence, by concatenating the serial number in the beginning of
NF-1 ends up in inserting a column before the 2nd last.
5. Update 2nd column by adding 10 to the variable:

$ awk -F, '{$2+=10;}1' OFS=, file

Unix,20,A

Linux,40,B

Solaris,50,C

Fedora,30,D

Ubuntu,60,E

$2 is incremented by 10.

6.Convert a specific column(1st column) to uppercase in the CSV file:

$ awk -F, '{$1=toupper($1)}1' OFS=, file

UNIX,10,A

LINUX,30,B

SOLARIS,40,C

FEDORA,20,D

UBUNTU,50,E

Using the toupper function of the awk, the 1st column is converted from lowercase to uppercase.

7. Extract only first 3 characters of a specific column(1st column):

$ awk -F, '{$1=substr($1,0,3)}1' OFS=, file

Uni,10,A

Lin,30,B

Sol,40,C
Fed,20,D

Ubu,50,E

Using the substr function of awk, a substring of only the first few characters can be retrieved.

8.Empty the value in the 2nd column:

$ awk -F, '{$2="";}1' OFS=, file

Unix,,A

Linux,,B

Solaris,,C

Fedora,,D

Ubuntu,,E

Set the variable of 2nd column($2) to blank(""). Now, when the line is printed, $2 will be blank.

9. Remove/Delete the 2nd column from the CSV file:

$ awk -F, '{for(i=1;i<=NF;i++)if(i!=x)f=f?f FS $i:$i;print f;f=""}'


x=2 file

Unix,A

Linux,B

Solaris,C

Fedora,D

Ubuntu,E

By just emptying a particular column, the column stays as is with empty value. To remove a column,
all the subsequent columns from that position, needs to be advanced one position ahead. The for
loop loops on all the fields. Using the ternary operator, every column is concatenated to the variable
"f" provided it is not 2nd column using the FS as delimiter. At the end, the variable "f" is printed
which contains the updated record. The column to be removed is passed through the awk variable
"x" and hence just be setting the appropriate number in x, any specific column can be removed.
10. Join 3rd column with 2nd colmn using ':' and remove the 3rd column:

$ awk -F, '{$2=$2":"$x;for(i=1;i<=NF;i++)if(i!=x)f=f?f FS $i:$i;print


f;f=""}' x=3 file

Unix,10:A

Linux,30:B

Solaris,40:C

Fedora,20:D

Ubuntu,50:E

Almost same as last example expcept that first the 3rd column($3) is concatenated with 2nd
column($2) and then removed.

gawk has 3 functions to calculate date and time:

 systime
 strftime
 mktime
Let us see in this article how to use these functions:

systime:
This function is equivalent to the Unix date (date +%s) command. It gives the Unix time, total
number of seconds elapsed since the epoch(01-01-1970 00:00:00).

$ echo | awk '{print systime();}'

1358146640

Note: systime function does not take any arguments.

strftime:
A very common function used in gawk to format the systime into a calendar format. Using this
function, from the systime, the year, month, date, hours, mins and seconds can be separated.

Syntax:
strftime (<format specifiers>,unix time);
1. Printing current date time using strftime:

$ echo | awk '{print strftime("%d-%m-%y %H-%M-%S",systime());}'

14-01-13 12-37-45

strftime takes format specifiers which are same as the format specifiers available with the date
command. %d for date, %m for month number (1 to 12), %y for the 2 digit year number, %H for the
hour in 24 hour format, %M for minutes and %S for seconds. In this way, strftime converts Unix time
into a date string.

2. Display current date time using strftime without systime:

$ echo | awk '{print strftime("%d-%m-%y %H-%M-%S");}'

14-01-13 12-38-08

Both the arguments of strftime are optional. When the timestamp is not provided, it takes the
systime by default.

3. strftime with no arguments:

$ echo | awk '{print strftime();}'

Mon Jan 14 12:30:05 IST 2013

strftime without the format specifiers provides the output in the default output format as the Unix
date command.

mktime:
mktime function converts any given date time string into a Unix time, which is of the systime
format.
Syntax:
mktime(date time string) # where date time string is a string which contains atleast 6 components in
the following order: YYYY MM DD HH MM SS

1. Printing timestamp for a specific date time :

$ echo | awk '{print mktime("2012 12 21 0 0 0");}'

1356028200
This gives the Unix time for the date 21-Dec-12.

2. Using strftime with mktime:

$ echo | awk '{print strftime("%d-%m-%Y",mktime("2012 12 21 0 0


0"));}'

21-12-2012

The output of mktime can be validated by formatting the mktime output using the strftime function
as above.

3. Negative date in mktime:

$ echo | awk '{print strftime("%d-%m-%Y",mktime("2012 12 -1 0 0


0"));}'

29-11-2012

mktime can take negative values as well. -1 in the date position indicates one day before the date
specified which in this case leads to 29th Nov 2012.

4. Negative hour value in mktime:

$ echo | awk '{print strftime("%d-%m-%Y %H-%M-%S",mktime("2012 12 3 -


2 0 0"));}'

02-12-2012 22-00-00

-2 in the hours position indicates 2 hours before the specified date time which in this case leads to
"2-Dec-2012 22" hours.

How to find the time difference between timestamps using gawk?


Let us consider a file where the
1st column is the Process name,
2nd is the start time of the process, and
3rd column is the end time of the process.

The requirement is to find the time consumed by the process which is the difference between the
start and the end times.
1. File in which the date and time component are separated by a space:

$ cat file

P1,2012 12 4 21 36 48,2012 12 4 22 26 53

P2,2012 12 4 20 36 48,2012 12 4 21 21 23

P3,2012 12 4 18 36 48,2012 12 4 20 12 35

Time difference in seconds:

$ awk -F, '{d2=mktime($3);d1=mktime($2);print $1","d2-d1,"secs";}'


file

P1,3005 secs

P2,2675 secs

P3,5747 secs

Using mktime function, the Unix time is calculated for the date time strings, and their difference
gives us the time elapsed in seconds.

2. File with the different date format :

$ cat file

P1,2012-12-4 21:36:48,2012-12-4 22:26:53

P2,2012-12-4 20:36:48,2012-12-4 21:21:23

P3,2012-12-4 18:36:48,2012-12-4 20:12:35

Note: This file has the start time and end time in different formats

Difference in seconds:

$ awk -F, '{gsub(/[-:]/," ",$2);gsub(/[-:]/,"


",$3);d2=mktime($3);d1=mktime($2);print $1","d2-d1,"secs";}' file
P1,3005 secs

P2,2675 secs

P3,5747 secs

Using gsub function, the '-' and ':' are replaced with a space. This is done because the mktime
function arguments should be space separated.
Difference in minutes:

$ awk -F, '{gsub(/[-:]/," ",$2);gsub(/[-:]/,"


",$3);d2=mktime($3);d1=mktime($2);print $1","(d2-d1)/60,"mins";}'
file

P1,50.0833 mins

P2,44.5833 mins

P3,95.7833 mins

Just by dividing the seconds difference by 60 gives us the difference in minutes.

3. File with only date, without time part:

$ cat file

P1,2012-12-4,2012-12-6

P2,2012-12-4,2012-12-8

P3,2012-12-4,2012-12-5

Note: The start and end time has only the date components, no time components

Difference in seconds:

$ awk -F, '{gsub(/-/," ",$2);gsub(/-/," ",$3);$2=$2" 0 0 0";$3=$3" 0


0 0";d2=mktime($3);d1=mktime($2);print $1","d2-d1,"secs";}' file

P1,172800 secs

P2,345600 secs
P3,86400 secs

In addition to replacing the '-' and ':' with spaces, 0's are appended to the date field since the mktime
requires the date in 6 column format.

Difference in days:

$ awk -F, '{gsub(/-/," ",$2);gsub(/-/," ",$3);$2=$2" 0 0 0";$3=$3" 0


0 0";d2=mktime($3);d1=mktime($2);print $1","(d2-d1)/86400,"days";}'
file

P1,2 days

P2,4 days

P3,1 days

A day has 86400(24*60*60) seconds, and hence by dividing the duration in seconds by 86400, the
duration in days can be obtained.

You might also like