Unit V
Unit V
1. GNU Compiler Collection (GCC): a compiler suite that supports many languages, such as C/C++
and Objective-C/C++.
3. GNU Binutils: a suite of binary utility tools, including linker and assembler.
5. GNU Autotools: A build system including Autoconf, Autoheader, Automake and Libtool.
GCC is portable and run in many operating platforms. GCC (and GNU Toolchain) is currently available
on all Unixes. They are also ported to Windows (by Cygwin, MinGW and MinGW-W64). GCC is also
a cross-compiler, for producing executables on different platform.
C++ Standard Support
GCC supports different dialects of C++, corresponding to the multiple published ISO standards. Which
standard it implements can be selected using the -std= command-line option.
C++98 C++17
C++11 C++20
C++14 C++23
Open a command line terminal and install C compiler by installation of the development package build-
essential:
1|Page
3CS4-06: Linux and Shell Programming Unit V
First, make sure Xcode is installed. If it is not installed on OS X, visit app store and install Xcode.
Xcode menu > Preferences > Downloads > choose "Command line tools" > Click "Install" button:
$ g++ --version
g++ (GCC) 6.4.0
More details can be obtained via -v option, for example,
$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-pc-cygwin/6.4.0/lto-wrapper.exe
Target: x86_64-pc-cygwin
Configured with: ......
2|Page
3CS4-06: Linux and Shell Programming Unit V
$ g++ -v
Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-pc-cygwin/6.4.0/lto-wrapper.exe
Target: x86_64-pc-cygwin
Configured with: ......
Thread model: posix
gcc version 6.4.0 (GCC)
ii. Help
You can get the help manual via the --help option. For example,
$ gcc --help
// or
$ man g++
Reading man pages under CMD or Bash shell can be difficult. You could generate a text file via:
$ man gcc | col -b > gcc.txt
The col utility is needed to strip the backspace. (For Cygwin, it is available in "Utils", "util-linux"
package.)
Alternatively, you could look for an online man pages, e.g., http://linux.die.net/man/1/gcc.
The GCC man pages are kept under "usr/share/man/man1".
$ whereis gcc
int main()
{
printf("Hello, world!\n");
return 0;
}
To compile the hello.c:
$ gcc hello.c
3|Page
3CS4-06: Linux and Shell Programming Unit V
int main()
{
cout << "Hello, world!" << endl;
return 0;
}
You need to use g++ to compile C++ program, as follows. We use the -o option to specify the output
file name.
// (Unixes / Mac OS X) In Bash shell
$ ./hello
4|Page
3CS4-06: Linux and Shell Programming Unit V
However, we usually compile each of the source files separately into object file, and link them together
in the later stage. In this case, changes in one file does not require re-compilation of the other files.
$ g++ -c file1.cpp
$g++ -c file2.cpp
The resultant intermediate file "hello.i" contains the expanded source code.
2. Compilation: The compiler compiles the pre-processed source code into assembly code for a
specific processor.
5|Page
3CS4-06: Linux and Shell Programming Unit V
$ gcc -S hello.i
The -S option specifies to produce assembly code, instead of object code. The resultant assembly
file is "hello.s".
3. Assembly: The assembler (as.exe) converts the assembly code into machine code in the object
file "hello.o".
$ as -o hello.o hello.s
4. Linker: Finally, the linker (ld.exe) links the object code with the library code to produce an
executable file "hello.exe".
$ ld -o hello.exe hello.o ...libraries...
6|Page
3CS4-06: Linux and Shell Programming Unit V
be shared between multiple programs. Furthermore, most operating systems allows one copy of
a shared library in memory to be used by all running programs, thus, saving memory. The shared
library codes can be upgraded without the need to recompile your program.
Because of the advantage of dynamic linking, GCC, by default, links to the shared library if it is
available.
You can list the contents of a library via "nm filename".
Searching for Header Files and Libraries (-I, -L and -l)
When compiling the program, the compiler needs the header files to compile the source codes;
the linker needs the libraries to resolve external references from other object files or libraries.
The compiler and linker will not find the headers/libraries unless you set the appropriate
options, which is not obvious for first-time user. For each of the headers used in your source
(via #include directives), the compiler searches the so-called include-paths for these headers.
The include-paths are specified via -Idir option (or environment variable CPATH). Since the
header's filename is known (e.g., iostream.h, stdio.h), the compiler only needs the directories.
The linker searches the so-called library-paths for libraries needed to link the program into an
executable. The library-path is specified via -Ldir option (uppercase 'L' followed by the
directory path) (or environment variable LIBRARY_PATH). In addition, you also have to
specify the library name. In Unixes, the library libxxx.a is specified via -lxxx option (lowercase
letter 'l', without the prefix "lib" and ".a" extension). In Windows, provide the full name such
as -lxxx.lib. The linker needs to know both the directories as well as the library names. Hence,
two options need to be specified.
Default Include-paths, Library-paths and Libraries
Try list the default include-paths in your system used by the "GNU C Preprocessor" via "cpp -v":
$cpp -v
......
/usr/lib/gcc/x86_64-pc-cygwin/6.4.0/include
/usr/include
/usr/lib/gcc/x86_64-pc-cygwin/6.4.0/../../../../lib/../include/w32api
Try running the compilation in verbose mode (-v) to study the library-paths (-L) and libraries (-l) used
in your system:
> gcc -v -o hello.exe hello.c
......
-L/usr/lib/gcc/x86_64-pc-cygwin/6.4.0
-L/usr/x86_64-pc-cygwin/lib
-L/usr/lib
-L/lib
-lgcc_s // libgcc_s.a
-lgcc // libgcc.a
7|Page
3CS4-06: Linux and Shell Programming Unit V
-lcygwin // libcygwin.a
-ladvapi32 // libadvapi32.a
-lshell32 // libshell32.a
-luser32 // libuser32.a
-lkernel32 // libkernel32.a
$ vim factorial.c
# include <stdio.h>
int main()
{
int i, num, j;
printf ("Enter the number: ");
scanf ("%d", &num );
$ ./a.out
Enter the number: 3
The factorial of 3 is 12548672
Let us debug it while reviewing the most useful commands in gdb.
Step 1. Compile the C program with debugging option -g
Compile your C program with -g option. This allows the compiler to collect the debugging
information.
8|Page
3CS4-06: Linux and Shell Programming Unit V
$ gcc -g factorial.c
Note: The above command creates a.out file which will be used for debugging as shown below.
Step 2. Launch gdb
Launch the C debugger (gdb) as shown below.
$ gdb a.out
Step 3. Set up a break point inside C program
Syntax:
break line_number
Other formats:
break [file_name]:line_number
break [file_name]:func_name
Places break point in the C program, where you suspect errors. While executing the program, the
debugger will stop at the break point, and gives you the prompt to debug.
So before starting up the program, let us place the following break point in our program.
break 10
Breakpoint 1 at 0x804846f: file factorial.c, line 10.
Examples:
print i
print j
print num
9|Page
3CS4-06: Linux and Shell Programming Unit V
(gdb) p i
$1 = 1
(gdb) p j
$2 = 3042592
(gdb) p num
$3 = 3
(gdb)
As you see above, in the factorial.c, we have not initialized the variable j. So, it gets garbage value
resulting in a big numbers as factorial values.
Fix this issue by initializing variable j with 1, compile the C program and execute it again.
Even after this fix there seems to be some problem in the factorial.c program, as it still gives wrong
factorial value.
So, place the break point in 10th line, and continue as explained in the next section.
l – list
p – print
c – continue
s – step
ENTER: pressing enter key would execute the previously executed command again.
10 | P a g e
3CS4-06: Linux and Shell Programming Unit V
Linux-like operating systems offer a centralized mechanism for finding and installing software. Software
is usually distributed in the form of packages, kept in repositories. Working with packages is known as
package management. Packages provide the core components of an operating system, along with shared
libraries, applications, services, and documentation.
A package management system does much more than one-time installation of software. It also provides
tools for upgrading already-installed packages. Package repositories help to ensure that code has been
vetted for use on your system, and that the installed versions of software have been approved by
developers and package maintainers.
When configuring servers or development environments, it’s often necessary to look beyond official
repositories. Packages in the stable release of a distribution may be out of date, especially where new or
rapidly-changing software is concerned. Nevertheless, package management is a vital skill for system
administrators and developers, and the wealth of packaged software for major distributions is a
tremendous resource.
This guide is intended as a quick reference for the fundamentals of finding, installing, and upgrading
packages on a variety of distributions, and should help you translate that knowledge between systems.
Package Management Systems: A Brief Overview
Most package systems are built around collections of package files. A package file is usually an archive
which contains compiled applications and other resources used by the software, along with installation
scripts. Packages also contain valuable metadata, including their dependencies, a list of other packages
required to install and run them.
While their functionality and benefits are broadly similar, packaging formats and tools vary by platform:
For Debian / Ubuntu: .deb packages installed by apt and dpkg
For Rocky / Fedora / RHEL: .rpm packages installed by yum
For FreeBSD: .txz packages installed by pkg
In Debian and systems based on it, like Ubuntu, Linux Mint, and Raspbian, the package format
is the .deb file. apt, the Advanced Packaging Tool, provides commands used for most common
operations: Searching repositories, installing collections of packages and their dependencies,
and managing upgrades. apt commands operate as a front-end to the lower-level dpkg utility,
which handles the installation of individual .deb files on the local system, and is sometimes
invoked directly.
Recent releases of most Debian-derived distributions include a single apt command, which
offers a concise and unified interface to common operations that have traditionally been handled
by the more-specific apt-get and apt-cache.
Rocky Linux, Fedora, and other members of the Red Hat family use RPM files. These used to
use a package manager called yum. In recent versions of Fedora and its derivatives, yum has
been supplanted by dnf, a modernized fork which retains most of yum’s interface.
FreeBSD’s binary package system is administered with the pkg command. FreeBSD also offers
the Ports Collection, a local directory structure and tools which allow the user to fetch, compile,
and install packages directly from source using Makefiles. It’s usually much more convenient
to use pkg, but occasionally a pre-compiled package is unavailable, or you may need to change
compile-time options.
i. Update Package Lists
Most systems keep a local database of the packages available from remote repositories. It’s best to
update this database before installing or upgrading packages. As a partial exception to this
pattern, dnf will check for updates before performing some operations, but you can ask at any time
whether updates are available.
11 | P a g e
3CS4-06: Linux and Shell Programming Unit V
Making sure that all of the installed software on a machine stays up to date would be an enormous
undertaking without a package system. You would have to track upstream changes and security alerts
for hundreds of different packages. While a package manager doesn’t solve every problem you’ll
encounter when upgrading software, it does enable you to maintain most system components with a
few commands.
On FreeBSD, upgrading installed ports can introduce breaking changes or require manual configuration
steps. It’s best to read /usr/ports/UPDATING before upgrading with portmaster.
For Debian / Ubuntu: sudo apt upgrade
For Rocky / Fedora / RHEL: sudo dnf upgrade
For FreeBSD Packages: sudo pkg upgrade
iii. Find a Package
Most distributions offer a graphical or menu-driven front end to package collections. These can be a
good way to browse by category and discover new software. Often, however, the quickest and most
effective way to locate a package is to search with command-line tools.
For Debian / Ubuntu: apt search search_string
For Rocky / Fedora / RHEL: dnf search search_string
For FreeBSD Packages: pkg search search_string
Note: On Rocky, Fedora, or RHEL, you can search package titles and descriptions together by using dnf
search all. On FreeBSD, you can search descriptions by using pkg search -D
iv. View Info about a Specific Package
When deciding what to install, it’s often helpful to read detailed descriptions of packages. Along with
human-readable text, these often include metadata like version numbers and a list of the package’s
dependencies.
For Debian / Ubuntu: apt show package
For Rocky / Fedora / RHEL: dnf info package
For FreeBSD Packages: pkg info package
For FreeBSD Ports: cd /usr/ports/category/port && cat pkg-descr
v. Install a Package from Repositories
Once you know the name of a package, you can usually install it and its dependencies with a single
command. In general, you can supply multiple packages to install at once by listing them all.
For Debian / Ubuntu: sudo apt install package
For Rocky / Fedora / RHEL: sudo dnf install package
For FreeBSD Packages: sudo pkg install package
vi. Install a Package from the Local Filesystem
Sometimes, even though software isn’t officially packaged for a given operating system, a developer or
vendor will offer package files for download. You can usually retrieve these with your web browser, or
via curl on the command line. Once a package is on the target system, it can often be installed with a
single command.
12 | P a g e
3CS4-06: Linux and Shell Programming Unit V
On Debian-derived systems, dpkg handles individual package files. If a package has unmet
dependencies, gdebi can often be used to retrieve them from official repositories.
On On Rocky Linux, Fedora, or RHEL, dnf is used to install individual files, and will also handle
needed dependencies.
For Debian / Ubuntu: sudo dpkg -i package.deb
For Rocky / Fedora / RHEL: sudo dnf install package.rpm
For FreeBSD Packages: sudo pkg add package.txz
vii. Remove One or More Installed Packages
Since a package manager knows what files are provided by a given package, it can usually remove them
cleanly from a system if the software is no longer needed.
For Debian / Ubuntu: sudo apt remove package
For Rocky / Fedora / RHEL: sudo dnf erase package
For FreeBSD Packages: sudo pkg delete package
viii. Get Help
In addition to web-based documentation, keep in mind that Unix manual pages (usually referred to
as man pages) are available for most commands from the shell. To read a page, use man:
$man page
In man, you can navigate with the arrow keys. Press / to search for text within the page, and q to quit.
For Debian / Ubuntu: man apt
For Rocky / Fedora / RHEL: man dnf
For FreeBSD Packages: man pkg
For FreeBSD Ports: man ports
Source code management
Source code management (SCM) is used to track modifications to a source code repository. SCM tracks
a running history of changes to a code base and helps resolve conflicts when merging updates from
multiple contributors. SCM is also synonymous with Version control.
As software projects grow in lines of code and contributor head count, the costs of communication
overhead and management complexity also grow. SCM is a critical tool to alleviate the organizational
strain of growing development costs.
When multiple developers are working within a shared codebase it is a common occurrence to make
edits to a shared piece of code. Separate developers may be working on a seemingly isolated feature,
however this feature may use a shared code module. Therefore developer 1 working on Feature 1 could
make some edits and find out later that Developer 2 working on Feature 2 has conflicting edits.
Before the adoption of SCM this was a nightmare scenario. Developers would edit text files directly
and move them around to remote locations using FTP or other protocols. Developer 1 would make edits
and Developer 2 would unknowingly save over Developer 1’s work and wipe out the changes. SCM’s
role as a protection mechanism against this specific scenario is known as Version Control.
SCM brought version control safeguards to prevent loss of work due to conflict overwriting. These
safeguards work by tracking changes from each individual developer and identifying areas of conflict
13 | P a g e
3CS4-06: Linux and Shell Programming Unit V
and preventing overwrites. SCM will then communicate these points of conflict back to the developers
so that they can safely review and address.
This foundational conflict prevention mechanism has the side effect of providing passive
communication for the development team. The team can then monitor and discuss the work in progress
that the SCM is monitoring. The SCM tracks an entire history of changes to the code base. This allows
developers to examine and review edits that may have introduced bugs or regressions.
In addition to version control SCM provides a suite of other helpful features to make collaborative code
development a more user friendly experience. Once SCM has started tracking all the changes to a project
over time, a detailed historical record of the projects life is created. This historical record can then be
used to ‘undo’ changes to the codebase. The SCM can instantly revert the codebase back to a previous
point in time. This is extremely valuable for preventing regressions on updates and undoing mistakes.
The SCM archive of every change over a project's life time provides valuable record keeping for a
project's release version notes. A clean and maintained SCM history log can be used interchangeably
as release notes. This offers insight and transparency into the progress of a project that can be shared
with end users or non-development teams.
SCM will reduce a team’s communication overhead and increase release velocity. Without SCM
development is slower because contributors have to take extra effort to plan a non-overlapping sequence
of develop for release. With SCM developers can work independently on separate branches of feature
development, eventually merging them together.
Overall SCM is a huge aid to engineering teams that will lower development costs by allowing
engineering resources to execute more efficiently. SCM is a must have in the modern age of software
development. Professional teams use version control and your team should too.
Source code management best practices
Commit often
Commits are cheap and easy to make. They should be made frequently to capture updates to a code
base. Each commit is a snapshot that the codebase can be reverted to if needed. Frequent commits give
many opportunities to revert or undo work. A group of commits can be combined into a single commit
using a rebase to clarify the development log.
Ensure you're working from latest version
SCM enables rapid updates from multiple developers. It’s easy to have a local copy of the codebase fall
behind the global copy. Make sure to git pull or fetch the latest code before making updates. This will
help avoid conflicts at merge time.
Make detailed notes
Each commit has a corresponding log entry. At the time of commit creation, this log entry is populated
with a message. It is important to leave descriptive explanatory commit log messages. These commit
log messages should explain the “why” and “what” that encompass the commits content. These log
messages become the canonical history of the project’s development and leave a trail for future
contributors to review.
Review changes before committing
14 | P a g e
3CS4-06: Linux and Shell Programming Unit V
SCM’s offer a ‘staging area’. The staging area can be used to collect a group of edits before writing
them to a commit. The staging area can be used to manage and review changes before creating the
commit snapshot. Utilizing the staging area in this manner provides a buffer area to help refine the
contents of the commit.
Use Branches
Branching is a powerful SCM mechanism that allows developers to create a separate line of
development. Branches should be used frequently as they are quick and inexpensive. Branches enable
multiple developers to work in parallel on separate lines of development. These lines of development
are generally different product features. When development is complete on a branch it is then merged
into the main line of development.
Agree on a Workflow
By default SCMs offer very free form methods of contribution. It is important that teams establish
shared patterns of collaboration. SCM workflows establish patterns and processes for merging branches.
If a team doesn't agree on a shared workflow it can lead to inefficient communication overhead when it
comes time to merge branches.
Revision Control System (RCS)
A revision control system (RCS) is an application capable of storing, logging, identifying, merging or
identifying information related to the revision of software, application documentation, papers or forms.
Most revision control systems store this information with the help of a differential utility for documents.
A revision control system is an essential tool for an organization with multi-developer tasks or projects,
as it is capable of identifying issues and bugs and of retrieving an earlier working version of an
application or document whenever required.
Most revision control systems run as independent standalone applications. There are two types of
revision control systems: centralized and decentralized. Some applications like spreadsheets and word
processors have built-in revision control mechanisms. Designers and developers at times use revision
control for maintaining the documentation along with the configuration files for their developments.
High-quality documentation and products are possible with the proper use of revision control systems.
A revision control system has the following features:
For all documents and document types, up-to-date history can be made available.
It is a simple system and does not require other repository systems.
For every document maintained, check-ins and check-outs can be done.
It has the ability to retrieve and revert to an old version of the document. This is extremely
helpful in case of accidental deletions.
In a streamlined manner, side features and bugs can be identified and fixed using the system.
Troubleshooting is also made easier.
Its tag system helps in differentiating between alpha, beta or release versions for different
documents or applications.
Collaboration becomes easier in a multi-person application development project.
15 | P a g e
3CS4-06: Linux and Shell Programming Unit V
CVS is a production quality system in wide use around the world, including many free software projects.
While CVS stores individual file history in the same format as RCS, it offers the following significant
advantages over RCS:
It can run scripts which you can supply to log CVS operations or enforce site-specific policies.
Client/server CVS enables developers scattered by geography or slow modems to function as
a single team. The version history is stored on a single central server and the client machines
have a copy of all the files that the developers are working on. Therefore, the network between
the client and the server must be up to perform CVS operations (such as checkins or updates)
but need not be up to edit or manipulate the current versions of the files. Clients can perform
all the same operations which are available locally.
In cases where several developers or teams want to each maintain their own version of the files,
because of geography and/or policy, CVS's vendor branches can import a version from another
team (even if they don't use CVS), and then CVS can merge the changes from the vendor branch
with the latest files if that is what is desired.
Unreserved checkouts, allowing more than one developer to work on the same files at the
same time.
CVS provides a flexible modules database that provides a symbolic mapping of names to
components of a larger software distribution. It applies names to collections of directories and
files. A single command can manipulate the entire collection.
CVS servers run on most unix variants, and clients for Windows NT/95, OS/2 and VMS are
also available. CVS will also operate in what is sometimes called server mode against local
repositories on Windows 95/NT.
CVS, and the older RCS, offer version control (or revision control), the practice of maintaining
information about a project's evolution so that prior versions may be retrieved, changes tracked, and,
most importantly, the efforts of a team of developers coordinated.
Basic Concepts
RCS (Revision Control System) works within a single directory. To accommodate large projects using
a hierarchy of several directories, CVS creates two new concepts called the repository and the sandbox.
The repository (also called an archive) is the centralized storage area, managed by the version control
system and the repository administrator, which stores the projects' files. The repository contains
information required to reconstruct historical versions of the files in a project. An administrator sets up
and controls the repository using the procedures and commands.
A sandbox (also called a working directory) contains copies of versions of files from the repository.
New development occurs in sandboxes, and any number of sandboxes may be created from a single
repository. The sandboxes are independent of one another and may contain files from different stages
of the development of the same project. Users set up and control sandboxes using the procedures and
commands found in "CVS User Reference”.
In a typical interaction with the version control system, a developer checks out the most current code
from the repository, makes changes, tests the results, and then commits those changes back to the
repository when they are deemed satisfactory.
Some systems, including RCS, use a locking model to coordinate the efforts of multiple developers by
serializing file modifications. Before making changes to a file, a developer must not only obtain a copy
of it, but he must also request and obtain a lock on it from the system. This lock serves to prevent (really
dissuade) multiple developers from working on the same file at the same time. When the changes are
committed, the developer unlocks the file, permitting other developers to gain access to it. The locking
16 | P a g e
3CS4-06: Linux and Shell Programming Unit V
model is pessimistic: it assumes that conflicts must be avoided. Serialization of file modifications
through locks prevents conflicts. But it is cumbersome to have to lock files for editing when bug-
hunting. Often, developers will circumvent the lock mechanism to keep working, which is an invitation
to trouble. Unlike RCS and SCCS, CVS uses a merging model which allows everyone to have access
to the files at all times and supports concurrent development. The merging model is optimistic: it
assumes that conflicts are not common and that when they do occur, it usually isn't difficult to resolve
them. CVS is capable of operating under a locking model via the -L and -l options to the admin
command. Also, CVS has special commands (edit and watch) for those who want additional
development coordination support. CVS uses locks internally to prevent corruption when multiple
people are accessing the repository simultaneously, but this is different from the user-visible locks of
the locking model discussed here.
In the event that two developers commit changes to the same version of a file, CVS automatically defers
the commit of the second committer's file. The second developer then issues the cvs update command,
which merges the first developer's changes into the local file. In many cases, the changes will be in
different areas of the file, and the merge is successful. However, if both developers have made changes
to the same area of the file, the second to commit will have to resolve the conflict. This involves
examination of the problematic area(s) of the file and selection among the multiple versions or making
changes that resolve the conflict. CVS only detects textual conflicts, but conflict resolution is concerned
with keeping the project as a whole logically consistent. Therefore, conflict resolution sometimes
involves changing files other than the one about which CVS complained. For example, if one developer
adds a parameter to a function definition, it may be necessary for all the calls to that function to be
modified to pass the additional parameter. This is a logical conflict, so its detection and resolution is
the job of the developers (with support from tools like compilers and debuggers); CVS won't notice the
problem. In any merge situation, whether or not there was a conflict, the second developer to commit
will often want to retest the resulting version of the project because it has changed since the original
commit. Once it passes, the developer will need to recommit the file.
Tagging
CVS tracks file versions by revision number, which can be used to retrieve a particular revision from
the repository. In addition, it is possible to create symbolic tags so that a group of files (or an entire
project) can be referred to by a single identifier even when the revision numbers of the files are not the
same (which is most often the case). This capability is often used to keep track of released versions or
other important project milestones.
For example, the symbolic tag hello-1_0 might refer to revision number 1.3 of hello.c and revision
number 1.1 of Makefile (symbolic tags are created with the tag and rtag commands).
Branching
The simplest form of development is linear, in which there is a succession of revisions to a file, and
each derived from the prior revision. Many projects can get by with a completely linear development
process, but larger projects (as measured by number of files, number of developers, and/or the size of
the user community) often run into maintenance issues that require additional capabilities. Sometimes,
it is desirable to do some speculative development while the main line of development continues
uninterrupted. Other times, bugs in the currently released version must be fixed while work on the next
version is underway. In both of these cases, the solution is to create a branch (fork) from an appropriate
point in the development of the project. If at a future point some or all of the changes on the branch are
needed back on the main line of development (or elsewhere), they can be merged in (joined). Branches
are forked with the tag -b command; they are joined with the update -j command.
17 | P a g e
3CS4-06: Linux and Shell Programming Unit V
AWK
The awk command is fundamentally a scripting language and a powerful text manipulation tool in
Linux. It is named after its founders Alfred Aho, Peter Weinberger, and Brian Kernighan. Awk is
popular because of its ability to process text (strings) as easily as numbers.
It scans a sequence of input lines, or records, one by one, searching for lines that match the pattern.
When a match is found, an action can be performed. It is a pattern-action language.
Input to awk can come from files, redirection and pipes or directly from standard input.
Terminology
Let’s get on to some basic terms before we dive into the tutorial. This will make it easier for you to
understand the concept better.
FS is the field separator. By default FS is set to whitespace. That means each word is a field.
NF is the Number of Fields in a particular record.
Fields are numbered as:
$0 for the whole line.
$1 for the first field.
$2 for the second field.
$n for the nth field.
$NF for the last field.
$NF-1 for the second last field.
Standard format of awk
awk can be used to print a message to the terminal based on some pattern in the text. If you run awk
command without any pattern and just a single print command, awk prints the message every time you
hit enter. This happens because awk command is expecting input from the command line interface.
18 | P a g e
3CS4-06: Linux and Shell Programming Unit V
Awk Printing
Processing input from the command line using awk
We saw in the previous example that if no input-source is mentioned then awk simply takes input from
the command line.
Input under awk is seen as a collection of records and each record is further a collection of fields. We
can use this to process input in real-time.
This code looks for the pattern where the third word in the line is ‘linux”. When a match is found it
prints the message. Here we have referenced the first field from the same line. Before moving forward,
let’s create a text file for use as input.
1 First 200
2 Second 300
3 Third 150
4 Fourth 300
5 Fifth 250
6 Sixth 500
19 | P a g e
3CS4-06: Linux and Shell Programming Unit V
7 Seventh 100
8 Eight 50
9 Ninth 70
10 Tenth 270
These could be the dues in rupees for different customers named First, Second…so on.
Input from a file can be printed using awk. We can refer to different fields to print the output in a fancy
manner.
1 $ awk '{print $1, "owes", $2}' rec.txt
print awk
$1 and $2 are used for referring to fields one and two respectively. These in our input file are the first
and second words in each line. We haven’t mentioned any pattern in this command therefore awk
command runs the action on every record. The default pattern for awk is “” which matches every line.
You can notice that by default print command separates the output fields by a whitespace. This can be
changed by changing OFS.
1 $ awk 'OFS=" owes " {print $1,$2}' rec.txt
OFS
20 | P a g e
3CS4-06: Linux and Shell Programming Unit V
The same output is achieved as the previous case. The default output field separator has been changed
from whitespace to ” owes “. This, however, is not the best way to change the OFS. All the separators
should be changed in the BEGIN section of the awk command.
Field separator can be changed by changing the value of FS. By default, FS is set to whitespace. We
created another file with the follow data. Here the name and the amount are separated by ‘-‘
1 First-200
2 Second-300
3 Third-150
4 Fourth-300
5 Fifth-250
6 Sixth-500
7 Seventh-100
8 Eight-50
9 Ninth-70
10 Tenth-270
1 $ awk 'FS="-" {print $1}' rec-sep.txt
You can notice that the first line of the output is wrong. It seems that for the first record awk was not
able to separate the fields. This is because we have mentioned the statement that changes the field
separator in the action section. The first time action section runs, is after the first record has been
processed. In this case, First-200 is read and processed with field separator as whitespace.
Correct way:
1 $ awk 'BEGIN {FS="-"} {print $1}' rec_1.txt
21 | P a g e
3CS4-06: Linux and Shell Programming Unit V
Now we get the correct output. The first record has been separated successfully. Any statement placed
in the BEGIN section runs before processing the input. BEGIN section is most often used to print a
message before the processing of input.
The third type of separator is record separator. By default record separator is set to newline. Record
separator can be changed by changing the value of RS. Changing RS is useful in case the input is a CSV
(comma-separated value) file.
First-200,Second-300,Third-150,Fourth-300,Fifth-250,Sixth-500,Seventh-100,Eight-50,Ninth-
1
70,Tenth-270
This is the same input as above but in a comma separated format.
1 $ awk 'BEGIN {FS="-"; RS=","; OFS=" owes Rs. "} {print $1,$2}' rec_2.txt
Reading csv
Boolean operations in awk
Boolean operations can be used as patterns. Different field values can be used to carry out comparisons.
awk works like an if-then command. In our data, we can find customers with more than Rs. 200 due.
22 | P a g e
3CS4-06: Linux and Shell Programming Unit V
This gives us the list by comparing the second field of each record with the 200 and printing if the
condition is true.
Since awk works with fields we can use this to our benefit. Running ls -l command gives the list of all
the files in the current directory with additional information.
ls -l
The awk command can be used along with ls -l to find out which files were created in the month of
May. $6 is the field for displaying the month in which the file was created. We can use this and match
the field with string ‘May’.
ls -l with awk
User-defined variables in awk
To perform additional operations variables can be defined in awk. For example to calculate the sum in
the list of people with dues greater than 200 we can define a sum variable to calculate the sum.
1 $ awk 'BEGIN {sum=0} $2>200 {sum=sum+$2; print $1} END{print sum}' rec.txt
sum variable
23 | P a g e
3CS4-06: Linux and Shell Programming Unit V
The sum variable is initialized in the BEGIN section, updated in the action section, and printed in the
END section. The action section would be used only if the condition mentioned in the pattern section
is true. Since the pattern is checked for each line, the structure works as a loop with an update being
performed each time the condition is met.
The awk command can also be used to count the number of lines, the number of words, and even the
number of characters. Let’s start with counting the number of lines with the awk command.
The number of lines can be printed by printing out the NR variable in the END section. NR is used to
store the current record number. Since the END section is accessed after all the records are processed,
NR in the END section would contain the total number of records.
1 $ awk 'END { print NR }' rec.txt
Number of Lines
Count number of words
To get the number of words, NF can be used. NF is the number of fields in each record. If NF is totalled
over all the records, the number of words can be achieved. In the command, c is used to count the
number of words. For each line, the total number of fields in that line is added to c. In the END section,
printing c would give the total number of words.
1 $ awk 'BEGIN {c=0} {c=c+NF} END{print c}' rec.txt
Number of Words
Count number of characters
Number of characters for each line can be obtained by using the in built length function of awk. $0 is
used for getting the entire record. length($0) would give the number of characters in that record.
1 awk '{ print "number of characters in line", NR,"=" length($0) }' rec.txt
Number Of Characters
24 | P a g e