0% found this document useful (0 votes)
25 views

Module2 Session3 Final

This document introduces Linux commands for extracting and analyzing data from files. It covers commands for viewing file statistics (wc), sorting data (sort), removing duplicates (uniq), comparing files (join, diff), searching for patterns (grep), extracting fields (cut), and redirecting command outputs to files. The goal is to learn how to manipulate text data from files and combine multiple commands to analyze biological data in Linux.

Uploaded by

Papillon Blanc
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Module2 Session3 Final

This document introduces Linux commands for extracting and analyzing data from files. It covers commands for viewing file statistics (wc), sorting data (sort), removing duplicates (uniq), comparing files (join, diff), searching for patterns (grep), extracting fields (cut), and redirecting command outputs to files. The goal is to learn how to manipulate text data from files and combine multiple commands to analyze biological data in Linux.

Uploaded by

Papillon Blanc
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Introduc)on

to Bioinforma)cs online course: IBT

Linux
Extrac)ng informa)on from files

Introduc)on to Bioinforma)cs online course: IBT


Linux | Amel Ghouila
Learning Objec)ves

① Learn how to search pa/erns in files and how


to extract specific data
② Learn how to sort files content
③ Learn basic commands to compare files
content
④ Learn results redirec;on
⑤ Learn commands combina;on

Introduc)on to Bioinforma)cs online course: IBT


Linux | Amel Ghouila
Learning Outcomes

①  Be able to search pa/erns in files extract


specific data
② Be able to sort files content
③ Be able to use some basic commands to
compare files content
④ Know how to write commands results into a
file
⑤ Be able to combine different commands

Introduc)on to Bioinforma)cs online course: IBT


Linux | Amel Ghouila
Part 1

Basic opera)ons on files and data


extrac)on

Introduc)on to Bioinforma)cs online course: IBT


Linux | Amel Ghouila
Some sta)s)cs about your file
content: wc command
•  wc prints newline, word, and byte counts for each
file
•  syntax: wc <op;ons> <filename>
•  Some useful op;ons:
•  -c: prints the byte counts
•  -m: prints the character counts
•  -l: prints the newline counts
•  For more info about the different commands use
man commandname
Introduc)on to Bioinforma)cs online course: IBT
5
Linux | Amel Ghouila
Basics opera)on on files

•  sort: reorder the content of a file “alphabe;cally”
syntax: sort <filename>
•  uniq: removes duplicated lines
syntax: uniq <filename>
•  join: compare the contents of 2 files, outputs the
common entries
syntax: join <filename1> <filename2>
•  diff: compare the contents of 2 files, outputs the
differences
syntax: diff <filename1> <filename2>

Introduc)on to Bioinforma)cs online course: IBT
6
Linux | Amel Ghouila
Sor)ng data

•  sort outputs a sorted order of the file content


based on a specified sort key (default: takes
en;re input)
•  Syntax: sort <op;ons> <filename>
Sor)ng data
•  Default field separator: Blank
•  Sorted files are used as an input for several other
commands so sort is oWen used in combina;on
to other commands
•  For <op;ons> see man
Introduc)on to Bioinforma)cs online course: IBT
7
Linux | Amel Ghouila
Sor)ng data: examples


w  Sort alphabe;cally (default op;on): sort <filename>
w  Sort numerically: sort -n <filename>
w  Sort on a specific column (n°4): sort –k 4 <filename>
w  Sort based on a tab separator: sort -t $'\t’ <filename>
w  ...

Introduc)on to Bioinforma)cs online course: IBT


8
Linux | Amel Ghouila
Extrac)ng data from files

•  grep: to search for the occurrence of a specific


pa/ern (regular expression using the wildcards…)
in a file
Syntax: grep <paRern> <filename>

•  cut: is used to extract specific fields from a file

Syntax: cut <op)ons> <filename>

Introduc)on to Bioinforma)cs online course: IBT
9
Linux | Amel Ghouila
grep command
•  grep (“global regular expression profile”) is used to
search for the occurrence of a specific pa/ern (regular
expression…) in a file
•  Grep output the whole line containing that pa/ern
•  For <op;ons> see man

Example:
Extract lines containing the pa1ern xxx from a file:
grep xxx <filename>
Extract lines that do not contain pa1ern xxx from a file:
grep –v xxx <filename>

Introduc)on to Bioinforma)cs online course: IBT


10
Linux | Amel Ghouila
grep example


Let’s consider a file named “ghandi.txt”
$ cat ghandi.txt
The difference between what we do
and what we are capable of doing
would suffice to solve
most of the world's problems

$ grep what ghandi.txt


The difference between what we do
and what we are capable of doing

$ grep -v what ghandi.txt


would suffice to solve
most of the world's problems

Introduc)on to Bioinforma)cs online course: IBT


11
Linux | Amel Ghouila
cut command

•  cut is used to extract specific fields from a file


•  Structure: cut <op;ons> <filename>
•  For <op;ons> see man
•  Important op;ons are
w -d (field delimiter)
w -f (field specifier)
Example:
extract fields 2 and 3 from a file having ‘space’ as a separator
cut -d’ ‘ -f2,3 <filename>
Introduc)on to Bioinforma)cs online course: IBT
12
Linux | Amel Ghouila
uniq command

•  uniq outputs a file with no duplicated lines


•  Uniq requires a sorted file as an input
•  Syntax: uniq <op;ons> <sorted_filename>
•  For <op;ons> see man
•  Useful op;on is -c to output each line with its
number of repeats

Introduc)on to Bioinforma)cs online course: IBT


13
Linux | Amel Ghouila
Join command

•  join is used to compare 2 input files based on the


entries in a common field (called “join field”) and
outputs a merged file
•  join requires sorted files as an input
•  Lines with iden;;cal “join field” will be present only
once in the output
•  Structure:
join <op;ons> <filename1> <filename2>
•  For <op;ons> see man

Introduc)on to Bioinforma)cs online course: IBT


14
Linux | Amel Ghouila
diff command

•  diff is used to compare 2 input files and displays the


different entries
•  Can be used to highlight differences between 2
versions of the same file
•  Default output: common lines not showed, only
different lines are indicated and shows what has
been added (a), deleted (d) or changed (c)
•  Structure: diff <op;ons> <filename1> <filename2>
•  For <op;ons> see man

Introduc)on to Bioinforma)cs online course: IBT


15
Linux | Amel Ghouila
Part 2

Outputs redirec)on and combining


different commands

Introduc)on to Bioinforma)cs online course: IBT


16
Linux | Amel Ghouila
Commands outputs

•  By default, the standard output of any command will


appear to the terminal screen.
•  Redirec;on of the output result to a file is possible.
•  This is par;cularly useful for big files
•  Syntax: command op;ons filename.in > filename.out

Introduc)on to Bioinforma)cs online course: IBT


17
Linux | Amel Ghouila
Outputs redirec)on

•  If the file exists, the result


will be redirected to it

$ cat ghandi.txt
The difference between what we do
and what we are capable of doing
would suffice to solve
most of the world's problems •  If the file does not exist, it will be
$ cut -d’ ‘ -f2,3 ghandi.txt
difference between
automa;cally created and the result
what we redirected to it.
suffice to
of the


$ cut -d’ ‘ -f2,3 ghandi.txt > ghandi.txt.out
$ cat ghandi.txt.out
difference between
what we
suffice to
of the
Introduc)on to Bioinforma)cs online course: IBT
18
Linux | Amel Ghouila
Commands combina)on

•  The standard output of any command will be one


unique output
•  As seen previously, this output can be printed in the
screen or redirected to a file
•  However, the output result of a command can also be
redirected to another command
•  This is par;cularly useful when several opera;ons are
needed for a file, with no need to store the
intermediate outputs

Introduc)on to Bioinforma)cs online course: IBT


19
Linux | Amel Ghouila
Commands combina)on: example

•  Combining several commands is done thanks to the


use of a “|” character

•  Structure:
command1 op;ons1 filename1.in |command2 op;ons2 > filename.out

•  This can be done for as many commands as needed

Introduc)on to Bioinforma)cs online course: IBT


20
Linux | Amel Ghouila

Thanks
Shaun Aron & Sumir Panji

Introduc)on to Bioinforma)cs online course: IBT


Linux | Amel Ghouila

You might also like