Combined
Combined
Beginners 2022
Home
• Course Description: 18
• Module 3: Gene Ontology and Pathway Analysis (Dec 1st - Dec 13th, 2022) 19
• Course requirements: 19
Module1 - Unix/Biowulf
• Lesson Objectives 22
• Why Bioinformatics? 22
• Terms to know 22
• Additional guidance 23
• What is Unix? 23
• What is DNAnexus? 24
• Finding the course and getting started with the GOLD system 25
• What is conda? 29
• Help Session 30
• Quick review 31
• Lesson Objectives 31
• File system 31
• Some useful unix commands to navigate our file system and tell us some things about
our files 32
• Getting Started 32
• Anatomy of a command 33
• Where am I? (pwd) 34
• Moving and renaming files and directories, all with one command (mv) 40
• Help! (man) 42
• Additional Resources 43
• Help Session 43
• Lesson 3 Review 45
• Learning Objectives 45
• Use of wildcards 47
• Access your history with the "up" and "down" arrows on your keyboard 48
• Keyboard shortcuts 48
• Working with file content (input <, and output >, append >>) 50
• Combining commands with pipe (|). Where the heck is pipe anyway? 51
• Lesson 4 Review 57
• Lesson Objectives 57
• Working on Biowulf 57
• Batch Jobs 58
• Partitions 60
• Walltime 61
• Swarm-ing on Biowulf 64
• Help Session 66
• Lesson 5 Review: 67
• Learning Objectives: 67
• Using fastq-dump 68
• Using seqkit stat 71
• fasterq-dump 72
• Using prefetch 72
• Where can we get an accession list or run information for a particular project? 72
• E-utilities 75
• Help Session 77
• Lesson Review 78
• Learning Objectives 78
• What is awk? 83
• What is sed? 84
• Help Session 85
Introduction to RNASeq
RNA-SEQ Overview 87
• What is RNASEQ ? 87
• RNASEQ - WorkFlow 87
• Read Choices 89
• Replicates 89
• Sequencing 93
• Computational Prerequisites 97
Data Analysis 98
• Answers 106
• Answers 107
Quantitation 110
• Replicates 114
Visualization 118
Resources 125
• About the Human Brain Reference and Universal Human Reference data 129
Lesson 10: Introducing the FASTQ file and assessing sequencing data quality 145
• What are the files that we need for RNA sequencing analysis? 195
• Alignment of HBR and UHR raw sequencing data with Bowtie2 213
• Creating bigWig files - step 3, actually creating the bigWig files 220
• Visualizing alignment HBR and UHR alignment results with IGV 221
• Normalization 238
Lesson 16: RNA sequencing review and classification based analysis 246
• Review 246
• Objectives 268
• Resources: 276
• Pathways: 295
Help Sessions
• Let's run fastqc, a quality control program, on the files we downloaded from the SRA.
317
• Using sbatch 317
• Objectives 325
• Find out some details about the Golden Snidget genome and transcriptome 328
• Objectives 334
• Objectives 348
• Objectives 360
• Trimming 364
• Pre-trimming 365
• Post-trimming 366
Lesson 13 Practice 367
• Objectives 367
• Review of what we have done so far for the Golden Snidget dataset 367
• Constructing a text file with the Golden Snidget sample IDs 369
• Alignment of Golden Snidget FASTQ files - constructing the HISAT2 command 370
• Alignment of Golden Snidget FASTQ files - converting SAM files to BAM 372
• Objectives 374
• Objectives 383
• Create a folder to store the Golden Snidget differential expression analysis results 383
• Format the Golden Snidget counts table for differential expression analysis 385
• Objectives 387
• Results 394
References
References 399
Additional Resources
• RNA-Seq 403
Logging into Biowulf 404
Installing IGV
• BTEP Bioinformatics for Beginners (September 13th, 2022 - December 13th, 2022) 411
Course Description:
This course was designed to teach the basic skills needed for bioinformatics, including working
on the Unix command line. This course primarily focuses on RNA-Seq analysis. All steps of the
RNA-Seq workflow, from raw data to differential expression and gene ontology analysis, are
covered. However, many of the skills learned are foundational to most bioinformatics analyses
and can be applied to the analysis of other types of next generation sequencing experiments.
Module 3: Gene Ontology and Pathway Analysis (Dec 1st - Dec 13th,
2022)
Lessons focus on gene ontology and pathway analysis.
• Lesson 17: Introduction to gene ontology and pathway analysis (Recording (https://
cbiit.webex.com/cbiit/ldr.php?RCID=5c6a9f415202606a12515f5bbb5c26b1))
• Lesson 18: Functional enrichment with DAVID (Recording (https://cbiit.webex.com/cbiit/
ldr.php?RCID=e40b65d867f24c06718ce1cfd8689f07))
• Lesson 19: Pathway analysis with Qiagen IPA (Recording (https://cbiit.webex.com/cbiit/
ldr.php?RCID=e2fb05d4f0479a491b6fc3fcb13490bd))
• Lesson 20: Review and Course Wrap-up (Recording (https://cbiit.webex.com/cbiit/
ldr.php?RCID=69b4b4550053c2260a37253ad93ca83c))
Course requirements:
Who can take this course?
There are no prerequisites to take this course. This course is open to NCI-CCR researchers
interested in learning bioinformatics skills, especially those relevant to analyzing bulk RNA
sequencing data.
Lesson content and practice questions can be found in these pages. Email [email protected] if
you have any comments, questions, or concerns.
Lesson Objectives
1. Course overview.
2. Introduce Unix and describe how it differs from other operating systems.
4. Discuss ways to use the command line outside of the DNAnexus teaching environment.
5. Introduce conda.
Why Bioinformatics?
1. Analyze your own data
• P-value - "the probability for the observed effect size to be a product of random chance"
(Biostar Handbook). Statitistical significance is usually set to values below 0.05 or 0.01.
• Adjusted p-value adjusted for multiple testing / multiple comparisons. When repeating a
test multiple times, the chances of getting a value by random chance increases.
• Statistical power - "reflects the ability of a test to produce the 'correct' prediction" (Biostar
Handbook). This is impacted by sample size, effect size, and the applied statistical
model.
Confidence interval - the range of values that contains some true value at a defined
• probability (e.g., 95%).
Additional guidance
• Statistical Models - these should be appropriate for your experimental design. The
methods you are using should be consistent with what's in related scientific literature.
Different tests produce different results. Consult a statistician if possible.
• Outliers - try to identify early on. If you remove data, you should have sound rationale for
doing so.
• P-hacking and HARKing - P-hacking is when you alter some form of the data analysis
pipeline to get different results (e.g., using different statitical tests or including/excluding
only subsets of the data). HARKing is modifying your hypotheses based on results. These
are common in data science, and can be dangerous if (1) you aren't transparent about
why something was done and what you did, (2) you fail to validate results, and (3) you
don't explore alternative explanations.
Do not overinterpret the meaning of a p-value. p-value thresholds are fairly arbitrary.
What is Unix?
1. An operating system, just like Windows or MacOS
3. Sometimes used interchangeably with Linux, which for our purposes, is just a version of
Unix
2. Useful for working with big data, like genomic sequence files
3. To use the NIH High Performance Cluster (HPC) Biowulf for data analysis
The Bash shell (the Bourne Again SHell) is the most popular Unix shell.
3.
2. The user has to learn a series of commands for interacting with a Unix system
3. BUT...a few commands, like the ones we will learn over the next several lessons, will allow
us to employ a number of bioinformatics tasks
1. Directory navigation: what the directory tree is, how to navigate and move
around with cd
2. Absolute and relative paths: how to access files located in directories
3. What simple Unix commands do: ls, mv, rm, mkdir, cat, man
4. Getting help: how to find out more on what a unix command does
5. What are “flags”: how to customize typical unix programs ls vs ls -l
6. Shell redirection: what is the standard input and output, how to “pipe” or
redirect the output of one program into the input of the other --- Biostar
Handbook (https://www.biostarhandbook.com/introduction-to-unix.html)
Finding the course and getting started with the GOLD system
Step 1: Login to DNAnexus
Step 2: Once you login, you should see the Projects page. If you have used DNAnexus
previously, you may see more than one project listed. If this is your first time using DNAnexus,
you will only see the project name for this course listed, BioStars. Double click on BioStars.
Step 3: Once you double click on the BioStars project, you will see a project directory
containing multiple subdirectories and files. Select (double click) one of the .html files (e.g.,
Class_LETTER_LETTER.html). We have divided the class into four groups based on name. For
example, if your first name begins with a letter A-E, select Class_A_E.html; if your first name
begins with a letter F-M, select Class_F_M.html. You will need to double-click on the .html file.
Step 4: The Class_LETTER_LETTER.html file will open the GOLD platform application, and you
will see a screen that looks like this:
At the top of the page you will see the instructors pictures and logins. You will need to find your
name (First and Last) in the table below the instructors. Once you find your name click on the
link associated with your name in the login column. The name that you see in the login column
will serve as your username in step 5.
Step 5: The login link will open a terminal with a prompt to login. Login with your username (See
step 4) and password (to be distributed in class).
Step 6: Once you login at the terminal, you will see the following page:
The course documentation is accessible at the top of the page and can be dragged up or
down for viewing. The command line terminal accounts for the rest of the page. You may need
to resize the screen to see the command prompt.
Now you should be logged onto the GOLD platform and ready for class.
Ending your DNAnexus session: if you are finished with the GOLDsystem for the day, logout
using
exit
In addition, we want you to be able to get started analyzing data on your own without having to
use the GOLD teaching environment. Most bioinformatics software will work with unix based
systems (MacOS or Linux). Therefore, if you are working on a Windows operating system, you
will need a work around.
The default shell starting with Mac OSX version 10.14 is the zsh shell. While this is not
•
really a problem, you can configure your computer to use the bash shell using the
following:
chsh -s /bin/bash
xcode-select --install
The Windows Subsystem for Linux (WSL) is a feature of the Windows operating
system that enables you to run a Linux file system, along with Linux command-line
tools and GUI apps, directly on Windows, alongside your traditional Windows
desktop and apps. --- docs.microsoft.com (https://docs.microsoft.com/en-us/
windows/wsl/faq)
To install WSL, you will need to submit a help ticket to service.cancer.gov (https://
service.cancer.gov/ncisp). There are multiple Linux distributions. We recommend new users
install "Ubuntu".
If you do not plan to use your local machine for bioinformatics analyses, you can connect to the
NIH HPC Biowulf using an SSH client. The secure shell (ssh) protocol is commonly used to
connect to remote servers. More on Biowulf later.
You can start an SSH session in your command prompt by executing ssh
user@machine and you will be prompted to enter your password. ---Windows
documentation (https://docs.microsoft.com/en-us/windows/terminal/tutorials/ssh?
source=recommendations)
To find the Command Prompt, type cmd in the search box (lower left), then press Enter to open
the highlighted Command Prompt shortcut.
Note about command prompt and powershell: Just like the bash shell works effectively with a
linux operating system, Windows also has shells to interact with the Windows operating system.
Windows has two shells: the Command Prompt and the PowerShell. However, because most
bioinformatics software is unix based, these shells will not be useful for bioinformatics scripting.
What is conda?
The Biostar Handbook works with programs installed within a conda environment named
bioinfo. Conda is commonly used for bioinformatics package installations.
It can be difficult to figure out what software is required for any particular research
project. It is often impossible to install different versions of the same software
package at the same time. Updating software required for one project can often
“break” the software installed for another project. --- Pugh and Tocknell,
Introduction to Conda for (Data) Scientists (https://carpentries-incubator.github.io/
introduction-to-conda-for-data-scientists/01-getting-started-with-conda/index.html)
Conda solves these problems by facilitating software installations, making the installation
process far easier. As a package and environment management system, conda also enhances
both the portability and reproducibility of scientific workflows by isolating software and their
dependencies in "environments". These environments do not interact with system wide
programs and therefore do not reek havoc on your local machine due to software
incompatibilites.
Conda runs on Windows, macOS, Linux and z/OS. Conda quickly installs, runs and
updates packages and their dependencies. Conda easily creates, saves, loads
and switches between environments on your local computer. It was created for
Python programs, but it can package and distribute software for any language. ---
docs.conda.io (https://docs.conda.io/en/latest/)
conda deactivate
Help Session
1. Getting set up on DNAnexus
2. Getting everyone access to Biowulf via ssh.
Quick review
• Unix is an operating system
• We use a unix shell (typically bash) to run many bioinformatics programs
• We need to learn unix to use non-GUI based tools and Biowulf
Lesson Objectives
• Learn the basic structure of a unix command
• Learn how to navigate our file system, including absolute vs relative directories
• Learn unix commands related to navigating directories, creating files and removing files
or directories, and getting help
You will need to learn how to troubleshoot error messages. Often this will involve googling the
error in reference to the entered command. There are many forums that post help regarding
specific errors (e.g., stack overflow, program repositories such as github).
File system
We manage files and directories through the operating system's file system. A directory is
synonymous with a "folder", which is used to organize files, other directories, executables, etc.
On a Windows or Mac, we usually open and scroll through our directories and files using a GUI.
For example, Finder is the default file management GUI from which we can access files or
deploy programs on a macbook.
This same file system can be accessed and navigated via command line from the unix shell.
Getting Started
ls
The "ls" command "lists" the contents of the directory you are in. You may see files and other
directories here.
How can you tell the difference between a file and a directory?
ls -lh
will show permissions and indicate directories (d). The -lh are flags. -l refers to listing in long
format, while -h provides human readable file sizes.
Or, many systems offset directories and files using colors (e.g., blue for directories). If you don't
see colorized output, try the -G flag.
We can also label output by adding a marker to indicate files, directories, links, and
executables using the -F flag.
ls -F
a terminal / = directory
a @ = link
a * = executable
Anatomy of a command
Using ls as an example, we can get an idea of the overall structure of a unix command.
The first thing we see is the command line prompt, usually $ or %, which varies by unix system.
The prompt let's us know that the computer is waiting for a command. Next we see the actual
command, in this case, ls, telling the computer to list the files and directories. Most commands
will have various options / flags that can be included to modify the command function. We can
also supply an argument, which in the case of ls is optional. For example, here we supplied an
alternative directory from which we are interested in listing files and directories. We hit enter
after each command, and when the command has finished running, the command prompt will
reappear prompting us to enter more commands.
Where am I? (pwd)
pwd
/home/username
where username is your name. This is your home directory - where you start from when you
open a terminal. This is an example of a "path". The path tells us the location of a file or
directory. Note: while Windows computers use a \ as a path separator, unix systems use a /.
Therefore, the pwd command is very helpful for figuring out where you are in the directory
structure. If you are getting "file not found" errors while trying to run something, it is a good idea
to pwd and see if you are where you think you are. Type the pwd command and make a note of
the directory you are in.
The file system on any computer is hierarchical, with the top level of the file system, or root
directory, being /.
At the top is the root directory that holds everything else. We refer to it using a slash
character, /, on its own; this character is the leading slash in /Users/nelle.
Inside that directory are several other directories: bin (which is where some built-in
programs are stored), data (for miscellaneous data files), Users (where users’
personal directories are located), tmp (for temporary files that don’t need to be
stored long-term), and so on.
We know that our current working directory /Users/nelle is stored inside /Users
because /Users is the first part of its name. Similarly, we know that /Users is stored
inside the root directory / because its name begins with /.
Notice that there are two meanings for the / character. When it appears at the front
of a file or directory name, it refers to the root directory. When it appears inside a
path, it’s just a separator.
Underneath /Users, we find one directory for each user with an account on Nelle’s
machine, her colleagues imhotep and larry.
Like other directories, home directories are sub-directories underneath "/Users" like
"/Users/imhotep", "/Users/larry" or "/Users/nelle"
Typically, when you open a new command prompt, you will be in your home
directory to start. ---swcarpentry/shell-novice: Software Carpentry: the UNIX shell
(https://swcarpentry.github.io/shell-novice/02-filedir/index.html)
The touch command creates a file, but the file is empty, so it is not a command you will use
very often, but good to know about.
touch file1.txt
touch file2.txt
ls
nano file2.txt
Unix is an operating system, just like Windows or MacOS. Linux is a Unix like o
Use the underscore (_) where a space would go, like this, to name a directory containing RNA-
Seq data.
my_RNA_Seq_data
brain_rna.fastq
liver_rna.fastq
The first part of the file name provides info about the file, and the extension (.fastq) tells what
kind of file it is. (Examples of file extensions are .csv, .txt, .fastq, .fasta and many more.)
It's important to understand file extensions, to know what kinds of data you are working with.
.txt are text files. These are likely but not always tab delimited.
.csv are "comma-separated values" - good for importing into MS Excel spreadsheets
.fastq tells you that these are FASTQ files, containing sequence data and quality scores
By adding the -i option, the system will ask if you're sure you want to delete. Generally
speaking, when a file on a Unix system is deleted, it is gone.
You can modify your profile on a Unix system to always ask before deleting, this is a good idea
when you're just getting started.
rm -i file1.txt
mkdir RNA_Seq_data
It can be used to go to a specific directory. Let's "go to" the directory we just made, and make
another directory within it.
cd RNA_Seq_data
pwd
mkdir exp_one
ls
cd exp_one
touch myseq.txt
ls
pwd
So, we've moved to the RNA_Seq_data directory, checked our directory with pwd, created a
directory called exp_one, listed the contents of RNA_Seq_data so we can see the directory
we just created, now we go to that directory with cd, create a file with touch, list the contents
with ls and print our working directory.
By itself, the cd command takes you home. Let's try that, and then do a pwd to see where we
are.
cd
pwd
/home/username
How can we go back to the exp_one directory we created? We need to give the "path" to that
directory.
cd RNA_Seq_data/exp_one
pwd
ls
Check where you are with pwd and look at the contents of the directory with ls. What do you
see? It should be the file "myseq.txt".
Here's another way to get around the directory structure using cd.
cd ~/RNA_Seq_data/exp_one
cd RNA_Seq_data/exp_one
The first cd command provides the full path to where you want to go, it is called an "absolute"
path.
For the second version, you need to be in the directory that contains /RNA_Seq_data, or the
command will not work. This is known as a "relative" path.
As a reminder, paths are the sequence of directories that hold your data. In this path...
~/RNA_Seq_data/exp_one
there is a directory named exp_one, within a directory named RNA_Seq_data, within our home
directory.
You will become more comfortable with paths as you build up your directories and data.
Another way to use the cd command is to go up one level in the directory structure, like this.
cd ..
This can be very helpful as you move around the directory tree. There are many more ways to
use the "cd" command.
rmdir exp_one
What should we do? We need to remove the contents of a directory before we can remove the
directory. Here's one safe option.
cd exp_one
ls
rm myseq.txt
ls
cd ..
ls
rmdir exp_one
Moving and renaming files and directories, all with one command (mv)
The mv command is a handy way to rename files if you've created them with a typo or decide to
use a more descriptive name. For example:
cd
mv file2.txt README.txt
ls
Be careful when moving files, a mistake in the command can yield unexpected results.
mv README.txt RNA_Seq_data
cd RNA_Seq_data
ls
For example:
mkdir dir1
mkdir dir2
touch dir2/hello.txt
touch hello.txt
mv -i dir2/hello.txt hello.txt
mv dir1 dir2
cd dir2
mv dir1 dir3
We can use the less command to view the contents of a file like this.
cd
less /data/sample.fasta
You'll need to type q to get out of less and back to the command line. Before the less
command was available, the more command was commonly used to look at file content. The
less command has more options for scrolling through files, so it is now the preferred
command.
This is similar to mv but will create an actual copy of a file. You will need to specify what you are
copying (the source) and where you want to make the copy (the target).
For example:
touch ~/file_to_copy.txt
cp ~/file_to_copy.txt ./RNA_Seq_data
We can also copy an entire directory using the recursive flag (cp -r).
cp -r RNA_Seq_data RNA_Seq_data_copy
Help! (man)
All Unix commands have a man or "manual" page that describes how to use them. If you need
help remembering how to use the command ls, you would type:
man ls
There are quite a few flags/options that we can use with the ls command, and we can learn all
about them on the man page. My favorite flags for ls are -l and -h. We will use flags often,
and you won't get far in Unix without knowing about them. Try this:
cd
ls -lh
-l (The lowercase letter "ell".) List in long format. (See below). If the output is to a terminal, a total
sum for all the file sizes is output on a line before the long listing.
cd
ls
ls -lh
Additional Resources
Software Carpentry: The Unix Shell (https://swcarpentry.github.io/shell-novice/01-intro/
index.html)
Help Session
Practice navigating the file system and creating files. Instructions are here.
Lesson 3 Review
• Biowulf is the high performance computing cluster at NIH.
• When you apply for a Biowulf account you will be issued two primary storage spaces: 1)
/home/$User and 2) /data/$USER, with 16 GB and 100 GB of default disk space.
• Hundreds of pre-installed bioinformatics programs are available through the module
system.
• Computational tasks on Biowulf should be submitted as a job (sbatch, swarm) or
through an interactive session sinteractive.
• Do not run computational tasks on the login node.
Learning Objectives
We are going to shift gears back to unix. We will focus on learning concepts that make working
with unix particularly useful including:
ls
ls -S
ls -lh
ls -h (when used with -l option, prints file sizes in a human readable format with the unit
suffixes: Byte, Kilobyte, Megabyte, Gigabyte, Terabyte. This reduces the number of digits
displayed.)
1. file type
2. Content permissions
3. Number of hard links to content
4. Owner
5. Group owner
6. Content size (bytes)
7. Last modified date / time
8. File / directory name
There are many flags you can use with ls. How would we find out what they are?
man ls
Or to see a more user friendly display, google to the rescue. Google "man ls unix" and see what
you get. Here's (https://shapeshed.com/unix-ls/) a useful, readable explanation of the "ls"
command with examples.
ls -lhS
ls -alt
Flags and options add a layer of complexity to unix commands but are necessary to get the
command or program to behave in the way you expect. For example, here is a command line
for running "blastn" an NCBI/BLAST application.
What's going on in this command line? First, the BLAST algorithm is specified, in this case it is
blastn, then the -db flag is used to choose the database to search against ( nt for
nucleotide). The query flag specifies the input sequence, which is in FASTA format, and the -
out flag specifies the name of the output file.
Use of wildcards
Wildcard characters are a handy tool when working at the command line. Want to list all your
FASTA files?
Use *:
ls /data/*.fasta
This example will work as long as all your FASTA files end in .fasta. But sometimes they don't.
ls /data/*.fa
ls /data/*.f*
In addition to the asterisk (*) character, you can use other wildcards on the unix command line,
not limited to the following:
? - matches a single character
{} - used for multiple matches
[] - specify a range of characters or numbers. For example, [a-z] would specify all lower
case letters and [0-9] would mean all digits.
To see some more practical examples of using wildcards, see this article (https://
www.tecmint.com/use-wildcards-to-match-filenames-in-linux/) from tecmint.com and this
(https://medium.com/@leedowthwaite/advanced-wildcard-patterns-most-people-dont-
know-52f7fd608cb3) from the medium.com. This second article provides a nice discussion on
how wildcards differ from regular expressions.
touch file.txt
touch file.fasta
touch file.fastq
Start typing...
The tab complete will save you lots of typing, and also help to figure out if you are where you
think you are in the directory structure.
Keyboard shortcuts
There are also a few handy keyboard shortcuts to make life on the command line easier. For
example:
This command reads the content of sample.fasta and outputs to standard output (i.e., the
screen). This is not helpful for very large files, as it moves to the end of the file quickly. Less is a
better command option for reading large files.
cat /data/sample.fasta
You can use cat to combine several files into one file, such as:
Although, this again prints to standard output, the screen. To capture that output, we need to
learn how to redirect output. (Coming up next!)
head
head /data/sample.fasta
tail
tail /data/sample.fasta
You can specify how many lines you would like to see (-n), or you can use the default value,
which is 10.
head -n 20 /data/sample.fasta
cat -n /data/sample.fasta
What does the -n flag do? How could you find out more about "cat"?
man cat
<
>
>>
Want to put the output from cat, head, or tail into a new file?
What if we want the first 20 lines and the last 20 lines in one file, with the first at the top and the
last at the bottom? Use append, >> to paste the second file to the bottom of the first file. Let's
try it.
Keep in mind that if you input into the same file multiple times, you are overwriting the previous
contents. For example, what is the final content of our file covid.fasta?
How many lines are now in "covid.fasta"? How can you check?
wc covid.fasta
wc is a very useful function. Without opening a file, we can find out how many lines, words and
characters are in it. Line counts are extremely useful to assess your data output.
If we created a file where we were expecting there to be 1000 lines of output? The wc command
provides a quick way to check.
What happened to all of our content? The final results are from "seq3.fasta" only. The other two
results files have been overwritten.
So, how would you get all three files into covid.fasta? You'll need to use append.
How could you test to see if the file has the expected number of lines?
wc covid.fasta
The pipe symbol "|" (a.k.a., vertical bar) is way over on the right hand side of your keyboard,
above the backslash \.
Pipe is used to take the output from one command, and use it as input for the next command,
all in one command line. Let's look at some examples.
head -n 20 /data/sample.fasta | wc
This combines several things we have learned about. The cat command opens the file
sample.fasta for writing. The pipe | command is used to take that output and run it through
the head command where we only want to see the first 20 lines, and we want them output > into
a file called "output.fasta". Let's compare the files. How are they different?
ls -lh
and
less /data/sample.fasta
less output.fasta
As our first example we will look for restriction enzyme (EcoRI) sites in a sequence file
(eco.fasta). The file has four EcoRI sites, but two of them are across the end of the line (and
won't be found).
ls /data/eco.fasta
grep -n GAATTC /data/eco.fasta
We can modify the "eco.fasta" file to remove the line breaks (\n) at the ends of the lines.
-v : This prints out all the lines that do not match the pattern
The unix "tr" (translate) command is used for translating or deleting characters.
Usage:
So this part of the command line is finding the line breaks "\n" and removing them.
What if we just wanted to count the occurrence of the EcoRI sites in the sequence?
Let's create a word file that we can input to grep. We can input multiple restriction enzyme sites
and search for all of them.
cd
nano wordfile.txt
Put in the words (GAATTC, TTTTT). Now we can use that file to find lines.
man grep
There are also existing programs to find motifs (patterns) in sequence data. For example:
fuzznuc -help
fuzznuc -pattern GAATTC -rformat2 tagseq eco.fasta -outfile /home/username/resu
For each commands all on one line or separate lines: (“i” can be any variable name). These
steps can be saved as a file, thereby creating a simple Unix script.
This one pulls out all the header ">" lines in the fasta files.
While this one just pulls out the ones from files named seq*.fasta.
As we have seen, ls -l, provides information about file types, the owner of the file, and other
permissions.
For example:
ls -l /data/sample.fasta
The first letter d indicates whether this is a directory or not - or some other special file type. The
next 3 positions are the owner's/user's permissions. In this image, the owner can "read", "write"
and "execute". So they can create files and directories here, read files here, and execute/run
programs. The next 3 positions show the permissions for the "group". The last 3 positions shows
permissions for everyone ("other").
You can modify permissions using chmod. Let's see this in action.
touch example.txt
chmod u-w example.txt
ls -l example.txt
Help Session
Let's complete a Unix treasure hunt.
Lesson 4 Review
• Flags and command options
• Wildcards (*)
• Tab complete
• Accessing user history with the "up" and "down" arrows
• cat, head, and tail
• Working with file content (input,output, and append)
• Combining commands with the pipe (|)
• grep
• for loop
• File Permissions
Lesson Objectives
• Learn about the slurm system by working on Biowulf: batch jobs, swarms jobs, interactive
sessions
• Retrieve data from NCBI through a batch job
• Learn how to troubleshoot failed jobs
NOTE: for this session, you will need to either login to your Biowulf account or use a student
account.
Working on Biowulf
Now that we are becoming more proficient at the command line, we can use these skills to
begin working on Biowulf. Today's lesson will focus on submitting computational jobs on the
Biowulf compute nodes.
Open your Terminal if you are using a mac or the Command prompt if you are using a
Windows machine.
When you log into Biowulf, you are automatically in your home directory (/home). This directory
is very small and not suitable for large data files or analysis.
$ cd /data/$USER
When working on Biowulf, you can not use any computational tools on the "login node". Instead,
you need to work on a node or nodes that are sufficient for what you are doing.
To run jobs on Biowulf, you must designate them as interactive, batch or swarm. Failure to do
this may result in a temporary account lockout.
Batch Jobs
Most jobs on Biowulf should be run as batch jobs using the "sbatch" command.
$ sbatch yourscript.sh
Where yourscript.sh is a shell script containing the job commands including input, output,
cpus-per-task, and other steps. Batch scripts always start with #!/bin/bash or similar call.
The sha-bang (#!) tells the computer what command interpreter to use, in this case the Bourne-
again shell.
For example, to submit a job checking sequence quality using fastqc (MORE ON THIS
LATER), you may create a script named fastqc.sh:
nano fastqc.sh
#!/bin/bash
and seqfile1 ... seqfileN are the names of the sequence files.
Note: fastqc is a available via Biowulf's module system, and so prior to running the command,
the module had to be loaded.
For more information on running batch jobs on Biowulf, please see: https://hpc.nih.gov/docs/
userguide.html (https://hpc.nih.gov/docs/userguide.html)
For multi-threaded jobs, you will need to set --cpus-per-task. You can do this at the
command line or from within your script.
In your script:
#!/bin/bash
#SBATCH --job-name qc
#SBATCH --mail-type BEGIN,END
#SBATCH --cpus-per-task #
Within the script we can use directives denoted by #SBATCH to support command line
arguments such as --cpus-per-task. If included within the script, you will not need to call
these at the command line when submitting the job. You should also pass the environment
variable, $SLURM_CPUS_PER_TASK to the thread argument Some other useful directives
include --job-name, where you assign a name to the submitted job, and --mail-type,
which you can use to direct slurm to send you an email when a job begins, ends, or both.
sbatch --help
Once you submit a job, you will need to interact with the slurm system to manage or view
details about submitted jobs.
Partitions
Your job may be in a waiting phase ("Pending" or "PD") depending on available resources. You
can specify a particular node partition using --partition.
Summary of partitions
Walltime
The default walltime, or amount of time allocated to a job, is 2 hours on the norm partition. To
change the walltime, use --time=d-hh:mm:ss.
batchlim
We will learn how to pull the files we are interested in directly from SRA at a later date. For now,
we will use the run information stored in sra_files_PRJNA578488.txt.
less sra_files_PRJNA578488.txt
Now, let's build a script downloading a single run, SRR10314042, to a directory called /data/
$USER/testscript.
mkdir /data/$USER/testscript
Open the text editor nano and create a script named filedownload.sh.
nano filedownload.sh
#!/bin/bash
#SBATCH --cpus-per-task=6
#SBATCH --gres=lscratch:10
#load module
module load sratoolkit
Remember:
Default compute allocation = 1 physical core = 2 CPUs Default Memory Per CPU = 2 GB
Therefore, default memory allocation = 4 GB
sbatch filedownload.sh
squeue -u $USER
Once the job status changes from PD (pending) to R (running), let's check the job status.
sjobs -u $USER
ls -lth
NOTE: There are instructions for running SRA-Toolkit on Biowulf here (https://hpc.nih.gov/apps/
sratoolkit.html).
Swarm-ing on Biowulf
Swarm is for running a group of commands (job array) on Biowulf. swarm reads a list of
command lines and automatically submits them to the system as sub jobs. To create a swarm
file, you can use nano or another text editor and put all of your command lines in a file called
file.swarm. Then you will use the swarm command to execute.
$ swarm -f file.swarm
Swarm creates two output files for each command line, one each for STDOUT (file.o) and
STDERR (file.e). You can look into these files with the "less" command to see any important
messages.
$ less swarm_jobid_subjobid.o
$ less swarm_jobid_subjobid.e
swarm --help
nano set.swarm
#SWARM --threads-per-process 3
#SWARM --gb-per-process 1
#SWARM --gres=lscratch:10
#SWARM --module sratoolkit
There is advice for generating a swarm file using a for loop and echo in the swarm user
guide (https://hpc.nih.gov/apps/swarm.html).
Let's run our swarm file. Because we included our directives within the swarm file, the only
option we need to include is -f for file.
swarm -f set.swarm
To start an interactive node, type "sinteractive" at the command line "$" and press Enter/Return
on your keyboard.
$ sinteractive
You will see something like this printed to your screen. You only need to use the
sinteractive command once per session. If you try to start an interactive node on top of
another interactive node, you will get a message asking why you want to start another node.
[username@biowulf ]$ sinteractive
salloc.exe: Pending job allocation 34516111
salloc.exe: job 34516111 queued and waiting for resources
Bioinformatics Training and Education Program
66 Lesson 5: Working on Biowulf
You can use many of the same options for sinteractive as you can with sbatch. The
default sinteractive allocation is 1 core (2 CPUs) and 4 GB of memory and a walltime of 8 hours.
exit
exit
Help Session
Let's submit some jobs on Biowulf.
Lesson 5 Review:
• The majority of computational tasks on Biowulf should be submitted as jobs: sbatch or
swarm
• the SRA-toolkit can be used to retrieve data from the Sequence Read Archive
Learning Objectives:
1. Download data from the SRA with fastq-dump
◦ split files into forward and reverse reads
◦ download part, not all, of the data
2. Compare fastq-dump to fasterq-dump
3. Introduce prefetch
4. Look at XML-formatted data with sra-stat
5. Grab SRA run info and run accession information
6. Work with csv-formatted data using the cut command to isolate columns
7. Learn to automate data retrieval with the parallel command
8. Learn about alternative download options
For the remainder of this course, we will be working on the GOLD teaching environment in
DNAnexus.
mkdir biostar_class
cd biostar_class
mkdir sra_data
cd sra_data
We will download data from the SRA using the command line package sratoolkit. Note, you
can also download files directly from NCBI via a web browser.
Using fastq-dump
fastq-dump SRR1553607
SRR1553607.fastq
Check the file to make sure it is in fastq format. How would you do this? What is FASTQ format?
Spots are a legacy term referring to locations on the flow cell for Illumina sequencers. All of the
bases for a single location constitute the spot including technical reads (e.g., adapters, primers,
barcodes, etc.) and biological reads (forward, reverse). In general, you can think of a spot as
you do a read. For more information on spots, see the linked discussion (https://
www.biostars.org/p/12047/) on Biostars. When downloading "spots", always split the spots into
the original files using:
--split-files
For paired-end reads, we need to separate the data into two different files like this:
which creates
SRR1553607_1.fastq
SRR1553607_2.fastq
NOTE: There is an additional option --split-3 that will split the reads into forward and reverse
files and a third file with unmatched reads. Since many bioinformatic programs require matching
paired end reads, this is the preferred option.
fastq-dump first downloads the data in SRA format, then converts it to FASTQ. If we want to
work with a subset of the data, for example the first 10,000 reads, we can use -X:
We have additionally included --outdir to save our subsetted data to a different directory, so
that we do not overwrite our previous downloads.
fastq-dump --help
To generate an XML report on the data that shows us the "spot" or read count, and the count of
bases "base_count", including the size of the data file:
where we can see the spot count, the base count, and the number of reads corresponding to
quality scores.
spot_count="203445" base_count="41095890"
Side bar: (XML) is Extensible Markup Language, a document format that is both human and
machine-readable. XML aims for: "simplicity, generality and usability across the Internet".
We can get similar information from our downloaded files using a program called seqkit
(https://bioinf.shenwei.me/seqkit/).
seqkit --help
We can see that seqkit stats provides "simple statistics of FASTA/Q files". Let's try it out.
fasterq-dump
We have already seen fasterq-dump in action (See Lesson 5). fasterq-dump is faster than
fastq-dump because it uses multi-threading (default --threads 6). --split-3 and --
skip-technical are defaults with fasterq-dump, so you do not need to worry about
specifying how you want the files to be split.
However, you can not grab a subset of the file like you can with fastq-dump nor can you
compress the file during download, so fasterq-dump is not necessarily a replacement for
fastq-dump.
Using prefetch
Both fastq-dump and fasterq-dump are faster when following prefetch (https://
github.com/ncbi/sra-tools/wiki/08.-prefetch-and-fasterq-dump), and fasterq-dump paired with
prefetch is the fastest way to pull the files from the SRA. There are alternative options, which
we will mention later.
prefetch will first download all of the necessary files to your working directory. Runs are
downloaded in the SRA format (compressed).
mkdir prefetch
cd prefetch
prefetch SRR1553607
ls -l
fasterq-dump SRR1553607
without navigating to the NCBI website, or there is the SRA Run Selector (https://
www.ncbi.nlm.nih.gov/Traces/study/). The Run Selector is nice because you can filter the results
for a given BioProject and obtain only the accessions that interest you in a semi user-friendly
format. We can search directly from the SRA Run Selector, or simply go to the main NCBI
website and begin our search from there. Let's see the latter example, as this is likely the
primary way you will attempt to find data in the future.
Step 1: Start from the NCBI homepage. Type the BioProject ID (PRJNA257197) into the search
field.
Step 2: In the search results, select the entries next to the SRA.
Step 3: From the SRA results, select "Send results to Run Selector".
Step 4: From there, you can download the metadata or accession list.
Copy and paste the accession list into a file using nano. Save to runaccessions.txt. Now
use head to grab the first few results.
E-utilities
{{Sdet}}
esearch and efetch, can be used to query and pull information from Entrez (https://
www.ncbi.nlm.nih.gov/Web/Search/entrezfs.html). They aren't the easiest to understand or use,
but you can use them to get the same run info that we grabbed using the Run Selector with the
following one liner:
esearch -db sra -query PRJNA257197 | efetch -format runinfo > runinfo.csv
Here we provide the BioProject ID to esearch, which queries the SRA. That info can be piped
to efetch, which will pull down the run info in comma separated format, which we can save to
file using >.
Then we can use a combo of cat, cut, grep, and head to grab only the accession info we are
interested in:
Here, we opened the file with cat, piped it to the cut command, piped it to grep for only the
accession IDs (skipping the column header), and get only the first few results.
So what is cut? It is a unix command, that is used to cut out selected portions of the file (man
cut for more info). The -f option specifies which field to cut (the first field), and -d tells the
program which delimiter to use. We know this is a comma-separate value file, so we do -d ','
to specify the comma.
{{Edet}}
GNU parallel executes commands in "parallel", one for each CPU core on your system. It
can serve as a nice replacement of the for loop. See Tool: Gnu Parallel - Parallelize Serial
Command Line Programs Without Changing Them (https://www.biostars.org/p/63816/)
This gives us twenty files, two files for each run and 10 runs. It takes the place of the multiple
fastq-dump commands we would need to use.
The curly braces are default behavior for parallel. This is the default replacement string;
when empty, it appends inputs. So the following gets the same result.
The real power of parallel comes when using something between the brackets such as {.},
{/}, and {//}, like this...
nano filelist
/dir/subdir/file.txt
/other/list.csv
/raw/seqs.fastq
Note: parallel will default to 1 concurrent job per available CPU. On a shared system like
helix with 48 CPUs, where users do run fast(er)q-dump, that can cause an overload. Even when
submitting a job on Biowulf, you may not want to run as many concurrent jobs as there are
allocated CPUs (e.g. if you ran fasterq-dump with 2 threads and you have 12 CPUs allocated --
jobs would be 6). Therefore, it is good practice to always specify the -j flag, which assigns the
number of jobs for the parallel command.
There is also a fantastic web service called SRA Explorer (https://sra-explorer.info/#) that can
be used to easily generate code for file downloads from the SRA, if you are looking to download
a maximum of 500 samples. Use caution, the SRA-explorer is not associated with the ENA or
NCBI, and program support / updates cannot be guaranteed.
Help Session
Let's search for and download some data.
Lesson Review
• pwd (print working directory)
• ls (list)
• touch (creates an empty file)
• nano (basic editor for creating small text files)
• using the rm command to remove files. Be careful!
• mkdir (make a directory) and rmdir (remove a directory, must be empty of all files)
• cd (change directory), by itself will take you home, cd .. (will take you up one directory),
cd /results_dir/exp1 (go directly to this directory)
• mv (for renaming files or moving files)
• less (for viewing files, "more" is the older version of this)
• man command (for viewing the man pages when you need help on a command)
• cp (copy) for copying files
• Flags and command options - making programs do what they do
• Wildcards (e.g., *)
• Tab complete - for less typing
• Accessing user history with the "up" and "down" arrows on the keyboard
• cat, head, and tail - print to screen, print first few lines to the screen, print last few
lines to the screen
• Working with file content (<, >, >>)
• Combining commands with pipe (|). Where the heck is pipe anyway?
• Finding information in files with grep
• Performing repetitive actions with Unix (for loop), GNU parallel
• Permissions (chmod,chown)
• wc - number of lines (-l), words (-w), and bytes (-c, usually one byte per character); for
number of characters use -m.
• grep- search files using regular expressions
• cut - cuts selected portions of a file
• fastq-dump and fasterq-dump - SRA file download
• ssh - secure shell protocol for remote login to Biowulf / Helix
Learning Objectives
• Introduce the RNA-Seq data
The data are paired-end with three replicates from each set (UHR, HBR).
Go to the directory you created for working with class material. If you haven't created a class
directory (biostar_class), do that now.
mkdir biostar_class
cd biostar_class
mkdir -p RNA_Seq/raw_data
What does the -p flag do? Now, go to the raw_data directory you have created.
cd RNA_Seq/raw_data
Now that we're in the correct directory, we will use curl to download some bulk RNA-Seq data.
curl http://data.biostarhandbook.com/rnaseq/projects/griffith/griffith-data.tar
Let's take a look at this Unix command line... The curl command is used to retrieve data from
web sites. A similar command is wget. The Unix system you are working with may have either
curl or wget installed. To see which is active on your system, just type the command at the
command line like this...
wget
curl
If curl is active on the system, you may see something like this...
curl --help
and see information on the usage of curl. So it looks like curl is installed on this system.
Moving on. Let's take a look at this command line. We now know what curl means, but how
about the rest of it. The URL http://data.biostarhandbook.com/rnaseq/projects/
griffith/griffith-data.tar.gz represents the "path" to this data. As we have
discussed, paths are a very important concept in Unix. An incorrect path can result in
frustrating "file not found" errors.
Another way to get to this data file is via your browser. Open a browser window and enter
http://data.biostarhandbook.com/rnaseq/projects/griffith. You will see an
index page listing all the directories at this location.
For example:
If you look closely, you will find a file named griffith-data.tar.gz. What happens if you
click on this link? Does it download? Can you open a tar file in the Mac environment? How about
on PC? How would you do it?
Okay, let's take a look at the file name griffith-data.tar.gz. What does the .tar.gz
extension mean? tar refers to "tape archive" and is used to archive a set of files into a single
file. The tar command can also be used to compress an archive using some form of
compression. The -z flag, for example, compresses the archive using gzip, which results in
the extension .gz. Note: gzip is a command on its own and can be run independently.
How do we deal with tar.gz files? On a Unix system, we untar and unzip the file using tar
with the flags -x, -v, and -f. tar auto-detects the compression type, so nothing specific is
needed to handle the compression type.
What does -xvf mean? If we check the man page for tar, we could find out...
man tar
-v means - produce verbose output. When using this flag tar will list each file name as it is read
from the tar (tape archive).
-f (file) means read the tar (tape archive) from or to the specified file.
You should see each of the files listed as the tar is decompressed. Two directories were created
in this process: a reads directory and a refs directory. In the reads directory there are 12
fastq files. In the refs directory, there are 4 files, containing genome and annotation
information. Keep in mind that we will be using a subsetted reference file from human
chromosome 22.
The fastq files are unzipped, but you may obtain zipped fastq files in the future. Because many
bioinformatics programs can work directly with fastq.gz files, let's compress these files to
save space.
gzip reads/*.fq
Note the use of the * wildcard. We are using gzip to zip all files ending in .fq in the directory
reads.
To peek inside these files after zipping you can use zcat or gzcat (for a mac) paired with
head. This works similar to cat paired with head
In this case, we are "piping" - with the pipe symbol |, the results of zcat into head and
selecting the top 8 lines of the file (-n 8).
The results should show the top 8 lines of the .fq.gz file.
@HWI-ST718_146963544:7:2201:16660:89809/1
CAAAGAGAGAAAGAAAAGTCAATGATTTTATAGCCAGGCAAAATGACTTTCAAGTAAAAAATATAAAGCACCTTACAAA
+
CCCFFFFFHHHHHJJJJJHIHIJJIJJJJJJJJJJJJIJJJJJJJJJJJJJIJJIIJJJJJJJJJJJJIIJFHHHEFFF
@HWI-ST718_146963544:7:2215:16531:12741/1
CAAAATATTTTTTTTTTCTGTATATGACAAGACACACATCAGATCATAAGCTACAAGAAAACAAACAAAAAAGATATGA
+
@@@DDDDDFFFFFIIII;??::::9?99?G8;)9/8'787.)77;@==D=?;?A>D?@BDC@?CC=?BBBBB?<:4::@
Keep in mind, there are several Unix commands that can be used to look at the contents of
files, each has it's own flags/options and is used slightly differently. For example:
less
more
cat
head
tail
less, in particular, can also be used to examine zipped files with the help of lesspipe, on
certain unix systems. On Biowulf, for example, you can use less to view compressed /archived
files.
What is awk?
A scripting language that can be used for manipulating data and generating
reports.
Awk is a utility that enables a programmer to write tiny but effective programs in the
form of statements that define text patterns that are to be searched for in each line
of a document and the action that is to be taken when a match is found within a
line. Awk is mostly used for pattern scanning and processing. It searches one or
more files to see if they contain lines that matches with the specified patterns and
then performs the associated actions. ---https://www.geeksforgeeks.org/awk-
command-unixlinux-examples/ (https://www.geeksforgeeks.org/awk-command-
unixlinux-examples/)
For each line, awk tries to match the CONDITION, and if that condition matches, it
performs the ACTIONS. ---Biostar Handbook, The Art of Bioinformatics Scripting
(https://www.biostarhandbook.com/books/scripting/programming-with-awk.html)
Let's see awk in action. Let's return to runinfo.csv from Lesson 6. We can use awk to print
columns of interest.
For example,
cd ../../sra_data
awk -F ',' '{ print $1,$4,$7 }' runinfo.csv | head > awk_example.txt
Here, the action is to simply print the first $1, fourth $4, and seventh $7 columns from
runinfo.csv. Since there is no condition to be met, awk acts on all lines. The -F flag is used
to specify the field separator. In this case, we are looking at a comma separated file, so we use
,. If we also want the output to be comma separated, we need to use the special awk variable
OFS.
There are many resources online for getting started with awk. There is a chapter in the Biostar
Handbook, IV Awk Programming (https://www.biostarhandbook.com/books/scripting/
programming-with-awk.html) in the Art of Bioinformatics Scripting. You may also find this article
series (https://catonmat.net/awk-one-liners-explained-part-one) explaining awk one-liners
handy.
What is sed?
sed stands for stream editor. Functions include searching, find and replace, and insertion /
deletion.
Notice the single quotes containing our substitution phrase. The s specifies sed's substitution
command, while the /s separate the search pattern and the replacement string. The first
occurrence in each line will be substituted. To substitute across all occurrences in a line use the
global obtion 's/SRR/ACC/g'.
You can pair sed with regular expressions. For example, let's say we want to replace a few of
the run accessions, those ending with a "17", "18", or "19", with "Unknown".
Help Session
For this help session, you will be downloading the Golden Snidget data. Practice materials are
located here.
RNA-SEQ Overview
What is RNASEQ ?
RNA-Seq (RNA sequencing), uses next-generation sequencing (NGS) to reveal the presence
and quantity of RNA in a biological sample at a given moment. (Wikipedia) Strictly speaking this
could be any type of RNA (mRNA, rRNA, tRNA, snoRNA, miRNA) from any type of biological
sample. For the purpose of this talk we will be limiting ourselves to mRNA.
Technically, with a few exceptions, we are not actually sequencing mRNA but rather cDNA.
RNASEQ - WorkFlow
A typical RNASEQ experiment involves several steps, only one of which falls within the realm of
bioinformatics. Namely the Data Analysis step.
• Experimental Design
◦ What question am I asking
We will now examine each of this steps, highlight the major components of each, and touching
briefly on some of more critical steps and pitfalls.
Remember
• RNASEQ looks at steady state mRNA levels which is the sum of transcription and
degradation
• Protein levels are assumed to be driven by mRNA levels
• RNASEQ can measure relative abundance not absolute abundance
• RNASEQ is really all about sequencing cDNA
Read Choices
For any NGS experiment you will have to make choices about the following sequencing options.
Unfortunately, there is and inverse relationship between accuracy and cost.
• Read Depth
◦ More depth needed for lowly expressed genes
◦ Detecting low fold differences need more depth
• Read Length
◦ The longer the length the more likely to map uniquely
◦ Paired read help in mapping and junctions
• Replicates
◦ Detecting subtle differences in expression needs more replicates
◦ Detecting novel genes or alternate iso-forms need more replicates
Replicates
Technical Replicates
• It’s generally accepted that they are not necessary because of the low technical variation
in RNASeq experiments
• Not strictly needed for the identification of novel transcripts and transcriptome assembly.
• Essential for differential expression analysis - must have 3+ for statistical analysis
• Minimum number of replicates needed is variable and difficult to determine:
◦ 3+ for cell lines
◦ 5+ for inbred samples
◦ 20+ for human samples (rarely possible)
• Where will the primary data be stored (fastq)? Where will the processed data be stored
(bam)? Who will do the primary analysis?
• Who will do the secondary analysis?
• Where will the published data be deposited and by who?
(what metadata will they require)
• Are you doing reproducible science?
If you are not going to analyse the data yourself talk to the people who will be
analyzing your data BEFORE doing the experiment*
For cost estimates, visit Sequencing Facility pricing for NGS For further assistance in planning
your RNA-Seq experiment or to discuss specifics of your project, please contact us by email:
[email protected] OR visit us during office hours on Fridays 10am to noon (Bldg37/
Room3041). For cost and specific information about setting up an RNA-Seq experiment, please
visit the Sequencing Facility website or contact Bao Tran
• Prepare all samples at the same time or as close as possible. The same person should
prepare all samples
• Do not prepare “experiment” and “control” samples on different days or by different
people. (Batch effects).
• Use high quality means to determine sample quality (RNA Integrity Number) (RIN >0.8)
and quantity, and size (Tapestation, Qibit, Bioanalyzer)
• Don’t assume everything will work the first time (do pilot experiments) or every time
(prepare extra samples)
Sequencing
Consult with the Sequencing Core as to which is most appropriate for your experiment. The
appropriate selection will be driven by cost, precision, speed, number of samples and number of
reads required
• Quality Control
◦ Sample quality and consistency
◦ Is Trimming appropriate - quality/adaptors
• Alignment/Mapping
◦ Reference Target (Sequence and annotation) Alignment Program
◦ Alignment Parameters
◦ Mark Duplicates
◦ Post-Alignment Quality Assurance
• Quantification *Counting Method and Parameters
• Quantification
◦ Differential Expression - statistics
• Visualization
◦ Visual inspection - IGV
◦ Data representation - scatter, violin plots, heat-maps
• Biological Meaning
◦ Gene Set Enrichment
◦ Pathway Analysis
There are pre-built workflows that can automate many of the processes involved, and facilitate
reproducibility.
Consider the simplest experiment (Two conditions three replicates) 6-12 fastq starting files
6-12 fastq files post trimming of adaptors 6 bam file, and 6 bam index files
It’s for this reason that you should learn enough about the process to make “sensible choices”
and to know when the results are reasonable and correct.
Treating an RNA-Seq (or any NGS) analysis as a black box is a “recipe for disaster” (or at least
bad science). That’s not to say that you need to know the particulars of every algorithm involved
in a workflow, but you should know the steps involved and what assumptions and/or limitations
are build into the whole workflow
Computational Prerequisites
These are considered appropriate if you are planning on doing all the data analysis yourself.
• High performance Linux computer (multi core, high memory, and plenty of storage)
• Familiarity with the “command line” and at least one programming/scripting language.
• Basic knowledge of how to install software
• Basic knowledge of R and/or statistical programming Basic knowledge of Statistics and
model building
Data Analysis
Here are a pair of examples of RNASEQ complete workflows
https://github.com/CCBR/Pipeliner/blob/master/RNASeqDocumentation.pdf
https://nf-co.re/rnaseq
GOOD
BAD
• FASTX-Toolkit is a collection of command line tools for Short-Reads FASTA/ FASTQ files
preprocessing.
• SeqKit is an ultrafast comprehensive toolkit for FASTA/Q processing.
• Trimmomatic is a fast, multithreaded command line tool that can be used to trim and crop
Illumina (FASTQ) data as well as to remove adapters.
• TrimGalore is a wrapper tool around Cutadapt and FastQC to consistently apply quality
and adapter trimming to FastQ files, with some extra functionality for MspI-digested
RRBS-type (Reduced Representation Bisufite-Seq) libraries.
• Cutadapt finds and removes adapter sequences, primers, poly-A tails and other types of
unwanted sequence from your high-throughput sequencing reads.
Alignment
Mapping Challenges
• Reads not perfect
• Duplicate molecules (PCR artifacts skew quantitation)
• Multimapped reads - Some regions of the genome are thus classified as unmappable
• Aligners try very hard to align all reads, therefore fewest artifacts occur when all possible
genomic locations are provides (genome over transcriptome)
The complexity of the problem of accurately mapping millions of reads against large genomes
can be appreciated by looking at a time line of the development of different mapping programs.
Common Aligners
Most alignment algorithms rely on the construction of auxiliary data structures, called indices,
which are made for the sequence reads, the reference genome sequence, or both. Mapping
algorithms can largely be grouped into two categories based on properties of their indices:
algorithms based on hash tables, and algorithms based on the Burrows-Wheeler transform
Tools for mapping high-throughput sequencing data Nuno A. Fonseca Johan Rung Alvis Brazma
John C. Marioni Author Notes Bioinformatics, Volume 28, Issue 24, 1 December 2012, Pages
3169–3177, https://doi.org/10.1093/bioinformatics/bts605
PseudoAligners assign reads to the most appropriate transcript... can’t find novel genes/
transcripts or other anomalies. Generally much faster than aligner and are likely more accurate
(Recent improvements in salmon have increased its accuracy, at the expense of being
somewhat slower than the original)
Answers
• STAR - (Salmon or Kallisto) - subjective
• Depends ! most recent or best annotated
• GeneCode with caveats - know what is being annotated and what is not and how it
effects your results
Answers
Most programs have lots of optional parameters that can tweak the results, but most are set to
defaults that should work in most common situations.
(Don’t touch what you don’t understand - especially if it gets you, your favorite answer)
sequence quality, nucleotide composition bias, PCR bias and GC bias, while “RNA-seq specific
modules” investigate sequencing saturation status of both splicing junction detection and
expression estimation, mapped reads clipping profile, mapped reads distribution, coverage
uniformity over gene body, reproducibility, strand specificity and splice junction annotation.
MultiQC is a modular tool to aggregate results from bioinformatics analyses across many
samples into a single report.
Picard Tools - RNAseqMetrics is a module that produces metrics about the alignment of RNA-
seq reads within a SAM file to genes
Samtools provide various utilities for manipulating alignments in the SAM/BAM format, including
sorting, merging, indexing and generating alignments in a per-position format.
BamTools is a command-line toolkit for reading, writing, and manipulating BAM (genome
alignment) files.
Quantitation
• The reads are mapped to a reference and the number of reads mapped to each gene/
transcript is counted
• Read counts are roughly proportional to gene-length and abundance
• The more reads the better
◦ Artifacts occur because of:
▪ Sequencing Bias
▪ Positional bias along the length of the gene Gene annotations (overlapping
genes) Alternate splicing
▪ Non-unique genes
▪ Mapping errors
Count Normalization
There are three metrics commonly used to attempt to normalize for sequencing depth and gene
length.
Differential Expression
Differential expression involves the comparison of normalized expression counts of different
samples and the application of statistical measures to identify quantitative changes in gene
expression between the different samples.
• Normalization of counts - the process of ensuring that values are expressed on the same
scale
(e.g. RPKM, FPKM, TPM, TMM). Corrects for variable gene length, read depth.
Replicates
Biological replicates are essential to derive a meaningful result. Don’t mistake the high precision
of the technique for the need for biological replicates.
If technical or biological variability exceeds that of the experimental perturbation you will get
zero DEs.
Remember not all DE may be directly due to the experimental perturbation, but could be do to
cascading effects of other genes.
Note pvalues refer to the each gene, whereas an FDR (or qvalue) is a statement about a list. So
using FDR cuff of 0.05 indicates that you can expect 5% false positives in the list of genes with
an FDR of 0.05 or less.
Count Matrix
Contrast File
Visualization
Here are a number of visual elements that are typically produce from RNASEQ data.
Normalization plots
Heat Maps
IGV Traces
Resources
Resources
• Functional Analysis
◦ Genomic Location
◦ Transcription Factor Enrichment Analysis
◦ miRNA Enrichment Analysis
Software Solutions
CCR staff have access to a number of resources
• Biowulf (Helix) - CIT maintained large cluster with a huge software library (Unix command
line)
• CCBR Pipeliner (Biowulf)
• Partek Flow (Local Web Service)
• DNAnexus (Cloud Solution)
• CLCBio Genomic Workbench (Small genomes)
Utility Programs
• SeqKit
• FastQC, RSeQC, MultiQC
• Cutadapt, Fastp, Trimmomatic, TrimGalore STAR,Bowtie, Salmon
• Samtools, Picard, bedrolls, bamtools
• R, Python
• IGV
Web-Based Tools
• BioJupies - Many analysis functions - generates Jupyter Notebook of results
(https://amp.pharm.mssm.edu/biojupies/)
• IDEP92 - an integrated web application for differential expression and pathway analysis
of RNA- Seq data
Further Reading
• RNA-seqlopedia - https://rnaseq.uoregon.edu/
• RNA-Seq by Example - https://www.biostarhandbook.com/
Lesson 8 Review
In Lesson 8, we learned about the basics of RNA sequencing, including experimental
considerations and basic ideas behind data analysis. In lessons 9 through 17 we will learn how
to analyze RNA sequencing data. We will start with quality assessment, followed by alignment
to a reference genome, and finally identify differentially expressed genes.
Learning Objectives
In this lesson, we will continue to learn about RNA sequencing analysis using the Human Brain
Reference (HBR) and Universal Human Reference (UHR) datasets (https://rnabio.org/
module-01-inputs/0001/05/01/RNAseq_Data/). In particular, we will
• 23 human brains
• brains are from both males and females, age ranging from 60 to 80 years
The Universal Human Reference data used RNA from 10 cancer cell lines.
These two experiments used the External RNA Control Consortium (ERCC) spike-in RNAs as
controls. These internal standards provide a known quantity of RNA to evaluate the quality of
RNA sequencing experiment. They can provide information on dynamic range, limits of
detection, and reliability of differential expression results. To learn more about ERCC spike-ins
refer to the following:
• https://www.thermofisher.com/order/catalog/product/4456740 (https://
www.thermofisher.com/order/catalog/product/4456740)
• https://www.nist.gov/programs-projects/external-rna-controls-consortium (https://
www.nist.gov/programs-projects/external-rna-controls-consortium)
• ERCC seminar (https://youtu.be/YVlrzKMJ2uc)
• ERCC publication (https://www.nature.com/articles/ncomms6125)
• ERCC Bioconductor package (https://www.bioconductor.org/packages/release/bioc/html/
erccdashboard.html)
Where is my data?
Here, let's download the HBR and UHR dataset to get acquainted with it.
First, we will use pwd to make sure we are in the home directory.
pwd
If we are in the home directory, we will see the following output displayed in the terminal where
"username" is your username, or student id that you used to sign into the terminal.
/home/usernanme
If not in the home directory use the command below to get back.
cd
Then create a folder called biostar_class and change into this folder.
mkdir biostar_class
cd biostar_class
Let's keep the analysis results of the HBR and UHR dataset to a folder called hbr_uhr, so we
need to create this and then change into it.
mkdir hbr_uhr
cd hbr_uhr
Now it's time to download the HBR and UHR dataset. What are the two commands that we can
use to download data from the web?
{{Sdet}}
Solution{{Esum}}
{{Edet}}
Here, we will use wget to download the HBR and UHR dataset (remember, we should now be in
the ~/biostar_class/hbr_uhr directory).
Using the wget command we just need to enter the command, which is wget, and then provide
the URL to the file.
wget http://data.biostarhandbook.com/rnaseq/projects/griffith/griffith-data.tar
If wget does not work, try curl. If using curl, we need to make sure to specify an output file
name using the -o option.
curl http://data.biostarhandbook.com/rnaseq/projects/griffith/griffith-data.tar
If we now listed the content of the ~/biostar_class/hbr_uhr directory (recall that ~ denotes home
directory), we will see the file griffith-data.tar.gz. Recall that this an archive of a collection of files
that has been zipped (or compressed) to save on storage space.
ls
griffith-data.tar.gz
We can use the tar command to unpack the contents of griffith-data.tar.gz. In the tar command
below, we include the following flags
We do ls -l (where -l denotes list contents of a directory in the long view), we will see two
additional folders
ls -1
The sequencing reads for the HBR and UHR dataset reside in the reads directory. Below, we
will list the content of the reads directory using ls -1 where -1 tells ls to list directory contents,
but with one item per row. We will talk about these fastq or fq files later.
ls -1 reads
HBR_1_R1.fq
HBR_1_R2.fq
HBR_2_R1.fq
HBR_2_R2.fq
HBR_3_R1.fq
HBR_3_R2.fq
UHR_1_R1.fq
UHR_1_R2.fq
UHR_2_R1.fq
UHR_2_R2.fq
UHR_3_R1.fq
UHR_3_R2.fq
In this lesson, the focus is to get to know the reference genome and annotation files.
ls refs
In the refs folder, we have two fasta (fa) files. One is the reference genome for human
chromosome 22 (see file 22.fa) and the other is the reference genome for ERCC spike-ins (see
file ERCC92.fa). We will only be using 22.fa here.
We also have two gtf files that tells us about features that exist in a genome. Again, because we
are not working with ERCC, we will ony be using the 22.gtf file, which tells us about the features
that exist on human chromosome 22.
For the non-R based tools, we can run them at the command line. You can see a full list of non-
R based tools if you listed the contents of the /miniconda3/bin folder although there is no need
to go into this folder to do anything.
ls /miniconda3/bin
The R helper scripts are a bit different. They are located in the folder /usr/local/code.
ls -1 /usr/local/code
combine_genes.r
combine_transcripts.r
compare_results.r
create_heatmap.r
create_pca.r
create_tx2gene.r
deseq2.r
edger.r
filter_counts.r
mission-impossible.mk
parse_featurecounts.r
1. The path to the R helper scripts, /usr/local/code, has been exported as the environmental
variable CODE.
2. If we ls $CODE, we should see the contents of the directory as well.
3. This prevents us from typing a long path when running the R helper scripts.
4. For example, if we wanted to use deseq2.r, we can type in the command line Rscript
$CODE/deseq2.r
(Note we run the R helper scripts by starting off with the Rscript command.)
ls $CODE
cd ~/biostar_class/hbr_uhr/refs
We start off with the reference genome file (the .fa file) used for the HBR and UHR dataset,
where the human chromosome 22 reference is derived from the GRCh38 version of the human
reference genome.
To start, look at the content inside 22.fa by using head. View the first 3 lines (indicated by -3). A
FASTA file can have extensions (.fasta or .fa).
head -3 22.fa
A "fasta" file contains nucleotide sequences. The first line is always a header or definition line
that starts with ">".
>chr22
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
This definition line tells us information about the nucleotide sequence such as which
chromosome is found (22 in our example as denoted by chr22). The first few lines of the
chromosome 22 reference are N, where the nucleotide is unknown.
Note that the definition line can provide more information, depending on how the sequence was
curated. As an example, in the human ADSL transcript nucleotide sequence (https://
www.ncbi.nlm.nih.gov/nuccore/NM_001363840.3?report=fasta), the header line tells us the
accession number or sequence ID (NM_001363840.3), species in which the sequence was
derived (Homo sapiens), name of the gene (ADSL), and that this is a mRNA sequence.
Below we use grep with the -v option so that it does not show any lines in the FASTA file that
contains the search pattern (in this case N - the unknown nucleotides in this case) to see some
actual sequences in chromosome 22, to make sure that the 22.fa file does contain meaningful
sequences.
>chr22
TAAGATGTCCTATAATTTCTGTTTGGAATATAAAATCAGCAACTAATATGTATTTTCAAA
GCATTATAAATACAGAGTGCTAAGTTACTTCACTGTGAAATGTAGTCATATAAAGAACAT
AATAATTATACTGGATTATTTTTAAATGGGCTGTCTAACATTATATTAAAAGGTTTCATC
AGTAATTCATTATATCAAAATGCTCCAGGCCAGGCGTGGTGGCTTATGCCTGTAATCCCA
GCACTTTGGGAGGTCGAAGTGGGCGGATCACTTGAGGTCAGGAGTTGGAGACTAGCCTGG
CCAACATGATGAAACCCCGTCTCTAATAATAATAATAAAAAAAAATTAGCTGGGTGTGGT
GGTGGGCAACTGTAATCTCAGCTAATCAGGAGGCTGAGGCAGAAGAATTGCTTGAACGTG
GAAGACAGAGTTTACAGTGTGCCAAGATCACACCACCCTACTCCAACTTGGGTGACAGAG
CAAGACTCAGTCTCAAGGAAAAAAAAAAGCTCGAAAAATGTTTGCTTATTTTGGTAAAAT
Here, we find that the human chromosome 22 reference (22.fa) is a file with a header or
definition line with the nucleotide sequence of that entire chromosome. But what is it good for?
22.fa contains the known sequences of human chromosome 22, thus it's a reference that we
can compare other sequences to. For high throughput sequencing, we need the known
sequences so that we can find out where in the genome each of the sequencing reads came
from. The reference genome in a way acts like a template that we can follow to reconstruct the
genome of the unknown. In other words, think of the reference genome as a picture of the
completed puzzle that helps us assemble the actual puzzle, by allowing us to overlap the
pieces to see if they fit the completed version.
A question we might ask about the reference genome is how big is the reference (ie. how many
bases)? To answer this, we can use the tool seqkit and it's stats feature.
Prior to the sequencing experiment, the size of the genome will help us determine the number
of reads needed to achieve a certain coverage (https://www.illumina.com/documents/products/
technotes/technote_coverage_calculation.pdf (https://www.illumina.com/documents/products/
technotes/technote_coverage_calculation.pdf)).
After the experiment, we could use the size of our genome along with other information to
decide the computing resources (ie. time and memory) needed for our analysis. Chromosome
22 is the second smallest in humans (https://medlineplus.gov/genetics/chromosome/22/), so it
would be faster to align to this than the entire human genome. For the sake of time in this class,
that's why we chose to align to just this chromosome.
What is in the 22.gtf file? A gtf file is known as Genome Transfer File, which is essentially a tab
delimited file (ie. columns in the file are separated by tab). It informs us of where different
features such as genes, transcripts, exons, and coding sequences are found in a genome.
First, even though we will map the RNA sequencing reads to a genome, we need to know which
genomic features (ie. genes) the reads are aligning to generate some metric for expression
(counts) and then perform differential expression analysis.
Second, because some of the sequencing reads can map to two exons, if we use a splice-
aware-aligner, the information in the gtf file can be used to recognize the exon-exon boundaries
(see Figure 1).
Otherwise, those reads that map to two exons will not be mapped and we end up losing
information. See https://useast.ensembl.org/info/website/upload/gff.html (https://
useast.ensembl.org/info/website/upload/gff.html) for required information in a gtf file.
The "score" column values represent the confidence in a feature existing in that genomic
position. In this example the score contains a dot ".", which represents a missing value.
The frame column provides reading frame information. Where a "." appears in the gtf table, this
means that the information is not available.
In Table 2, we are looking at gene ENSG00000277248.1, where the first line tells us that the
feature is a gene, and the lines below this shows the transcript products for the gene as well as
exons associated for that transcript.
DATA
CHROMOSOME FEATURE START END SCORE STRAND FRAME ATTRIBUTE
SOURCE
gene_id
"ENSG00000277
gene_type "snRN
chr22 ENSEMBL gene 10736171 10736283 . - .
gene_status "NO
gene_name "U2
3;
DATA
CHROMOSOME FEATURE START END SCORE STRAND FRAME ATTRIBUTE
SOURCE
gene_id
"ENSG00000277
transcript_id
"ENST00000615
gene_type "snRN
gene_status "NO
gene_name "U2
transcript_type
chr22 ENSEMBL transcript 10736171 10736283 . - .
"snRNA";
transcript_status
"NOVEL";
transcript_name
"U2.14-201"; leve
"basic";
transcript_suppo
"NA";
gene_id
"ENSG00000277
transcript_id
"ENST00000615
gene_type "snRN
gene_status "NO
gene_name "U2
transcript_type
"snRNA";
chr22 ENSEMBL exon 10736171 10736283 . - . transcript_status
"NOVEL";
transcript_name
"U2.14-201";
exon_number 1;
exon_id
"ENSE00003736
level 3; tag "basi
transcript_suppo
"NA";
Figure 1: A sequencing read (red fragments) aligning two exons (e1 and e2). Modified from:
https://training.galaxyproject.org/training-material/topics/transcriptomics/tutorials/rb-rnaseq/
tutorial.html (https://training.galaxyproject.org/training-material/topics/transcriptomics/tutorials/
rb-rnaseq/tutorial.html)
To view the gtf or other tabular data in Unix, we can cat the file and then pipe or send the output
to the column command, which allows us to print the columns so that they are nicely aligned. In
this case, we use the -t option in column to determine the number of columns in the input and
by default white space is used to separate the columns. Piping to less -S allows us to truncate
some really long lines like those found in the attributes column of the gtf file and this also allows
us to scroll horizontally to view additional columns. Hit Q to exit the cat command.
We also need to download 22.fa and 22.gtf to our local desktop. To do this copy both into the ~/
public directory.
cp 22.fa ~/public
cp 22.gtf ~/public
From there, download 22.fa and 22.gtf to our local machine. Note where the files were
downloaded.
When we open IGV, we will see the window shown in (Figure 2).
Figure 2
IGV comes preloaded with several genomes, but if we do not see the one we need in the drop
down menu we can always load it from the Genomes drop down on the menu bar where it gives
us some options including loading the genome from local file or from the web or URL (Figure 3).
Compatible file formats include FASTA, JSON, or ".genome".
Figure 3:
To load data into IGV, we will choose one of the options in the File drop down in the menu bar
(see Figure 4). Note that from the File drop down, we can take snap shots of our view and save
as either PNG or SVG images.
Figure 4:
Since IGV loaded hg19 upon startup, we will load the chromosome 22 reference genome using
the Load from File option under the Genomes drop down in the menu bar (Figure 5).
Figure 5:
Next, we will load the 22.gtf file onto IGV so we can see the genes aligned to the reference
genome (Figure 6). The blue rectangles represent genes and transcripts, we can zoom in to
look at ENSG00000280363.1 by searching for this gene ID in the Go box (we can actually
search by coordinates, ID, or name).
Figure 6
Figure 7
The transcripts are depicted by solid rectangles separated by lines (Figure 8). The solid
rectangles are exons and the lines connecting are introns. If we click on the transcripts, a box
pops up and we get more information regarding the transcript.
Figure 8
Within the exons, narrower solid rectangles represent untranslated regions or UTRs (Figure 9).
Figure 9
Among the many features of IGV, is the ability for users to zoom in close enough to see the
bases (Figure 10).
Figure 10
Lesson 9 Review
In the previous lesson, we explored the reference genomes and genome annotation files that
are needed in our analysis of the Human Brain Reference (HBR) and Universal Human
Reference (UHR) RNA sequencing data.
Learning objectives
In lesson 9, we learned that reference genomes came in the form of FASTA files, which
essentially store nucleotide sequences. In this lesson, we will learn about the FASTQ file, which
is the file format that we get from our high throughput sequencing experiment. Our goals are to
The skills learned can be applied to your own research and will be used when we learn more
about RNA sequencing in subsequent lessons. In this lesson, we will continue to use the
Human Brain Reference and Universal Human Reference datasets.
The directory in which the HBR and UHR dataset resides is ~/biostar_class/hbr_uhr. Let's go
ahead and change into this folder. Because we are talking about the sequencing data in this
lesson, we will then need to change into the reads folder.
cd ~/biostar_class/hbr_uhr
cd reads
Let's now list (using ls) the contents of the reads folder. We use -1 to make ls list one item per
row.
ls -1
HBR_1_R1.fq
HBR_1_R2.fq
HBR_2_R1.fq
HBR_2_R2.fq
HBR_3_R1.fq
HBR_3_R2.fq
UHR_1_R1.fq
UHR_1_R2.fq
UHR_2_R1.fq
UHR_2_R2.fq
UHR_3_R1.fq
UHR_3_R2.fq
Each FASTQ file is composed of many sequences. We can use the seqkit tool and its stats
function to get statistics on our FASTQ files. In the seqkit command below, we use * to denote
wild card in order to have seqkit run stats for all FASTQ files.
Our query of the stats for the FASTQ files generates the results below where we are informed of
things such as the number of sequences (or reads) in a FASTQ file. For the HBR_1 biological
replicates, both files in the pair (R1 and R2) have 118,571 sequences. The average length of
the sequences in these files is 100 bases. Pairs in paired end sequencing should have the
same number of sequences because the reads generated from the pairs came from the same
template.
Recall from the seqkit stat results shown above that each FASTQ file contains many sequencing
reads. The record for each sequencing read is composed of four lines and these include the
following.
• The first line is the sequencing read metadata (ie. instrument used, location on the flow
cell where the read was derived, and whether the read is the first or second of a pair in
paired end sequencing) - in our example,
• "/1" at the end of the metadata line in the reads from HBR_1_R1.fastq.gz denotes the first
read of a pair
• "/2" at the end of the metadata line in the reads from HBR_1_R2.fastq.gz denotes the
second read of a pair
• The second line is the sequence
• The third line is a "+"
• The fourth line contains the quality score of each of the bases in the sequencing read.
This quality score tells us the confidence of the base call
Below we use the head command (where -4 indicates we want only the first 4 lines) to show the
first sequencing read of HBR_1_R1.fq and HBR_1_R2.fq.
head -4 HBR_1_R1.fq
@HWI-ST718_146963544:7:2201:16660:89809/1
CAAAGAGAGAAAGAAAAGTCAATGATTTTATAGCCAGGCAAAATGACTTTCAAGTAAAAAATATAAAGCACCTTACAAA
+
CCCFFFFFHHHHHJJJJJHIHIJJIJJJJJJJJJJJJIJJJJJJJJJJJJJIJJIIJJJJJJJJJJJJIIJFHHHEFFF
head -4 HBR_1_R2.fq
@HWI-ST718_146963544:7:1101:5039:82953/2
GGTGGGGACAGGGTACTTGGCATAAAGTAGGCTCTTAGTACATTTTTTGAATGAATGAATGACTCTGAAAGGTAAATAA
+
@B=DFDFFHHGFHGHJJJJJJIIIJJIIIJIIJIJJJJGHIIJJJJJJJGHGIIJIJIJJJHHHHHHHFFFFFECEEEE
If we view the first 8 lines in HBR_1_R1.fq, we can see that we have the same structure for the
second sequencing read in the file.
head -8 HBR_1_R1.fq
@HWI-ST718_146963544:7:2201:16660:89809/1
CAAAGAGAGAAAGAAAAGTCAATGATTTTATAGCCAGGCAAAATGACTTTCAAGTAAAAAATATAAAGCACCTTACAAA
+
CCCFFFFFHHHHHJJJJJHIHIJJIJJJJJJJJJJJJIJJJJJJJJJJJJJIJJIIJJJJJJJJJJJJIIJFHHHEFFF
@HWI-ST718_146963544:7:2215:16531:12741/1
CAAAATATTTTTTTTTTCTGTATATGACAAGACACACATCAGATCATAAGCTACAAGAAAACAAACAAAAAAGATATGA
+
@@@DDDDDFFFFFIIII;??::::9?99?G8;)9/8'787.)77;@==D=?;?A>D?@BDC@?CC=?BBBBB?<:4::@
Previously, we used seqkit stats to get statistics for HBR_1_R1.fastq.gz and HBR_1_R2.fastq.gz,
such as the number of sequencing reads in the files. In theory, if we grep for @HWI of the
metadata line and then count the number of lines using wc -l (again, we use wc to obtain word
count and -l instructs this command to provide only the number of lines in a file), we should get
the same result as that from seqkit stats (warning: this might not work all of the time).
118571
118571
Here we have learned how to use an existing application as well as stand alone commands to
get some basic statistics from FASTQ files. However, doing this repeatedly for every FASTQ file
can become cumbersome so we typically turn to tools like FASTQC to generate sequencing
quality metrics.
We can take a look at the instructions for running FASTQC by typing the following at our
command prompt.
fastqc --help
Once the instructions have been pulled up, we see that to run FASTQC we simply do the
following.
fastqc input1
FASTQC will print out the status of the analysis. Analysis status for HBR_1_R1.fq is shown below
but note that this status will be printed for all inputs in the FASTQC command.
Now if we list the files in the hbr_uhr directory, we will see that each FASTQ file has a
corresponding FASTQC html report and zip folder. We can view the html version of the QC
report in a web browser and this is what most people would do. The QC results (text and figures
that appears in the html report) are also available in the zip folder.
ls
unzip HBR_1_R1_fastqc.zip
cd HBR_1_R1_fastqc
ls
In the HBR_1_R1_fastqc folder, the QC summary is available in the summary.txt file. The QC
results is in fastqc_data.txt and the figures are in the Images folder.
cd Images
ls
adapter_content.png per_sequence_gc_content.png
duplication_levels.png per_sequence_quality.png
per_base_n_content.png per_tile_quality.png
per_base_quality.png sequence_length_distribution.png
per_base_sequence_content.png
Now, let's change back into the ~/biostar_class/hbr_uhr/reads folder and move
HBR_1_R1_fastqc.html to our ~/public directory and then right click to open it in a new browser
tab. In the cp command below ~ denotes home directory.
cd ~/biostar_class/hbr_uhr/reads
cp HBR_1_R1_fastqc.html ~/public
If we now listed the contents of the ~/public directory, then we will see HBR_1_R1_fastqc.html.
ls ~/public
In the FASTQC report, the first thing we see is the Summary panel that allows us to navigate to
different portions of the FASTQC report. In this summary panel, we see that we have one failed
module (red circle with x) and three warnings (yellow circle with !). In the main report pane, the
first information is the Basic Statistics (Figure 2), which tells us
• Name of the FASTQ file that we are working with (HBR_1_R1.fq in this case)
• The number of sequences in the file
• Number of sequences flagged with poor quality (0 in the HBR_1_R1.fq file)
• The sequence length (ie. how many bases are in each sequencing read of the FASTQ
file). For HBR_1_R1.fq, each sequencing read is composed of 100 bases and this
concurs with the results from seqkit stats.
Figure 2
Next, we can see the "Per base sequence quality" plot (Figure 3).
This chart plots the error likelihood at each position averaged over all measurements.
• The vertical axis are the quality scores that you see in row 4 of the sequencing reads in a
FASTQ file. These quality scores represent error probabilities, where:
The three colored bands (green, yellow, red) illustrate the typical labels assigned to these
• measure:
• The yellow boxes contain 50% of the data, the whiskers indicate the 75% outliers.
• The red line inside the yellow boxes is the median quality score for that base.
Figure 3
Next, in Figure 4, we have the "Per tile sequence quality" plot. This graph tells us whether
something is wrong with a tile in the flow cell. Along the horizontal axis are the base positions.
The vertical axis represents the tile number in which the read came from.
This plot compares the average quality score of a tile to the average quality score of all tiles at a
particular base position. -- https://sequencing.qcfail.com/articles/position-specific-failures-of-
flowcells/ (https://sequencing.qcfail.com/articles/position-specific-failures-of-flowcells/)
• Colder colors indicate that the average quality score at a tile is at or above the average
quality score of all tiles at a particular base position. So a plot that is entirely blue is good
(Figure 4).
• Hotter colors indicate that the average quality score at a tile is below the average quality
score of all tiles at a particular base position. So a plot with red indicates a part of the flow
cell has problems (Figure 5).
Figure 4
Next, in Figure 6 we see a distribution of the sequence quality scores in the "Per sequence
quality scores" plot, showing whether the quality of a subset of sequences is poor. Here our
sequences have good quality, where most reads have quality that clusters at around 37.
Figure 6
Figure 7 shows us the sequence make up along the bases of our reads in the "Per base
sequence content" plot. If a library is random, then the percent composition of each nucleotide
base (A,T,C,G) should be the same (~25%).
This module fails for HBR_1_R1.fq because difference between the percentage of A and the
percentage of T is larger than 20 at a particular location OR the difference between the
percentage of C and the percentage of G is larger than 20 at a particular location -- Babraham
bioinformatics Per base sequence content (https://www.bioinformatics.babraham.ac.uk/
projects/fastqc/Help/3%20Analysis%20Modules/
4%20Per%20Base%20Sequence%20Content.html).
In HBR_1_R1.fq, it looks like the difference between the percent composition of A and T at base
position 2 is causing the failure. Unfortunately, this type of uneveness in base distribution at the
beginning of a read is observed in RNA sequencing due to priming with random hexamers
during the library preparation stage.
Figure 7
Figure 8 shows the GC content across each sequence compared to a normal distribution in
what is called the "Per sequence GC content" plot. The GC content in HBR_1_R1.fq is off from
the normal theoretical distribution.
The peak of this theoretical distribution is an estimate of the GC content of the underlying
genome. Deviation from of the GC content from the theoretical distribution could be caused by
contamination or sequencing bias. -- Babraham bioinformatics Per base sequence content
(https://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/
5%20Per%20Sequence%20GC%20Content.html).
Figure 8
Whether we have any unknown bases in our sequencing reads is shown in Figure 9, which is
the "Per base n content" plot".
Figure 9
Figure 10 shows the sequence length distribution of HBR_1_R1.fq in what is known as the
"Sequence Length Distribution" plot.
Figure 10
Figure 11 shows the sequence duplication levels. High levels of duplication may indicate an
enrichment bias such as over-amplification in the PCR step. Otherwise, most sequences will
occur only once.
Figure 11
Figure 12 shows that we have some overrepresented sequences. As an exercise, let's copy one
of the overrepresented sequences and BLAST it to find out what it is. Presence of
overrepresented sequences may indicate enrichment artifact or the sequencing library is not
very diverse. On the other hand, overrepresented sequences could have biological
significance.
Figure 12
Figure 13 tells us whether some of our sequencing reads have adapter content. Adapters
sequences should be trimmed prior to alignment.
Figure 13
HBR_1_R1_fastqc.html
HBR_1_R2_fastqc.html
HBR_2_R1_fastqc.html
HBR_2_R2_fastqc.html
HBR_3_R1_fastqc.html
HBR_3_R2_fastqc.html
UHR_1_R1_fastqc.html
UHR_1_R2_fastqc.html
UHR_2_R1_fastqc.html
UHR_2_R2_fastqc.html
UHR_3_R1_fastqc.html
UHR_3_R2_fastqc.html
Lesson 10 Review
In the previous lesson, we learned about the structure of the FASTQ file, which stores our raw
sequencing reads. Next, we learned to use a tool called FASTQC to assess the quality of each
of the FASTQ files in the Human Brain Reference (HBR) and Universal Human Reference (UHR)
dataset.
Learning objectives
As described in the Lesson 10 review above, we generated quality reports for each of the
FASTQ files in the Human Brain Reference and Universal Human Reference dataset using
FASTQC. However, interrogating 12 individual FASTQC reports is cumbersome. In this lesson,
we will focus on the following.
• Merge FASTQC reports using a tool called MultiQC so that we can interrogate one report
rather than multiple.
• Learn to perform quality and adapter trimming on FASTQ files.
The skills learned can be applied to your own research and will be used when we learn more
about RNA sequencing in subsequent lessons. In this lesson, we will continue to work with the
HBR and UHR datasets.
cd ~/biostar_class/hbr_uhr/QC
FASTQC generated 12 html reports, here, we will merge them using the tool MultiQC.
MultiQC searches a given directory for analysis logs and compiles a HTML report.
It's a general use tool, perfect for summarising the output from numerous
bioinformatics tools. -- https://multiqc.info (https://multiqc.info).
MultiQC "knows" the report formats of many existing NGS tools: FastQC, cutadapt,
bowtie2, tophat, STAR, kallisto, HISAT2, samtools, featureCounts, HTSeq, MACS2,
Picard, GATK, etc. -- https://wikis.utexas.edu/display/bioiteam/Using+MultiQC
(https://wikis.utexas.edu/display/bioiteam/Using+MultiQC).
MultiQC can be used to aggregate reports from pre-alignmnet quality check as well
as metrics from other downstream steps of high throughput sequencing analysis.
See https://multiqc.info/docs/ (https://multiqc.info/docs/) for the tools that MultiQC
can generate aggregate reports for.
Below, we will take a look at the MultiQC documentation to see how to run it.
multiqc --help
MultiQC allows users to input some options but mainly to run this application, we need to
specify the directory that contains our analysis logs.
MultiQC aggregates results from bioinformatics analyses across many samples into
a single report.
It searches a given directory for analysis logs and compiles a HTML report. It's a
general use tool, perfect for summarising the output from numerous
bioinformatics tools.
To run, supply with one or more directory to scan for analysis results. For
example, to run in the current working directory, use 'multiqc .'
While we can specify the path of our hbr_uhr directory to MultiQC, we are in it already so if we
want to aggregate the FASTQC reports in this folder we can just specify the directory path with
"." to tell MultiQC to look for files to aggregate here (in the present working directory). We use
the --filname option to specify a name (multiqc_hbr_uhr) for the MultiQC report.
After running MultiQC, we can use ls to list the contents of our folder. We see that we have a
html file (multiqc_report.html) that we can open to view the quality assessment summary for all
of our samples.
ls
Let's copy multiqc_report.html to our public directory so we can take a look at the contents of
this report in our web browser.
cp multiqc_report_hbr_uhr.html ~/public/multiqc_report_hbr_uhr.html
Upon opening the MultiQC report, we see a navigation panel (similar to what we have with the
individual FASTQC reports) that allows us to quickly move to different sections of the report. We
are also provided with links to get help if we have questions about how to use the MultiQC
reports. To the right of the report page, we have a tool box that allows us to do things like
highlighting different samples in different colors for better visualization, rename samples, and
export each of the individual QC plots as an image for inclusion in presentations and/or
publications. MultiQC reports are interactive.
Figure 1
The next figure (Figure 2) shows some basic statistics about our samples, including percentage
of duplicate reads, GC content, number of bases in the sequencing reads (or read length),
percentage of modules that failed in the FASTQC report for that sample, and the total number of
sequences in a FASTQ file (in Millions of sequences).
We can click on the Configure Columns tab to choose which columns we like to see in this table
(Figure 3) and the plot button to visualize the data in graphical format.
Figure 2
Figure 3
The next plot (Figure 4) shows the break down of unique and duplicate reads for each FASTQ
file. Again, duplication suggests some sort of enrichment bias. The default to this panel is to
show the number of sequences but we can get a percentage breakdown by clicking on the
Percentages tab.
Figure 4
Figure 5 shows us the average quality score of the sequencing reads in FASTQ files along each
base position. If we click on the green rectangle with the number 12 written in it, we can choose
which sample we like to see in our plot (Figure 6). Regarding these boxes at the top of the QC
plots, green means QC passed while orange and red indicate warning and failed, respectively.
Figure 5
Figure 6
In Figure 7, we see the quality score distribution of each of our FASTQ files.
Figure 7
In Figure 8, we get an interactive heatmap of the percent composition of each nucleotide base
(A,T,C,G) along the bases (horizontal axis) for each of the FASTQ files (vertical axis). If we hover
over a tile, we will see the corresponding numbers. If we click on any row in this heatmap, we
will get the base composition plot for just that sample (Figure 9).
Figure 8
Figure 9
The GC distribution for each of the FASTQ files is shown in Figure 10.
Figure 10
Figure 11 tells us that except for a few bases at the beginning of the read, we do not have
unknown bases in our FASTQ files.
Figure 11
Figure 12 tells us that all of the reads in our FASTQ files have 100 bases, so no problems there.
Figure 12
We see the sequence duplication levels, overrepresented sequences, and adapter content
information in Figures 13 through 15.
Figure 13
Figure 14
Figure 15
At the end of the MultiQC report, we see a heatmap of the modules that have passed, warn, or
failed status for each of the FASTQ files.
Figure 16
cd ~/biostar_class
mkdir trimming
cd trimming
Let's now download the FASTQ files for SRR1553606, which was sequenced under the paired
end format, so we will need to specify --split-file to separate read 1 and read 2. We specify -X
10000 to retrieve only 10000 reads, otherwise the download will take longer.
ls
SRR1553606_1.fastq SRR1553606_2.fastq
Let's run FASTQC for both read 1 and read 2 of SRR1553606. We can use the wildcard (*) to
get both files rather than inputting them separately.
fastqc SRR1553606_*.fastq
We can copy the FASTQC reports for SRR1553606 to the ~/public directory to review them.
cp SRR1553606_1_fastqc.html ~/public
cp SRR1553606_2_fastqc.html ~/public
Initial QC of both FASTQ files for SRR1553606 shows failures and warnings. Figures 17 through
20 shows the per base sequence quality and adapter content for SRR1553606 read 1 and read
2. Here let's focus on removing adapters and poor quality reads. Adapters in particular will
interfere with the alignment step. At the end of the reads in SRR1553606_2, we see that the 25th
- 75th percentile (the yellow box) of scores span a large range, with many of them in the orange
and red regions, where reliability of the reads would come into question (Figure 20).
Let's use the tool Trimmomatic to clean up the adapters and the poor quality reads for
SRR1553606. For help with Trimmomatic type trimmomatic --help at the command line.
Before getting started with using trimmomatic, let's create a file called nextera.fa which houses
the nextera adapter sequence that we need to remove (from the FASTQC result, we have
Nextera adapter contamination).
The command below will create a file called nextera.fa and open it in the nano editor. We can
then copy and paste the sequence, then hit control-x to save and exit the editor.
nano nextera.fa
>nextera
CTGTCTCTTATACACATCTCCGAGCCCACGAGAC
We initiate the application by typing trimmomatic at the command line and the parameters are
explained below.
• PE stands for paired end mode. We are dealing with sequencing data derived from
paired end library preparation so we can use this option. If we have single end
sequencing data then we can replace PE with SE. After specifying paired end (PE) mode
◦ The files for read 1 and read 2 are entered
◦ Following that, we enter the names of the trimmed FASTQ files
(SRR1553606_trimmed_1.fastq and SRR1553606_trimmed_2.fastq). We need to
specify two because we have two input files for paired end sequencing
◦ Note that we also specify a file name for unpaired reads
(SRR1553606_trimmed_1_unpaired.fastq and
SRR1553606_trimmed_2_unpaired.fastq). Sometimes a read in one file may be
successfully processed while the same read will not be successfully processed in
the second file, thus, we place these reads in a separate file.
• The next portion to the trimmomatic command allows us to specify the quality score
criteria for trimming. Here we use a sliding window (SLIDINGWINDOW), which scans the
5' end of the read and removes when the average quality of the window falls below a
threshold.
◦ We choose a window size of 4 reads
◦ Quality threshold of 30
◦ The final construction is SLIDINGWINDOW:4:30
• We then use the ILLUMINACLIP flag to specify the file to our adapter sequence where the
numbers (2:30:5) that follows sets the criteria on how Trimmomatic would determine
whether a portion of the read matches the adapter (see the Trimmomatic manual at http://
www.usadellab.org/cms/uploads/supplementary/Trimmomatic/
TrimmomaticManual_V0.32.pdf (http://www.usadellab.org/cms/uploads/supplementary/
Trimmomatic/TrimmomaticManual_V0.32.pdf) for more).
• In the MINLEN argument, we specify 50 and Trimmomatic will remove reads that are less
than 50 bases. We set this threshold because shorter reads will be difficult to map
because they would potentially fall onto multiple regions of the genome.
In the ls command below, we place * (wildcard) around trimmed to tell ls that we want any file
with the word trimmed in it. We use * before and after so that ls will know there could be
characters before trimmed and also after.
ls *trimmed*
SRR1553606_trimmed_1.fastq
SRR1553606_trimmed_1_unpaired.fastq
SRR1553606_trimmed_2.fastq
SRR1553606_trimmed_2_unpaired.fastq
Copy FASTQC reports for the SRR1553606 trimmed data to the ~/public folder to view.
cp SRR1553606_trimmed_1_fastqc.html ~/public
cp SRR1553606_trimmed_2_fastqc.html ~/public
Our per base quality looks much better and adapters were removed after trimming using
Trimmomatic. Note in the basic statistics portion of the FASTQC report for the trimmed files, we
loss around 40% of the reads from original. Therefore, trimming is a balancing act between
removing unwanted reads and keeping as much of the original information as possible to
prevent our experiment from becoming a wasted effort.
Figure 21: Per base quality for SRR1553606_1 after Trimmomatic trimming.
Figure 23: Per base quality for SRR1553606_2 after Trimmomatic trimming.
BBDuk is another tool that can be used for adapter and quality trimming. In addition, BBDuk
can be used to filter out contaminations, perform GC filtering, filter for length, etc. (see https://
jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbduk-guide/ (https://
jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbduk-guide/)).
Let's run BBDuk to do the same adapter and quality trim as we did with Trimmomatic for the
FASTQ files in SRR1553606.
• Again, we are working with paired end sequencing so we provide read 1 after the "in="
argument and then read 2 after "in2=". Similarly, we specify the output file names for read
1 and read 2 after the "out=" and "out2=" arguments, respectively.
• The qtrim argument tells BBDuk we want to perform quality trimming. Setting qtrim=r
means that BBDuk will trim from the right side of the sequence. We specify the quality
threshold for trimming to 30 using the trimq argument.
• For adapter trimming, we specify the adapter sequence FASTA file (again, we will be
uisng nextera.fa created earlier). Note that for adapter trimming, we use the ktrim option,
which essentially tells BBDuk to trim based on sequence matching rather than quality.
Here, we set ktrim=r so that BBDuk will trim away bases to the right of the match. The
parameters that follow ktrim are criteria that determine whether a portion of the
sequencing read matches the adapter.
• See BBDuk manual (https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-
user-guide/bbduk-guide/) for more about arguments and parameters that can be used
with this program.
As BBDuk is running, we see statistics such as the number of reads that are quality and/or
adapter trimmed. We also see the number of reads that have been removed and the number of
reads that remain.
Input: 20000 reads 2020000 bases. QTrimmed: 4137 reads (20.69%) 276896 bases
(13.>71%) KTrimmed: 5632 reads (28.16%) 448982 bases (22.>23%) Total
Removed: 6356 reads (31.78%) 725878 bases (35.>93%) Result: 13644 reads
(68.22%) 1294122 bases >(64.07%)
Running FASTQC on the BBDuk trimmed output, we see that BBDuk performs similar to
Trimmomatic.
Learning objectives
Here, we will do a quick review of what we have learned about RNA sequencing in Lessons 8
through 11.
Once you sign into this handbook, you will find that it is composed of several different books
including one for RNA sequencing.
Scroll to the bottom of the page and you will find a button that says Access Your Account. Click
this to sign in.
Because the Biostars handbook subscription is only good for 6 months, we recommend that
you download either the PDF or eBook.
What are the files that we need for RNA sequencing analysis?
{{Sdet}}
Solution{{Esum}}
{{Edet}}
{{Sdet}}
Solution{{Esum}}
The reference genome serves as a "known" that guides us in constructing the genome of the
unknown from sequencing data.
{{Edet}}
What file format is the reference genome in and what information does it contain?
{{Sdet}}
Solution{{Esum}}
The reference genome is in the fasta/fa format. These files will have extension fasta or fa, where
the two extensions are used interchangeably.
A fasta file contains a definition line that starts with ">" followed by nucleotide sequences.
{{Edet}}
{{Sdet}}
Solution{{Esum}}
The annotation file lists the features of a genome (ie. genes, transcripts, exons) along with their
coordinates and other information. Annotations files are useful in RNA sequencing because it
informs us of which gene or transcripts the aligned reads are overlapping and thus helps us
generate a table of expression counts for our samples either on per gene or transcript basis.
{{Edet}}
{{Sdet}}
Solution{{Esum}}
A fastq or fq file is the format for files that contain our sequencing data. Similar to a fasta file,
which contains a header line that starts with ">" followed by sequence, the fastq file also
contains a header line for each sequencing read that starts with "@". The sequencing read
follows the metadata line, which is then followed by a "+" sign and a line that contains the quality
score of each of the bases in a sequencing read.
{{Edet}}
What tool can we use to assess quality of sequencing data? And how do we aggregate several
FASTQC reports into one.
{{Sdet}}
Solution{{Esum}}
FASTQC
{{Edet}}
What type of data clean up can we perform on sequencing data prior to downstream analysis?
{{Sdet}}
Solution{{Esum}}
We can trim away adapters and low quality reads. Trimmomatic is a tool that can be used to do
this.
{{Edet}}
Lesson 11 Review
In Lesson 11 we learned to aggregate multiple FASTQC reports into one using MultiQC, which
allows us to easily interrogate the quality of sequencing data for multiple samples. We also
learned to trim away adapters and poor quality reads in raw data using Trimmomatic
(instructions for trimming using BBDuk are available in the Lesson 11 content for you to look at
even though we did not go over this).
Learning objectives
In this lesson, we will continue to use the Human Brain Reference (HBR) and Universal Human
Reference (UHR) data and we will
• Learn to align the sequencing data to reference genome using HISAT2, which is a splice
aware aligner
• Look at post alignment QC
• Familiarize ourselves with the contents of alignment output
• Learn to use SAMTOOLS to work with alignment output
• Align sequences with BOWTIE2, which is not splice aware so we can visualize and
compare to results obtained from HISAT2 using the Integrative Genome Viewer (IGV) in
Lesson 14.
The skills learned in this lesson can be applied towards your own research and subsequent
lessons in this course.
cd ~/biostar_class/hbr_uhr/refs
As a review, when we list the contents of this folder, we will see that it contains the reference
genome (22.fa) and annotation (22.gtf) for human chromosome 22.
ls
The first step in alignment is to create an index for the reference genome. Think of an index as a
table of contents in a book. If we are searching for something in a book, we can either search
from beginning to end and depending on the size of the book, this could take a long time.
Alternatively, we could use the table of contents to jump to and search only the relevant
sections. Thus, an index allows the aligner to search more specifically and reduce computation
time.
To align the HBR and UHR raw reads to chromosome 22 we will use a tool called HISAT2 (http://
daehwankimlab.github.io/hisat2/manual/), which is a splice aware aligner used for RNA
sequencing. This aligner will be able to handle the alignment of reads that fall on two exons. We
will use the build feature of HISAT2 to create our index. Other aligners will have their own
algorithm for indexing the reference genome.
To build the index for human chromosome 22, we type hisat2-build at the command prompt,
followed by the name of FASTA file for the reference genome (22.fa in our case) and then the
base name (ie. file name without extension) that we would like to use for our index (here we
choose 22 as the base name).
hisat2-build 22.fa 22
After the index has been generated, we can list the contents of our biostar_class/hbr_uhr/refs
folder to see if anything has changed. We use the -1 option with ls to list one item per row.
ls -1
Note that we now have HISAT2 genome indices, which are the 8 files that have extension "ht2".
22.1.ht2
22.2.ht2
22.3.ht2
22.4.ht2
22.5.ht2
22.6.ht2
22.7.ht2
22.8.ht2
Once the index for the human chromosome 22 reference has been created, change back into
the ~/biostar_class/hbr_uhr folder.
cd ~/biostar_class/hbr_uhr
Then create a new folder called hbr_uhr_hisat2 and change into this. We will keep our
alignment outputs in this folder.
mkdir hbr_uhr_hisat2
cd hbr_uhr_hisat2
• The "-x" flag prompts us to enter the base name (ie. without extension) of genome index.
The HISAT2 index in the ~/biostar_class/hbr_uhr/refs directory, which is one directory
back from our present working directory of ~/biostar_class/hbr_uhr/hbr_uhr_hisat2, so we
can use ".." to specify go one directory back then go into refs.
• We specify files containing read 1 and read 2 of paired end sequencing after "-1" and "-2"
flags, respectively. The reads are in ~/biostar_class/hbr_uhr/reads, which is one directory
back from our present working directory of ~/biostar_class/hbr_uhr/hbr_uhr_hisat2, so we
can use ".." to specify go one directory back then go into reads.
• Using the "-S" flag, we indicate that we want to save the alignment results in the SAM
format where SAM stands for Sequencing Alignment Mapped.
----
56230 pairs aligned concordantly 0 times; of these:
173 (0.31%) aligned discordantly 1 time
----
56057 pairs aligned 0 times concordantly or discordantly; of these:
112114 mates make up the pairs; of these:
112061 (99.95%) aligned 0 times
48 (0.04%) aligned exactly 1 time
5 (0.00%) aligned >1 times
52.75% overall alignment rate
Let's breakdown the alignment statistics shown above for the sample HBR_1.
The first line of the HISAT2 alignment statistics says 118571 reads (100.00%) were paired.
Recall from FASTQC that read 1 and read 2 FASTQ files for HBR_1 have 118571 reads, each
(Figures 1 and 2). So the first line in the HISAT2 alignment statistics is telling us that out of all the
reads from read 1 and read 2 FASTQ files of HBR_1, we have 118571 pairs, which agrees with
what we know. Also, this means we have 118571x2 or 237142 reads for the HBR_1 sample.
Figure 1
Figure 2
Following the first line of the HISAT2 alignment statistics, we will see some terminologies like
concordant or discordant mapped reads. See Figure 3 for a visual explanation.
• A concordant read pair is defined as those that align in an expected manner where the
reads are oriented towards one another and the distance between the outer edges is
within expected ranges.
• A discordant read pair is defined as those that do not align as expected such as where
the distance between the outer edges is smaller or larger than expected.
If we sum up the number of reads that mapped in the above break down and divide by the total
number of reads in the two FASTQ files for the HBR_1 sample then we should get an overall
alignment rate of 52.75%.
Figure 3: Source: Benjamin J. Raphael, Chapter 6: Structural Variation and Medical Genomics,
PLOS Computational Biology, December 27, 2012 (https://journals.plos.org/ploscompbiol/
article?id=10.1371/journal.pcbi.1002821)
If we include the option --summary-file in the HISAT2 command, we can specify a file name to
save the alignment statistcs.
Before moving further, let's use the parallel and echo commands to create a text file that
contains the IDs for the HBR and UHR samples. This will allow us to align many FASTQ files at
the same time.
To create the list of sample IDs for the HBR and UHR dataset we run the command below:
• parallel allows us to create the list of sample IDs all in one go (ie. multi-task rather than
doing things in series)
• echo will print its arguments to the terminal (ie. echo horse will print horse in the terminal),
in this case we want echo to print {1}{2}, where these can represent anything connected
together by the underscore (the underscore is denoted by ). {1} denotes input 1 and {2}
denotes input 2 such that input 1 is printed first and input 2 is printed second.
• the inputs in the command below are HBR and UHR, the sample groups in our dataset; "1
2 3" to denote 1, 2, or 3 replicates for each group (ie. we have samples HBR_1, HBR_2,
HBR_3, UHR_1, UHR_2, UHR_3)
• one of the ways to specify input using parallel is the ":::" notation. here, after {1}_{2}, we
specify the sample groups as the first input (HBR UHR) and then the replicate number (1,
2, or 3)
• we write this to a file called ids.txt, in the ~/biostar_class/hbr_uhr/reads folder, where the
reads folder is one directory up from our present working directory of ~/biostar_class/
hbr_uhr/snidget_hisat2 so we can denote this using ".." (one directory up) and then
specify the reads directory followed by the file name
cat ../reads/ids.txt
HBR_1
HBR_2
HBR_3
UHR_1
UHR_2
UHR_3
To align all of the HBR and UHR FASTQ files we can take advantage of the parallel command.
Let's use the --summary-file option to store the alignment statistics. We will see why this is useful
in a bit. In the command below, we
After alignment with HISAT2, let's list the contents of the hbr_uhr_hisat2 directory to see what
has changed.
ls -1
HBR_1.sam
HBR_1_hisat2_summary.txt
HBR_2.sam
HBR_2_hisat2_summary.txt
HBR_3.sam
HBR_3_hisat2_summary.txt
UHR_1.sam
UHR_1_hisat2_summary.txt
UHR_2.sam
UHR_2_hisat2_summary.txt
UHR_3.sam
UHR_3_hisat2_summary.txt
If we print the alignment summary file for HBR_1 (HBR_1_hisat2_summary.txt) then we should
see the same statistics that we saw earlier.
cat HBR_1_hisat2_summary.txt
Recall from Lesson 11 that we can include summaries from post alignment steps into MultiQC
reports. This is why we create the summary files for the HISAT2 alignments for the HBR and
UHR data. So let's run MultiQC again to generate a report that includes the HISAT2 alignment
statistics. Make sure that we change the ~/biostar_class/hbr_uhr/ folder of our home directory
and then construct the multiqc command below, where we specify the output filename using --
filename. The output will be written in the QC directory. We use "." to tell multiqc to search ~/
biostar_class/hbr_uhr for any QC or log files and it will search not only the present working
directory but also sub-directories.
cd ~/biostar_class/hbr_uhr
Note that multiqc is now adding the alignment statistics in the report.
We can copy multiqc_hbr_uhr_with_hisat2 to ~/public and then take a look at this file.
cp QC/multiqc_hbr_uhr_with_hisat2.html ~/public
At the GOLD landing page, scroll down the student table until you see your name and click the
tab labeled File that is associated with your name.
You will then be taken to a page that where you access the files in your ~/public directory. You
can either right click to download the files or view in a separate browser tab in the case of html
files.
In the navigation pane of the multiqc report, we now see a link to the hisat2 alignment statistics
(Figure 4).
Figure 4
In the general statistics table, we now have a column indicating the overall alignment rate for
each sample (Figure 5).
Figure 5
This information is the same that we see in the alignment statistics generated by HISAT2, except
now we have combined it with our pre-alignment QC reports and this provides a nice way for us
to keep the logs and summary of our analysis in one place so we can share with colleagues
and collaborators.
Figure 6
If we click on the bar for one of the samples, a dialogue box with alignment statistics appears
and we can then see the numbers.
Figure 7
cd ~/biostar_class/hbr_uhr/hbr_uhr_hisat2
Now that we have our SAM files generated for the HBR and UHR dataset. For your reference,
information on the SAM file can be found here (https://samtools.github.io/hts-specs/SAMv1.pdf).
A SAM file is tab delimited, so we can opened using Excel. I am going to copy HBR_1.sam to
my ~/public folder so I can open this locally on my Excel to go over this SAM file (try to just
watch what I do here).
SAM files always start off with metadata information and these lines start with @.
• @HD includes
◦ SAM file format information (in this case version 1.0 as indicated by VN:1.0)
◦ Whether the alignment file has been sorted (in this case no as indicated by
SO:unsorted)
• @SQ provides reference information
◦ SN denotes reference sequence name, which is chr22
◦ LN is the reference length in bases, which is 50818468 as we found in Lesson 9
using seqkit stats
• @PG provides information about the program we used to generate the alignment
◦ ID is the program record identifier
◦ PN is the program name (HISAT2 in our case version 2.2.1 as indicated next to VN)
◦ CL is the command line call used to generate the alignment
1. QNAME or query template name, which is essentially the sequencing read that we want
mapped onto a reference genome. If we grep the first sequence in the HBR_1.sam, then
we should retrieve one of the reads in the original FASTQ files (try it on your own).
2. The second column is a FLAG that tells us a bit about the mapping. Here is a good tool to
help interpret these FLAGS (https://broadinstitute.github.io/picard/explain-flags.html).
These FLAGS can inform us of things like pair end read alignment orientation.
3. Column three contains the name of our reference genome.
4. Column four tells us the left most genomic coordinate where the sequencing read maps.
5. The mapping quality (MAPQ) is provided in the fifth column (the higher the number, the
more confident we are of the map); a value of 255 in this column means that the mapping
quality is not available.
6. Column six presents the CIGAR string, which tells us information about the match/
mismatch, insertion/deletion, etc. of the alignment.
7. Column seven is the reference sequence name of the primary alignment of the NEXT read
in the template. We will see an "=" if this name is the same as the current (which we would
expect for paired reads)
8. The alignment position of the next read in the template is provided in column 8. When this
is not available, the value is set to 0.
9. Column nine provides the template length (TLEN), which tells us how many bases of the
reference genome does the sequencing read span.
10. The tenth column is just the sequencing read (some are written as the reverse
complement so be cautious. The FLAGS in column two will tell us whether the sequence
is reverse complemented).
11. The eleventh column is the Phred quality scores of the sequencing read.
12. For definition of optional fields, see https://samtools.github.io/hts-specs/ (https://
samtools.github.io/hts-specs/SAMtags.pdf).
The SAM file format is human readable so we will need to convert it to a machine readable
format (Binary Alignment Map or BAM) before we can visualize alignment results using IGV and
perform other downstream processes like obtaining read counts. To convert between SAM and
BAM, we can use an application called SAMTOOLS (http://www.htslib.org/).
If we type samtools on the command line, we can see that this application has lots of features.
For an alignment file to be of use, we will need to sort it by genomic position using the sort
feature of SAMTOOLS. So we type the following command where the -o flag prompts us to enter
the output file name (in this case it will be a BAM file). The last argument is the input SAM file,
which is the SAM file that we want to sort.
Now that we have our BAM file for HBR_1 generated, we need to index it. Similar to the idea of
indexing a reference genome, indexing the BAM file will allow the program that uses it to more
efficiently search through it. For this we will use samtools index, where the -b flag tells
SAMTOOLS to create the index from a BAM file. We include the extension ".bai" in the output
file.
The sorting and indexing of our BAM files are perhaps the two necessary steps that allows us to
visualize and move forward with our analysis. The above shows how to sort and index one BAM
file at a time. Below, let's use the parallel command to take care of these tasks for all of our
alignment outputs at one go.
Similar to how we constructed the hisat2 alignment, we use cat to send the sample IDs to
parallel. The samtools command construct is enclosed in double quotes and {} acts as a place
holder for accepting the sample IDs provided by the cat command.
To run Bowtie2, we will need to create a Bowtie2 specific index for chromosome 22, so change
into the ~/biostar_class/hbr_uhr/refs folder.
cd ~/biostar_class/hbr_uhr/refs
Then, we use bowtie2-build using 22.fa as input (like we did with HISAT2) and assign to the
index the base name of 22.
bowtie2-build 22.fa 22
Listing the contents of our biostar_class/hbr_uhr folder, we see the Bowtie2 indices for
chromosome 22 has been created and these have extension "*.bt2"
ls -1
22.1.bt2
22.1.ht2
22.2.bt2
22.2.ht2
22.3.bt2
22.3.ht2
22.4.bt2
22.4.ht2
22.5.ht2
22.6.ht2
22.7.ht2
22.8.ht2
22.fa
22.gtf
22.rev.1.bt2
22.rev.2.bt2
ERCC92.fa
ERCC92.gtf
cd ~/biostar_class/hbr_uhr
mkdir hbr_uhr_bowtie2
cd hbr_uhr_bowtie2
The construct for the command to alignment using Bowtie2 is similar to Hisat2, with the
exception that we append "_bowtie2" to the output so that we know that it is from a Bowtie2
alignment. HISAT2 is actually based on Bowtie2 http://daehwankimlab.github.io/hisat2/manual/
(http://daehwankimlab.github.io/hisat2/manual/).
And now we will sort the Bowtie2 SAM files and convert to BAM.
In the commands below, we use cat to send the sample IDs to parallel. The samtools command
construct is enclosed in double quotes and {} acts as a place holder for accepting the sample
IDs provided by the cat command.
Lesson 13 Review
Previously, we used the application HISAT2 to align the raw sequencing data from the Human
Brain Reference (HBR) and Universal Human Reference (UHR) dataset. We created sorted and
indexed alignment output in the form of BAM files that we could use to visualize results in the
Integrative Genome Viewer (IGV). We also used the splice unaware Bowtie2 aligner to map the
HBR and UHR reads to chromosome 22 and will see how these results differ from HISAT2 when
visualizing in IGV.
Learning objectives
In this lesson, we will continue to use the HBR and UHR dataset and focus on learning how to
visualize the alignment outputs (both HISAT2 and Bowtie2) in IGV.
In Lesson 9, we got a short introduction on what IGV can do. It allows us to visualize genomic
data such as reference genomes and how features such as genes and transcripts align to
them. A common thing to do after aligning raw sequencing reads is to visually inspect the
results in IGV. In this lesson, we will do exactly this. In Lesson 11, we generated the BAM files,
here we will generate an additional file that is used to visualize sequencing coverage in IGV.
This file format is known as bigWig or bw (https://genome.ucsc.edu/goldenPath/help/
bigWig.html). bigWig files are binary so not human readable and make visualization faster
because the computer only needs to store in memory the content that is needed to be
displayed. We will use bedtools and bedGraphToBigWig to generate the bigWig files for the
HISAT2 and Bowtie2 alignment of the HBR and UHR dataset.
Be sure to change into the ~/biostar_class/hbr_uhr folder for this portion of the class.
cd ~/biostar_class/hbr_uhr
Step 1 for generating bigWig files is to convert the BAM alignment results to a bedGraph (with
extension bg) file that contains coverage along genomic regions.
• BED file - this is also known as Browser Extensible Format and contains three columns,
which are chromosome, start coordinate and end coordinate -- see Ian's response in this
Biostars forum (https://www.biostars.org/p/113452/)
• bedGraph - this has the same first three columns as the BED file but also has a fourth
column that could be anything such as sequencing coverage -- also see Ian's response
in this Biostars forum (https://www.biostars.org/p/113452/)
$ cat A.bed
chr1 10 20
chr1 20 30
chr2 0 500
To generate a bedGraph file from BAM alignment outputs from the HBR and UHR dataset, we
will use an application called bedtools (https://bedtools.readthedocs.io/en/latest/index.html) ,
which can be used for a range of tasks including compiling information on genomic intervals.
We will use it's genomecov (https://bedtools.readthedocs.io/en/latest/content/tools/
genomecov.html), which calculates coverage over a genomic range with the following options:
• -split: when applied to RNA sequencing data, do not count reads that map to introns see
this post on why we get reads that map to introns in RNA sequencing (https://
www.biostars.org/p/42890/)
• -bg: reports sequencing depth along a genomic interval rather than at each nucleotide
position (See Figure 1, shows the ways we can get bedtools to output coverage
information - where some options report coverage along an interval, some report at each
base position. Bedtools also gives us the ability to fine tune how we count coverage along
splice junctions with the split option)
Below, we take advantage of the parallel command to convert the BAM files from both HISAT2
and Bowtie2 alignments into bedGraph (bg) files for all of the samples in one go.
• cat reads/ids.txt: this will read the HBR and UHR sample IDs that are stored in a file
called ids.txt in the reads folder of the ~/biostar_class/hbr_uhr directory
• |: will pipe or send the output of cat to the next command, which is parallel
• we enclose the command that we want parallel to operate on in double quotes; here we
are using bedtools and its genomecov subcommand, where the parameters -ibam, -split,
and -bg were explained above
• hbr_uhr_hisat2/{}.bam: this is the path to our BAM file generated from aligning the HBR
and UHR FASTQ files to genome
◦ hbr_uhr_hisat2 is the output directory for the HISAT2 alignment (hbr_uhr_bowtie2 is
the output directory for the Bowtie2 alignment)
◦ we use {} as a place holder to receive the HBR and UHR sample IDs from cat (ie.
HBR_1, HBR_2, HBR_3, UHR_1, UHR_2, UHR3) and these IDs are appended by
.bam to complete the full path of the input BAM file
• hbr_uhr_hisat2/{}_hisat2.bg:
◦ we write the bedGraph output to the hbr_uhr_hisat2 for the HISAT2 alignment
(hbr_uhr_bowtie2 for the Bowtie2 alignment)
◦ again, {} acts as a place holder to receive the HBR and UHR sample IDs from cat,
and the sample IDs are appended with _hisat2.bg for the HISAT2 alignment ouput
(or _bowtie2.bg for Bowtie2 alignment output) to complete the full path of the output
bedGraph (bg) file
Below we will look at the content of HBR_1_hisat2.bg sorted by sequencing depth (column 4)
from highest to lowest using the column command.
The first column is the chromosome, followed by the genomic coordinates and the sequencing
depth.
To proceed with converting the bedGraph files to bigWig, we need to first create an index of our
genome using SAMTOOLS and it's faidx feature. Where faidx will index/extract FASTA.
Listing the contents of our refs directory, we now see an index of the human chromosome 22
genome named 22.fa.fai.
ls refs
After the index file for the genome has been created, we will use a tool called
bedGraphToBigWig (https://www.encodeproject.org/software/bedgraphtobigwig/) to generate
bigWig (bw) files from bedGraph (bg).
• cat reads/ids.txt: this will read the HBR and UHR sample IDs that are stored in a file
called ids.txt in the reads folder of the ~/biostar_class/hbr_uhr directory
• |: will pipe or send the output of cat to the next command, which is parallel
• we enclose the command that we want parallel to operate on in double quotes; here we
are using bedGraphToBigWig
• hbr_uhr_hisat2/{}_hisat2.bg: this is the path to our bg (bedGraph) file generated using
bedtools genomecov
◦ hbr_uhr_hisat2 is the output directory for the HISAT2 alignment (hbr_uhr_bowtie2 is
the output directory for the Bowtie2 alignment)
◦ we use {} as a place holder to receive the HBR and UHR sample IDs from cat (ie.
HBR_1, HBR_2, HBR_3, UHR_1, UHR_2, UHR3) and these IDs are appended by
.bg to complete the full path of the input BAM file
• refs/22.fa.fai: path for index of the human chromosome 22 reference
• hbr_uhr_hisat2/{}_hisat2.bw:
◦ we write the bigWig output to the hbr_uhr_hisat2 for the HISAT2 alignment
(hbr_uhr_bowtie2 for the Bowtie2 alignment)
◦ again, {} acts as a place holder to receive the HBR and UHR sample IDs from cat,
and the sample IDs are appended with _hisat2.bw for the HISAT2 alignment ouput
(or _bowtie2.bw for Bowtie2 alignment output) to complete the full path of the
output bigWig (bw) file
Figure 1: Click on the BAM_BW.html under All Projects -> BioStars to access the IGV launcher
for the HBR and UHR dataset.
We can also select the reference genome to use in the genome selection drop down menu. But
here, we will stick with hg38 (Figure 3).
Figure 3
We can also load local data to IGV by going to "File" and then "Load from File" (Figure 4). But
the alignment output has already been pre-loaded for us, so we will not need to load data from
local.
Figure 4
Unfortunately, the entire IGV browser is empty after we loaded the data, with the exception of
some coverage at chromosome 22 on the bigWig tracks from hg38. This is because we aligned
our HBR and UHR raw sequencing data only to chromosome 22.
Figure 5
Once we select chr22 (Figure 6), we will begin to see more information in IGV.
Figure 6
• On the top (Tracks 1) we have the bigWig or bw tracks for HBR_1 aligned with HISAT2
(HBR_1.bw) or aligned with Bowtie2 (HBR_1_bowtie2.bw).
• Tracks 2 - the BAM alignment for HBR_1 aligned with Bowtie2 (alignment and coverage
tracks are separate)
• Tracks 3 - the BAM alignment for HBR_1 aligned with HISAT2 (alignment, splice junction,
and coverage tracks are separate)
• On the bottom, we have the gene model track
Figure 7
When the alignment results have loaded, let's go to chr22:41,377,179-41,390,187 (enter this in
the search box and select Go) (Figure 6). Things to note are
Figure 8
We should also note the difference in the coverage histogram between the HBR_1 HISAT2 and
Bowtie2 alignments on the bigWig track (chr22:41,377,179-41,390,187).
Figure 9
If we zoom in to chr22:41,387,655, we will see that the coverage track is color partly in blue and
partly in red (Figure 10). This is showing a potential variant, where the reference is C (look at the
sequences right above the TEF gene model) and some of the samples have T.
Figure 10
If we go back up to the "File" tab in the IGV menu bar, we can select to "Load from Server"
(Figure 11) to bring in other information such as SNPs into IGV (Figure 12).
Figure 11
Figure 12
Once we click ok in the menu shown in Figure 12, we will see a SNP track beneath the bigWig
tracks. Also, we will find that a SNP record aligns to the location where we found our potential
variant.
Figure 13
If we click on the SNP record, a dialog box containing more information about this variant
appears.
Figure 14
Note that the presence of an "I" in the subject genome represents an insertion (click on it to see
more details). In the example in Figure 15, we have a T insertion.
Figure 15
One of the things that we can do with IGV is to color or group reads according to certain
criteria. For example, in paired end sequencing, the orientation of alignment for the pairs can
alert us to potential structural variations so one of the things we can do is to go into our
alignment (let's use the HBR_1 Bowtie2 alignment) by right clicking on the track, then select
"Group alignments by" and then choose "pair orientation" (Figure 16).
Figure 16
Grouping the alignments by pair orientation, we see that the alignment in the
HBR_1_Bowtie2.bam track has been separated into two groups - a track labeled "RL" and one
labeled "LR" (Figure 17). See https://software.broadinstitute.org/software/igv/
interpreting_pair_orientations (https://software.broadinstitute.org/software/igv/
interpreting_pair_orientations) for the definitions of the read pair orientations. In our example
Figure 17
Lesson 14 review
In the previous lesson, we learned to visualize RNA sequencing alignment results in the
Integrative Genome Viewer (IGV).
Learning objectives
In this lesson, we will identify genes that are differentially expressed between conditions. Here,
we want to compare UHR samples to HBR samples, so we will be using existing applications to
determine the ratio of expression of genes in the UHR samples versus the HBR samples (ie.
UHR / HBR) and then we will get a metric to determine whether the expression change is
statistically significant. Below are the two tasks for this lesson.
• Get expression counts for genes in the Human Brain Reference (HBR) and Universal
Human Reference (UHR) dataset
• Complete our analysis of the HBR and UHR data by obtaining genes that are differentially
expressed between the two groups
• Visualize gene expression
Before getting started, make sure that we are in the ~/biostar_class/hbr_uhr directory.
cd ~/biostar_class/hbr_uhr
First let's create a new directory in our ~biostar_class/hbr_uhr folder to store the counts.
mkdir hbr_uhr_deg_chr22
cd ~/biostar_class/hbr_uhr/hbr_uhr_hisat2
• -p, specifies we want to count in the paired end mode since we are working with paired
end sequencing
• -a, which prompts us to provide the annotation file (22.gtf)
• -g, which prompts us to specify attribute type in the GTF file (we choose to use
gene_names so we can get expression by gene)
• -o prompts us to provide the output name
• finally, our input BAM files are provided
After featureCounts finishes, we can go into the hbr_uhr_deg_chr22 directory to see what we
have.
cd ~/biostar_class/hbr_uhr/hbr_uhr_deg_chr22
ls -1
We have a file hbr_uhr_chr22_counts.txt with the expression counts and a summary. We will
need the expression counts (ie. hbr_uhr_chr22_counts.txt) for differential expression analysis.
hbr_uhr_chr22_counts.txt
hbr_uhr_chr22_counts.txt.summary
If we take a look at the first two lines of hbr_uhr_chr22_counts.txt, we will see that featureCounts
saves program information and command line call in the first line (we need to remove this line).
The second line are the column headers. In the column headers, we do not need columns Chr,
Start, End, Strand, and Length. All we want are the genes and expression for these genes in our
samples.
head -2 hbr_uhr_chr22_counts.txt
To remove the header line, we use the sed command below, where
• the option 1d indicates delete the first line (1 refers to the first line, d denotes delete)
• next we provide the input file
• we use ">" to indicate we want to save the output and then provide a name for the output
(we will append "_no_header" to the original file name)
head -2 hbr_uhr_chr22_counts_no_header.txt
To remove the columns Chr, Start, End, Strand, and Length, we can use cut
• where -f1,7-12 tells the command that we want field 1 (the gene ids) and then fields 7-12,
the expression counts for the samples
• the output from cut is then piped to tr '\t' ',', which will replace the original tab ('\t')
separated columns with comma (',') separated columns. In other words, rather than using
tabs to separate the columns of our counts table, we use commas, which is required for
the differential expression analysis software. We can do tr ',' '\t' to go from comma
separated columns to tab separated columns
• we save the new counts table as counts.csv
Now, we can use the column command to look at the HBR and UHR expression counts table.
Hit q to exit the column command and return to the terminal.
Normalization
Before diving into differential expression analysis, it is important to take a brief look at the
concept of normalization. After obtaining expression counts, we will need to normalize the
counts before performing differential expression analysis. This is an important step because
normalization serves to remove technical variations amongst the samples to ensure that
whatever difference we get in gene expression is due to biology.
Common sources of technical noise include the following and we would like to have them
removed before continuing on with analysis.
• Differences in library size between the samples (ie. the sum of the counts between the
samples are not the same). To understand this a bit better, think of a Western blot, where
we have to load the same amount of protein or starting material (from each sample) into
each lane in a gel so that we can compare protein expression.
• Gene length - the longer the gene the more reads it will get.
• Batch effect - when we process different samples on different days, locations, by different
people etc. we are introducing technical noise.
While there are different approaches for normalization, we will not explore these today as it is a
more advanced topic. Also, some of the differential expression analysis packages such as
DESeq2 (https://bioconductor.org/packages/release/bioc/html/DESeq2.html) and edgeR
(https://bioconductor.org/packages/release/bioc/html/edgeR.html) have their own normalization
procedures and users are instructed to input only the raw integer counts when using these
packages.
For now, we will use scripts written by the author of the Biostars handbook. These scripts can
be run on the command line but they are R packages, so for those interested in learning R,
please visit https://btep.ccr.cancer.gov/on-line-classes-2022/ (https://btep.ccr.cancer.gov/on-
line-classes-2022/) for our series of introductory R courses.
If we listed the contents of the design_file_hisat2 folder, we will see two files
ls ~/design_file_hisat2
design.csv design.txt
cp ~/design_file_hisat2/design.csv .
cp ~/design_file_hisat2/design.txt .
The design.csv file and design.txt file contain the same content, except the columns in the csv
file are separated by commas and the txt file are separated by tabs. We can use cat to view
these.
cat design.csv
sample,condition
HBR_1.bam,HBR
HBR_2.bam,HBR
HBR_3.bam,HBR
UHR_1.bam,UHR
UHR_2.bam,UHR
UHR_3.bam,UHR
cat design.txt
sample condition
HBR_1.bam HBR
HBR_2.bam HBR
HBR_3.bam HBR
UHR_1.bam UHR
UHR_2.bam UHR
UHR_3.bam UHR
To get differential expression for genes on chromosome 22, we will need make sure we are in ~/
biostar_class/hbr_uhr/hbr_uhr_deg_chr22 folder (use pwd to confirm and cd into if needed)
and run the deseq2.r script as shown below.
Rscript $CODE/deseq2.r
While the deseq2.r script is running, it will generate the following. Mainly, it's telling us that it is
using the design.csv and counts.csv files as input and the output is stored in results.csv.
Since the differential analysis results are in csv format, we can open these in Excel. However,
we can also view these in the terminal using the column command. And doing this we find that
there are 13 columns in the differential analysis results table and they are described below. Hit
q to get out of the column command.
We can use the command below to sort by log2 fold change from largest to smallest, where
• column is used to print our tabular data nicely with columns aligned
• -t indicates to separate the columns by a tab
• -s ',' indicates that the columns in our file are originally separated by comma (ie. a csv
file)
• sed 1q will print the first line of our table and then quits, preventing the first line from
being included in the sort
• we use sort to sort things
◦ -k: prompts us to provide column that we want to sort by, here it is column 6 (log2
fold change),
◦ n indicates we want to sort numerically
◦ r indicates in reverse order (from largest to smallest)
Rscript $CODE/create_heatmap.r
To view the expression heatmap, copy it to ~/public. Go back to the GOLD landing page, click
on the "Files" tab associated with your name and you will be taken to an index page. You can
click on PDF files and view in them in the browser without having to download them.
cp heatmap.pdf ~/public/heatmap_hisat2.pdf
Figure 1 shows the expression heatmap for the chromosome 22 genes in the HBR and UHR
samples. On the top of the plot, we have a color key, which indicates that down regulated
genes have negative values and these are colored by shades of green and up regulated genes
have positive values and these are colored by shades of red. The horizontal axis of the heatmap
are labeled with the sample names, the right vertical axis are the genes, the left vertical axis is
the dendrogram that shows clustering of genes based on expression. In this expression
heatmap, it is clear that there are gene clusters that are up regulated or down regulated in one
group but not the other.
Figure 1
Another common visualization in RNA sequencing is Principal Component Analysis (PCA) plot.
This helps us visualize clusters of samples in our data by transforming our data so that we can
visualize along the axes that capture the largest variation in the data. We will use the
creat_pca.r script and the command below to generate our PCA. The input for this script is our
counts table (counts.csv) but we need a tab separated version of it. We also need design.txt
(tab separated version of the design file) and the number of samples we have (6 in the case of
the HBR and UHR dataset).
First use cat and tr to create tab separated counts table from counts.csv. The tr command
replaces ',' with tabs ('\t').
Then, we run the R script create_pca.r where the inputs are counts.txt (tab separated
expression counts table), design.txt, and 6 (the number of samples).
cp pca.pdf ~/public/pca_hisat2.pdf
The PCA plot is shown in Figure 2. The HBR samples are colored in red dots and UHR samples
are colored in the green dots. The horizontal axis (PC1) is the one that captures the most
variance or separation (79%) in our samples. PC2 on the vertical axis captures the second most
variance or separation (7%). We see that along PC1, the HBR and UHR samples are clearly
separated and we are able to see perhaps the biological difference between these samples.
Along PC2, while the HBR samples cluster closely, we see that the UHR_2 sample is off by itself
(away from the other two samples in this group). This could indicate some underlying biology of
UHR_2 or maybe it's caused by some technical factor.
Figure 2
Review
In the previous classes, we learned about the steps involved in RNA sequencing analysis. We
started off with assessing quality of raw sequencing data, then we aligned the raw sequencing
data to genome, and finally we obtained expression counts and conducted differential
expression analysis.
• FASTQC to obtain quality metrics for individual FASTQ files. Recall that FASTQ files
contain our sequencing data and each file has many sequencing reads. Each read is
composed of four lines
◦ Header, that starts with @
◦ Actual sequence
◦ "+"
◦ Quality, which tells us the error likelihood of the base call
• MultiQC to aggregate multiple FASTQC outputs into one
When analyzing high throughput sequencing data, we will need to trim away adapters.
Adapters help anchor the unknown sequencing template to the Illumina flow cell and can
interfere with alignment. We may also want to trim away low quality reads. In this course, we
learned to use Trimmomatic, which is a flexible trimming tool for Illumina data (http://
www.usadellab.org/cms/?page=trimmomatic) to trim away low quality reads and adapters.
Refer to the Trimmomatic manual (http://www.usadellab.org/cms/uploads/supplementary/
Trimmomatic/TrimmomaticManual_V0.32.pdf) and the Trimmomatic publication (https://
www.ncbi.nlm.nih.gov/pmc/articles/PMC4103590/) for details on how to use this tool.
4. Provide names of the output files including ones for unpaired reads (ie. those that
survived the processing in one file but not the other)
◦ specify window size (ie. how many bases the sliding window is composed of), in
the example below we use a window size of 4 bases
◦ specify the average quality score threshold for the window, in the example below
we use 30
◦ in the example below, we will scan starting from the 5' end of the read, 4 bases at a
time and trim once the average quality within a 4-base window falls below 30
6. ILLUMINACLIP is used to trim away adapters and other Illumina-specific sequences, the
parameters needed for ILLUMINACLIP are
7. MINLEN is used to specify the minimum length of the trimmed sequence that we want to
keep. In the example below, we set this to 50 bases. If a read falls below 50 bases, then it is
dropped
https://www.genepattern.org/modules/docs/Trimmomatic/#gsc.tab=0 (https://
www.genepattern.org/modules/docs/Trimmomatic/#gsc.tab=0)
FYI
Why are adapter sequences trimmed from only the 3' ends of reads? (https://
support.illumina.com/bulletins/2016/04/adapter-trimming-why-are-adapter-sequences-trimmed-
from-only-the--ends-of-reads.html)
We may also run FASTQC again after trimming to make sure that adapters have been removed
and quality is good.
One of the challenges in analyzing high throughput sequencing is to reconstruct the genome of
the unknown by using a knonw (ie. reference). The next step in analysis is to align our
sequencing data to a reference genome. We used HISAT2 (splice aware) and visually
compared the alignment results obtained from Bowtie2 (not splice aware) using the Integrative
Genome Viewer (IGV). For RNA sequencing, we should use splice aware aligners to account for
reads that map across two exons.
After alignment of sequencing data to genome, we will need to count how many reads aligned
to which gene. Using the tool featureCounts, we were able to do this. This tool takes as input
our BAM alignment files and also an annotation file (gtf/gff) that tells us location of features (ie.
genes, transcripts) in a genome. Note that for paired end sequencing we should include the
additional flags below.
-p: If specified, fragments (or templates) will be counted instead of reads. This
option is only applicable for paired-end reads; single-end reads are always
counted as reads.
-p: Specify that input data contain paired-end reads. To perform fragment counting
(ie. counting read pairs), the '--countReadPairs' parameter should also be specified
in addition to this parameter.
--countReadPairs: Count read pairs (fragments) instead of reads. This option is only
applicable for paired-end reads.
"For paired end reads, you should count read pairs (fragments) rather than reads
because counting fragments will give you more accurate counts. There are several
reasons why you cannot get the fragment counts by simply dividing the counts you
got from counting reads by two. One reason is that a fragment with two mapped
reads will give you two counts when you count reads, but a fragment with only one
mapped read will only contribute one count (this fragment should get 1 count in
fragment counting but it ended up with 0.5 count when you count reads instead of
fragments). Another reason is that some reads may be found to be overlapping with
more than one gene and therefore were not counted, but the corresponding
fragments may be counted because the ambiguity was solved by longer fragment
length." -- Wei Shi (https://support.bioconductor.org/p/67534/)
After obtaining the expression counts for each gene, we can run helper R scripts provided by
the author of the Biostar Handbook to obtain differential gene expression results. In our lessons,
we used deseq2.r. We also generated visualizations that show us how genes cluster by
expression (heatmap) and how samples cluster together (PCA).
Learning objectives
An alternative to aligning raw sequencing data to a reference genome is to map them to a
reference transcriptome. In this lesson, we will use the HBR and UHR datasets, and learn about
this approach for analyzing RNA sequencing data and discuss some advantages and
drawbacks. We will do the following
• Construct the human chromosome 22 reference transcriptome using FASTA file for the
chromosome 22 reference genome and gtf file
• Align sequencing reads to the reference transcriptome
• Obtain differential expression
cd ~/biostar_class/hbr_uhr
gffread) to create one from the chromosome 22 genome (22.fa) that we have used when
analyzing the HBR and UHR data via alignment to the genome.
• -w tells gffread to write a FASTA file with sequences of all exons from a transcript, which is
followed by the output file name (in this example, we are storing the reference
transcriptome, 22_transcriptome.fa in the refs folder so we need to specify that path as
well since we are currently in the ~/biostar_class/hbr_uhr folder)
• -W tells gffread to include the exon coordinates in the header of each sequence (ie. the
sequencing header that starts with ">")
• -g prompts us to enter the reference genome FASTA file (22.fa in our case)
• the gtf (22.gtf) annotation file is provided at the end
Using head we can view the first few transcripts in human chromosome 22.
head refs/22_transcriptome.fa
Here, we can see that each transcript has a header line that starts with ">" followed by the
actual sequence of the transcript. On the header line we have the transcript ID, which starts
with ENST (they are from Ensembl) and genomic coordinates for the transcripts.
Let's create a folder, salmon to store our alignment outputs with salmon
mkdir salmon
After the index has been generated, we can use salmon quant with the options below to
generate our expression counts table. Below is how we would construct the command for one
sample, however, we have 6 samples in the HBR and UHR dataset so we can turn to the
parallel command to align all of these at the same time.
• -l prompts us to specify the library type and specifying "A" will allow salmon to
automatically infer library type (ie. if the library is paired end)
• --validateMappings helps to make alignment more accurate
• For paired end sequencing, we specify read 1 and read 2 after the flags -1 and -2,
respectively
• -o prompts us to specify a name of the folder where the output will be stored (we want our
output to be stored in the folder salmon, which was created earlier)
Because we have 6 samples, we want to get Salmon to quantify them in one go so we will need
to create a text file called ids.txt that contains the sample names for this dataset.
cd ~/biostar_class/hbr_uhr/reads
cd ..
Now if we change into the salmon directory and list the content, we should see the folders
below, which contain the transcriptome alignment output for each sample in the HBR and UHR
dataset (Salmon produces an output folder for each sample).
cd salmon
ls -1
HBR_1_SALMON
HBR_2_SALMON
HBR_3_SALMON
UHR_1_SALMON
UHR_2_SALMON
UHR_3_SALMON
Let's take a look at the contents of the HBR_1 alignment results folder.
cd HBR_1_SALMON
ls -1
aux_info
cmd_info.json
libParams
lib_format_counts.json
logs
quant.sf
If we need to recall how we ran the salmon alignment, we can see this in cmd_info.json, where
cmd stands for command line (so this file provides command line information).
cat cmd_info.json
"salmon_version": "1.7.0",
"index": "22_transcriptome.idx",
"libType": "A",
"validateMappings": [],
"mates1": "HBR_1_R1.fq",
"mates2": "HBR_1_R2.fq",
"output": "salmon/HBR_1_SALMON",
"auxDir": "aux_info"
If we look at the salmon_quant.log file in the logs directory, we can get some information such
as overall alignment rate.
cd logs
cat salmon_quant.log
The expression counts are available in the file quant.sf so to take a look at the HBR_1 Salmon
quantifications we need to go up one folder (ie. ~/biostar_class/hbr_uhr/salmon/
HBR_1_SALMON).
cd ..
Below, we use the column command to show the first few lines of column 4 in quant.sf file for
HBR_1, which contains the count data.
• column is used to print our tabular data, which is quant.sf for HBR_1 nicely with columns
aligned
◦ -t finds the number of columns in the data
• we use | to pipe the column output to sed, where
◦ 1q will print the first line of our table and then quit, preventing the first line from
being included in the sort
◦ then we use sort where
▪ -k prompts us to specify the column we like to sort by (column 4 containing
the count data in this case)
▪ we want to sort column 4 numerically so we use n after the 4 to indicate this
▪ we also want to sort column 4 from largest to smallest so we include r, with
the final construct being "-k 4nr" for sorting column 4 numerically from largest
to smallest
Hit q to get out of the column command and return to the prompt
The columns in the quant.sf Salmon output file are described below.
cd ~/biostar_class/hbr_uhr/salmon
We will next combine the expression for all of the HBR and UHR samples into one csv file. To do
this, we need a design.csv file like we did with the alignment to genome analysis. As a review,
the design file has two columns, where one column provides the sample names, and the other
informs of the condition with which the samples were treated. Fortunately, the design file has
been created already and they should be in the folder design_file_salmon in our home directory.
For now, we want design.csv.
ls ~/design_file_salmon
design.csv
design.txt
To work with the design.csv file, we will need to copy it to the ~/biostar_class/hbr_uhr/ folder (so
let's change into this)
cd ~/biostar_class/hbr_uhr/
cp ~/design_file_salmon/design.csv .
cat design.csv
sample,condition
HBR_1_SALMON,HBR
HBR_2_SALMON,HBR
HBR_3_SALMON,HBR
UHR_1_SALMON,UHR
UHR_2_SALMON,UHR
UHR_3_SALMON,UHR
Note that the sample names in the design file matches the names of the salmon alignment
output folders.
Now that we have the design file, run the combine_transcripts.r script to get a table with the
expression for all samples. This script takes as input the design.csv file and the folder that
contains the salmon output for all of the samples (ie. the salmon directory).
Below we show the first 4 lines of the merged salmon counts table, using the column command,
where
Hit q to get out of the column command and return to the prompt
mv counts.csv salmon/
mv design.csv salmon/
Change back into the salmon directory so we can run differential analysis.
cd salmon
Rscript $CODE/deseq2.r
As we seen previously, the deseq2.r script writes the differential analysis output to a file called
results.csv. To view the differential analysis results let's do the following. Hit q to exit column and
return to the prompt.
◦ -s allows us to specify the column separator, which is a "," because we are working
with a csv or comma separated value file, where the columns in the file are
separated by ","
First let's sort the results.csv file by transcript ID and save it as results_id_sorted.txt to denote
that this is a sorted version. To obtain the results_id_sorted.txt file, we
• use grep to find ENST, which is the prefix to the transcript ids in the results.csv file
• use | to send the grep results to column to generate a nicely aligned print out - note that
in the column command, even though we use -s to specify that the columns in the input
are comma separated, column will print the results to the terminal as a tab separated
table
• use | to send the column output to sort
• finally, > save the sorted output as results_id_sorted.txt
cd ~/biostar_class/hbr_uhr
We have the 22.gtf file that tells us the genes, transcripts, exons and other features that reside
on human chromosome 22. There is tool called gtfToGenePred that can help us extract the
transcripts and gene names from the gtf file.
We can pull up the documentation for gtfToGenePred if we just type "gtfToGenePred" in the
command prompt.
options:
-genePredExt - create a extended genePred, including frame
information and gene name
-allErrors - skip groups with errors rather than aborting.
Useful for getting infomation about as many errors as possible.
-ignoreGroupsWithoutExons - skip groups contain no exons rather than
generate an error.
-infoOut=file - write a file with information on each transcript
-sourcePrefix=pre - only process entries where the source name has the
specified prefix. May be repeated.
-impliedStopAfterCds - implied stop codon in after CDS
-simple - just check column validity, not hierarchy, resulting genePred
-geneNameAsName2 - if specified, use gene_name for the name2 field
instead of gene_id.
-includeVersion - it gene_version and/or transcript_version attributes exi
in the corresponding identifiers.
If you remember in 22.gtf, under the attribute column we have things like gene id, gene name,
etc (see the table below). We will use the option geneNameAsName2 to pull the gene names.
We will also be using the genePredExt option to create the genePred file.
DATA
CHROMOSOME FEATURE START END SCORE STRAND FRAME ATTRIBUTE
SOURCE
gene_id
"ENSG00000277
gene_type "snRN
chr22 ENSEMBL gene 10736171 10736283 . - .
gene_status "NO
gene_name "U2
3;
DATA
CHROMOSOME FEATURE START END SCORE STRAND FRAME ATTRIBUTE
SOURCE
gene_id
"ENSG00000277
transcript_id
"ENST00000615
gene_type "snRN
gene_status "NO
gene_name "U2
transcript_type
chr22 ENSEMBL transcript 10736171 10736283 . - .
"snRNA";
transcript_status
"NOVEL";
transcript_name
"U2.14-201"; leve
"basic";
transcript_suppo
"NA";
gene_id
"ENSG00000277
transcript_id
"ENST00000615
gene_type "snRN
gene_status "NO
gene_name "U2
transcript_type
"snRNA";
chr22 ENSEMBL exon 10736171 10736283 . - . transcript_status
"NOVEL";
transcript_name
"U2.14-201";
exon_number 1;
exon_id
"ENSE00003736
level 3; tag "basi
transcript_suppo
"NA";
Here, we will use the genePredExt and geneNameAsName2 to get gene names included in our
output.
Note that in the gtfToGenePred command below, we are saving the genePred output to the refs
folder.
Taking a glance at the 22_extended.genePred file, we see that the first column has the
transcript IDs, followed by strand and genomic coordinate information. The 12th column has the
gene names. So columns 1 and 12 are what we want. The genePred format actually stands for
gene prediction, which is exactly what it does, it tells us information abosut genes and their
transcripts.
Here, we use cut to extract column 1 (transcript ID) and column 12 (gene name) of the
genePred file and save it to 22_transcript_to_gene.txt.
ENST00000615943.1 U2
ENST00000618365.1 CU459211.1
ENST00000623473.1 CU104787.1
ENST00000624155.1 BAGE5
ENST00000422332.1 ACTR3BP6
ENST00000612732.1 5_8S_rRNA
ENST00000614148.1 AC137488.1
ENST00000614087.1 AC137488.2
ENST00000621672.1 CU013544.1
Note that some genes may appear more than once because they can have multiple transcript
products. As an example SLC25A17.
ENST00000263255.10 SLC25A17
ENST00000491545.5 SLC25A17
ENST00000435456.6 SLC25A17
ENST00000544408.5 SLC25A17
ENST00000402844.7 SLC25A17
ENST00000447566.5 SLC25A17
ENST00000420970.5 SLC25A17
ENST00000430221.5 SLC25A17
ENST00000427084.5 SLC25A17
ENST00000458600.5 SLC25A17
ENST00000443810.5 SLC25A17
ENST00000412879.5 SLC25A17
ENST00000426396.5 SLC25A17
ENST00000434193.5 SLC25A17
ENST00000478550.1 SLC25A17
ENST00000449676.5 SLC25A17
ENST00000434185.1 SLC25A17
Now let's sort 22_transcript_to_gene.txt, which contains our transcript ID to gene name
mapping by transcript ID and save it as 22_transcript_to_gene_id_sorted.txt to denote that it is
sorted.
Finally, we can change back into the salmon output directory (cd salmon) to paste (the
command that we will use is called paste) the sorted transcript ID to gene name mapping file
(22_transcript_to_gene_id_sorted.txt) to the sorted differential expression analysis results
(results_id_sorted.txt) and save as results_with_gene_names.txt.
cd salmon
The input for the paste command below are the two files that we want to paste together.
Again, to view results_with_gene_names.txt we can use the column command (note that in the
column command below, we did not specify the column separator because they are already
separated by tabs)
We should also insert column headers. To this use the sed command where inside the single
quotes,"1i" tells it to insert into the first line the text that follows. Because
results_with_gene_names.txt is tab separated, we would need to separate the column header
by tabs using "\t" following each heading.
The sed command is known as a stream editor. It has many functions including printing of
specific lines, deletion of lines, substitutions, and inserstion of lines.
Below, we see the transcripts derived from SLC2A11. This is where classification based
analysis is beneficial - it allows us to look at transcript level expression differences between
conditions such that even though we are talking about the same gene, different transcript
isoforms maybe expressed under different conditions or between tissue types. On the other
hand, in alignment based analysis, we are aligning everything to the genome so we are getting
aggregate gene level expression information.
A drawback to the classification based approach is that we are mapping to a known database
of transcripts. This may prevent us from discovering expression of novel splice isoforms.
Rscript $CODE/create_heatmap.r
cp heatmap.pdf ~/public/hbr_uhr_heatmap_salmon.pdf
To generate the PCA plot, we need a tab separated version of the design file, which we can
copy over from ~/biostar_class/design_file_salmon folder. Since we are in ~/biostar_class/
hbr_uhr/salmon we can use "." to denote copy to here, our present working directory.
cp ~/design_file_salmon/design.txt .
Let's use cat to view design.txt. Again, the difference between design.csv and design.txt is that
the columns are separated by commas in the csv file and by tabs in the txt file.
cat design.txt
sample condition
HBR_1_SALMON HBR
HBR_2_SALMON HBR
HBR_3_SALMON HBR
UHR_1_SALMON UHR
UHR_2_SALMON UHR
UHR_3_SALMON UHR
To generate the PCA plot, we will also need a tab separated version of the counts table, which
we can generate using the code below where
Then, we run the create_pca.r script like we did with the alignment based analysis method.
cp pca.pdf ~/public/hbr_uhr_pca_salmon.pdf
Objectives
1. Determine potential next steps following differential expression analysis.
2. Tour geneontology.org and understand the three main ontologies.
3. Learn about different methods and tools related to functional enrichment and pathway
analysis.
4. Get familiar with databases commonly used by popular functional enrichment tools.
You now have a potentially large list of differentially expressed genes. Now what? If you are like
most biologists, you are interested in understanding these genes within their biological context.
To do that, we can examine gene ontology and perform some type of functional enrichment
analysis or pathway analysis.
These types of analyses exploit the use of gene sets, and not all gene sets represent a
pathway. Gene sets, which are collections of genes "formed on the basis of shared biological or
functional properties as defined by a reference knowledge base. Knowledge bases are
Whereas, a pathway is not a simple list of genes but rather includes an interaction component
usually related to a specific mechanism, process, etc.
The Gene Ontology (GO) provides a framework and set of concepts for describing
the functions of gene products from all organisms. --- https://www.ebi.ac.uk/ols/
ontologies/go (https://www.ebi.ac.uk/ols/ontologies/go).
There are two parts to the gene ontology: (Check out https://www.youtube.com/watch?
v=6Am2VMbyTm4 (https://www.youtube.com/watch?v=6Am2VMbyTm4) for a more detailed
overview)
1. the ontology (the GO terms and their hierarchical relationship) - form a directed, acyclic
graph structure (nodes = GO terms, edges = relationships)
2. the annotations (the annotated genes linked to various GO terms)
What is a GO term?
• GO terms provide information about a gene product
• GO terms as a vocabulary are species agnostic, but there are species constraints
• ontology and annotations are updated regularly
• computer readable - suitable for bioinformatics
GO integrates information about gene product function in the context of three domains:
1. molecular function (F) - "the molecular activities of individual gene products" (e.g., kinase)
2. cellular component (C) - "where the gene products are active" (e.g., mitochondria)
3. *biological process (P) - "the pathways and larger processes to which that gene product’s
activity contributes " (e.g., transport)
Checkout geneontology.org.
From this, ORA determines which pathways are over or under represented by asking "are there
more annotations in the gene list than expected?"
GSEA
Pathway Topology
ORA and FCS discard a large amount of information. These methods use gene sets, and even if
the gene sets represent specific pathways, structural information such as gene product
interactions, positions of genes, and types of genes is completely ignored. Pathway topology
methods seek to rectify this problem.
Some examples:
constructs a mathematical model that captures the entire topology of the pathway
and uses it to calculate a perturbation for each gene. Then, these gene
perturbations are combined into a total perturbation for the entire pathway and a p-
value is calculated by comparing the observed value with what is expected by
chance. (https://advaitabio.com/ipathwayguide/more-accurate-pathway-rankings-
using-impact-analysis-instead-of-enrichment/)
• Other tools include Pathway-Express, SPIA, NetGSA, etc. (See Nguyen et al. 2019
(https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1790-4) for a
review of PT methods.)
Other databases
There are many databases devoted to relating genes and gene products to pathways,
processes, and other phenomenon. Again, the following is not meant to be a comprehensive
list.
• curated database
• biological pathways
• molecular interaction networks
• Very nice pathway maps
• Restricted licenses
Reactome (https://reactome.org/)
PANTHER (http://www.pantherdb.org/pathway/)
WikiPathways (https://www.wikipathways.org/index.php/WikiPathways)
NDEx (https://home.ndexbio.org/index/)
Pathguide (http://pathguide.org/?
organisms=all&availability=all&standards=all&order=alphabetic&DBID=none)
Note: Genome builds will have differences in the names and coordinates of genomic features,
which will impact gene ID conversions. See this tutorial (https://github.com/hbctraining/Training-
modules/blob/master/DGE-functional-analysis/lessons/Genomic_annotations.md) from the
Harvard Chan Bioinformatics Core.
Resources:
• Functional enrichment and comparison with R (https://alexslemonade.github.io/refinebio-
examples/03-rnaseq/pathway-analysis_rnaseq_01_ora.html) .
• ClusterProfiler, pathview, and good introductory information (https://github.com/
hbctraining/Training-modules/blob/master/DGE-functional-analysis/lessons/
functional_analysis_2019.md)
• Article on the impact of the evolving GO (https://www.nature.com/articles/
s41598-018-23395-2)
• Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges, PLOS
Computation Biology, 2012 (https://journals.plos.org/ploscompbiol/article?id=10.1371/
journal.pcbi.1002375)
• Pathway enrichment analysis and visualization of omics data using g:Profiler, GSEA,
Cytoscape and EnrichmentMap (https://www.nature.com/articles/s41596-018-0103-9)
• Toward a gold standard for benchmarking gene set enrichment analysis, Briefings in
Bioinformatics, 2021 (https://academic.oup.com/bib/article/22/1/545/5722384)
• Enrichment analysis resource list from UCSF (https://guides.ucsf.edu/bistats/pathenrich)
Lesson 17 review
In the previous class, we got an overview of functional and pathway analysis, which help to put
RNA sequencing results into biological context by informing us of things like biomolecular
pathway, biological function, cellular localization, etc. of genes in our study. We were also
introduced to tools that could help us perform these analyses.
Learning objectives
This lesson will provide an overview of the Database for Annotation, Visualization and Integrated
Discovery (DAVID) (https://david.ncifcrf.gov/home.jsp). We will
Background on DAVID
This tool was created and is maintained by the Laboratory of Human Retrovirology and
Immunoinformatics (https://frederick.cancer.gov/research/laboratory-human-retrovirology-and-
immunoinformatics) at Frederick National Lab.
DAVID is used for functional analysis. Given an input gene list, DAVID will inform us of the
following.
• Whether genes in an input gene list are associated with diseases and links out to
resources such as NCBI's MedGen (https://www.ncbi.nlm.nih.gov/medgen/C2931781)
• Molecular functions that genes perform
• Biological pathways in which genes participate
• Other annotations (ie. cellular location, tissue expression, etc.) that the genes map to
DAVID compares the overlap of user provided gene list to an annotation to the overlap of a
background gene list to the same annotation. Thus, DAVID is using the Fisher exact test to
determine if the overlap of genes in the user input to a particular annotation is statistically
different from what we would observe in the background. See Table 2 in Huang et al, Nature
Protocols 2009 (https://www.nature.com/articles/nprot.2008.211/tables/2) for more information
on the background gene set but essentially, the default background of the genome-wide genes
is appropriate for studies that involve a genome-wide survey. However, DAVID provides custom
background gene sets and users can specify their own.
Caution
"...make sure that the genes in your list are found in the background set that you have selected
in DAVID otherwise, DAVID will ignore them." -- DAVID FAQ (https://david.ncifcrf.gov/
content.jsp?file=FAQs.html#14)
DAVID performs over representation analysis (ORA) at its core, which aims to find enriched
molecular functions, pathways, or other annotations represented by the input gene list. In other
words, many genes in the list map onto those molecular functions, pathways, or annotations.
With DAVID, we are essentially looking at contingency tables (Figure 1). The example in Figure
1 shows the number of user input genes and background genes (selected from the whole
genome) that fall onto a particular pathway (ie. p53 signaling). However, how certain can we be
that the number of user input genes that map to the pathway is observed not by random
chance. In other words, do user input genes fall onto a pathway more often as compared to the
background or expected. DAVID uses the Fisher exact test to help us decide whether what we
are observing is due to chance.
Figure 1: Contingency table showing the number of user input genes and background genes
from the genome that fall onto a certain pathway. DAVID help documentations (https://
david.ncifcrf.gov/helps/functional_annotation.html#bonfer)
Below are some resources for you to learn about or review this statistical procedure.
Pathway Commons also provides a statistics primer that discusses those methods that are
relevant to pathway analysis.
A problem that arises with enrichment analysis is the need to perform multiple statistical tests
across many gene sets. In short, type I errors or false positives increases as the number of tests
performed increases -- Pathway Commons multiple testing (https://www.pathwaycommons.org/
guide/primers/statistics/multiple_testing/). Users can choose to use either Bonferroni, Benjamini-
Hochberg, or False Discovery Rate (FDR) to corrrect for multiple testing.
Reducing Redundancy
"Due to the redundant nature of annotations, Functional Annotation Chart presents similar/
relevant annotations repeatedly. It dilutes the focus of the biology in the report. To reduce the
redundancy, the Functional Annotation Clustering report groups/displays similar annotations
together which makes the biology clearer and more focused..." -- DAVID help documents
(https://david.ncifcrf.gov/helps/functional_annotation.html#summary)
"The goal of DAVID's design is to be able to efficiently upload and analyze a list consisting of
<=3000 genes. All DAVID tools have been tested with lists in this range and should return
results in a few seconds to no more than a few minutes. If running time is longer than a few
minutes, please contact the DAVID Bioinformatic Team for help. Please note that Functional
Annotation Clustering and Gene Functional Classification have a 3000 gene limit." -- DAVID FAQ
(https://david.ncifcrf.gov/helps/FAQs.html)
Getting help
Click on the Start Analysis button to initiate an analysis, this will take us to the Analysis Wizard.
Supplying input
1. provide our input gene list (either copy paste or upload as a text file)
2. specify gene identifier type (gene identifiers could be gene symbol, Ensembl IDs, Entrez
IDs, Genbank IDs, Refseq IDs, etc.)
3. specify whether we are providing an input gene list or background gene list
4. submit the list for analysis
Here, we are going to provide the genes that are upregulated in the UHR sample with respect to
the HBR samples. These genes were obtained by filtering the differential expression table for
log2 fold change ≥ 1 and false discovery rate of ≤ 0.05. The genes are in the file
hbr_uhr_deg_chr22_up_genes.txt.
Step 1: After attaching hbr_uhr_deg_chr22_up_genes.txt as the input gene list in the DAVID
Analysis Wizard, choose OFFICIAL_GENE_SYMBOL as the identifier type and then specify
Homo sapiens in the Select species box that appears. Next, specify that we are providing an
input gene list and then click on Submit List.
Step 2: After submitting the gene list, DAVID will tell us that we have successfully submitted the
gene list and that we are using the Homo sapiens genome as background. We can then select
which analysis tool we like to run. Notice that there is a Gene ID Conversion Tool. This is used to
convert the input gene list to a set of IDs that are recognized by DAVID in the event that we do
not know or DAVID does not recognize the identifier type in our input.
Gene ID conversion - specify ID type to convert to: We have an option to choose a range of IDs
to convert our gene list to but in this example, we have chosen the ENSEMBL_GENE_ID.
Remember to specify the species where our genes came from (Homo sapiens in this case).
When ready, hit Submit to Conversion Tool.
Gene ID conversion: Once we hit Submit to Conversion Tool, we will be taken to the page below.
We can choose to convert all genes or convert each gene individually. Some of the genes may
not be in the DAVID database, so the Gene ID conversion tool will not be able to convert those.
We would need to do some data wrangling to find identifiers for those genes not in the DAVID
database.
Gene ID conversion - send converted IDs back to input: After conversion, we can send the new
list back to DAVID either as input or background. Here, we will submit as input.
Gene ID conversion - name the converted ID list: DAVID will give us the option to name the
converted gene list. We will name its hbr_uhr_deg_chr22_up. Note that the Gene ID Conversion
Tool was opened in a separate tab. After we submit the converted gene list, go back to the
Analysis Wizard tab to continue with the analysis.
Once we have submitted our gene list for analysis, DAVID takes us to the Annotation Summary
Results page. This page confirms the name of our gene list (hbr_uhr_deg_chr22_up) and the
background gene list that we are using (Homo sapiens). We can navigate the Annotation
Summary Results page to obtain different insights to our data.
Disease Annotations
One of the first insights is that DAVID informs whether our input genes play a role in diseases.
DAVID pulls disease annotations from different sources. Clicking on the Chart button will take us
to a chart view showing the disease records from a given database in which our input genes
map.
• DISGENET (https://www.disgenet.org)
• GAD_DISEASE
• OMIM (https://www.omim.org)
In the chart view, we are presented with the disease(s) found in a particular database that our
input genes map to. Clicking on one of the disease terms sends us to NCBI' MedGen.
In the chart view shown above, we clicked on Malignant neoplasm of the breast, and this took
us to the corresponding record in NCBI's MedGen. MedGen is NCBI's database that contains
organized information pertaining to human gene-disease relationships.
If we click on the blue bar next to the chart button, we will be taken to a gene view of the
disease terms.
The gene view lists disease(s) that the gene may play a role in.
The chart view for some annotation databases such as OMIM will not have records. This is
because there were no diseases that met the statistical threshold.
However, we can still click on the gene view to see what disease individual genes may play a
role in.
For diseases that are annotated in OMIM, clicking on the corresponding link in the gene view
will take us to the OMIM record.
Here, we clicked on Meningioma and was taken to the OMIM record for this disease where we
see MN1 as one of the genes involved in the disorder.
Pathways:
We see similar organization of information for other annotations. For instance, DAVID pulls
biomolecular pathway information from several sources such as KEGG.
Clicking on the chart view for KEGG pathways we can see what pathways within this database
our input genes participate in. The column labeled RT denotes related terms (ie. related
pathways). Under the Genes column, we can click on the blue bar to view the genes that map
to a pathway. The count column tells us how many genes in our list participate in a particlar
pathway.
Clicking on a pathway under the Term column in Figure 21 takes us to the pathway record in
KEGG or which ever database we are viewing. The input gene that participate in a pathway are
labeled with a blinking red star.
Chart Report is an annotation term focused view which lists annotation terms and
their associated genes under study. -- DAVID help documents (https://
david.ncifcrf.gov/helps/functional_annotation.html#summary)
Clicking on the Functional Annotation Chart button, we are taken to the page that lists all of the
functional annotations that are input genes map onto. Note that for our gene list
(hbr_uhr_deg_chr22_up), we get 133 chart records as a default.
Adding/removing records from the Functional Annotation Chart: But remember that there are
boxes that we can check if we expanded on the annotation categories. If we checked the
boxes corresponding to DISGENET, GAD_DISEASE, and GAD_DISEASE_CLASS, and check
the Functional Annotation Chart again, we see that we have a few additional annotation records.
Thus, what we see in the Functional Annotation Chart is customizable.
Functional annotation clustering works to cluster annotations that share similar genes. If we click
on Functional Annotation Clustering in the Annotation Summary Results page then we can see
the functional annotation clusters that our input genes map to.
We can fine tune how DAVID clusters the annotations using the parameters below.
• "Similarity Threshold (any value between 0 to 1; default = 0.35): The minimum kappa
value to be considered significant. A higher setting will lead to more genes going
unclustered, which leads to a higher quality functional classification result with fewer
groups and fewer gene members. Kappa value of 0.3 starts giving meaningful biology
based on our genome-wide distribution study. Anything below 0.3 has a good chance to
be noise." -- DAVID functional classification documentations (https://david.ncifcrf.gov/
helps/functional_classification.html)
• "Initial Group Members (any value ≥ 2; default = 4): The minimum gene number in a
seeding group, which affects the minimum size of each functional group in the final
cluster. In general, the lower value attempts to include more genes in functional groups,
and may generate a lot of small size groups." -- DAVID functional classification
documentations (https://david.ncifcrf.gov/helps/functional_classification.html)
• "Final Group Members (any value ≥ 2; default = 4): The minimum gene number in one
final group after a 'cleanup' procedure. In general, the lower value attempts to include
more genes in functional groups and may generate a lot of small size groups. It
cofunctions with previous parameters to control the minimum size of functional groups. If
you are interested in functional groups containing only 2 or 3 genes, you need to set it to
a very low value. Otherwise, the small group will not be displayed and the genes will go
unclustered." -- DAVID functional classification documentations (https://david.ncifcrf.gov/
helps/functional_classification.html)
• "Multi-linkage Threshold (any value between 0% to 100%; default = 50%): This parameter
controls how seeding groups merge with each other, i.e. two groups sharing the same
gene members over the percentage will become one group. A higher percentage, in
general, gives sharper separation (i.e. it generates more final functional groups with more
tightly associated genes in each group). In addition, changing the parameter does not
cause additional genes to go unclustered." -- DAVID functional classification
documentations (https://david.ncifcrf.gov/helps/functional_classification.html)
Provides a gene-centric view which lists the genes and their associated annotation
terms... -- DAVID help documents (https://david.ncifcrf.gov/helps/
functional_annotation.html#summary)
DAVID can also generate gene clusters where those gene that cluster together share common
annotations.
We can view the cluster information as a heatmap where green represents an association
between the gene and an annotation. The black represents no reported association between
the gene and an annotation. On the bottom horizontal axis, we have annotation name. The right
vertical axis, we have the gene names.
Lesson objective
In this lesson, we will continue to le arn about pathway analysis using the Qiagen Ingenuity
Pathway Analysis(IPA) package. This class will be taught by Qiagen field application scientist.
Example data
For the lesson, we will work with the differential expression analysis results from the human brain
reference (HBR) and univeral human reference (uhr).
For the practice session, will use the differential expression analysis results from the hcc1395
dataset.
Accessing IPA
To access IPA, see the BTEP Bioinformatics Resources for CCR scientists (https://
btep.ccr.cancer.gov/docs/resources-for-bioinformatics/software/ipa/) website.
Course Wrap-up
This lesson concludes the Bioinformatics for Beginners course series. Please email us any time
at [email protected] for help with your bioinformatics questions or concerns.
Lesson Objectives
1. Short course overview.
2. Review BTEP and course resources.
3. Learn how to login and access the Biostars module.
4. Discuss upcoming BTEP courses
5. Q & A
RNA-Seq overview
Review of resources
BTEP Resources pages
BTEP seeks to inform and empower researchers, so that you can ultimately tackle some data
analyses on your own. We offer a number of resources to accomplish this goal. Take a look at
our resources documentation (https://btep.ccr.cancer.gov/docs/resources-for-bioinformatics/) to
get an idea of the many bioinformatics resources available to help you reach your analysis and
training objectives.
For this course, there are additional resources worthy of note under the Additional Resource
tab, including Further Readings and Tutorials, instructions for Logging into Biowulf, instructions
for Accessing the Biostar Handbook, and instructions on Using the Biostars Module for Biowulf.
As a reminder, when you registered for the course, you also gained a 6 month subscription to
the Biostar Handbook Collection (https://www.biostarhandbook.com/index.html) . This is a
fantastic resource covering a range of bioinformatics topics, not just RNA-Seq. At the minimum,
consider downloading the book volumes, available as pdfs, before your subscription
disappears.
Biostars on Biowulf
For your convenience, we have created a module on Biowulf that includes many of the same
programs in the bioinfo environment from The Biostar Handbook. Instructions for using this
module can be found at Additional Resources. Let's briefly review some key points about
Biowulf and then take a look at using the Biostars module.
All NIH employees in the NIH Enterprise Directory (NED) are eligible for a Biowulf account.
There is a charge of $35 per month associated with each account, which is pretty nominal. The
instructions for obtaining an account are here (https://hpc.nih.gov/docs/accounts.html).
Accessing Biowulf
Once you have your Biowulf account, you can connect remotely to Biowulf using a secure shell
(SSH) protocol. If you have a macbook, you will need to open the terminal application. If you
are using Windows 10 or >, you can use ssh from the powershell or command prompt. If this
fails, consider installing PuTTY (http://www.chiark.greenend.org.uk/~sgtatham/putty/
download.html) and using PuTTY.
The login node will be used to submit jobs to run on the compute nodes that make up Biowulf or
to request an interactive job on a compute node. It can also be used for editing / compiling
code, file management on a small scale, and file transfer on a small scale.
If this is your first time logging into Biowulf, you will see a warning statement with a yes/no
choice. Type “yes”. Type in your password at the prompt. NOTE: The cursor will not move as
you type your password!
For more information and more detailed training documentation, see hpc.nih.gov/training/
(https://hpc.nih.gov/training/). Also, the hpc user guides are important resources. Most of your
initial questions can be answered simply by referring to and reading the user guides (https://
hpc.nih.gov/docs/user_guides.html).
module avail
module avail [appname|string|regex]
module –d
To load a module
module list
To unload modules
Note: you may also create and use your own modules.
Let's see how we can use the Biostars module to work with course materials.
First, as we have already logged into Biowulf, we need to get an interactive session.
1. Use sinteractive to work on an interactive node. This will result in 4GB of memory and
2 CPUs.
Note: If you are planning to use the sratoolkit to download data from the SRA, you will
need to allocate local scratch space (sinteractive --gres=lscratch:30).
source /data/classes/BTEP/apps/biostars/1.0/run_biostars.sh
Note: If you want to use the biostars module for other purposes or you want to submit jobs via
sbatch, skip Step 2. You can load the module with the following:
ls -l $DATA
ls -l $CODE
Lesson 2 Practice
The instructions that follow were designed to test the skills you learned in Lesson 2. Thus, the
primary focus will be navigating directories and manipulating files.
1. Let's navigate our files using the command line. Begin in your home directory.
{{Sdet}}
Solution{{Esum}}
cp -r /data/Practice_Sessions/Practice_L2 ~
{{Edet}}
{{Sdet}}
Solution{{Esum}}
cd ./Practice_L2
{{Edet}}
4. List the contents of Practice_L2. If there are files, when were they last modified and
what is the file size?
{{Sdet}}
Solution{{Esum}}
ls -lh
The -lh flag will allow you to obtain a list of directory content in long format, which
provides information including the date the file was last modified and the file size.
{{Edet}}
{{Sdet}}
Solution{{Esum}}
mv sample_names.txt treatment_groups.txt
{{Edet}}
{{Sdet}}
Solution{{Esum}}
mkdir Analysis
{{Edet}}
{{Sdet}}
Solution{{Esum}}
cp treatment_groups.txt ./Analysis
{{Edet}}
{{Sdet}}
Solution{{Esum}}
ls -l ./Analysis
{{Edet}}
{{Sdet}}
Solution{{Esum}}
cd Analysis
pwd
{{Edet}}
{{Sdet}}
Solution{{Esum}}
cd ../data
{{Edet}}
{{Sdet}}
Solution{{Esum}}
ls
{{Edet}}
{{Sdet}}
Solution{{Esum}}
less A_R1.fastq
{{Edet}}
13. Copy and paste the first line of A_R1.fastq into a new file called Line1.txt using
keyboard shortcuts and nano.
{{Sdet}}
Solution{{Esum}}
mv Line1.txt ..
{{Edet}}
{{Sdet}}
Solution{{Esum}}
cd ..
{{Edet}}
{{Sdet}}
Solution{{Esum}}
rm -i -r Analysis
{{Edet}}
Lesson 4 Practice
For today's practice, we are going to embark on a Unix treasure hunt created by the Sanders
Lab (https://sanderslab.github.io/code/) at the University of California San Francisco. Note: the
treasure hunt materials can be obtained directly from the Sanders lab code repository linked
above.
To begin create a directory called treasure_hunt in your home directory and run the perl
script in /data/Practice_Sessions/Practice_L4 from the treasure_hunt directory.
{{Sdet}}
Solution{{Esum}}
mkdir treasure_hunt
cd treasure_hunt
perl /data/Practice_Sessions/Practice_L4/treasureHunt_v2.pl
ls -l
{{Edet}}
Recommendation: Create an environment variable to store the path to the treasure hunt
directory to facilitate movement through the directory.
{{Sdet}}
Solution{{Esum}}
THUNT=`pwd`
echo $THUNT
{{Edet}}
1. How many words are in the last line of the file containing the teasure?
{{Sdet}}
Solution{{Esum}}
tail -n 1 openTheBox.txt | wc -w
{{Edet}} 2. Save the last line to a new file called finallyfinished.txt without copying
and pasting.
{{Sdet}}
Solution{{Esum}}
{{Edet}}
3. Now append the first line to the same file that you just saved the last line.
{{Sdet}}
Solution{{Esum}}
{{Edet}}
Congratulations! You have found the treasure and have gained some useful unix practice
throughout your hunt.
Lesson 5 Practice
The following can be used to practice skills learned in Lesson 5.
Login to Biowulf
If you are already logged in, exit the remote connection and reconnect. Remember, you must
be on the NIH network.
{{Sdet}}
Solution{{Esum}}
{{Edet}}
Start a new script, named fastqc.sh in the same directory in which you downloaded data
from Lesson 5.
mkdir fastqc
fastqc -o ./fastqc/ -t 4 *.fastq
This command will output the fastqc results to a directory named fastqc inside the current
working directory. It will also run using four threads and will run on all fastq files in your working
directory.
You need edit this script in order to submit as a job on Biowulf. What is missing?
{{Sdet}}
Solution{{Esum}}
#!/bin/bash
#SBATCH --cpus-per-task=4
{{Edet}}
{{Sdet}}
Solution{{Esum}}
sbatch fastqc.sh
{{Edet}}
How can we check on our job? What is the job's status? How much memory is it using?
{{Sdet}}
Solution{{Esum}}
squeue -u $USER
sjobs -u $USER
{{Edet}}
Solution{{Esum}}
scancel job-id
where job-id is the id of the job. Check the output of squeue -u $USER if you are unsure
what the job id is.
{{Edet}}
Let's get an interactive session and see how we can use the module.
sinteractive
Also, we have created a script to set up your shell and load the module, use
source /data/classes/BTEP/apps/biostars/1.0/run_biostars.sh
This creates a $DATA environment variable holding the path to many of the files from class. We
try to update this, but please let us know if something is missing.
ls $DATA
fastqc -h
Lesson 6 Practice
The following was designed to practice skills learned in lesson 6.
{{Sdet}}
Solution{{Esum}}
mkdir Lesson6_practice
cd Lesson6_practice
{{Edet}}
{{Sdet}}
Solution{{Esum}}
You can download the accession list directly from the SRA Run Selector.
esearch -db sra -query PRJEB37445 | efetch -format runinfo | cut -f 1 -d ',' |s
{{Edet}}
Navigate to the ENA. How might you go about downloading the data?
Lesson 7 Practice
In Lesson 7, you learned how to download and work with archived and compressed files. To
practice what you have learned, we will use the ERCC spike in control data, which Istvan Albert,
creator of the Biostar Handbook, has reframed as the "Golden Snidget", "a magical golden bird
with fully rotational wings."
{{Sdet}}
Solution{{Esum}}
mkdir golden
cd golden
{{Edet}}
{{Sdet}}
Solution{{Esum}}
man wget
or
wget --help
-nc stands for "no clobber", which keeps wget from downloading and overwriting an existing
file of the same name. {{Edet}}
{{Sdet}}
Solution{{Esum}}
{{Edet}}
Just for fun, let's rearchive and zip the data we just packed, name it funtimes.ref.tar.gz.
How might we do this using tar? Check the help information. Alternatively, try google.
{{Sdet}}
Solution{{Esum}}
tar --help
tar -czvf funtimes.ref.tar.gz refs
{{Edet}}
{{Sdet}}
Solution{{Esum}}
{{Edet}}
{{Sdet}}
Solution{{Esum}}
ls -lh
{{Edet}}
{{Sdet}}
Solution{{Esum}}
gzip reads/*.fq
{{Edet}}
Lesson 9 Practice
Objectives
In this practice session, we will apply our knowledge to
• learn about the reference genome and annotation file for the Golden Snidget dataset
• visualize the Golden Snidget genome using the Integrative Genome Viewer (IGV) -
instructor will demo this and you can practice on your own after getting IGV installed.
{{Sdet}}
Solution{{Esum}}
pwd
If you are not in the ~/biostar_class folder, then change into it.
cd ~/biostar_class
{{Edet}}
Next, we will create the directory snidget within the ~/biostar_class folder. Take a moment to see
if you can do this.
{{Sdet}}
solution{{Esum}}
mkdir snidget
{{Edet}}
{{Sdet}}
solution{{Esum}}
cd snidget
{{Edet}}
Where is my data?
The Golden Snidget reference genome is located at http://data.biostarhandbook.com/books/
rnaseq/data/golden.genome.tar.gz. Can you download and extract?
{{Sdet}}
Solution{{Esum}}
Download
wget http://data.biostarhandbook.com/books/rnaseq/data/golden.genome.tar.gz
OR
curl http://data.biostarhandbook.com/books/rnaseq/data/golden.genome.tar.gz -o
Unpack
{{Edet}}
{{Sdet}}
Solution{{Esum}}
ls -l
total 60
-rw-rw-r-- 1 joe joe 57462 Feb 5 2020 golden.genome.tar.gz
drwxrwxr-x 1 joe joe 70 Oct 25 00:06 refs
In addition to the golden.genome.tar.gz file, we have a refs folder. The refs folder contains the
reference genome (genome.fa), reference transcriptome (transcriptome.fa), and annotations
(features.gff) for the Golden Snidget.
ls refs
{{Edet}}
{{Sdet}}
Solution{{Esum}}
Download
wget http://data.biostarhandbook.com/books/rnaseq/data/golden.reads.tar.gz
OR
curl http://data.biostarhandbook.com/books/rnaseq/data/golden.reads.tar.gz -o g
Unpack
{{Edet}}
{{Sdet}}
Solution{{Esum}}
ls -l
We see the two tar.gz files that were downloaded and a new folder called reads.
total 117384
-rw-rw-r-- 1 joe joe 57462 Feb 5 2020 golden.genome.tar.gz
-rw-rw-r-- 1 joe joe 120138017 Oct 25 00:18 golden.reads.tar.gz
drwxrwxr-x 1 joe joe 336 Oct 25 00:19 reads
drwxrwxr-x 1 joe joe 70 Oct 25 00:06 refs
The reads folder contains the FASTQ (fq) files for this dataset. We will be working with these in
the next lesson.
ls reads
{{Edet}}
{{Sdet}}
Solution{{Esum}}
cd refs
{{Edet}}
How many bases are in the Golden Snidget genome (ie. what is the genome size for the Golden
Snidget)?
{{Sdet}}
Solution{{Esum}}
{{Edet}}
{{Sdet}}
Solution{{Esum}}
{{Edet}}
Note: if you grep for > in Unix, be sure to put quotes around it.
Is there an alternative way to get the number of transcripts in the Golden Snidget (ie. without
using seqkit)?
{{Sdet}}
Solution{{Esum}}
92
{{Edet}}
The goal of the Golden Snidgt RNA sequencing experiment is to find genes that are
differentially expressed when the Golden Snidget is EXCITED compared to when it is BORED.
Looking at the transcript names, what can you tell about a particular transcript in the EXCITED
and BORED state? What would you expect the differential gene expression analysis to tell us
when we get to this later on in the course? You will need to take a look at the features.gff file for
this.
{{Sdet}}
Solution{{Esum}}
less features.gff
Look at the gene or transcript names on the last column of the annotations file (gene names
and transcripts names are the same in this dataset). Take for example AAA-750000-UP-4, the
transcript name is telling us that
{{Edet}}
transcriptome in IGV
Let's open IGV locally on our computer. Then we will copy the Golden Snidget refs folder to our
public directory so we can download and use these locally. Remember the location on your
computer to which the files were downloaded. See if you can remember how to copy the
Golden Snidget reference to the public directory. Hint, it may be easier to do this from the ~/
biostar_class/snidget directory.
{{Sdet}}
Solution{{Esum}}
cd ~/biostar_class/snidget
cp -r refs ~/public
{{Edet}}
After we have successfully copied the refs folder to public, click to open it and right click on
each of the files and click
• "Save link as" (if on Google Chrome or Firefox) - include the appropriate file extension
when saving
• "Download Linked File as" (if on Safari)
The first step in using IGV is to load our reference genome. Take some time to see if you recall
how to do this.
{{Sdet}}
Solution{{Esum}}
{{Edet}}
After loading the genome, let's view the transcripts in IGV and see how they line up in the
genome. Take a moment to see if you recall how to do this.
{{Sdet}}
Solution{{Esum}}
Then choose the features.gff file and the result will look like the image below.
{{Edet}}
Take sometime to explore IGV (zoom in, search for a transcript, pan around...)
Lesson 10 Practice
Objectives
In this lesson, we introduced the structure of the FASTQ file and learned to assess quality of raw
sequencing data using FASTQC. Here, we will practice what we learned using the Golden
Snidget dataset.
Where is my data?
Recall that the Golden Snidget data resides in ~/biostar_class/snidget folder. Can you change
into the folder and find where the sequencing reads are (ie. in which folder they are located)?
{{Sdet}}
Solution{{Esum}}
cd ~/biostar_class/snidget
cd reads
{{Edet}}
{{Sdet}}
Solution{{Esum}}
ls
12
{{Edet}}
From the names of the FASTQ (fq) files, are these from paired or single end sequencing?
{{Sdet}}
Solution{{Esum}}
Paired
{{Edet}}
Can you find the first sequencing read in the file BORED_1_R1.fq? If you can, can you identify
the sequencing header line and the quality score line?
{{Sdet}}
Solution{{Esum}}
head -4 BORED_1_R1.fq
{{Edet}}
From what you know about the structure of FASTQ files, how many reads are in
BORED_1_R1.fq? There are two ways you can find out.
{{Sdet}}
Solution{{Esum}}
112193
{{Edet}}
What can we do to get stats such as the number of reads and read length for all of the Golden
Snidget FASTQ files?
{{Sdet}}
Solution{{Esum}}
{{Edet}}
How do we visualize quality metrics for the Golden Snidget sequencing reads?
{{Sdet}}
Solution{{Esum}}
{{Edet}}
Look at the quality check report for BORED_1_R1.fq. But first copy it to the ~/public folder.
{{Sdet}}
Solution{{Esum}}
cp BORED_1_R1_fastqc.html ~/public
{{Edet}}
{{Sdet}}
Solution{{Esum}}
{{Edet}}
BORED_1_R2_fastqc.html
BORED_2_R1_fastqc.html
BORED_2_R2_fastqc.html
BORED_3_R1_fastqc.html
BORED_3_R2_fastqc.html
EXCITED_1_R1_fastqc.html
EXCITED_1_R2_fastqc.html
EXCITED_2_R1_fastqc.html
EXCITED_2_R2_fastqc.html
EXCITED_3_R1_fastqc.html
EXCITED_3_R2_fastqc.html
Lesson 11 Practice
Objectives
In this lesson, we learned to
{{Sdet}}
Solution{{Esum}}
cd ~/biostar_class/snidget/QC
{{Edet}}
How do we merge the FASTQC results from the Golden Snidget dataset into one?
{{Sdet}}
Solution{{Esum}}
Since we are in the snidget folder, which contains our FASTQC results, we can use "." to denote
"here in this folder" because MultiQC will look for output logs in the specify folder.
{{Edet}}
Next copy the MultiQC output to the public directory. Do you remember how to do this?
{{Sdet}}
Solution{{Esum}}
cp multiqc_report_snidget.html ~/public/multiqc_report_snidget.html
{{Edet}}
Can you configure the Golden Snidget MultiQC output's General Statistics table to show the
percentage of modules that failed?
In the General Statistics table of the Golden Snidget MultiQC report, can you assign different
colors to distinguish the FASTQ files for the BORED and EXCITED groups?
In the overrepresented sequences plot, how many samples have warnings and how many
failed?
{{Sdet}}
Solutions{{Esum}}
Let's go back to the biostar_class directory and create a folder called practice_trimming for this
exercise. How do we do this?
{{Sdet}}
Solution{{Esum}}
This depends on where you are currently (ie. your present working directory is). From there go
back to the biostar_class folder.
cd ~/biostar_class
mkdir practice_trimming
{{Edet}}
After the "practice_trimming" directory has been created, change into this directory. How do we
do this?
{{Sdet}}
Solution{{Esum}}
cd practice_trimming
{{Edet}}
How many FASTQ files were downloaded? And from the file names, is this from paired or single
end sequencing.
{{Sdet}}
Solution{{Edet}}
ls
Two FASTQ files were downloaded and this is paired end sequencing.
{{Edet}}
Let's run FASTQC for the these files. Do you recall how to do this?
{{Sdet}}
Solution{{Esum}}
fastqc SRR1553606_*.fastq
{{Edet}}
{{Sdet}}
Solution{{Esum}}
cp SRR1553606_*_fastqc.html ~/public
{{Edet}}
How is the quality and are there adapter contamination for the FASTQ files in SRR1553606? If
yes, can we trim away the adapters and poor quality reads? FYI, for this exercise our adapter
sequence is below (can we create an input file called nextera_adapter.fa with the adapter
sequence?).
>nextera
CTGTCTCTTATACACATCTCCGAGCCCACGAGAC
{{Sdet}}
Solution{{Esum}}
The answer is the quality for both FASTQ files is not great and we can remove the poor quality
reads and the adapters.
nano nextera_adapter.fa
Copy and paste the adapter sequence into nano, hit control x and save to exit.
>nextera
CTGTCTCTTATACACATCTCCGAGCCCACGAGAC
{{Edet}}
{{Sdet}}
Solution{{Esum}}
{{Edet}}
What is another tool that we can use to perform quality and adapter trimming on FASTQ files?
{{Sdet}}
Solution{{Esum}}
BBDuk
{{Edet}}
Lesson 12 Practice
Objectives
In this practice session, we will work with something new, which is a dataset from the Griffith lab
RNA sequencing tutorial. Here, we will have a chance to practice what we have learned up until
this point of the course series including
Where is my data
This data set is from the Griffith lab RNA sequencing tutorial (https://rnabio.org) and are kept at
http://genomedata.org/rnaseq-tutorial/practical.tar. This dataset is derived from a study that
profiled the transcriptome of HCC1395 breast cancer cell line and the HCC1395BL
lymphoblastoid line, which makes this a tumor versus normal transcriptome comparison --
Griffith lab (https://rnabio.org/module-01-inputs/0001/05/01/RNAseq_Data/).
To begin to work with this dataset, create a directory called hcc1395 in the ~/biostar_class
folder and then change into it. See if you can do this.
{{Sdet}}
Solution{{Esum}}
pwd
cd ~/biostar_class
mkdir hcc1395
cd hcc1395
{{Edet}}
{{Sdet}}
Solution{{Esum}}
wget http://genomedata.org/rnaseq-tutorial/practical.tar
After downloading, list the directory content in the long view to make sure we successfully
downloaded something.
ls -l
total 355132
-rw-rw-r-- 1 joe joe 363653120 Oct 23 2018 practical.tar
We have downloaded an archive (tar) of the practice data. We can use the tar command to
unpack.
ls -l
We have the fastq files in fastq.gz format (ie. they are gzipped but we are will not be unzipping
these as our tools can work with the gzipped versions).
total 710268
-rw-rw-r-- 1 joe joe 25955505 Mar 18 2017 hcc1395_normal_rep1_r1.fastq.gz
-rw-rw-r-- 1 joe joe 30766759 Mar 18 2017 hcc1395_normal_rep1_r2.fastq.gz
-rw-rw-r-- 1 joe joe 25409781 Mar 18 2017 hcc1395_normal_rep2_r1.fastq.gz
-rw-rw-r-- 1 joe joe 30213083 Mar 18 2017 hcc1395_normal_rep2_r2.fastq.gz
-rw-rw-r-- 1 joe joe 25132378 Mar 18 2017 hcc1395_normal_rep3_r1.fastq.gz
-rw-rw-r-- 1 joe joe 30174637 Mar 18 2017 hcc1395_normal_rep3_r2.fastq.gz
-rw-rw-r-- 1 joe joe 30361801 Mar 18 2017 hcc1395_tumor_rep1_r1.fastq.gz
-rw-rw-r-- 1 joe joe 35887220 Mar 18 2017 hcc1395_tumor_rep1_r2.fastq.gz
-rw-rw-r-- 1 joe joe 29769613 Mar 18 2017 hcc1395_tumor_rep2_r1.fastq.gz
-rw-rw-r-- 1 joe joe 35254974 Mar 18 2017 hcc1395_tumor_rep2_r2.fastq.gz
-rw-rw-r-- 1 joe joe 29472281 Mar 18 2017 hcc1395_tumor_rep3_r1.fastq.gz
-rw-rw-r-- 1 joe joe 35241854 Mar 18 2017 hcc1395_tumor_rep3_r2.fastq.gz
-rw-rw-r-- 1 joe joe 363653120 Oct 23 2018 practical.tar
{{Edet}}
{{Sdet}}
Solution{{Esum}}
{{Edet}}
{{Sdet}}
Solution{{Esum}}
331958
{{Edet}}
For this portion of the practice, create a folder called qc within ~/biostar_class/hcc1395 (which
should be the present working directory) to store the FASTQC outputs.
{{Sdet}}
Solution{{Esum}}
mkdir qc
cd qc
{{Edet}}
Can you generate quality reports for the FASTQ files in this dataset?
{{Sdet}}
Solution{{Esum}}
We use -o to specify output directory in the FASTQC command below. Since we want the
FASTQC output to be written in the present working directory (which is ~/biostar_class/hcc1395/
qc) we can just use "." to denote this ("." means here in the current directory).
{{Edet}}
{{Sdet}}
Solution{{Esum}}
cp *.html ~/public
{{Edet}}
Examine the FASTQC reports. How is the quality of the sequencing data? Are there any adapter
contaminations?
{{Sdet}}
Solution{{Esum}}
While some of the sequence quality scores were not great especially in the read 2 files, we
definitely need to trim for adapters.
{{Edet}}
Can you merge the FASTQC files for this dataset into an MultiQC report?
{{Sdet}}
Solution{{Esum}}
{{Edet}}
Trimming
For this exercise, go back to the ~/biostar_class/hcc1395 folder and create a new directory
called trimmed_data.
{{Sdet}}
Solution{{Esum}}
cd ~/biostar_class/hcc1395
mkdir trimmed_data
cd trimmed_data
{{Edet}}
{{Sdet}}
Solution{{Esum}}
wget http://genomedata.org/rnaseq-tutorial/illumina_multiplex.fa
or
{{Edet}}
Let's trim using the FASTQ files for hcc1395_normal_rep1 using the following thresholds:
{{Sdet}}
Solution{{Esum}}
{{Edet}}
Pre-trimming
Post-trimming
Lesson 13 Practice
Objectives
In this lesson we learned how to align raw sequencing reads to reference and to process
alignment results for downstream analysis. Here, we will test our knowledge by continuing with
the Golden Snidget dataset.
cd ~/biostar_class/snidget
ls -l QC
total 117384
drwxrwxr-x 1 joe joe 1188 Oct 31 23:29 QC
-rw-r--r-- 1 joe joe 57462 Oct 27 00:30 golden.genome.tar.gz
-rw-r--r-- 1 joe joe 120138017 Oct 27 00:30 golden.reads.tar.gz
drwxr-xr-x 1 joe joe 336 Oct 27 00:30 reads
drwxr-xr-x 1 joe joe 70 Oct 27 00:30 refs
In the ~/biostar_class/snidget folder, we have the tar.gz files for the Golden Snidget reference
genome and the sequencing data. These were unpacked previously to give the refs and reads
folder. The QC folder contains our FASTQC and MultiQC reports for the unaligned sequencing
data.
ls -1
BORED_1_R1_fastqc.html
BORED_1_R1_fastqc.zip
BORED_1_R2_fastqc.html
BORED_1_R2_fastqc.zip
Bioinformatics Training and Education Program
368 Lesson 13 Practice
BORED_2_R1_fastqc.html
BORED_2_R1_fastqc.zip
BORED_2_R2_fastqc.html
BORED_2_R2_fastqc.zip
BORED_3_R1_fastqc.html
BORED_3_R1_fastqc.zip
BORED_3_R2_fastqc.html
BORED_3_R2_fastqc.zip
EXCITED_1_R1_fastqc.html
EXCITED_1_R1_fastqc.zip
EXCITED_1_R2_fastqc.html
EXCITED_1_R2_fastqc.zip
EXCITED_2_R1_fastqc.html
EXCITED_2_R1_fastqc.zip
EXCITED_2_R2_fastqc.html
EXCITED_2_R2_fastqc.zip
EXCITED_3_R1_fastqc.html
EXCITED_3_R1_fastqc.zip
EXCITED_3_R2_fastqc.html
EXCITED_3_R2_fastqc.zip
multiqc_report_snidget.html
multiqc_report_snidget_data
{{Sdet}}
Solution{{Esum}}
{{Edet}}
For today, because we are working with alignment, let's create a folder called snidget_hisat2 to
store the results for the HISAT2 alignment. Take a moment to create this folder.
{{Sdet}}
Solution{{Esum}}
mkdir snidget_hisat2
{{Edet}}
{{Sdet}}
Solution{{Esum}}
cd refs
{{Edet}}
Besides the raw sequencing files and reference genome, what other information do we need to
complete the alignment (hint: it has something to do with the reference genome) and how do we
do this?
{{Sdet}}
Solution{{Esum}}
{{Edet}}
cd ~/biostar_class/snidget/reads
Then, how do we use the parallel command to construct the text file with the Golden Snidget
sample IDs? Also, save this file as ids.txt.
{{Sdet}}
Solutions{{Esum}}
Listing the contents of the reads directory, we see that we have the file ids.txt.
ls
Now, if we print out the contents of ids.txt using cat, then we see the sample IDs for the Golden
Snidget dataset.
cat ids.txt
BORED_1
BORED_2
BORED_3
EXCITED_1
EXCITED_2
EXCITED_3
{{Edet}}
cd ~/biostar_class/snidget/snidget_hisat2
We are going to align all off the FASTQ files for the Golden Snidget in one go. Also, remember
as we construct our command to align the FASTQ files that
Remember to save the alignment status as a text file so we can view later (save this as sample-
name_hisat2_summary.txt).
{{Sdet}}
Solution{{Esum}}
{{Edet}}
Now that the alignment is done, what are the overall alignment rates? Were there a lot of
discordant alignments?
{{Sdet}}
Solution{{Esum}}
Overall alignment rate for each sample is 100%. For the most part the reads aligned
concordantly.
{{Edet}}
Can we include the HISAT2 alignment statistics in a MultiQC report? Hint, change back into the
~/biostar_class/snidget directory and save this MultiQC report to the QC folder.
cd ~/biostar_class/snidget
{{Sdet}}
Solution{{Esum}}
Yes
multiqc --filename QC/multiqc_report_snidget_post_alignment .
Now listing, the contents of the QC folder we will see the MultiQC report that includes the post
alignment statistics (multiqc_report_snidget_post_alignment.html).
ls QC
cp QC/multiqc_report_snidget_post_alignment.html ~/public
{{Edet}}
cd ~/biostar_class/snidget/snidget_hisat2
{{Sdet}}
Solution{{Esum}}
{{Edet}}
Lesson 14 Practice
Objectives
Here, we will practice using the Integrative Genome Viewer (IGV) to visualize the hcc1395 RNA
sequencing alignment results.
Figure 1: Click on the hcc1395_igv.html under All Projects -> BioStars to access the IGV
launcher for the hcc1395 dataset.
Figure 2: Click the launch button to view the alignments for samples hcc1395_normal_rep1 and
hcc1395_tumor_rep2.
{{Sdet}}
Solution{{Esum}}
{{Edet}}
{{Sdet}}
Solution{{Esum}}
{{Edet}}
Upon opening of IGV, the bam tracks are empty, how do we zoom in to start viewing
information?
{{Sdet}}
Solution{{Esum}}
We can either select by chromosome, search by gene name or coordinates, or use the zoom
feature
{{Edet}}
Search for the TEF gene, does there appear to be difference in gene expression between the
normal and tumor sample?
{{Sdet}}
Solution{{Esum}}
It appears that the tumor sample is expressing more TEF than normal
{{Edet}}
You might have to remove the hcc1395_tumor_rep2 tracks for this but what is the difference
between the HISAT2 and Bowtie2 alignment for the hcc1395_normal_rep1 sample?
{{Sdet}}
Solution{{Esum}}
The HISAT2 alignment have the extra splice junction track and the reads that map across exons
are connected by lines
{{Edet}}
{{Sdet}}
Solution{{Esum}}
It's telling us that there could be a potential single nucleotide variant as we observed C's in
addition to T's at this position. There is a T in the reference.
{{Edet}}
Bonus question
Work on this if you choose to, at your own leisure, or if time permits.
How would you confirm that there is a potential single nucleotide variant at position 50,768,105
on chromosome 22 for the hcc1395_normal_rep1 sample?
{{Sdet}}
Solution{{Esum}}
In the Available Datasets box, choose Annotations -> Variation and Repeats -> All Snps
We see that there is indeed a potential SNP here and if we click on it we will see more
information about the potential SNP.
{{Edet}}
Lesson 15 Practice
Objectives
Previously, we performed QC on the Golden Snidget RNA sequencing data, aligned the
sequencing reads to its genome, and obtained expression counts. We can now finally perform
differential expression analysis, to find out which genes are differentially expressed between the
EXCITED and BORED states of the Golden Snidget.
{{Sdet}}
Solution{{Esum}}
ls -1 $CODE | wc -l
{{Edet}}
cd ~/biostar_class/snidget
{{Sdet}}
Solution{{Esum}}
mkdir snidget_deg
{{Edet}}
Creating design.csv
nano design.csv
Copy the text below to the nano editor, hit control-x and save to return to the terminal.
sample,condition
BORED_1.bam,BORED
BORED_2.bam,BORED
BORED_3.bam,BORED
EXCITED_1.bam,EXCITED
EXCITED_2.bam,EXCITED
EXCITED_3.bam,EXCITED
{{Sdet}}
Solution{{Esum}}
cd ~/biostar_class/snidget/snidget_hisat2/
featureCounts -p -a ../refs/features.gff -g gene_name -o ../snidget_deg/counts.
{{Edet}}
cd ~/biostar_class/snidget/snidget_deg
Do you remember how to remove the header line in the counts table?
{{Sdet}}
Solution{{Esum}}
{{Edet}}
Next, how do we remove columns 2 through 6 of the counts table and convert it from tab
delimited to csv?
{{Sdet}}
Solution{{Esum}}
Again, save the counts table without header, we will need it later.
{{Edet}}
Now that the correctly formated counts table is generated. Let's see if we can remember how
to
run deseq2.r to generate the differential expression results.
{{Sdet}}
Solution{{Esum}}
Rscript $CODE/deseq2.r
{{Edet}}
Take a look at the results.csv file, which contains the differential expression analysis output. Can
we sorted by largest to smallest fold change? In the sorted results table, what do you notice?
How well do the fold change results match expected?
{{Sdet}}
Solution{{Esum}}
{{Edet}}
Visualizing expression
Let's create an expression heatmap. How do we do this? Looking at the heatmap, do the
treatments (ie. BORED and EXCITED) cluster well together?
{{Sdet}}
Solution{{Esum}}
Rscript $CODE/create_heatmap.r
We will need to copy the result, heatmap.pdf to the public folder to view. And the BORED and
EXCITED groups do cluster together.
{{Edet}}
Lesson 16 Practice
Objectives
In this lesson, we learned about the classification based approach for RNA sequencing
analysis. In this approach, we are aligning our raw sequencing reads to a reference
transcriptome rather than a genome. Here, we will get to practice what we learned using the
Golden Snidget dataset.
cd ~/biostar_class/snidget
mkdir -p classification_based/salmon
{{Sdet}}
Solution{{Esum}}
cd refs
{{Edet}}
{{Sdet}}
Solution{{Esum}}
cd ~/biostar_class/snidget/reads
cd ~/biostar_class/snidget
{{Edet}}
After getting the counts, change into the classification_based folder. How do we do this from the
~/biostar_class/snidget directory.
{{Sdet}}
Solution{{Esum}}
cd classification_based
{{Edet}}
To proceed further with the analysis we need a design.csv file with the sample names, which
essentially are the names of salmon output folders and the treatment condition of the samples.
Take a moment to create the design.csv file.
{{Sdet}}
Solution{{Esum}}
nano design.csv
Copy and paste the text below into the nano editor, hit control x and then save to go back to the
terminal.
sample,condition
BORED_1_SALMON,BORED
BORED_2_SALMON,BORED
BORED_3_SALMON,BORED
EXCITED_1_SALMON,EXCITED
EXCITED_2_SALMON,EXCITED
EXCITED_3_SALMON,EXCITED
{{Edet}}
Remember salmon quant generates one folder for each sample and saves the counts for each
sample in that particular folder. We will need to combine these count files into one. Do you
remember how to do this?
{{Sdet}}
Solution{{Esum}}
{{Edet}}
{{Sdet}}
Solution{{Esum}}
Rscript $CODE/deseq2.r
{{Edet}}
Visualization
Can we generate an expression heatmap?
{{Sdet}}
Solution{{Esum}}
Rscript $CODE/create_heatmap.r
{{Edet}}
Next, let's generate the Principal Components Analysis plot. But first, we need to convert the
counts.csv and design.csv files to their tab delimited counterparts. Remember in Lesson 14 that
we used the tr command to convert a tab delimited text file to a comma separated file. Can you
do the opposite?
{{Sdet}}
Solution{{Esum}}
{{Edet}}
After generating the tab delimited expression counts table and design file, take a moment to
recall how we would generate the Principal Components Analysis plot.
{{Sdet}}
Solution{{Esum}}
{{Edet}}
Take a look at the differential expression results from the classification based approach. Do the
gene expression changes reflect the expected?
{{Sdet}}
Solution{{Esum}}
{{Edet}}
Hint: first copy the results.csv file from the alignment based method in the ~/biostar_class/
snidget/snidget_deg folder to ~/biostar_class/snidget and name it snidget_deg_genome.csv.
Then, copy the results.csv file from the ~/biostar_class/snidget/classfication_based folder to the
~/biostar_class/snidget folder and rename it snidget_deg_transcriptome.csv. Then make use of
the cut, sed, sort, and paste commands to get a merged file. Work in the ~/biostar_class/
snidget directory for this exercise.
{{Sdet}}
Solution{{Esum}}
cd ~/biostar_class/snidget
cp ~/biostar_class/snidget/snidget_deg/results.csv snidget_deg_genome.csv
cp classification_based/results.csv snidget_deg_transcriptome.csv
cut -f1,5 -d ',' snidget_deg_genome.csv | (sed -u 1q; sort) | less -S > snidget
The left half of the table are fold changes derived from the alignment based analysis. The right
half of the table are fold changes derived from classification based analysis.
{{Edet}}
Learning objectives
In this practice session, we will practice using DAVID.
The genes that exhibit higher expression in the tumor tissue (ie. log2 fold change ≥ 1 and false
discovery rate ≤ 0.05) have been subsetted into the file hcc1395_deg_chr22_up_genes.txt.
Before uploading the data, open hcc1395_deg_chr22_up_genes.txt to see what the content
looks like. These should resemble gene symbols so select "OFFICIAL_GENE_SYMBOL" as the
identifier type in Step 2. However, as a hint, DAVID might not recognize these and may divert
you to the gene identifier conversion tool. If it does divert you, convert these to
ENSEMBL_GENE_ID. At the gene ID conversion tool, how many IDs are available for
conversion? Thus, out of the 157 genes that we are starting with, how many can we use in
functional annotation analysis?
{{Sdet}}
Solution{{Esum}}
{{Edet}}
Convert the gene IDs and then submit the converted IDs as a gene list called
hcc1359_deg_chr22_up.
Results
Using the Gene Functional Classifier to group genes with similar annotations, how many
clusters do we get?
{{Sdet}}
Solution{{Esum}}
{{Edet}}
Potential Diseases
Now, run the Functional Annotation Tool. Look at the gene wise view for DISGENET, are there
any genes in our input that map to cancer?
{{Sdet}}
Solution{{Esum}}
{{Edet}}
Gene Ontology
Go under the Gene Ontology section in the Annotation Summary Results and expand on it.
Click the chart view for GOTERM_BP_DIRECT to look at some biological processes that the
upregulated genes in the dataset map to. What are some of the processes? Does it make sense
that we get genes that are expressed higher in the tumor samples mapping to these
processes?
{{Sdet}}
Solution{{Esum}}
{{Edet}}
{{Sdet}}
Solution{{Esum}}
{{Edet}}
Look thoroughly through the annotations, are any expected annotations and unexpected ones
given that this is a tumor versus normal comparison dataset?
References
Content for this course series was adapted / inspired by the following sources:
Apps/Dependencies:
bbtools (https://hpc.nih.gov/apps/bbtools.html)
bcftools (http://samtools.github.io/bcftools/bcftools.html)
bedtools (https://hpc.nih.gov/apps/bedtools.html)
bio (https://www.bioinfo.help/index.html)
bioawk (https://hpc.nih.gov/apps/bioawk.html)
blastn (https://hpc.nih.gov/apps/Blast.html)
bowtie2 (https://hpc.nih.gov/apps/bowtie2.html)
bwa (https://hpc.nih.gov/apps/bwa.html)
cd-hit (https://hpc.nih.gov/apps/cd-hit.html)
csvkit (https://hpc.nih.gov/apps/csvkit.html)
csvtk (https://github.com/shenwei356/csvtk)
cutadapt (https://hpc.nih.gov/apps/cutadapt.html)
datamash (https://www.gnu.org/software/datamash/)
efetch (https://biopython.org/docs/1.75/api/Bio.Entrez.html#)
emboss (http://emboss.open-bio.org)
entrez-direct (https://www.ncbi.nlm.nih.gov/books/NBK179288/)
fastqc (https://hpc.nih.gov/apps/fastqc.html)
featureCounts (https://academic.oup.com/bioinformatics/article/30/7/923/232889)
freebayes (https://hpc.nih.gov/apps/freebayes.html)
gffread (https://github.com/gpertea/gffread)
hisat2 (https://hpc.nih.gov/apps/hisat.html)
kallisto (https://hpc.nih.gov/apps/kallisto.html)
mafft (https://hpc.nih.gov/apps/mafft.html)
minimap2 (https://hpc.nih.gov/apps/minimap2.html)
multiqc (https://hpc.nih.gov/apps/multiqc.html)
parallel (https://hpc.nih.gov/apps/parallel.html)
picard (https://hpc.nih.gov/apps/picard.html)
python (https://www.python.org)
R/Rscript (https://hpc.nih.gov/apps/R.html)
readseq (https://www.ebi.ac.uk/Tools/sfc/readseq/)
salmon (https://hpc.nih.gov/apps/salmon.html)
samtools (https://hpc.nih.gov/apps/samtools.html)
seqkit (https://hpc.nih.gov/apps/seqkit.html)
seqtk (https://hpc.nih.gov/apps/seqtk.html)
snpEff (https://hpc.nih.gov/apps/snpEff.html)
sra-tools (https://github.com/ncbi/sra-tools)
STAR (https://hpc.nih.gov/apps/STAR.html)
subread (https://hpc.nih.gov/apps/subread.html)
trimmomatic (https://hpc.nih.gov/apps/trimmomatic.html)
ucsc-bedgraphtobigwig (https://hpc.nih.gov/apps/Genome_Browser.html)
wget (https://www.gnu.org/software/wget/)
R packages:
biomaRt (https://bioconductor.org/packages/release/bioc/html/biomaRt.html)
DESeq2 (https://bioconductor.org/packages/release/bioc/html/DESeq2.html)
edgeR (https://bioconductor.org/packages/release/bioc/html/edgeR.html)
GenomeInfoDb (https://bioconductor.org/packages/release/bioc/html/[GenomeInfoDb.html])
ggplot2 (https://ggplot2.tidyverse.org)
gplots (https://cran.r-project.org/web/packages/gplots/index.html)
pheatmap (https://www.rdocumentation.org/packages/pheatmap/versions/1.0.12/topics/
pheatmap)
RColorBrewer (https://cran.r-project.org/web/packages/RColorBrewer/index.html)
tidyverse (https://bioconductor.org/packages/release/bioc/vignettes/destiny/inst/doc/
tidyverse.html)
tximport (https://bioconductor.org/packages/release/bioc/html/tximport.html)
Additional Resources
RNA-Seq
1. RNA-seq Bioinformatics Course (https://rnabio.org/) from the Griffith lab.
2. If you have a Biowulf account but have not used it in 60 days, please send email to
[email protected] and ask them to unlock your account.
3. If you do not have a Biowulf account: You will be assigned a Student ID and Password for
class help sessions.
4. WARNING: When you log into Unix systems, you will not be able to see your password as
you type it. The cursor does not move, and you may think you are doing something
wrong. This is a safety feature on Unix systems- someone looking over your shoulder can
not see your password. Just type in the password and it will work.
5. If this is the first time you are logging into Unix/Biowulf, you will see a Security Alert as
soon as you log-in. Answer “yes” to the security prompt.
a. Open the Terminal program in Applications/ (Try cmd spacebar to open a search
tool and type Terminal)
b. ssh [email protected] where username is your username or student id
if you are using a student account
7. Logging into Biowulf from a PC. You may need to go to service.cancer.gov to install
one of these options if not included.
a. Windows 10/11 has OpenSSH
b. Windows 10 OS has built-in SSH using the Power Shell
c. Older OS will need to download PuTTy (https://www.chiark.greenend.org.uk/
~sgtatham/putty/latest.html). Under “Alternative binary files” , download 64-bit x86
putty.exe (this will work for most installations).
Your account will give you access to all five volumes of the Biostar Handbook:
1. The Biostar Handbook
2. The Art of Bioinformatics Scripting
3. RNA-Seq by Example
4. Corona Virus Genome Analysis
5. Biostar Workflows
While we will cover some of the topics in these volumes, we only scratch the surface. Feel free
to use your license to learn more about bioinformatics.
If you decide to tackle the concepts in these volumes on Biowulf, consider using the Biostars on
Biowulf module.
Biostars on Biowulf
To complement this course, there is a module available on Biowulf with installed programs
associated with the Biostar Handbook. During class, we will work on the command line on the
GOLD system on DNAnexus. However, this system will not be available outside of class time.
There are two options for catching up on class work or working on practice problems outside of
class. (1) You can install a Biostars conda environment on your local computer (See Biostar
Handbook for instructions (https://www.biostarhandbook.com/computer-setup.html) ). (2) You
can use the Biowulf HPC cluster and the Biostars Biowulf module. For option 2, you will need to
obtain a Biowulf account (https://hpc.nih.gov/docs/accounts.html).
2. Connect to Biowulf
3. Use sinteractive to work on an interactive node. This will result in 4GB of memory and
2 CPUs. Alternatively, you may submit jobs by making scripts and submitting jobs via
sbatch (See lesson 5).
Note: If you are planning to use the sratoolkit to download data from the SRA, you will
need to allocate local scratch space (sinteractive --gres=lscratch:30).
source /data/classes/BTEP/apps/biostars/1.0/run_biostars.sh
If you want to use the biostars module for other purposes or you want to submit jobs via
5.
sbatch, skip Step 4. You can load the module with the following:
Course files found on DNAnexus will be made accessible on Biowulf. The path to course files
has been assigned to the environment variable $DATA, which is automatically set when you run
Step 4.
Figure 1
Once you are at the landing page, click on Order a Service and you will be prompted to log in
with your NIH credentials. After logging in you will see a page that lists service categories
(Figure 2). From here, look under the Categories pane and select Software.
Figure 2
Next, you will be taken to the software services page where you will select Install Software Not
Managed by CBIIT.
Figure 3
Figure 4
Note to make sure you can at least get IGV opened (ie. test it) before ending the help session
with service.cancer.gov.
Below, you will find questions and answers brought up in the course polls for the BTEP
Bioinformatics for Beginners course series that took place from September 13th, 2022 to
December 13th, 2022.
Answer to Question 1: Normalization is always needed in RNA sequencing because this helps
to remove technical effects so that we are comparing only the biological differences between
experimental conditions. Technical factors include things like batch effects, differences in
sequencing depth (library size) between samples, gene length, and library composition.
Various techniques that normalized based on library size and gene length were discussed in
class. Refer to the documenation on RNA sequencing quantitation (https://btep.ccr.cancer.gov/
docs/b4b/RNASeq_Overview/06.Quantitation/#count-normalization).
Other types of normalization method include quantile normalization, which has been found to
work well for RNA sequencing. Further, some differential expression analysis packages have
their own normalization scheme so all users need to do is to provide the raw integer expression
counts.
“Hi-just a followup comment. The experience of the bioinformatics people that my lab works
with is that normalization in RNAseq is everything and you are wasting your time if you don’t do
it properly. Especially if your treatment leads to in increase in RNA content of the cell-like resting
vs activated T cells. The truth is, most everything goes up under these conditions but most
software adjusts things such that half genes go up and half go down. In fact most of what is
indicated to go down under these conditions are things that just don’t go up as much as other
genes. Anyway, that was a long explanation for why understanding how to do this properly is
likely to be crucial. In my opinion, anyway. Thanks very much”
example, the most commonly used methods in the detection of differential gene expression
(e.g. DeSeq2 and edgeR) assume that most genes are not differentially expressed and start
with a correction for library size. This assumption might not apply to your experimental setup: If
you suspect your treatment would result in differences in in the total amount of signal (RNA
yield) between your two samples (e.g. activated vs. resting t-cells, overexpression of cMyc),
normalizing by library size wouldn’t be accurate to the biology of your sample since you’d be
removing this difference by normalizing by library size. If you want to capture these absolute
differences in signal then you would have to modify your experimental approach (e.g. include a
spike-in control - see https://journals.asm.org/doi/full/10.1128/MCB.00970-14 (https://
journals.asm.org/doi/full/10.1128/MCB.00970-14)).
Question 3: Strategies for batch correction (when to use it, what approaches)?
Answer to Question 3: Batch effects are caused by variation between samples that are not due
to your experimental design (i.e. technical variation). The best approach to take regarding batch
correction is to design your experiments in such a way that you can avoid it entirely: if possible,
isolate and prepare your samples on the same day, use the same reagents and locations for
preparing your samples). If you must have batches and can’t prepare your samples on the
same day, make sure they're not confounded: have representatives of your biological groups
(e.g. controls and treatments) in each batch, and keep track of any batches in your
experimental records and metadata so you can identify batch effects if they exist in your data
(see here for a nice overview (https://hbctraining.github.io/Intro-to-rnaseq-fasrc-salmon-flipped/
lessons/02_experimental_planning_considerations.html)).
You can identify batch effects through an initial exploratory analysis of your data (e.g.
hierarchical clustering, PCA analysis).
An example of how PCA can be used to identify batch effects: When labelled according to
sample type (Wildtype +/- drug and Mutant +/- drug) replicates 3 are outliers, and make no
sense. However, when knowledge of when the different samples were processed (two
“batches”) is added, it is apparent that replicates 3 were all done on a different day and
represent batch effects that must be factored into the analysis (and it looks like the drug only
affected the mutant).
Tools to adjust your data for batch effects exist: there are two common methods, ComBAT and
sva which can be found in the Bioconductor package sva (https://www.bioconductor.org/
packages/release/bioc/html/sva.html). A review on identifying and removing batch effects is
reviewed in this article here (https://www.nature.com/articles/nrg2825) , however, I would
recommend consulting a bioinformatician with experience in this before venturing out to do this
on your own.
Question 4: When working with large data sets what is the expectations for time and storage
space?
Answer to Question 4: This depends on your project but FASTQ files can be many gigabytes in
size and you have FASTQ files for many samples, the space can add up. You can always add
more storage space to your Biowulf account (https://hpc.nih.gov/storage/). The amount of time it
takes to complete your analysis depends on the pipeline that you are using and whether you
have to create your own pipeline. If creating your own analysis pipeline, you will also need to
optimize the run time of your analysis. Reach out to one of our expert analysts to discuss and
plan before starting your study.
Question 5: What methods are available for transferring fastq files from my disk to Biowulf?
Answer to Question 5:
See Transferring data to/from the NIH HPC systems (https://hpc.nih.gov/docs/transfer.html) for
more about transferring files to and from Biowulf.
• mounting (https://youtu.be/H8ZksTK3EtE?t=88)
• secure copy or scp (https://youtu.be/H8ZksTK3EtE?t=258)
• Globus (https://youtu.be/mg9-a1OuDqo)
While there are several approaches for transferring file to Biowulf from local, if you are working
with many FASTQ files, Globus would be the goto solution.
Question 6: What should I do after downloading RNAseq data from public website(e.g. TCGA)
and check for data status?
Answer to Question 6:This depends on the study. For instance, you may find BAM files in a
TCGA RNA sequencing study or expression counts. So depending on the data available, this
will determine your starting point. When working with public datasets, it is very important to
learn as much about a study as you can prior to working with data (ie. do not just download and
blindly analyze).
Question 7: What are the various form of data format or quality of public data (GEO, TCGA,
Depmap, UK data, etc)?
Answer to Question 7: This varies depending on the repository and study. For instance, GEO
does not have a standard set of data that investigators need to submit. Thus, in GEO studies,
you may find those with only sequencing data (FASTQ files), expression data, differentially
expression results, or a combination.
Question 8: If I want to merge two kinds of GEO study, what should I check in terms of data
quality, format, etc?
Answer to Question 8: Start with raw data if at all possible, do some of the QC steps that we did
in the class if they’re applicable, and process the data uniformly (same programs, pipelines,
etc).