4 - 7 Genome Assembly To Annotation - Final
4 - 7 Genome Assembly To Annotation - Final
1
General sequencing and assembly workflow
2
Sequence assembly strategies
Map-based sequencing
By Commins, J., Toft, C., Fares, M. A. - "Computational Biology Methods and Their Application to the
Comparative Genomics of Endocellular Symbiotic Bacteria of Insects." Biol. Procedures Online
(2009). Accessed via SpringerImages., CC BY-SA 2.5,
https://commons.wikimedia.org/w/index.php?curid=17509619
Building a library
Actual situation
4
Building a library
5
Assembling the fragments
6
Forward-reverse constraints
• The sequenced ends are facing towards each other
• The distance between the two fragments is known
(within certain experimental error)
Insert
F R
I II
R F
I II
R F
Clone
II I
F R
7
Building Scaffolds
8
Assembly gaps
Physical gaps
Sequencing gaps
sequencing gap - we know the order and orientation of the contigs and have at
least one clone spanning the gap
physical gap - no information known about the adjacent contigs, nor about the DNA
spanning the gap
9
Unifying view of assembly
Assembly
Scaffolding
10
Alignment versus De Novo Assembly
Yes No
?
Alignment to Reference de novo Assembly
11
Alignment versus De Novo Assembly
Yes
Alignment to Reference
12
Workflow: QC & Mapping reads
Map reads to
reference genome
using e.g. BWA or
Bowtie2
Sort by coordinates using SAMtools
sort or PicardTools SortSam
Output:
Call variants, Sorted BAM file
structural (binary SAM
variation etc sequence
alignment map)
Steps in Alignment/Mapping
1. Get your sequence data
6. Downstream Processing
14
Steps in Alignment/Mapping
1. Get your sequence data
6. Downstream Processing
15
Public Short Read Repositories
NIH/NCBI
• Short Read Archive (http://www.ncbi.nlm.nih.gov/sra)
• Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo/)
• 1000 Genomes (ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/)
• European Nucleotide Archive (http://www.ebi.ac.uk/ena/)
fastq-dump SRR036642
16
Steps in Alignment/Mapping
1. Get your sequence data
6. Downstream Processing
17
Running FastQC
Open in browser:
fastqc_report.html
Per base sequence quality p-value =
0.0001
p-value = 0.001
p-value = 0.01
p-value = 0.05
18
6. Downstream Processing
19
Short Read Alignment Software
• BFAST MAQ SSAHA and
• BLASTN mrFAST and SSAHA2
• BLAT mrsFAST STAR
• Bowtie MOSAIK TopHat
• Bowtie2 Novoalign ~20 more…
• BWA RUM
• ELAND SHRiMP
• GNUMAP SOAP
• GMAP and SpliceMap
GSNAP
http://en.wikipedia.org/wiki/List_of_sequence_alignment_software
http://tinyurl.com/seqanswers-mapping
20
Issues of Consideration for Alignment
Software
Library types:
• Genomic DNA (for resequencing)
• ChIP DNA (PCR bias)
• RNA-seq cDNA
– mRNA-seq (junction mapping)
– smRNA-seq (adapter trimming)
Adapters
DNA
Fragment
Illumina 21
Issues of Consideration for Alignment
Software
Types of reads
e.g. SRR036642.fastq
• Single-end
22
Issues of Consideration for Alignment
Software
Platform differences
• Bases (ACTG)
• Read Length
• 454 (homopolymers)
23
Issues of Consideration for Alignment
Software
Software Properties
• Open-source or proprietary ($)
• Accuracy
• Speed of algorithm
• Multi-threaded or single processor
• RAM requirements (2GB vs 50GB for loading index)
• Use of base quality score
• Gapped alignment (indels)
24
Steps in Alignment/Mapping
1. Get your sequence data
6. Downstream Processing
25
“Mapping reads to the reference” is
finding where their sequence occurs in the genome
• Read can match equally well at more than one location (e.g.
repeats, pseudo-genes)
CIGAR informs –
How much of a read has been matched (and has insertions and deletions)
Where are those matches (and insertions/deletions,)
(http://samtools.github.io/hts-specs/SAMv1.pdf)
4
5
Visualizing mapping results
Source: www.cs.colostate.edu/~cs680/Slides/lecture3.pdf
Steps in Alignment/Mapping
1. Get your sequence data
6. Downstream Processing
Visualization of output in Integrated Genome Browser (IGV)
IGV
• http://www.broadinstitute.org/igv/projects/current/igv_mm.jnlp (Windows )
• http://www.broadinstitute.org/igv/projects/current/igv_lm.jnlp (Mac)
Steps in Alignment/Mapping
1. Get your sequence data
6. Downstream Processing
SV/CNV/Variant Calling
• https://datacarpentry.org/wrangling-genomics/
• https://genomics.sschmeier.com/ngs-
variantcalling/index.html
• https://learn.gencore.bio.nyu.edu/variant-calling/
De novo genome assembly
Short Sequence
“Reads”
Is a Reference Genome
available?
http://www.ncbi.nlm.nih.gov/sites/genome
“Browse by organism groups”
No
de novo
Assembly
61
General strategy of assembling a genome
de novo
Pre-process short
reads (trim, quality
filter…)
Assemble sequences
into contigs
Order contigs
into scaffolds
Annotate genome
62
Choose genome,
gather info ASSEMBLE!
Pipeline for de novo
DNA Library assembly Assemble again and again
preparation
(different tools, kmers)
Merge Re-scaffold
overlapping reads
ASSEMBLE! ANNOTATE!
Best Assembly Advice
1. Greedy
2. Overlap-layout- consensus
(OLC)
3. De Bruijn Graph
66
Greedy
Was used in the very early next gen assemblers (e.g. SSAKE, VCAKE)
1.The highest scoring alignment takes on another read with the
highest score
2.The paired end reads are used to generate super contigs
3. Mate pairs could also be used to determine contig order
Imperfect
overlap
Correct
!
Incorre
ct
Existing
Contig
69
Overlap Graph or Overlap-layout-consensus
(OLC)
Examples are:
Velvet, SOAPdenovo, ALLPATHS-LG,
ABySS
De novo assembly tools
List of de-novo assemblers
Short read assembler
• SPAdes http://cab.spbu.ru/software/spades/
• Velvet European Bioinformatics Institute ;
https://en.wikipedia.org/wiki/Velvet_assembler
• Soapdenovo;
https://www.animalgenome.org/bioinfo/resources/manuals/SO
AP.html
• ABySS; https://github.com/bcgsc/abyss
Coverage
GC Content
Genome annotation
• repeats analysis and annotation
• protein-coding gene annotation (including gene structure
prediction and gene function annotation)
• non-coding RNA gene annotation (including annota tion of
microRNA, tRNA, rRNA, and other ncRNA)
• transposon and tandem repeats annotation
76
• Quast (QUality ASsesment Tool) , evaluates
genome assemblies by computing various
metrics, including:
1. N50: length for which the collection of all contigs of that
length or longer covers at least 50% of assembly length.
2. L50: The minimum number X such that X longest contigs
cover at least 50% of the assembly
Evaluating the assembly
Basic statistics
N50 the length of the shortest contig such that the sum of contigs of equal length
or longer is at least 50% of the total length of all contigs OR 50% of entire
assembly is contained in contigs or scaffolds equal to or larger than X.
Contig size (bp)
3000
2000 N50
1200
800
600 N90
400
Total: 8000
N90 = the length of the shortest contig such that the sum of contigs
of equal length or longer is at least 90% of the total length of all
contigs.
78
Contig or Scaffold N50
• Most widely used statistic for genome
assemblies
• Measure of contiguity
• Take all contigs and sort them from shortest to
longest. The N50 is the length of the contig for
which half of the assembly is comprised of
contigs at least this length.
• More informative than mean
Contig or Scaffold N50
• 1,1,1,1,1,1,1,1,2,2,3,4,6,6,8,9,9,9,10,24
– Mean = 5
– N50 = 9
Assemblathon 1
• Genome Res. 2011 21: 2224-2241
Genome Assembly Gold-standard Evalutions (GAGE)
• Genome Res. 2012 22: 557-567
• http://gage.cbcb.umd.edu/results/index.html
81
Genome annotation
• Two main levels:
• Structural annotation = Nucleotide-Protein level
annotation – Finding genes and other biologically
relevant sites thus building up a model of genome as
objects with specific locations
• Functional annotation – Objects are used in
database searches (and experiments) aim is
attributing biologically relevant information to whole
sequence and individual objects
• Annotations is rate limiting step of sequencing
projects
Things we are looking to annotate?
• Protein Coding genes
• CDS
• mRNA
• Promoter and Poly-A Signal
• Alternative spliced RNA
• Pseudogenes
• ncRNA
What are genes?
• Complete DNA segments responsible to make functional
products
• Products
• Proteins
• Functional RNA molecules
• miRNA (micro RNA)
• rRNA (ribosomal RNA)
• snRNA (small nuclear)
• snoRNA (small nucleolar)
• tRNA (transfer RNA)
Pseudogenes
• Non-functional copy of a gene
• Processed pseudogene
• Retro-transposon derived
• No 5’ promoters
• No introns
• Often includes polyA tail
• Non-processed pseudogene
• Gene duplication derived
• Both include events that make the gene non-
funtional
• Frameshift
• Stop codons
• We assume pseudogenes have no function, but we
really don’t know!
Noncoding RNA (ncRNA)
TIGR: http://www.tigr.org/tigr-scripts/CMR2/CMRGenomes.spl
NCBI: http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.html
88
The challenge of eukaryotic genomes
4 million bp E. coli Genome
• Rule-based programs
• Use explicit set of rules to make decisions.
• Example: GeneFinder
• Neural Network-based programs
• Use data set to build rules.
• Examples: Grail, GrailEXP
• Hidden Markov Model-based programs
• Use probabilities of states and transitions between
these states to predict features.
• Examples: Genscan, GenomeScan
Common difficulties