0% found this document useful (0 votes)
29 views92 pages

4 - 7 Genome Assembly To Annotation - Final

A course which give practical insight from genome Assembly to annotation in practical

Uploaded by

Desye Melese
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views92 pages

4 - 7 Genome Assembly To Annotation - Final

A course which give practical insight from genome Assembly to annotation in practical

Uploaded by

Desye Melese
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 92

Genome sequence assembly

Assembly concepts and methods

(some slides courtesy of Mihai Pop, Amel Ghouila)

1
General sequencing and assembly workflow

2
Sequence assembly strategies

Whole genome shotgun

Map-based sequencing

By Commins, J., Toft, C., Fares, M. A. - "Computational Biology Methods and Their Application to the
Comparative Genomics of Endocellular Symbiotic Bacteria of Insects." Biol. Procedures Online
(2009). Accessed via SpringerImages., CC BY-SA 2.5,
https://commons.wikimedia.org/w/index.php?curid=17509619
Building a library

• Break DNA into random fragments (8-10x coverage)

Actual situation

4
Building a library

• Break DNA into random fragments (8-10x coverage)


• Sequence the ends of the fragments
– Amplify the fragments in a vector
– Sequence 800-1000 (500-700) bases at each end of the fragment

5
Assembling the fragments

6
Forward-reverse constraints
• The sequenced ends are facing towards each other
• The distance between the two fragments is known
(within certain experimental error)
Insert
F R
I II

R F

I II

R F
Clone
II I

F R
7
Building Scaffolds

• Break DNA into random fragments (8-10x coverage)


• Sequence the ends of the fragments
• Assemble the sequenced ends
• Build scaffolds

8
Assembly gaps
Physical gaps

Sequencing gaps

sequencing gap - we know the order and orientation of the contigs and have at
least one clone spanning the gap
physical gap - no information known about the adjacent contigs, nor about the DNA
spanning the gap

9
Unifying view of assembly

Assembly

Scaffolding

10
Alignment versus De Novo Assembly

Short Sequence “Reads”

Is a Reference Genome available?


http://www.ncbi.nlm.nih.gov/sites/genome
“Browse by organism groups”

Yes No

?
Alignment to Reference de novo Assembly

11
Alignment versus De Novo Assembly

Short Sequence “Reads”

Is a Reference Genome available?


http://www.ncbi.nlm.nih.gov/sites/genome
“Browse by organism groups”

Yes

Alignment to Reference

12
Workflow: QC & Mapping reads

Input reads Quality check Not OK? Quality- & Adapter-


(fastq files) with FastQC trimming
OK?

Map reads to
reference genome
using e.g. BWA or
Bowtie2
Sort by coordinates using SAMtools
sort or PicardTools SortSam
Output:
Call variants, Sorted BAM file
structural (binary SAM
variation etc sequence
alignment map)
Steps in Alignment/Mapping
1. Get your sequence data

2. Check quality of sequence data

3. Choose an alignment/mapping program

4. Run the alignment

5. View the alignments

6. Downstream Processing

14
Steps in Alignment/Mapping
1. Get your sequence data

2. Check quality of sequence data

3. Choose an alignment/mapping program

4. Run the alignment

5. View the alignments

6. Downstream Processing

15
Public Short Read Repositories
 NIH/NCBI
• Short Read Archive (http://www.ncbi.nlm.nih.gov/sra)
• Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo/)
• 1000 Genomes (ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/)
• European Nucleotide Archive (http://www.ebi.ac.uk/ena/)

fastq-dump SRR036642
16
Steps in Alignment/Mapping
1. Get your sequence data

2. Check quality of sequence data

3. Choose an alignment/mapping program

4. Run the alignment

5. View the alignments

6. Downstream Processing

17
Running FastQC

Open FastQC program

Open in browser:
fastqc_report.html
Per base sequence quality p-value =
0.0001

p-value = 0.001

p-value = 0.01

p-value = 0.05

18

Babraham Bioinformatics http://www.bioinformatics.babraham.ac.uk/projects/fastqc/


Steps in Alignment/Mapping
1. Get your sequence data

2. Check quality of sequence data

3. Choose an alignment/mapping program

4. Run the alignment

5. View the alignments

6. Downstream Processing

19
Short Read Alignment Software
• BFAST MAQ SSAHA and
• BLASTN mrFAST and SSAHA2
• BLAT mrsFAST STAR
• Bowtie MOSAIK TopHat
• Bowtie2 Novoalign ~20 more…
• BWA RUM
• ELAND SHRiMP
• GNUMAP SOAP
• GMAP and SpliceMap
GSNAP
http://en.wikipedia.org/wiki/List_of_sequence_alignment_software

http://tinyurl.com/seqanswers-mapping
20
Issues of Consideration for Alignment
Software

Library types:
• Genomic DNA (for resequencing)
• ChIP DNA (PCR bias)
• RNA-seq cDNA
– mRNA-seq (junction mapping)
– smRNA-seq (adapter trimming)

Adapters
DNA
Fragment

Illumina 21
Issues of Consideration for Alignment
Software

Types of reads

e.g. SRR036642.fastq

• Single-end

e.g. SRR027894_1.fastq, SRR027894_2.fastq


1
• Paired-end
2
Mean, Standard Deviation of Inner Distance

22
Issues of Consideration for Alignment
Software

Platform differences

• Bases (ACTG)

• Colorspace (2-base encoding,SOLiD)

• Read Length

• 454 (homopolymers)

23
Issues of Consideration for Alignment
Software

Software Properties
• Open-source or proprietary ($)
• Accuracy
• Speed of algorithm
• Multi-threaded or single processor
• RAM requirements (2GB vs 50GB for loading index)
• Use of base quality score
• Gapped alignment (indels)

24
Steps in Alignment/Mapping
1. Get your sequence data

2. Check quality of sequence data

3. Choose an alignment/mapping program

4. Run the alignment

5. View the alignments

6. Downstream Processing

25
“Mapping reads to the reference” is
finding where their sequence occurs in the genome

100 bp identified 200 – 500 bp unknown sequence 100 bp identified

Source: Wikimedia, file:Mapping Reads.png


“Mapping reads to the reference”:
naïve text search algorithms are too slow
• Naïve approach: compare each read with every position in the genome
– Takes too long, will not find sequences with mismatches

• Search programs typically create an index of the reference sequence (or


text) and store the reference sequence (text) in an advanced data
structure for fast searching.

• An index is basically like a phone book (with


addresses)  Quickly find address (location)
of a person

Example of algorithm using ‘indexed


seed tables’ to quickly find locations
of exact parts of a read
Read Mapping: General problems

• Read can match equally well at more than one location (e.g.
repeats, pseudo-genes)

• Reads can have imperfect hit to it’s actual position, e.g. if it


carries a break point, SNP, insertion and/or deletion compared
to the reference sequence
Output of read mapping: SAM and BAM files

• SAM = Sequence Alignment Map


• BAM = Binary SAM = compressed SAM
• Sequence Alignment/Map format
• contains information about how sequence reads map to a
reference genome
• Supports paired-end reads and color space from SOLiD.
• Is produced by bowtie, BWA and other mapping tools
SAM and BAM formats/files
• After mapping the FASTQ file to
the reference genome you will end
up with a SAM or BAM alignment
file
• SAM stands for Sequence
Alignment/Map format
• A single SAM file can store
mapped, unmapped, and even QC-
failed reads from a sequencing run,
and indexed to allow rapid access.
This means that the raw
sequencing data can be fully
recapitulated from the SAM/BAM
file.
SAM/BAM file
SAM and BAM formats/files

• SAM is rarely helpful and really


takes up too much space which
is why we use only the BAM in
principle

• A BAM file (.bam) is the binary


version of a SAM file (saving
storage and faster manipulation)
SAM and BAM formats/files
 A SAM file (.sam) is a tab-delimited text
file that contains sequence alignment
data
 SAM files can be opened using a text
editor or viewed using the UNIX "more"
command

 Most alignment programs will supply:

• - a header: describing the format


version, sorting order of the reads,
genomic sequences to which the reads
were mapped
• - an alignment section: contains the
information for each sequence about
where/how it aligns to the reference
genome
http://samtools.sourceforge.net/SAM1.pdf
http://genome.sph.umich.edu/wiki/SAM

CIGAR stands for Concise Idiosyncratic Gapped Alignment Report. It is


a compressed representation of an alignment that is used in the SAM file
format.
Output of read mapping: SAM file
Read_1 0 ENSG00000262694|HG1257_PATCH|72905355|72987235 937 1 70M * 0 0 CCACGAAAACTC…. III….. AS:
i:-12 XS:i:-12 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:11T38T19 YT:Z:UU

Total match proportion = 70/70

Read_2 0 ENSG00000091664|11|22359643|22401049 39302 24 4M1I1M8I56M * 0 0


AATGACAAAGAATAA….. IIII… AS:i:-79 XN:i:0 XM:i:7 XO:i:2 XG:i:9 NM:i:16 MD:Z:1C28C14C0C1A8T2T0
YT:Z:UU

4M1I1M8I56M = 4 Matches, 1 Insertion, 1 Match, 8 Insertions, 56 Matches


Total match proportion = (4 + 1 + 56 ) / 70 = 61/70

CIGAR informs –
 How much of a read has been matched (and has insertions and deletions)
 Where are those matches (and insertions/deletions,)
(http://samtools.github.io/hts-specs/SAMv1.pdf)

 QNAME: Query template NAME. Reads/segments having identical QNAME are


regarded to come from the same template. A QNAME ‘*’ indicates the
information is unavailable.
 Used to group/identify alignments that are together, like paired alignments or a
read that appears in multiple alignments.
Explain flag tool:
https://broadinstitute.github.io/picard/explain-flags.html
POS: 5
CIGAR: 3M1I3M1D2M
Short Read Alignment: Focus on BWA

4
5
Visualizing mapping results

IGV: Integrated Genome Viewer


Harvesting Information from SAM

• Query name, QNAME (SAM) / read_name (BAM).


• FLAG provides the following information:
– are there multiple fragments?
– are all fragments properly aligned?
– is this fragment unmapped?
– is the next fragment unmapped?
– is this query the reverse strand?
– is the next fragment the reverse strand?
– is this the last fragment?
– is this a secondary alignment?
– did this read fail quality controls?
– is this read a PCR or optical duplicate

Source: www.cs.colostate.edu/~cs680/Slides/lecture3.pdf
Steps in Alignment/Mapping
1. Get your sequence data

2. Check quality of sequence data

3. Choose an alignment/mapping program

4. Run the alignment

5. View the alignments

6. Downstream Processing
Visualization of output in Integrated Genome Browser (IGV)

 IGV
• http://www.broadinstitute.org/igv/projects/current/igv_mm.jnlp (Windows )
• http://www.broadinstitute.org/igv/projects/current/igv_lm.jnlp (Mac)
Steps in Alignment/Mapping
1. Get your sequence data

2. Check quality of sequence data

3. Choose an alignment/mapping program

4. Run the alignment

5. View the alignments

6. Downstream Processing
SV/CNV/Variant Calling

• Structural variations (SV) Deletions, duplications, copy-


number variations, insertions, inversions, translocations.
• Copy number Variations (CNV) Deletions or duplications
of genes or relatively large regions of the genome that
affect chromosomes
• Variant Calling (SNPs and small InDels)
 SNPs: affects only 1 nucleotide
 InDels: affects 1 or several nucleotides
Overview of SV/CNV/Variant Calling

Adapted from Scherer et al 2007


VCF (Variant Call Format)
• VCF (Variant Call Format) - Text file format
storing SNPs and InDels information (
http://www.1000genomes.org/node/101)
• Obtaining variants listed in this format is a
multistep procedure involving different tools but
standardized
• Headers (meta-information) + data lines - 8
required fields, tab-delimited
Variants annotation
• Variant annotation programs: SnpEff
- A variant annotation and effect prediction tool.
- Annotates and predicts the effects of variants on genes: Are they in a
gene? In an exon? Do they change protein coding? Do they cause
premature stop codons?
Variants annotation
Biological interpretation
From Variant annotation to data mining
•web-based
•available packages
Aim
• Functional impact of variants (synonymous or not…)
• Gene Ontology Annotation (BP, MF, CC)
• Pathway/Network information
• Predictions of pathogenicity/severity
NB: DAVID (Database for Annotation, Visualization and
Integrated Discovery) to switch between databases
https://david.ncifcrf.gov/
Downstream Processing
Finding and annotating peaks (ChIP-seq)
Assembling/annotating transcripts, identify differential gene expression (RNA-seq)
SNP and structural variation identification, prediction of effects (DNA-seq)
Etc.

Park, Nat Rev Genet, http://grimmond.imb.uq.edu.au/mammalian_transcriptome.html


2009
Tutorials

• https://datacarpentry.org/wrangling-genomics/
• https://genomics.sschmeier.com/ngs-
variantcalling/index.html
• https://learn.gencore.bio.nyu.edu/variant-calling/
De novo genome assembly

• De novo sequencing refers to sequencing a novel genome


where there is no reference sequence available for
alignment.
• Sequence reads are assembled as contigs, and the coverage
quality of de novo sequence data depends on the size and
continuity of the contigs (i.e, the number of gaps in the data).
Alignment versus De Novo Assembly

Short Sequence
“Reads”

Is a Reference Genome
available?
http://www.ncbi.nlm.nih.gov/sites/genome
“Browse by organism groups”
No

de novo
Assembly

61
General strategy of assembling a genome
de novo

Pre-process short
reads (trim, quality
filter…)

Assemble sequences
into contigs

Order contigs
into scaffolds

Annotate genome

62
Choose genome,
gather info ASSEMBLE!
Pipeline for de novo
DNA Library assembly Assemble again and again
preparation
(different tools, kmers)

Sequencing Fill gaps

Quality check Evaluate assembly


contiguity

Trimming Evaluate assembly gene


content

Error correction Choose a final assembly

Merge Re-scaffold
overlapping reads

ASSEMBLE! ANNOTATE!
Best Assembly Advice

• Remember: your goal is to have a genome


assembly.
• But you will not be doing one assembly.
• In the end you will have many assemblies to
choose from.
– Because you will be doing a lot of work!
• Use a lot of assembly tools for a lot of k values.
– Large k can better resolve repeats
– Comes at coverage cost
• The whole process should take a few months.
De novo assembly basics
• Find all overlaps between reads
• Build a graph
• Simplify the graph (sequencing errors)
• Traverse a graph to produce a consensus.
Assembly Algorithms

1. Greedy

2. Overlap-layout- consensus
(OLC)

3. De Bruijn Graph

Schatz M C et al. Genome Res. 2010;20:1165-1173

66
Greedy
Was used in the very early next gen assemblers (e.g. SSAKE, VCAKE)
1.The highest scoring alignment takes on another read with the
highest score
2.The paired end reads are used to generate super contigs
3. Mate pairs could also be used to determine contig order

* Repeats can cause big problems in this


approach
67
Imperfect Overlap Between Reads Can Lead to
Incorrect Assembly in the Greedy Approach

Imperfect
overlap

Correct
!

Incorre
ct

Brief Bioinform. 2009 July; 10(4): 354–366.


68
Greedy Extension Leads to Arrested Assembly
if Multiple Matches are Found
Two Unassembled Reads that Match
Contig

Existing
Contig

Can’t Resolve, so Assembly


Stops

69
Overlap Graph or Overlap-layout-consensus
(OLC)

• Perform better overall


• All against all using k-mers as seeds;
Seed & Extend algorithm is used.
• Good for Long reads (e.g. Sanger or
other >100bp, such as 454, Ion Torrent,
PacBio) due to minimum overlap
threshold
• Examples: CABOG (Celera), ARACHNE
• Newbler developed for 454 is based on
OLC and is now being used for
IonTorrent
Overlap Graph or Overlap-layout-consensus
(OLC)
De Bruijn Graph
• It breaks reads into successive k-mers and the graph maps the k-mers
• Each k-mer is a node and edges are drawn between each k-mer in a
read.
• Repeat sequences create a fork in the graph; alternative sequences
create a bubble.
• The k-mer size can only be determined by “trial and error”.
• A small value of K will create a complex graph but a large value of K
may miss small overlaps. A good starting point would be a k-mer size
that is 2/3 the size of the read
• Good for short reads or small genomes. With long reads and/or large
genomes, may require lots of RAM (e.g., ~0.5 TB for human)

Examples are:
Velvet, SOAPdenovo, ALLPATHS-LG,
ABySS
De novo assembly tools
List of de-novo assemblers
Short read assembler
• SPAdes http://cab.spbu.ru/software/spades/
• Velvet European Bioinformatics Institute ;
https://en.wikipedia.org/wiki/Velvet_assembler
• Soapdenovo;
https://www.animalgenome.org/bioinfo/resources/manuals/SO
AP.html
• ABySS; https://github.com/bcgsc/abyss

• ..\De novo sequence assemblers - Wikipedia.html


De novo assembly tools
Assembly quality assessment/
evaluation tools
• QUAST – Quality Assessment Tool for Genome
Assemblies
• evaluate assemblies both with a reference genome, as well as
without a reference.

• Benchmarking Universal Single-Copy Orthologs


(BUSCO)
• is a tool to assess completeness of genome assembly,
gene set and transcriptome. It is based on the concept of
single-copy orthologs that should be highly conserved
among the closely related species.
Evaluating the assembly
 Genome assembly results:
• contig size and number of contigs produced
• scaffold size and number
• N50 and N90

 Coverage
 GC Content
 Genome annotation
• repeats analysis and annotation
• protein-coding gene annotation (including gene structure
prediction and gene function annotation)
• non-coding RNA gene annotation (including annota tion of
microRNA, tRNA, rRNA, and other ncRNA)
• transposon and tandem repeats annotation

 Comparative genomics and evolution (chromosome structure,


conserved gene families)

76
• Quast (QUality ASsesment Tool) , evaluates
genome assemblies by computing various
metrics, including:
1. N50: length for which the collection of all contigs of that
length or longer covers at least 50% of assembly length.
2. L50: The minimum number X such that X longest contigs
cover at least 50% of the assembly
Evaluating the assembly
Basic statistics
N50 the length of the shortest contig such that the sum of contigs of equal length
or longer is at least 50% of the total length of all contigs OR 50% of entire
assembly is contained in contigs or scaffolds equal to or larger than X.
Contig size (bp)
3000

2000 N50
1200
800
600 N90
400
Total: 8000
N90 = the length of the shortest contig such that the sum of contigs
of equal length or longer is at least 90% of the total length of all
contigs.

78
Contig or Scaffold N50
• Most widely used statistic for genome
assemblies
• Measure of contiguity
• Take all contigs and sort them from shortest to
longest. The N50 is the length of the contig for
which half of the assembly is comprised of
contigs at least this length.
• More informative than mean
Contig or Scaffold N50
• 1,1,1,1,1,1,1,1,2,2,3,4,6,6,8,9,9,9,10,24
– Mean = 5
– N50 = 9

• N50 can be manipulated if you eliminate small


contigs
– Which may be useless anyway

• NG50 – uses genome size instead of assembly


length
Choosing a de novo Assembler

Assemblathon 1
• Genome Res. 2011 21: 2224-2241
Genome Assembly Gold-standard Evalutions (GAGE)
• Genome Res. 2012 22: 557-567
• http://gage.cbcb.umd.edu/results/index.html

81
Genome annotation
• Two main levels:
• Structural annotation = Nucleotide-Protein level
annotation – Finding genes and other biologically
relevant sites thus building up a model of genome as
objects with specific locations
• Functional annotation – Objects are used in
database searches (and experiments) aim is
attributing biologically relevant information to whole
sequence and individual objects
• Annotations is rate limiting step of sequencing
projects
Things we are looking to annotate?
• Protein Coding genes
• CDS
• mRNA
• Promoter and Poly-A Signal
• Alternative spliced RNA
• Pseudogenes
• ncRNA
What are genes?
• Complete DNA segments responsible to make functional
products
• Products
• Proteins
• Functional RNA molecules
• miRNA (micro RNA)
• rRNA (ribosomal RNA)
• snRNA (small nuclear)
• snoRNA (small nucleolar)
• tRNA (transfer RNA)
Pseudogenes
• Non-functional copy of a gene
• Processed pseudogene
• Retro-transposon derived
• No 5’ promoters
• No introns
• Often includes polyA tail
• Non-processed pseudogene
• Gene duplication derived
• Both include events that make the gene non-
funtional
• Frameshift
• Stop codons
• We assume pseudogenes have no function, but we
really don’t know!
Noncoding RNA (ncRNA)

• ncRNA represent 98% of all transcripts in a


mammalian cell
• ncRNA have not been taken into account in gene
counts
• cDNA
• ORF computational prediction
• Comparative genomics looking at ORF
• ncRNA can be:
• Structural
• Catalytic
• Regulatory
Genome Annotation Approaches

• Gene Predictions using software


• Identifying Open Reading Frames
• Looking for well studied splice junction sites.
• Training the algorithm on what is already known
about gene structure.
• Experimental evidence
• Sequencing RNA molecules (ESTs)
• Homology to genes in other species that are already
experimentally validated.
Prokaryotic gene model: ORF-genes

• “Small” genomes, high gene density


• Haemophilus influenza genome 85% genic
• Operons
• One transcript, many genes
• No introns.
• One gene, one protein
• Open reading frames
• One ORF per gene
• ORFs begin with start,
• end with stop codon (def.)

TIGR: http://www.tigr.org/tigr-scripts/CMR2/CMRGenomes.spl
NCBI: http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.html
88
The challenge of eukaryotic genomes
4 million bp E. coli Genome

3 billion bp The Human Genome

50% of genome is repeat sequences!


Gene prediction programs

• Rule-based programs
• Use explicit set of rules to make decisions.
• Example: GeneFinder
• Neural Network-based programs
• Use data set to build rules.
• Examples: Grail, GrailEXP
• Hidden Markov Model-based programs
• Use probabilities of states and transitions between
these states to predict features.
• Examples: Genscan, GenomeScan
Common difficulties

● First and last exons difficult to annotate because they


contain UTRs.
● Smaller genes are not statistically significant so they are
thrown out.
● Algorithms are trained with sequences from known genes
which biases them against genes about which nothing is
known.
● Masking repeats frequently removes potentially
indicative chunks from the untranslated regions of genes
that contain repetitive elements.
Tutorials
● https://www.hadriengourle.com/tutorials/
assembly/
● https://colauttilab.github.io/NGS/deNovoT
utorial.html
● https://www.geneious.com/tutorials/de-
novo-assembly/

You might also like