0% found this document useful (0 votes)
3 views36 pages

Lecture 05

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views36 pages

Lecture 05

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

BLAST & Beyond

Computational Biology Lecture 5


Review from last time
!"#$%#&'()*+")(")#,-.'-/(0'1,12(3.#/456"/'$."-7("&8&2
9:&("-;(<"$/(*=(>(&#.#)"$(/*("-;(<"$/(*=(?@A

B4#&(#&(C&'=C)(=*$("-&%'$#-,(&<'+#=#+("-");/#+")(DC'&/#*-&(%4'-(;*C(4"E'(
>("-F(?(#-(4"-F1
Review from last time
!"#$%#&'()*+")(")#,-.'-/(0'1,12(3.#/456"/'$."-7("&8&2
9:&("-;(<"$/(*=(>(&#.#)"$(/*("-;(<"$/(*=(?@A

B4#&(#&(C&'=C)(=*$("-&%'$#-,(&<'+#=#+("-");/#+")(DC'&/#*-&(%4'-(;*C(4"E'(
>("-F(?(#-(4"-F1
Review from last time

Pairwise local alignment (e.g., Smith-Waterman) asks,


“Is any part of A similar to any part of B?”

This is useful for answering specific analytical questions when you have A
and B in hand.
Review from last time
Review from last time
!"#$%#&'()*+")(")#,-.'-/(0'1,12(3.#/456"/'$."-7("&8&2
9:&("-;(<"$/(*=(>(&#.#)"$(/*("-;(<"$/(*=(?@A

B4#&(#&(C&'=C)(=*$("-&%'$#-,(&<'+#=#+("-");/#+")(DC'&/#*-&(%4'-(;*C(4"E'(
>("-F(?(#-(4"-F1

We would frequently prefer to ask a question such as,


“Is any part of A similar to ... anything?”

Where we have sequence A, and “anything” represents a special set of


sequences or a sequence database.
FASTA file

Here is a nucleotide file as represented in a FASTA (or Pearson format) file:


Ø Where does this come from?
Ø Which organism/gene/protein does this belong to?
Ø What other similar sequences exist out there?
For these, we need a database search
In database searching, the basic operation is to sequentially align a query
sequence to each subject sequence in the database. The results are reported as a
ranked hit list followed by a series of individual sequence alignments, plus
various scores and statistics.
National Center for Biotechnology Information (NCBI)

The most widely used interface for the retrieval of information from biological
databases is the NCBI Entrez system. Entrez capitalizes on the fact that there are
preexisting, logical relationships between the individual entries found in numerous
public databases.
National Center for Biotechnology Information (NCBI)

The most widely used interface for the retrieval of information from biological
databases is the NCBI Entrez system. Entrez capitalizes on the fact that there are
preexisting, logical relationships between the individual entries found in numerous
public databases.
Ø Where does this come from?
Ø Which organism/gene/protein does this belong to?
Ø What other similar sequences exist out there?
Ø Where does this come from?
Ø Which organism/gene/protein does this belong to?
Ø What other similar sequences exist out there?
For these, we need a database search
In database searching, the basic operation is to sequentially align a query
sequence to each subject sequence in the database. The results are reported as a
ranked hit list followed by a series of individual sequence alignments, plus
various scores and statistics.

FASTA search:

The first widely used


program for database
similarity searching was
FASTA (Lipman and
Pearson, 1985; Pearson and
Lipman, 1988; Pearson,
2000).
BLAST
BLAST: the Basic Local Alignment Search Tool

Time complexity & the motivation for BLAST.

Tour of the online version of BLAST.

Interpreting BLAST statistics.


91,568 citations
Where to use BLAST USING BLAST

BLAST Website.

Run BLAST on a computer that talks to the internet.

Run BLAST on a computer with a built-in database.

*All three are conceptually the same


https://blast.ncbi.nlm.nih.gov/Blast.cgi USING BLAST
BLAST a protein sequence USING BLAST
Entering data and parameters USING BLAST

Input
query
sequence
Entering data and parameters USING BLAST

Select your
database
Entering data and parameters USING BLAST

Fine-tune
behavior
Entering data and parameters USING BLAST

RUN!
Alignments summary: Best hit! USING BLAST
Alignments summary: Best hit! USING BLAST

This is often your query sequence, provided that it was


already in the BLAST sequence database.
Alignments summary: Another hit USING BLAST
Coverage BLAST STATISTICS

Compare the length of the aligned portion of a sequence to its


original length to determine coverage.
Coverage tells us how global/local the alignment is.

Note: “subject” always refers to a sequence in the database.


Coverage: visualized BLAST STATISTICS

High coverage
of query →

Low coverage
of query →
Identities BLAST STATISTICS

The number of identically paired amino acids in the


alignment (e.g., ‘Val’ over ‘Val’).
Divide Identities by the sequence length, or length of the
aligned portion, to determine percent identity.
Positives (Similarity) BLAST STATISTICS

The number of positively scoring matches and mismatches


in the alignment based on your scoring scheme.

(e.g., “Ile” over “Leu.”)


Positives (Similarity) BLAST STATISTICS

Recall the BLOSUM62 matrix.

Similar amino acids (positives)


have a positive score here.
Score= X bits (Y) BLAST STATISTICS

Y is the sum of the values (costs) for all paired amino acids in
the alignment, minus gap penalties.
BLOSUM62 values are the default.
The bit score X is normalized to be independent of the
scoring system; the details are complicated.
Expect a.k.a. E-value BLAST STATISTICS

Very important! The number of alignments at least this


strong (based on Score) that would be expected by chance.

Expect = Query Length × Database Size × 2–(Bit Score)


≈ p-value when Expect < 0.01
Interpreting BLAST results BLAST STATISTICS

HIGH LOW
similarity similarity

HIGH Your Gene*


Distant
Population
coverage Variants Close
Homologs
Pseudogenes
Homologs

LOW
Shared Domains Dubious
coverage
Other BLAST programs

PROGRAM QUERY DATABASE

blastp protein protein

blastn nucleotide nucleotide

nucleotide
blastx protein
(translated in all 6 reading
frames)
nucleotide
tblastn protein
(translated in all 6 reading frames)
nucleotide nucleotide
tblastx
(translated in all 6 reading (translated in all 6 reading frames)
frames)
blastp is the most commonly used.
blastx is useful for aligning a codon sequence to its product.
(Highlights non-coding DNA; useful in evolutionary analysis.)
SUMMARY
BLAST provides a fast method for searching/aligning a
query sequence against a large sequence database.

The central idea is to quickly filter database sequences


based on small-scale similarity before using DP.

But how does it do it?


But how does it do it?

BLAST Algorithm (very brief)

You might also like