0% found this document useful (0 votes)
157 views

Bioinformatics: Blast and Sequence Analysis

BLAST allows rapid comparison of a query sequence against database sequences. It is a fast and accurate algorithm accessible via the web. BLAST is fundamental for understanding relatedness of a query to known proteins or DNA. It has applications like identifying orthologs and paralogs, discovering new genes, finding variants, and exploring protein structure and function. A typical BLAST search involves choosing a query sequence, selecting a BLAST program and database, and optional parameters before submitting the search.

Uploaded by

Salix Matt
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
157 views

Bioinformatics: Blast and Sequence Analysis

BLAST allows rapid comparison of a query sequence against database sequences. It is a fast and accurate algorithm accessible via the web. BLAST is fundamental for understanding relatedness of a query to known proteins or DNA. It has applications like identifying orthologs and paralogs, discovering new genes, finding variants, and exploring protein structure and function. A typical BLAST search involves choosing a query sequence, selecting a BLAST program and database, and optional parameters before submitting the search.

Uploaded by

Salix Matt
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Bioinformatics

Chapter 3
BLAST AND SEQUENCE ANALYSIS
BLAST

BLAST (Basic Local Alignment Search Tool)


allows rapid sequence comparison of a query
sequence against a database.

The BLAST algorithm is fast, accurate,


and web-accessible.

page 101
Why use BLAST?

BLAST searching is fundamental to understanding


the relatedness of any query sequence of interest
to other known proteins or DNA sequences.

Applications include
• identifying orthologs and paralogs:
For example, Besides RBP, what other lipocalins are known?
When new bacterial genome is sequenced and several thousand
proteins are identified, how many of these proteins are paralogous?
How many of the predicted genes have no significantly related
matches in GenBank?

page 102
Why use BLAST?

Applications include
• discovering new genes or proteins
Are there any lipocalins such as RBP in plants?
Are there any reverse transcriptase genes (such as HIV-1 pol gene)
in fish?

page 102
Why use BLAST?

•discovering variants of genes or proteins


DNA sequence may be searched against a protein database to learn
what proteins are most related to the protein encoded by your DNA
sequence. For example, many viruses are extremely mutable; what
HIV-1 pol variants are known?

• investigating expressed sequence tags (ESTs)


that may exhibit alternative splicing.
• exploring protein structure and function
reveal conserved residues such as cysteines that are likely to have
important biological roles.

page 102
A typical BLAST SEARCH has 4 components

(1) Choose the sequence (query)

(2) Select the BLAST program

(3) Choose the database to search

(4) Choose optional parameters

Then click “BLAST”

page 102
Step 1: Choose your sequence

• There are three main forms of data input: (1) cutting and
pasting DNA or protein sequence, (2) using sequence in the
FASTA format, and (3) simply using an accession number
[e.g., a RefSeq or GenBank Identification (GI) number].
• A sequence in FASTA format begins with a single-line
description followed by lines of sequence data.
• Do not confuse the FASTA format with the FASTA program
• The description line is distinguished from the sequence data
by a greater than >”) symbol in the first column.
• It is recommended that all lines of text be shorter than 80
characters in length.

page 103
Example of the FASTA format for a BLAST query

Fig. 2.9
page 32
Step 2: Choose the BLAST program

page 104
Step 3: choose the database

nr = non-redundant (most general database)


dbest = database of expressed sequence tags
dbsts = database of sequence tag sites
gss = genomic survey sequences

protein databases
nucleotide databases page 106
Step 3: choose the database
• For protein database searches (blastp
and blastx), the main two options are
the nr database and SwissProt.
• For proteins, the nr database consists
of the combined protein records from
GenBank, the Protein Data Bank (PDB),
SwissProt, PIR, and PRF
• For DNA databases searches (blastn,
tblastn, tblastx) the options are to search
the nucleotide nr database or the EST
database.
• The nr database for DNA searches
includes nucleotide sequences from
GenBank, EMBL, DDBJ, and PDB.
• However, the nr database does not have
records from the EST, STS, GSS, or high-
throughput genomic sequence (HTGS)
databases.
• The nr databases are derived by
merging several main protein or DNA
databases.
• These databases often contain identical
sequences, but only one of these
sequences is retained by the nr database.
• (Even if two sequences in the nr database
appear to be identical, they in fact have
some subtle difference.)
• The nr databases are typically the
preferred sites for searching the majority
of available sequences.
Step 4a: Select optional search parameters

organism

Entrez!

algorithm

page 107
Step 4a: optional blastp search parameters

Expect
Word size

Scoring matrix

Filter, mask

page 108
Step 4a: optional blastn search parameters

Expect
Word size

Match/mismatch scores

Filter, mask

page 108
Step 4a: Selecting Optional Search
Parameters
• A score is a numerical value that describes the overall
quality of an alignment. Higher numbers correspond to
higher similarity.
• The bit-score S' measures sequence similarity
independent of query sequence length and database size
and is normalized based on the raw pairwise alignment
score.
• E-value (associated to a score S) is the number of distinct
alignments expected by chance with a score equivalent
to or better than S. The lower the E value, the more
significant the score is.
Step 4a: Selecting Optional Search
Parameters
• The default setting for the expect value is 10 for blastn,
blastp, blastx, and tblastn.
• At this E value, 10 hits with scores equal to or better than
the alignment score S are expected to occur by chance.
(This assumes that you search the database using a
random query with similar length to your actual query.)
• By changing the expect option to a lower number, fewer
database hits are returned; fewer chance matches are
reported. Increasing E returns more hits.
BLAST OUTPUTS
Score, Bit-score,
BLAST searchP-value, E-value:
output: graphical example
output

Example: BLAST - Pho4p (S. cerevisiae)

>gi|259146228|emb|CAY79487.1| Pho4p [Saccharomyces cerevisiae EC1118]


MGRTTSEGIHGFVDDLEPKSSILDKVGDFITVNTKRHDGREDFNEQNDELNSQEHHNSSENGNENENEQD
SLALDDLDRAFELVEGMDMDWMMPSHAHHSPATTATIKPRLLYSPLIHTQSAVPVTISPNLVATATSTTS Query (input) sequence
ANKVTKNKSNSSPYLNKRRGKPGPDSATSLFELPDSVIPTPKPKPKPKQYPKVILPSNSTRRISPVTAKT (Pho4p from S. cerevisiae)
SSSAEGVVVASESPVIAPHGSSHSRSLSKRRSSGALVDDDKRESHKHAEQARRNRLAVALHELASLIPAE
WKQQNVSAAPSKATTVEAACRYIRHLQQNVST

BLAST (default parameters)

Results (output) of BLAST


The top segment displays the color key
and the query based scale.
The colored bars represent the actual
HSPs. The position of each bar
indicates the region of the query the
* HSP covers.
The thin line (see *) indicates that the
two HSPs are from the same sequence.
Small vertical lines (not obtained here)
indicate breaks, i.e., segments which
Explanation of Output of a BLAST Search: are not connected in the actual
http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/new_view.html alignment. page 112
Score, Bit-score, P-value, E-value: example
Example: BLAST - Pho4p (S. cerevisiae)

Results (output) of BLAST

Max score = highest alignment score (bit-score) between the query sequence and the database sequence segment .
Total score = sum of alignment scores of all segments from the same database sequence that match the quary
sequence (calculated over all segments). This score is different from the max score if several parts of the database
sequence match different parts of the query sequence (see " * " in the example).
Query coverage = percent of the query length that is included in the aligned segments. This coverage is calculated
over all segments (cf. total score).
E-value = number of alignments expected by chance with a particular score or better. The expect value is the default
sorting metric and normally gives the same sorting order as Max score.
Total Score
• Total Score important when BLAST finds multiple,
but not joined, sections of similarity between the
query and the hit.
• For each area of similarity, BLAST generates an
alignment and a score.
• If the Max Score is equal to the Total Score, then
only a single alignment is present.
• If the Total Score is larger than the Max Score,
then multiple alignments are present and their
individual scores contributed to the Total Score.
More on E Score
• E value or Expect value, represents the number of hits you
would expect to find by chance given the quality of the
alignment and the size of the database.
• If a database of only As, Ts, Gs, and Cs gets large, you start
finding sequence similarities by chance, particularly with
short queries.
• E value takes into account both the length and composition of
the alignment along with the percentage identity found.
• A number close to zero means that the hit is significant and
not due to chance.
• BLAST results tables are sorted by E value, the most
significant hits appearing at the top. When there are two or
more identical E values, the Max Score is then used to sort
the hits.
BLAST search output: tabular output

High scores
low E values

page 113
• We can assess the relatedness of any two
proteins by performing a pairwise alignment.
• In this procedure, we place the two sequences
directly next to each other.
• It is often extremely difficult to align two proteins
by visual inspection.
• Also, if we allow gaps in the alignment to account
for deletions or insertions in the two sequences,
the number of possible alignments rises
exponentially.
• Clearly, we will need a computer algorithm to
perform an alignment such as BLAST.
TIPS
• Always use CAPITAL letters for the one letter
codes.
• When displaying these sequences as a word-
processing document, use the Courier font for
easy alignment.
Sequence Analysis
• The term "sequence analysis" in biology
implies subjecting a DNA or peptide sequence
to sequence alignment, sequence databases,
repeated sequence searches, or other
bioinformatics methods on a computer.
The Concept
• An alignment is a mutual arrangement of two
sequences
• Exhibits where two sequences are similar, and where
they differ
• Sequences that are similar probably have the same
function
Sequence alignment involves the identification of the
correct location of deletions and insertions that have
occurred in either of the two lineages since the
divergence from a common ancestor.
Pairwise sequence alignment
• One of the most basic questions about a gene or
protein is whether it is related to any other gene or
protein.
• Relatedness of two proteins at the sequence level
suggests that they are homologous.
• Relatedness also suggests that they may have common
functions.
• By analyzing many DNA and protein sequences, it is
possible to identify i.e domains or motifs that are
shared among a group of molecules.
• These analyses of the relatedness of proteins and
genes are accomplished by aligning sequences.
Definitions: Homology, Similarity,
Identity
• Let us consider the lipocalin family of proteins.
• We will begin with human RBP4 (accession
number NP-006735) and bovine B-lactoglobulin
(accession number P02754) as two proteins that
are distantly but significantly related.
• Both accession numbers can be obtained from
Entrez, or other search engines
• Two sequences are homologous if they share a
common evolutionary ancestry.
• Homologous proteins almost always share a
significantly related three-dimensional
structure.
• RBP and B-lactoglobulin have very similar
structures as determined by X-ray
crystallography
• When two sequences are homologous, their
amino acid or nucleotide sequences usually
share significant identity.
• Thus, while homology is an inference
(sequences are homologous or not),
• identity and similarity are quantities that
describe the relatedness of sequences.
• They are quantitative measurement of the
number of residues which are identical in both
of the sequences being aligned. Can be
expressed as a percentage.
Terms of Sequence Comparison
Sequence identity
• Exactly same Nucleotide/Amino Acid in same position

Sequence similarity
• Substitutions with similar chemical properties

Sequence homology
• General term that indicates evolutionary relatedness among
sequences
• Sequences are homologous if they are derived from a
common ancestral sequence.
• Two molecules may be homologous without
sharing statistically significant amino acid (or
nucleotide) identity.
• In the lipocalin family, all the members are
homologous, but some have sequences that
have diverged so greatly that they share no
recognizable sequence identity.
• In general, three-dimensional structures
diverge much more slowly than amino acid
sequence identity between two proteins
Orthologous & Paralogous
• Proteins that are homologous may be
orthologous or paralogous.
• Orthologs (ortho = excact) are homologous
sequences/genes in different species evolved
from a common ancestral gene during
speciation.
• Orthologous RBPs. This unrooted tree was
generated from a multiple sequence
alignment of 13 lipocalin
• In this tree, sequences that are more closely
related to each other are grouped closer
together.
• Orthologs are presumed to have similar
biological functions; in this example, human
and rat RBPs both transport vitamin A in
serum.
• Paralogous genes often develop different
functions due to missing selective pressure on
one copy of the duplicated gene.
• Paralogs (para = in parallel) ( are homologous
sequences within the same organism
(genome) that arose by a mechanism such as
gene duplication.
• For example, human plasma RBP is
homologous to another carrier protein,
human apolipoprotein D (NP-00 1638).
• These two proteins are paralogs.
• All of the lipocalins have distinct distributions
in the body and are thought to have distinct
but related functions as carrier proteins.
More on Paralogs
• Human a-globin and b-globin are paralogs, as are
mouse a-globin and mouse b-globin.
• Human a-globin and mouse a-globin are orthologs.
What is the relation of human a-globin to mouse b-
globin?
• Could be considered paralogs, because a-globin and
b-globin originate from a gene duplication event
rather than from a speciation event.
• However, they are not paralogs because they do not
occur in the same species. It may thus be most
appropriate to simply call them “homologs,”
reflecting their descent from a common ancestor.
• Paralogous human lipocalins: Each of these
proteins is human, and each is a member of
the lipocalin family
• In all genome sequencing projects, orthologs
are identified based on database searches.
• Two DNA (or protein) sequences are defined
as homologous based on achieving significant
alignment scores
• However, homologous proteins do not
necessarily share the same function.
Recap: Homology
• Homology designates a qualitative relationship of
common descent between entities
• Two genes are either homologs or not !
‾ It doesn’t make sense to say “two genes are 43%
homologous”
‾ It doesn’t make sense to say “John is 43% diabetic”
Two genes are orthologs if they originated from single
ancestral gene in the most recent common ancestor of
their respective genomes
Two genes are paralogs if they are related by duplication

You might also like