0% found this document useful (0 votes)
11 views

TY-Exercise_4_(35)(Updated)

BLAST (Basic Local Alignment Search Tool) is a bioinformatics algorithm used for comparing biological sequences, allowing researchers to identify similar sequences in databases. Variants of BLAST include BLASTn for nucleotide sequences, tBLASTn for protein searches in unannotated DNA, BLASTx for translating nucleotide queries into protein sequences, and BLASTp for comparing protein sequences. The document also provides exercises for using BLAST to identify proteins and nucleotide sequences, detailing metrics like query length, E-value, max score, and percent identity.

Uploaded by

Arpita Upadhyay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

TY-Exercise_4_(35)(Updated)

BLAST (Basic Local Alignment Search Tool) is a bioinformatics algorithm used for comparing biological sequences, allowing researchers to identify similar sequences in databases. Variants of BLAST include BLASTn for nucleotide sequences, tBLASTn for protein searches in unannotated DNA, BLASTx for translating nucleotide queries into protein sequences, and BLASTp for comparing protein sequences. The document also provides exercises for using BLAST to identify proteins and nucleotide sequences, detailing metrics like query length, E-value, max score, and percent identity.

Uploaded by

Arpita Upadhyay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Roll No.

– 35

EXERCISE 4
Title- Performing Database Search using NCBI BLAST

Write about BLAST & its variants


BLAST:-
In bioinformatics, BLAST (basic local alignment search tool) is an algorithm and program for
comparing primary biological sequence information, such as the amino-acid sequences
of proteins or the nucleotides of DNA and/or RNA sequences. A BLAST search enables a
researcher to compare a subject protein or nucleotide sequence (called a query) with a library
or database of sequences, and identify database sequences that resemble the query sequence
above a certain threshold. For example, following the discovery of a previously unknown gene in
the mouse, a scientist will typically perform a BLAST search of the human genome to see if
humans carry a similar gene; BLAST will identify sequences in the human genome that resemble
the mouse gene based on similarity of sequence. BLAST is one of the most widely used
bioinformatics programs for sequence searching.[4] It addresses a fundamental problem in
bioinformatics research. The heuristic algorithm it uses is much faster than other approaches,
such as calculating an optimal alignment. This emphasis on speed is vital to making the
algorithm practical on the huge genome databases currently available, although subsequent
algorithms can be even faster. BLAST is more time-efficient than FASTA by searching only for
the more significant patterns in the sequences, yet with comparative sensitivity. While BLAST is
faster than any Smith-Waterman implementation for most cases, it cannot "guarantee the optimal
alignments of the query and database sequences" as Smith-Waterman algorithm does. The
Smith-Waterman algorithm was an extension of a previous optimal method, the Needleman–
Wunsch algorithm, which was the first sequence alignment algorithm that was guaranteed to find
the best possible alignment. However, the time and space requirements of these optimal
algorithms far exceed the requirements of BLAST.
Variant of BLAST:-
a) BLASTn (Nucleotide BLAST):
BLASTn compares one or more nucleotide sequence to a database or another sequence. This
is useful when trying to identify evolutionary relationships between organisms.

b) tBLASTn:

1
Roll No. – 35

tBLASTn used to search for proteins in sequences that haven't been translated into
proteins yet. It takes a protein sequence and compares it to all possible translations of a
DNA sequence. This is useful when looking for similar protein-coding regions in DNA
sequences that haven't been fully annotated, like ESTs (short, single-read cDNA
sequences) and HTGs (draft genome sequences). Since these sequences don't have
known protein translations, we can only search for them using tBLASTn.

c) BLASTx:
BLASTx compares a nucleotide query sequence, which can be translated into six
different protein sequences, against a database of known protein sequences. This tool
is useful when the reading frame of the DNA sequence is uncertain or contains errors
that might cause mistakes in protein-coding. BLASTx provides combined statistics
for hits across all frames, making it helpful for the initial analysis of new DNA
sequences.

d) BLASTp:
BLASTp, or Protein BLAST, is used to compare protein sequences. You can
input one or more protein sequences that you want to compare against a single
protein sequence or a database of protein sequences. This is useful when you're
trying to identify a protein by finding similar sequences in existing protein
databases.

Exercises:
1. Given a protein sequence, find out which protein it is and from which organism it belongs
using BLAST Program.

>sample sequence
MERGVRRGAALVAAWRSLWERGGLALFRPQCRTGCGACRVQGTRPFSLSAAASAVLGLGSWGGDSGKQ
KLTLQDVAELIRKKECRRVVVMAGAGIS
TPSGIPDFRSPGSGLYSNLEQYNIPYPEAIFELAYFFINPKPFFTLAKELYPGNYRPNYAHYFLRLLHDKGLLL
RLYTQNIDGLERVAGIPPDRLVEAHGT
FATATCTVCRRKFPGEDFRGDVMADKVPHCRVCTGIVKPDIVFFGEELPQRFFLHMTDFPMADLLFVIGTSL
EVEPFASLAGAVRNSVPRVLINRDLV
GPFAWQQRYNDIAQLGDVVTGVEKMVELLDWNEEMQTLIQKEKEKLDAKDK

2
Roll No. – 35

It is the NAD-dependent protein deacetylase sirtuin-3 of mitochondrial Isoform X1 of Gallus


gallus

From the given details, find:

 What is the length of the query sequence?


346
 What is the number of sequences in the database searched?
100
 What is the E-value?
The number of hits expected to be seen by chance. The closer to 0, the better. The hits are
automatically sorted by E value (best to worst). This metric is extremely useful for identifying real
hits. For this, E value was - 0.0
 What is max score?
The highest bit score that is calculated from alignment matches and mismatches. The bit-score is
the required size of a sequence database in which the current match could be found by chance. It
is derived from the raw alignment score; the higher the score, the better the alignment. For this,
Max score was – 714.
 What is total score?

3
Roll No. – 35

The sum of the alignment scores of all of the segments from the sequence. The higher the score,
the better the alignment. For this, Total score is – 714.
 What is percent identity?
The % of bases that are identical to the reference genome. A query sequence can have a low %
identity, but still be a real hit. It is essential to take the E value into account and look for homology
between conserved regions – this will be evident at the protein level. For this, percent identity is –
100%.
 Are there any homologous (orthologous/paralogous) sequences? Which are those? Mention them.
Yes, Only Orthologous. The sequence is said to be Orthologous if they were separated by a
speciation event: when a species diverges into two separate species, the divergent copies of a single
gene in the resulting species are said to be orthologous. 5 best orthologous sequences from the
given protein sequence are given below:
a) Gallus gallus
b) Phasianus colchicus
c) Centrocercus urophasianus
d) Lagopus muta
e) Tympanuchus pallidicinctus
 Which the best match?
NP_001186422.1 (Gallus gallus)
Max score: 714
Total Score: 714
E value: 0.0
 Which is the least match?
NWZ52525.1 (Haliaeetus albicilla)
Max score: 579
Total score: 579
E value: 0.0

2. Given a nucleotide sequence, find the details regarding the sequence like which organism,
name, accession id etc. using BLAST

>sample sequence
AGTGCCGCGCGTCGAGCGGAGCAGAGGAGGCGAGGGCGGAGGGCCAGAGAGGCAGTTGGAAGATGG
CGGACGAGGTGGCGCTCGCCCTTCAGGCCGCCGGCTCCCCTTCCGCGGCGGCCGCCATGGAGGCCGCG
TCGCAGCCGGCGGACGAGCCGCTCCGCAAGAGGCCCCGCCGAGACGGGCCTGGCCTCGGGCGCAGCC
CGGGCGAGCCGAGCGCAGCAGTGGCGCCGGCGGCCGCGGGGTGTGAGGCGGCGAGCGCCGCGGC
CCCGGCGGCGCTGTGGCGGGAGGCGGCAGGGGCGGCGGCGAGCGCGGAGCGGGAGGCCCCGGCGAC
GGCCGTGGCCGGGGACGGAGACAATGGGTCCGGCCTGCGGCGGGAGCCGAGGGCGGCTGACGACTTC
GACGACGACGAGGGCGAGGAGGAGGACGAGGCGGCGGCGGCAGCGGCGGCGGCAGCGATCGGCTAC
CGAGACAACCTCCTGTTGACCGATGGACTCCTCACTAATGGCTTTCATTCCTGTGAAAGTGATGACGAT
GACAGAACGTCACACGCCAGCTCTAGTGACTGGACTCCGCGGCCGCGGATAGGTCCATATACTTTTGT
TCAGCAACATCTCATGATTGGCACCGATCCTCGAACAATTCTTAAAGATTTATTACCAGAAACAATTCC
TCCACCTGAGCTGGATGATATGACGCTGTGGCAGATTGTTATTAATATCCTTTCAGAACCACCAAAGC
GGAAAAAAAGAAAAGATATCAATACAATTGAAGATGCTGTGAAGTTACTGCAGGAGTGTAAAAAGAT

4
Roll No. – 35

AATAGTTCTGACTGGAGCTGGGGTTTCTGTCTCCTGTGGGATTCCTGACTTCAGATCAAGAGACGGTAT
CTATGCTCGCCTTGCGGTGGACTTCCCAGACCTCCCAGACCCTCAAGCCATGTTTGATATTGAGTATTT
TAGAAAAGACCCAAGACCATTCTTCAAGTTTGCAAAGGAAATATATCCCGGACAGTTCCAGCCGTCTC
TGTGTCACAAATTCATAGCTTTGTCAGATAAGGAAGGAAAACTACTTCGAAATTATACTCAAAATATA
GATACCTTGGAGCAGGTTGCAGGAATCCAAAGGATCCTTCAGTGTCATGGTTCCTTTGCAACAGCATC
TTGCCTGATTTGTAAATACAAAGTTGATTGTGAAGCTGTTCGTGGAGACATTTTTAATCAGGTAGTTCC
TCGGTGCCCTAGGTGCCCAGCTGATGAGCCACTTGCCATCATGAAGCCAGAGATTGTCTTCTTTGGTGA
AAACTTACCAGAACAGTTTCATAGAGCCATGAAGTATGACAAAGATGAAGTTGACCTCCTCATTGTTA
TTGGATCTTCTCTGAAAGTGAGACCAGTAGCACTAATTCCAAGTTCTATACCCCATGAAGTGCCTCAAA
TATTAATAAATAGGGAACCTTTGCCTCATCTACATTTTGATGTAGAGCTCCTTGGAGACTGCGATGTT
ATAATTAATGAGTTGTGTCATAGGCTAGGTGGTGAATATGCCAAACTTTGTTGTAACCCTGTAAAGCTT
TCAGAAATTACTGAAAAACCTCCACGCCCACAAAAGGAATTGGTTCATTTATCAGAGTTGCCACCAAC
ACCTCTTCATATTTCGGAAGACTCAAGTTCACCTGAAAGAACTGTACCACAAGACTCTTCTGTGATTGC
TACACTTGTAGACCAAGCAACAAACAACAATGTTAATGATTTAGAAGTATCTGAATCAAGTTGTGTGG
AAGAAAAACCACAAGAAGTACAGACTAGTAGGAATGTTGAGAACATTAATGTGGAAAATCCAGATTT
TAAGGCTGTTGGTTCCAGTACTGCAGACAAAAATGAAAGAACTTCAGTTGCAGAAACAGTGAGAAAA
TGCTGGCCTAATAGACTTGCAAAGGAGCAGATTAGTAAGCGGCTTGAGGGTAATCAATACCTGTTTGT
ACCACCAAATCGTTACATATTCCACGGTGCTGAGGTATACTCAGACTCTGAAGATGACGTCTTGTCCTC
TAGTTCCTGTGGCAGTAACAGTGACAGTGGCACATGCCAGAGTCCAAGTTTAGAAGAACCCTTGGAAG
ATGAAAGTGAAATTGAAGAATTCTACAATGGCTTGGAAGATGATACGGAGAGGCCCGAATGTGCTGG
AGGATCTGGATTTGGAGCTGATGGAGGGGATCAAGAGGTTGTTAATGAAGCTATAGCTACAAGACAG
GAATTGACAGATGTAAACTATCCATCAGACAAATCATAACACTATTGAAGCTGTCCGGATTCAGGAAT
TGCTCCACCAGCATTGGGAACTTTAGCATGTCAAAAAATGAATGTTTACTTGTGAACTTGAACAAGGA
AATCTGAAAGATGTATTATTTATAGACTGGAAAATAGATTGTCTTCTTGGATAATTTCTAAAGTTCCAT
CATTTCTGTTTGTACTTGTACATTCAACACTGTTGGTTGACTTCATCTTCCTTTCAAGGTTCATTTGTAT
GATACATTCGTATGTATGTATAATTTTGTTTTTTGCCTAATGAGTTTCAACCTTTTAAAGTTTTCAAAAG
CCATTGGAATGTTAATGTAAAGGGAACAGCTTATCTAGACCAAAGAATGGTATTTCACACTTTTTTGTT
TGTAACATTGAATAGTTTAAAGCCCTCAATTTCTGTTCTGCTGAACTTTTATTTTTAGGACAGTTAACTT
TTTAAACACTGGCATTTTCCAAAACTTGTGGCAGCTAACTTTTTAAAATCACAGATGACTTGTAATGTG
AGGAGTCAGCACCGTGTCTGGAGCACTCAAAACTTGGTGCTCAGTGTGTGAAGCGTACTTACTGCATC
GTTTTTGTACTTGCTGCAGACGTGGTAATGTCCAAACAGGCCCCTGAGACTAATCTGATAAATGATTTG
GAAATGTGTTTCAGTTGTTCTAGAAACAATAGTGCCTGTCTATATAGGTCCCCTTAGTTTGAATATTTG
CCATTGTTTAATTAAATACCTATCACTGTGGTAGAGCCTGCATAGATCTTCACCACAAATACTGCCAAG
ATGTGAATATGCAAAGCCTTTCTGAATCTAATAATGGTACTTCTACTGGGGAGAGTGTAATATTTTGGA
CTGCTGTTTTTCCATTAATGAGGAAAGCAATAGGCCTCTTAATTAAAGTCCCAAAGTCATAAGATAAA
TTGTAGCTCAACCAGAAAGTACACTGTTGCCTGTTGAGGATTTGGTGTAATGTATCCCAAGGTGTTAGC
CTTGTATTATGGAGATGAATACAGATCCAATAGTCAAATGAAACTAGTTCTTAGTTATTTAAAAGCTTA
GCTTGCCTTAAAACTAGGGATCAATTTTCTCAACTGCAGAAACTTTTAGCCTTTCAAACAGTTCACACC
TCAGAAAGTCAGTATTTATTTTACAGACTTCTTTGGAACATTGCCCCCAAATTTAAATATTCATGTGGG
TTTAGTATTTATTACAAAAAAATGATTTGAAATATAGCTGTTCTTTATGCATAAAATACCCAGTTAGGA
CCATTACTGCCAGAGGAGAAAAGTATTAAGTAGCTCATTTCCCTACCTAAAAGATAACTGAATTTATTT
GGCTACACTAAAGAATGCAGTATATTTAGTTTTCCATTTGCATGATGTGTTTGTGCTATAGACAATATT
TTAAATTGAAAAATTTGTTTTAAATTATTTTTACAGTGAAGACTGTTTTCAGCTCTTTTTATATTGTACA
TAGACTTTTATGTAATCTGGCATATGTTTTGTAGACCGTTTAATGACTGGATTATCTTCCTCCAACTTTT
GAAATACAAAAACAGTGTTTTATACTTGTATCTTGTTTTAAAGTCTTATATTAAAATTGTCATTTGACTT
TTTTCCCGTTAAAAAAAAAAAAAAA

5
Roll No. – 35

It is the mRNA sequence of Mus musculus sirtuin 1 (Sirt1). Its Accession ID: NM_019812.3

From the given details, find

 What is the length of the query sequence?


3920
 What is the number of sequences in the database searched?
100
 What is the E-value?
The number of hits expected to be seen by chance. The closer to 0, the better. The hits are
automatically sorted by E value (best to worst). This metric is extremely useful for identifying real
hits. For this, E value was - 0.0
 What is max score?
The highest bit score that is calculated from alignment matches and mismatches. The bit-score is
the required size of a sequence database in which the current match could be found by chance. It
is derived from the raw alignment score; the higher the score, the better the alignment. For this,
Max score was – 7215.
 What is total score?
The sum of the alignment scores of all of the segments from the sequence. The higher the score,
the better the alignment. For this, Total score is – 7215.
 What is percent identity?

6
Roll No. – 35

The % of bases that are identical to the reference genome. A query sequence can have a low %
identity, but still be a real hit. It is essential to take the E value into account and look for homology
between conserved regions – this will be evident at the protein level. For this, percent identity is –
100%.
 Are there any homologous (orthologous/paralogous) sequences? Which are those? Mention them.
Only Orthologous sequences are present. The sequence is said to be Orthologous if they were
separated by a speciation event: when a species diverges into two separate species, the divergent
copies of a single gene in the resulting species are said to be orthologous.
a) Mus musculus
b) Mus caroli
c) Arvicanthis niloticus
d) Mastomys coucha
e) Rattus norvegicus

 Which the best match?


NM_019812.3 (Mus musculus)
Max score: 7215
Total score: 7215
E value: 0.0
 Which is the least match?
XM_060642624.1 (Panthera onca)
Max score: 2483
Total score: 3639
E value: 0.0

You might also like