0% found this document useful (0 votes)
3 views

Practical 2 sequence alignment

The document outlines a practical class on microbial bioinformatics focusing on sequence analysis and alignment using various bioinformatics tools. Students will perform local and global alignments, computational translation of coding sequences, and multiple sequence alignments, specifically with Bacillus thuringiensis endotoxins. Key tasks include using BLAST for database searches, translating coding sequences with Expasy, and conducting alignments with EMBOSS Needle and ClustalW.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Practical 2 sequence alignment

The document outlines a practical class on microbial bioinformatics focusing on sequence analysis and alignment using various bioinformatics tools. Students will perform local and global alignments, computational translation of coding sequences, and multiple sequence alignments, specifically with Bacillus thuringiensis endotoxins. Key tasks include using BLAST for database searches, translating coding sequences with Expasy, and conducting alignments with EMBOSS Needle and ClustalW.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

MAM 5108: Microbial Bioinformatics

1 Dr. Pasan Fernando

Practical 2: Sequence analysis and alignment


In this practical class, you will learn how to perform local and global alignments using
bioinformatics tools. Furthermore, you will perform computational translation of coding
sequences and multiple sequence alignments. Again, you will work with Bacillus thuringiensis
(Bt) endotoxins.

1. Using the Basic Local Alignment Search Tool (BLAST) for performing a database search
in GenBank (20 marks)

BLAST is the most popular bioinformatics program to conduct local alignment between
sequences. It is widely used to retrieve similar sequences for an input sequence from
biological databases.

Go to the National Center for Biotechnology Information (NCBI) BLAST tool by


performing a google search or using the following link:
https://blast.ncbi.nlm.nih.gov/Blast.cgi

Select the Nucleotide BLAST program from the program list. Then, copy and paste the
“Unknown_cry_cds.fasta” sequence given to you, including the header, into the query
sequence box. Alternatively, you can upload the FASTA file by clicking on the “choose
file” button. After you copy and paste the sequence, the FASTA header will be
automatically added to “Job title” field. You can give another title if required.

Then, under the “Choose search set” section, select “Nucleotide collection” as the
standard database (default option). Under the “program selection” section, select
“Highly similar sequences (megablast)” as the program optimization method.

Finally, tick the box to show results in a new window and click on the “BLAST” button to
run the program. You can run the program using default algorithm parameters.
However, for advanced search queries, you can change these parameters by expanding
on the “Algorithm parameters” box. The BLAST program will take a few minutes
depending on the query and connection speed and generate the results in a new
window. Make sure that you have a good network connection while running the
program.

Answer the following questions based on the BLAST results.

a. According to the top hits, what are the gene name and the organism (including
subspecies) of the input sequence? (2 marks)

1
MAM 5108: Microbial Bioinformatics
2 Dr. Pasan Fernando

b. Give two synonyms to the gene name of the input sequence (use knowledge
from your previous lab). (2 marks)

c. Based on the Max Score of the BLAST hits, are there multiple hits with the
highest Max Score? If yes, how many hits? And what is the highest Max Score? (3
marks)

d. What can you conclude from the Query Cover and the E-values of the hit(s) you
found in part (c)? (4 marks)

e. What can you conclude about the homology of hit(s) you found in part (c) based
on the Percent Identity value(s)? (3 marks)

f. What can you conclude from the accession length of the hit(s) you found in part
(c)? (2 marks)

g. What could be the reason for finding multiple hits with the same highest Max
Score? (2 marks)

h. By considering all the results what would be your pick as the best hit? Write its
GenBank accession number and explain the reason for your selection. (2 marks)

2
MAM 5108: Microbial Bioinformatics
3 Dr. Pasan Fernando

2. Using the Expert Protein Analysis System (Expasy) Translate tool to translate a coding
sequence of a gene. (5 marks)

You can use the Expasy translate tool to translate a coding sequence (without introns)
or an mRNA sequence to the corresponding amino acid sequence.

Go to the Expasy translate tool by clicking on the given link or googling it.
https://web.expasy.org/translate/

Copy the sequence in the “Unknown_cry_cds.fasta” file and paste it into the DNA or
RNA sequence box. Select the following parameter settings and click on the translate
button.
 Output format: Compact
 DNA strands: both forward and reverse
 Genetic codes: standard

a. In the results, select the open reading frame with the longest continuous amino
acid sequence. What is this reading frame for the sequence? What is the reason
for selecting the longest continuous sequence? (3 marks)

b. Then, click on the starting methionine residue (letter “M” in red color) of the
longest amino acid sequence. This will give you two views for the selected
sequence: Pseudo-entry and FASTA format. What is the length of this amino acid
sequence? (2 marks)

3
MAM 5108: Microbial Bioinformatics
4 Dr. Pasan Fernando

c. Now, click on the download button on top of the Pseudo-entry result and select
the FASTA format to download. Then change the resulting file format from “.txt”
into “.fasta” by changing the file extension.

3. Using the NCBI blastx program to search protein databases using a translated nucleotide
query. (10 marks)

With the blastx program, you do not need to separately translate your coding sequence
or mRNA sequence. The program will take the nucleotide sequence as the input and
translate it in 6 open reading frames and search for the best protein hit. Use the blastx
program to search for the best protein hit for the “Unknown_cry_cds.fasta” coding
sequence.

Go to the NCBI BLAST using the following link or a google search. Then, select the blastx
search.
https://blast.ncbi.nlm.nih.gov/Blast.cgi

Click on choose file button under the “enter query sequence” section and upload the
“Unknown_cry_cds.fasta” file you saved before. Alternatively, you can also copy and
paste the sequence into the sequence box. Then, make sure the UniprotKB/Swiss-Prot is
selected as the database. Tick on “show results in a new window” and click on the BLAST
button.

a. What is the UniProt ID of the best hit? What are the reasons for selecting this
UniProt record as the best hit? Explain using result metrics such as Max Score, E-
value, etc. (8 marks)

4
MAM 5108: Microbial Bioinformatics
5 Dr. Pasan Fernando

b. Access the UniProt record for the best hit. What are the name and the function
of the protein according to the UniProt record? (2 marks)

c. Finally, download the amino acid sequence in FASTA format from the UniProt
record.

4. Performing a pairwise global alignment using the Needleman-Wunsch alignment


algorithm. (7 marks)

EMBOSS Needle is an online tool that reads two input sequences and writes their
optimal global sequence alignment to a file. It uses the Needleman-Wunsch alignment
algorithm to find the optimum alignment (including gaps) of two sequences along their
entire length. You can access EMBOSS Needle using the following link:
https://www.ebi.ac.uk/Tools/psa/emboss_needle/

In this exercise, you will use the EMBOSS Needle to perform a global alignment on the
translated amino acid sequence you generated in 2(c) and the amino acid sequence you
downloaded from the UniProt record in 3(c). Go to the EMBOSS Needle and copy and
paste the two sequences in the sequence boxes. Make sure “protein” is selected as the
molecule type and click on submit button at the bottom. You can also enter your email
address for notifications by ticking the corresponding box.

Answer the following questions based on the alignment result.

a. Write the identity, similarity and gap percentages, and final score for the
alignment. (4 marks)

b. What can you conclude about the alignment of the two sequences from the
above results? (1 mark)

5
MAM 5108: Microbial Bioinformatics
6 Dr. Pasan Fernando

c. What does this indicate about the Expasy translate prediction you performed
during question 2? (2 marks)

5. Conducting a multiple sequence alignment using the ClustalW algorithm. (8 marks)

You will perform a multiple sequence alignment using the ClustalW algorithm in MEGA
software. Initially, you will download similar sequences to the amino acid sequence you
downloaded in 3(c) by performing a Protein BLAST at the NCBI.

Go to the NCBI BLAST using the following link or a google search. Then, select the
Protein BLAST search.
https://blast.ncbi.nlm.nih.gov/Blast.cgi

Click on choose file button under the “enter query sequence” section and upload the
amino acid sequence you downloaded in 3(c) in FASTA format. Alternatively, you can
also copy and paste the sequence into the sequence box.

Then, make sure the UniprotKB/Swiss-Prot is selected as the database and blastp is
selected as the algorithm. Then, expand the Algorithm parameters and limit the Max
target sequences to 10. Then, tick on “show results in a new window” and click on the
BLAST button.

This will result in the top 10 most similar protein sequences available in the UniProtKB
to the input sequence. Click on the download button on top of the hit list and select the
“FASTA (complete sequence)” option. This will download the 10 sequences in FASTA
format within the same file. Change the file extension of this file from “.txt” to “.fasta”.
Now, you can open this file in the MEGA software.

To perform the multiple sequence alignment, first, open the MEGA software. Then, click
on the Align button and select “Edit/Build Alignment”. Then select “Retrieve sequences
from a file” from the next window and click “Ok”. Then locate the FASTA file that you
downloaded before and open the file. Now, you will see all 10 sequences in the
sequence viewer.

Then, click on the Edit menu and select the “Select All” option. This will select all the
sequences. Then, click on the Alignment menu and click on the “Align by ClustalW”
option. In the resulting parameter box, change the Protein Weight Matrix into BLOSUM

6
MAM 5108: Microbial Bioinformatics
7 Dr. Pasan Fernando

(click on the “Weight” submenu). Then use the default remaining parameters and click
on the “Ok” button. This will run the multiple sequence alignment.

This will result in the aligned sequences in the sequence viewer. Deselect the initial
selection by clicking on one sequence. This will bring back the colors for you to observe
similar amino acid residues in aligned sequences. Fully conserved sites will be
represented by a star at the top of the site.

You can save the session by clicking on the Data menu and selecting the “save session”
option. Or you can click on “Export alignment” in the Data menu and export the
alignment in FASTA format.

Now, access the UniProt record (you already did in 3(b)) of the input sequence and
locate the corresponding Pfam record.

a. Write the names of the distinct domains found in the sequence with their
sequence coordinates. Also, take a screenshot of the domain organization
diagram and paste it below. (6 marks)

7
MAM 5108: Microbial Bioinformatics
8 Dr. Pasan Fernando

b. Now, you can observe the locations of the above domains in the sequence
alignment. First, click on the very first amino acid residue of the input sequence
(3 (b)) in the MEGA alignment viewer. At the bottom of the window, you will see
the position number. Make sure to select “W/O Gaps” option to avoid counting
gaps. Now, by clicking on different residues you can identify the domain regions
of the Pfam record. What can you conclude about the conservation of residues in
domain regions? (2 marks)

You might also like