BI Manual
BI Manual
Department: Zoology
Roll. No : bsf2001858
Semester:7th
Bioinformatics
Practicals
Practical # 1
Retrieval of FASTA sequence using NCBI
Introduction to NCBI
It is stands for National Center for
Biotechnology Information a division of the National Library of Medicine (NLM) at the U.S.
National Institutes of Health is a leader in the field of bioinformatics. It studies computational
approaches to fundamental questions in biology and provides online delivery of biomedical
information and bioinformatics tools. The National Center for Biotechnology Information (NCBI)
produces a variety of online information resources for biology including the GenBank nucleic
acid sequence database and the PubMed database of citations and abstracts published in life
science journals. NCBI provides search and retrieval operations for most of these data from 35
distinct databases.
Entrez Global Query is an integrated search and retrieval system that provides access to all
databases simultaneously with a single query string and user interface.
METHODS
First of all Open the NCBI website https://www.ncbi.nlm.nih.gov .
Then Select the database and choose the option Nucleotide.
Retrieve FASTA
FASTA Sequence:
A gene is the molecular unit of heredity of a living organism. It is the name given to certain
stretches of DNA and RNA that code for a type of protein or a strand of RNA that has some
function in an organism. Knowledge of gene sequences has become indispensable for basic
biological research, other research branches using sequencing, and in many applied fields such
as diagnostics, biotechnology, forensic biology, and biological systematics. In bioinformatics, the
FASTA format is a textual format for representing either nucleotide sequences or peptide
sequences in which nucleotides or amino acids are represented by single-letter codes. The
format also allows for sequence names and comments before sequences. The format originates
from the FASTA software package but has now become a standard in bioinformatics.
Obtain relevant information about gene and retrieve FASTA format of its sequence by
clicking on the FASTA tab at the left corner.
FASTA Sequence of EGFR gene
CTGGTTGTGCATTTGCTGTGGGTTCCCTCCGGCAGGCGACCTCTCCGCGCTGAGAAGGTTATCCGGATAAC
CAAGTAATTATGTGGTGACAGATCACGGCTCGTGCGTCCGAGCCTGTGGGGCCGACAGCTATGAGATGGA
GGAAGACGGCGTCCGCAAGTGTAAGAAGTGCGAAGGGCCTTGCCGCAAAGTGTGTAACGGAATAGGTAT
TGGTGAATTTAAAGACTCACTCTCCATAAATGCTACGAATATTAAACACTTCAAAAACTGCACCTCCATCAGT
GGCGATC
Practical # 2
Determination of proteins physical and chemical
parameters
Physical Parameters:
A characteristic of matter that may be observed and measured without changing the chemical
Identity of a sample. The measurement of a physical property may change the arrangement of
matter in A sample but not the structure of its molecules.
Chemical Parameters:
Chemical parameters include pH acidity, alkalinity, chlorine, hardness, dissolved oxygen And
biological oxygen demand. Biological parameters include nutrients, bacteria, algae and Viruses.
Water quality parameters are important because different application scenarios will Generally
have different requirements.
METHOD
Open the Expasy ProtParam website https://web.expasy.org/protparam/ on the
Google
Then Paste amino acid sequence of EGFR protein( retrieved from NCBI )on box and click
on compute parameters
The result page is appeared which shows different physical and chemical parameters of EGFR
protein such as
Practical # 3
Finding similar sequence for proteins and DNA
BLAST
BLAST stands for Basic Local Alignment Search Tool. BLAST finds regions of similarity between
biological sequences. The program compares nucleotide or protein sequences to sequence
databases and calculates the statistical significance. BLAST can be used to infer functional and
evolutionary relationships between sequences as well as help identify members of gene
families.
Method:
• Open the BLAST website https://blast.ncbi.nlm.nih.gov/Blast.cgi in web browser
Select the Nucleotide BLAST and the blastn suit page is Opened.
Paste the FASTA sequence of EGFR in the tab and also add job title in the respective bar
then click on blast option .
results of blast are as follows
1. Description
2. Graphic summary
3. Alignment
4. Taxonomy
Description: In this result, a list of similar sequences are arranged in ascending order.
Query coverage: Query cover is the percentage of the query sequence that overlaps the
reference sequence
Percentage identity: Percent identity is the % of bases that are identical to the reference
genome.
E value: "E-value" (Expect Value) is a statistical measure that represents the expected
number of random alignments that would have a score equal to or better than the one
obtained in the search, purely by chance.
Graphic Design: A graphical summary in the context of a BLAST report typically refers to
a visual representation of the sequence comparison between the query sequence and
database sequences found to be similar.
Alignment: In BLAST, alignment refers to the process of finding and displaying regions of
similarity between two or more sequences. The primary purpose of alignment in BLAST
is to identify regions where the query sequence and a sequence from a database (or
multiple sequences) share similarity, which can provide insights into potential homology,
functional conservation, or evolutionary relationships.
Tool : Clustal W
ClustalW is a widely used system for aligning any number of homologous nucleotide or protein
sequences. For multi-sequence alignments, ClustalW uses progressive alignment methods. In
these the most similar sequences that is those with the best alignment score are aligned first.
Then progressively more distant groups of sequences are aligned until a global alignment is
obtained.
Then enter the multiple sequences in the bar and then click the Execute Multiple
Alignment. Output page is open
Result Interpretation
Conserved Regions: Positions where most or all sequences
have the same nucleotide (A, T, C, G). These positions
indicate conserved regions. Conserved regions are often
biologically significant, as they may represent functional
domains or important structural elements in the
sequences.
Gaps in the alignment are represented by "-" characters.
They indicate insertions or deletions (indels) in the
sequences. The length and position of gaps can vary. Long
gaps may suggest significant sequence differences or
structural variations.
Consensus Sequence: Some alignment files may include a
consensus sequence, which represents the most common
nucleotide at each position in the alignment. It is typically
denoted by symbols such as "*" or ":" to indicate different
levels of conservation.
Phylogenetic analysis: The Multiple sequence alignment
can be used as input for phylogenetic analysis to infer
evolutionary relationships between sequences. Tools like
phylogenetic trees can be generated to visualize these
relationships based on the alignment
PRACTICAL#5
Predicting Proteins Secondary Structure
Tool : Psipred
PSIPRED works to normalize the sequence profile generated by PSIBLAST. Then, by using neural
networking, initial secondary structure is predicted. For each amino acid in the sequence, the
neural network is fed with a window of 15 acids.
Method :
Open the Psipred website http://bioinf.cs.ucl.ac.uk/psipred/ on the Google.
Then enter the FASTA sequence of Gene EGFR in the given bar and also add job name.
Then press the submit bar
Results are shown on the screen.
PRACTICAL#6:
Predicting RNA Secondary Structure
Tool used :RNA Fold
RNAfold predicts the consensus structure of a set of aligned DNA or RNA sequences. It extends
standard dynamic programming algorithms for RNA secondary structure prediction by averaging
the energy contributions over all sequences and incorporating covariation terms into the energy
model to reward compensatory mutations and penalize non-compatible base-pairs. Again, it
supports prediction of the minimum free energy structure and base-pairing probabilities and
can handle circular sequences. The input is a single multiple sequence alignment in CLUSTAL W
or FASTA format. There are only two additional parameters compared to the RNAfold server,
namely ‘Weight of covariance term’ and ‘Penalty for non-compatible sequences’ which affect
the covariance scoring schema and the penalization of non-compatible base-pairs of the
RNAalifold algorithm. The output is similar to that of the RNAfold server, but also features a
structure annotated alignment. Plots are augmented by a special coloring schema that indicates
compensatory mutations. Note that the more mutations are observed that support a certain
base-pair, the more evidence is given that this base-pair might be correctly predicted.
Method:
Open the website RNAfold
http://rna.tbi.univie.ac.at/cgi-bin/RNAWebSuite/RNAfold.cgi on the Google’s search bar
Features:
For each family in Pfam one can:
● View a description of the family
● Look at multiple alignments
● View protein domain architectures
● Examine species distribution
● Follow links to other databases
● View known protein structures
Method
Open Pfam website from the Google by using http://pfam-legacy.xfam.org/.
Then add assession no of protein and click on go tab
Practical # 11
Primer design