0% found this document useful (0 votes)
17 views

Lecture Bioinfo Databases

Uploaded by

Khushal Khan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Lecture Bioinfo Databases

Uploaded by

Khushal Khan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Who needs to study Bioinformatics?

•What is Bioinformatics?

“Bioinformatics is about searching biological databases, comparing


sequences, looking at protein structures, and (more generally) asking
biological questions with a computer”

•Introduced by French scientist Jean-Michel Claverie in late 80s


(“bioinformatique”)

•Saves you months of work!


Before the era of Bioinformatics

• Only two ways to perform


experiments,

1. In vivo

2. In vitro

• We are now in the age of In


Silico biology!
Bioinformatics is a must do!
Bioinformatics in context
Mathematics/
Genomics computer
science

Molecular
biology Bioinformatics Biophysics

Ethical, legal, and


social implications Molecular
evolution
What does this mean?

• Think of Bioinformatics as a tool!

• Now you are equipped with computational tools to answer biological


questions
The biological foundations of Bioinformatics
• Proteins and Nucleic acids

• Proteins are made up of amino acids


while nucleic acids are made up of
nucleotides

• How best to represent proteins and


nucleic acids?
• Need a formula to describe their
composition
• The identity of the protein is determined
from the composition and the precise
order of amino acids it contains
The Birth of Bioinformatics

• Protein sequences started to accumulate in 1960s

• People started manual comparisons (pre-computer era)

• With the advent of computers, people started to write algorithms from


scratch to analyze “sequence data”

• This was the genesis of bioinformatics


“The holy grail of Bioinformatics”
GCTCCTCACTGTCTGTGTTTATTCTTTTAGCTTCTTCAGA
TCTTTTAGTCTGAGGAAGCCTGGCATGTGCAAATGAAG > 500, 000 genes
TTAACCTAA... sequenced

Expected number of unique


protein structures:
~ 700-1,000
The core of Bioinformatics to date
•Relationships between

TDQAAFDTNIVTLTRFVM
EQGRKARGTGEMTQLLNS
LCTAVKAISTAVRKAGIA
HLYGIAGSTNVTGDQVKK
LDVLSNDLVINVLKSSFA
TCVLVTEEDKNAIIVEPE
KRGKYVVCFDPLDGSSNI
DCLVSIGTIFGIYRKNST
sequence
DEPSEKDALQPGRNLVAA
GYALYGSATMLV

Sequence 3D structure protein functions

•Properties and evolution of genes, genomes, proteins, metabolic


pathways in cells

•Use of this knowledge for prediction, modelling, and design


From sequence to structure

• Proteins adapt a three-


dimensional (3D) structure,
which is functionally important

• Structure is determined by the


composition and order of amino 1. Hydrophobic amino acids (e.g., Valine, Leucine) do not
want to be on the surface
acids in that protein 2. Hydrophilic love to be on the surface to interact with
water (e.g., Serine)
3. Also affected by the electric charge on some residues
and their size
In Short!
• Proteins have a unique order and composition of amino acids, simply
referred to as the ‘sequence’

• Sequence determines the 3D shape of the protein, simply referred to as


the ‘structure’

• Structure determines the molecular activities of proteins, simply referred


to as the ’function’

• Sequence -> Structure -> Function (but not always!)


What about DNA & RNA?
• DNA & RNA are made up of
nucleotide chains

• Nucleotides consist of carbohydrates,


phosphate, and one out of five
nitrogen bases

• Adenine, Guanine, Cytosine,


Thymine, and Uracil or simply A, T,
G, and C
What should be cheaper and faster? DNA/RNA
or protein sequencing?

DNA/RNA sequencing is faster and cheaper simply


because of fewer characters, four nucleotides vs. twenty
What do we mean by complementarity?

T is always facing A, while G is always facing C in one-


to-one reciprocal relationship
How can this knowledge help us?

If we know the sequence of one strand, we can get the


sequence of the other strand
Example
• 5’-ATGCTGA-3’

• What is the complimentary sequence?


• 5’-ATGCTGA-3’
• 3’-TACGACT-5’

• How is this reported?


• 5’-ATGCTGA-3’ and 5’-TCAGCAT-3’
What is a Database?

A database is an organized collection of related information


What are the advantages of using databases?

• Easy and quick retrieval of information

• Provide backup support


Biological Databases
•Need to collect and store biological data and its associated knowledge
into databases

•Fundamental to the survival of science


Two kinds of Biological Databases

1. Primary
• Contain primary sequence information (nucleotide or protein) and associated
annotations

1. Secondary
• Summarize the results from primary databases
Primary Databases

• Nucleotide sequence databases

• Protein sequence databases


Nucleotide Sequence Databases
• Genbank
• Perhaps the best known database
• Contains all publically available annotated DNA sequences
• Exchanges data daily with the DNA Data Bank of Japan (DDBJ) and European
Molecular Biology Laboratory (EMBL)
• Contains roughly 179 million sequence entries (Dec 2014)
• Prior submission of sequence into Genbank/DDBJ/EMBL is a prerequisite for
publishing new sequence in any scientific journal
• Submission is easy and can be done electronically
• Each entry has a unique id known as the “Accession Number (AN)”
Accession number

• A unique identifier of each record in the database

• Usually alpha-numeric in nature


Why do we need accession numbers?
• Common names lead to non-specific results
• A search on “Cytochrome” will output many different types of cytochromes (a,
b, c, and others)

• Cannot distinguish among species


• Search on “Insulin” will return insulin sequences from many organisms
Example Genbank Entry
Secondary Databases
PROSITE
• Sometimes a newly sequenced protein gives no hits to sequence
databases

• How do we determine its function then?

“In some cases, the structure and function of an unknown protein which is
too distantly related to any protein of known structure to detect its affinity
by overall sequence alignment may be identified by its possession of a
particular cluster of residues types classified as a motifs. The motifs, or
templates, or fingerprints, arise because of particular requirements of
binding sites that impose very tight constraint on the evolution of portions
of a protein sequence” - A. M. Lesk, 1988

You might also like