0% found this document useful (0 votes)
20 views

First Lecture

The document provides an overview of genomics and proteomics. It discusses that genomics deals with DNA sequences, organization, function and evolution of genomes, while proteomics aims to identify all proteins in a cell including modified forms, localization, functions and interactions. It also describes how genomics was enabled by techniques like gene cloning and recombinant DNA. The summary highlights the key topics and goals of genomics and proteomics at a high level in 3 sentences.

Uploaded by

Mohamed Hasan
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

First Lecture

The document provides an overview of genomics and proteomics. It discusses that genomics deals with DNA sequences, organization, function and evolution of genomes, while proteomics aims to identify all proteins in a cell including modified forms, localization, functions and interactions. It also describes how genomics was enabled by techniques like gene cloning and recombinant DNA. The summary highlights the key topics and goals of genomics and proteomics at a high level in 3 sentences.

Uploaded by

Mohamed Hasan
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 89

Introduction to Genomics,

Proteomics
Genomics and Proteomics
• The field of genomics deals with the DNA sequence,
organization, function, and evolution of genomes

• Proteomics aims to identify all the proteins in a cell or


organism including any post-translationally modified
forms, as well as their cellular localization, functions,
and interactions

• Genomics was made possible by the invention of


techniques of recombinant DNA, also known as gene
cloning or genetic engineering

2
Genetic Engineering
• In genetic engineering, the immediate goal of an experiment
is to insert a particular fragment of chromosomal DNA into a
plasmid or a viral DNA molecule

• This is accomplished by breaking DNA molecules at specific


sites and isolating particular DNA fragments

• DNA fragments are usually obtained by the treatment of DNA


samples with restriction enzymes

• Cloning from mRNA molecules depends on an unusual


polymerase, reverse transcriptase, which can use a single-
stranded RNA molecule as a template and synthesize a
complementary DNA (cDNA)
3
cDNA Cloning

• The resulting full-length cDNA contains an


uninterrupted by introns coding sequence for the
protein of interest
• If DNA sequence is known at both ends of the cDNA
for design of appropriate primers, amplification of
the cDNA produced by reverse transcriptase is
possible by reverse transcriptase PCR (RT-PCR)

4
Origin of “Genomics”: 1987

“For the newly developing discipline of [genome] mapping/sequencing


(including the analysis of the information), we have adopted the term
GENOMICS… The new discipline is born from a marriage of molecular and
cell biology with classical genetics and is fostered by computational
science.”
- McKusick and Ruddle, A new discipline, a new name, a new journal,
Genomics, Vol. 1, No. 1. (September 1987), pp. 1-2
What is genomics?

“Genomics is a discipline in genetics that applies


recombinant DNA, DNA sequencing methods, and
bioinformatics to sequence, assemble, and analyze the
function and structure of genomes (the complete set of
DNA within a single cell of an organism).”

-
What is genomics?

“Research of single genes does not fall into the definition


of genomics unless the aim of this genetic, pathway, and
functional information analysis is to elucidate its effect on,
place in, and response to the entire genome's networks.”

-
Central Dogma of Biology

http://www.lhsc.on.ca/Patients_Families_Visitors/Genetics/Inherited_Metabolic/Mitochondria/DiseasesattheMolecularLevel.htm
What can genomics tell us?

DNA Sequence

Gene Sequence Regulatory Sequence DNA Variation

Protein Sequence Gene Expression Human disease

Protein/Gene Function
How do we study genomics?
1. Isolate nucleic acid molecules from biological samples
2. Determine nucleotide sequence using biochemical
techniques
3. Digitize nucleotide sequence
4. Examine digital sequences to identify patterns with
algorithms and statistics
5. Relate patterns to biological observations by:
6. Comparing patterns detected across many samples
7. Manipulating a system to see how patterns change
Sequencing Techniques
Sanger sequencing – fluorescent-labeled DNA fragments

Sequencing by synthesis, NGS

Adapted from http://web.uri.edu/gsc/next-generation-sequencing/


Bioinformatics
• Rapid automated DNA sequencing was instrumental in the success of
the Human Genome Project, an international effort begun in 1990 to
sequence the human genome and that of a number of organisms
• However, a genomic sequence is like a book using an alphabet of
only four letters, without spaces or punctuation. Identifying genes and
their functions is a major challenge
• The annotation of genomic sequences at this level is one aspect of
bioinformatics, defined broadly as the use of computers in the
interpretation and management of biological data

12
What’s Next ?
THE “POST-GENOMICS” ERA

Annotation Comparative Structural Functional


genomics genomics genomics

Goal: 13

to understand the living cell


Annotation

CCTGACAAATTCGACGTGCGGCATTGCATGCAGACGTGCATG
CGTGCAAATAATCAATGTGGACTTTTCTGCGATTATGGAAGAA
CTTTGTTACGCGTTTTTGTCATGGCTTTGGTCCCGCTTTGTTC
AGAATGCTTTTAATAAGCGGGGTTACCGGTTTGGTTAGCGAGA
AGAGCCAGTAAAAGACGCAGTGACGGAGATGTCTGATG CAA
TAT GGA CAA TTG GTT TCT TCT CTG AAT ......
.............. TGAAAAACGTA
14
Identify the genes within a
given sequence of DNA

Identify the sites


Which regulate the gene
Annotation

Predict the function

15
promoter TF binding site

CCTGACAAATTCGACGTGCGGCATTGCATGCAGACGTGCATG

Transcription
CGTGCAAATAATCAATGTGGACTTTTCTGCGATTATGGAAGAA

Start Site
CTTTGTTACGCGTTTTTGTCATGGCTTTGGTCCCGCTTTGTTC
AGAATGCTTTTAATAAGCGGGGTTACCGGTTTGGTTAGCGAGA
AGAGCCAGTAAAAGACGCAGTGACGGAGATGTCTGATG CAA
TAT GGA CAA TTG GTT TCT TCT CTG AAT .................................

.............. TGAAAAACGTA
Ribosome binding Site
ORF=Open Reading Frame
CDS=Coding Sequence 16
Comparative
genomics

Human ATAGCGGGGGGATGCGGGCCCTATACCC
Chimp ATAGGGG - - GGATGCGGGCCCTATACCC
Mouse ATAGCG - - - GGATGCGGCGC -TATACCA

17
Structural
genomics

18
Functional Genomics
• Genomic sequencing has made possible a new approach to genetics called functional
genomics, which focuses on genome-wide patterns of gene expression and the
mechanisms by which gene expression is coordinated

• DNA microarray (or chip) - a flat surface about the size of a postage stamp with up to
100,000 distinct spots, each containing a different immobilized DNA sequence suitable
for hybridization with DNA or RNA isolated from cells growing under different conditions

• DNA microarrays are used to estimate the relative level of gene expression of each
gene in the genome

19
20
Assigning the structures of all proteins

protein complexes
fold Evolutionary
Biologic processes relationship

Shape and electrostatics Protein-ligand complexes 21

Active sites Functional sites


Sequence Analysis
● Assembly – putting short sequences together to
reconstruct a longer, source sequence
● Mapping – locating where one short sequence is found
in a longer sequence
● Pattern recognition – looking for specific patterns
within sequences that have special meaning

In each of these cases, sequences are aligned to one


another
Sequence Alignment
● Provides a measure of relatedness
● Alignment quantified by similarity (% identity)
● Useful for any sequential data type:
○ DNA/RNA
○ Amino acids
○ Protein secondary structure
● High sequence similarity might imply:
○ Common evolutionary history
○ Similar biological function
What Alignments Can Tell Us

● Homology - Orthologs, Paralogs


● Genomic identity/origin of a
sequence/individual
● Genome/gene structure
○ Genic structure (exons, introns, etc)
○ RNA 2D structure
○ Chromosome rearrangements/3D structure
DNA Sequence Alignment Example
Sequence 1 ATACACAGTAGGAGATACCAGTAAGGGAGGGGG

Sequence 2 ATACCATAAGCGAG

Match Mismatch
Alignment 1 ATACACAGTAGGAGATACCAGTAAGGGAGGGGG
--------------ATACCA-TAAGCGAG----
Gap
Alignment 2 ATACACAGTAGGAGATACCAGTAAGGGAGGGGG
ATAC-CA--------------TAAGCGAG----

Alignment 3 ATACACAGTAGGAGATACCAGTAAGGGAGGGGG
ATAC-CA-TA--AG---C--G--AG--------
Scoring/Substitution Matrices

● Given alignment, how “good” is it?


● Higher score = better alignment
● Implicitly represent evolutionary patterns
A C G T -
ATACCAGTAAGGGAG Score = 22
A 2 -3 -1 -3 -3
ATACCA-TAAGAGAG
C -3 2 -3 -1 -3

G -1 -3 2 -3 -3
ATACCAGTAAGG-GAG Score = 19
ATACCA-TAAG-AGAG
T -3 -1 -3 2 -3
ATACCA-GTAAGGGAG Score = -20
- -3 -3 -3 -3 NA
A-TACCATAAGAGAG-
Sequence Alignment Algorithms

● Global alignments - beginning and end of


both sequences must align
● Local alignments - one sequence may align
anywhere within the other
● Multiplicity:
○ Pairwise alignments (2 sequences)
○ Multiple sequence alignment (3+ sequences)
Global Alignment
Both sequences are aligned from end to end

AAANTAIYYDPNPDMP A--
NTAI-YDPN--M-
Interior sequences are aligned as well as possible

AERAKDNLCRLEHTTLRKVTAAANTAIYYDPNPDMPVVAEDQEWVNVYYEM
A-----N------T-----------AI-YD--P------------N----M

However, sequences of vastly different length can produce


meaningless alignments
Local Alignment
Alignment may begin and end at any position

AAANTAIYYDPNPDMP -
AANTAI-YDPN--M-

AERAKDNLCRLEHTTLRKVTAAANTAIYYDPNPDMPVVAEDQEWVNVYYEM
---------------------AANTAI-YDPN--M----------------

Local alignment may produce better alignments when


sequence lengths differ greatly
LOCAL ALIGNMENT
SMITH-WATERMAN
• BEST SCORE FOR ALIGNING PART OF
SEQUENCES
• OFTEN BEATS GLOBAL ALIGNMENT SCORE
Global Alignment
ATTGCAGTG-TCGAGCGTCAGGCT
ATTGCGTCGATCGCAC-GCACGCT

Local Alignment
CATATTGCAGTGGTCCCGCGTCAGGCT
TAAATTGCGT-GGTCGCACTGCACGCT 30
GLOBAL VS. LOCAL ALIGNMENT

Global alignment: Local alignment:


DOROTHY--------HODGKIN DOROTHY
DOROTHYCROWFOOTHODGKIN DOROTHY
31
HODGKIN
HODGKIN
Multiple Sequence Alignment
Like pairwise alignment, but with N sequences

Sequence consensus among many species suggests


evolutionary pressure
Alignment Examples
Example: Genome Assembly
We need multiple copies of each book
(genome) to arrive at a consensus text
(DNA sequence) of the original

If your genome was a book that had


its sentences chopped into
fragments, assembly is analogous to
reconstructing all the sentences.
Great explanation of DNA sequence assembly: http://gcat.davidson.edu/phast/
Example: Genome Assembly

An error?
A polymorphism?
A different allele?
Incorrect alignment?

Greedy approach: take most frequent nucleotide at each aligned position

Great explanation of DNA sequence assembly: http://gcat.davidson.edu/phast/


Example: Exon Microarray Probes
● Microarray probes are short single-stranded DNA
sequences from a reference genome
● Exon Microarrays have probes only from exons

● Exon probes must map to the correct exon, BUT


● Probes must NOT map anywhere else, they must be
unique in the genome
Example: mRNA-Seq Analysis
Start with a pool of
mRNA molecules

Millions of DNA
Find all locations sequences 30-150
where sequences nucleotides long
map in genome

Count the number


of sequences that
map to individual
regions (e.g. genes)
Example: DNA Binding Site Discovery
Identify genomic regions where a particular
TF is bound across the entire genome

By extracting and aligning the DNA


sequence corresponding to these binding
events, we can identify which DNA
sequences this TF tends to bind
Human Genomics and Gene
Expression
The Human Genome Project

● Planning begins 1984, launched 1990,


“completed” 2001, “finished” 2004
● Championed by Dr. Charles DeLisi
● Overview of the Human Genome Project:
http://www.genome.gov/12011238
Human Genome Composition

● Key findings:
○ ~20k genes
○ More segmental duplications than expected
○ Fewer than 7% of protein families vertebrate
specific
○ ~3% of sequence codes for protein coding genes
○ >85% of the genome is transcribed
○ Repetitive elements may comprise >66% of genome
How The Genome Was Determined
International Human Genome Sequencing Consortium

● Fragment DNA with restriction enzymes


● Ligate fragments into bacterial artificial
chromosomes (BACs)
● Amplify BACs with tagged DNA fragments
● Fragment isolated BAC vectors
● Sequence via Sanger-style sequencing to 4x coverage
● Finished draft genome in ~10 years
How The Genome Was Determined
Celera Technologies: shotgun sequencing

● Used public BACs contigs from the


Human Genome Project and theirown
● Much shorter DNA reads, assembled
later in silico using the HGP BAC clones
as a scaffold
● Finished draft genome in ~3 years
The Genome Is All About Genes

● Genic sequences
● What do our genes do?
● How are genes controlled?
● What genes are different between humans?
● How are genes associated with disease?

Gene Expression
Gene Expression

“Gene expression is the process by which


information from a gene is used in the
synthesis of a functional gene product.”

- Wikipedia
But What Is A Gene?

● A specific DNA sequence


● A fundamental unit of inheritance
● A molecule created by transcription of an
RNA product (then translated into a protein)
which has a function
● A “gene” is an abstract concept
But What Is A Gene?

● DNA?
● RNA?
● Protein?
● Informational molecule?
● Functional molecule?

Yes, all of them


What Is Gene Expression?
● Active mRNA transcription?
● mRNA abundance?
● mRNA translation?
● RNA function?
● Protein abundance?
● Protein function?

Yes, all of them


The Gene Expression Landscape
● mRNA - protein coding genes
● Functional non-coding RNA (ncRNA) biotypes:
○ microRNA (miRNA)/small interfering RNA (siRNA)
○ Long (intergenic) non-coding RNA (lncRNA/lincRNA)
○ Ribosomal RNA (rRNA)
○ Transfer RNA (tRNA)
○ Many more (30+)
● Antisense: transcript initiated from TSS
in opposite direction of primary gene
● Pseudogenes
How We Measure Gene Expression

● mRNA transcription/translation
○ Fluorescent tagging + microscopy
○ ribosomal capture
● mRNA abundance
○ Northern blots
○ Quantitative polymerase chain reaction (qPCR)
○ Microarrays
○ High-throughput sequencing
How We Measure Gene Expression

● Protein abundance
○ Western blots
○ Fluorescent tagging + microscopy
○ Mass spectrometry
○ Protein arrays
● mRNA/Protein localization
○ Fluorescent tagging + microscopy
mRNA Measurement Considerations

● Most mRNA quantification techniques


measure steady state abundance
● mRNA measurements are snapshots
○ Measure large populations of cells to quantify
“average” abundance
● Poor concordance between mRNA and
corresponding protein abundance
The Holy Grail of bioinformatics

...to be able to understand the words in a sequence sentence


that form a particular protein structure
In silico function prediction
…a reality check
• What is the function of this
structure?

• What is the function of this sequence?

• What is the function of this motif?


– the fold provides a scaffold, which can be
decorated in different ways by different
sequences to confer different functions -
knowing the fold & function allows us to
rationalise how the structure effects its
function at the molecular level
How Is It Possible?
 The structure of a protein is uniquely determined by
its amino acid sequence
(but sequence is sometimes not enough):
 prions
 pH, ions, cofactors, chaperones

 Structure is conserved much longer than sequence in


evolution.
 Structure > Function >> Sequence
How Often Can We Do It?
 There are currently ~47000 structures in the PDB
(but only ~4000 if you include only ones that are not
more than 30% identical and have a resolution better
than 3.0 Å).

 An estimated 25% of all sequences can be modeled


and structural information can be obtained for ~50%.
Protein Basics:
Proteins are macromolecules

Amino acids are the basic building blocks of proteins


Amino Acids are classified by properties: polar,
nonpolar, and charged (ionic)
Polypeptides are constructed by condensation reactions
with amino acids
Four Levels of
Protein Structure

60
Four Levels of Protein Structure
Different Levels of Protein Structure
Protein function depends on
specific conformation (shape)

There are four levels of protein


structure.
The primary structure is the
linear sequence of amino acids.
What determines this sequence?
Where in the cell are amino acids
joined this way?
Four Levels of Protein
Structure

 Primary Structure:
Linear Sequence of Amino Acids
Each amino acid has
H O
central carbon liked to
---hydrogen (H) H2 C C
---amino group (NH2) N R OH
---acid group (COOH)
---unique group (R)
The carboxyl group of one amino acid is linked
to the amino group of the next amino acid.
Amino acids are linked together by covalent peptide bonds
(Fig. 4-1)
Proteins are made up of a polypeptide backbone with
attached side chains
(Fig. 4-2)
Schematic amino acid R groups
A Ala
C Cys
D Asp
E Glu
F Phe*
G Gly
H His*  C
I Ile*
K Lys*
 N
L Leu*  O
M Met*  S
N Asn
P Pro
Q Gln
R Arg*
S Ser
T Thr*
V Val*
W Trp*
Y Tyr
The secondary structure of
protein depends on hydrogen
bonding between C=O and N-
H groups.
Alpha Helix, Beta sheets, Turn
and loop
Four Levels of Protein Structure
 Secondary Structure:

Polypeptide folding into α helix, β sheet, or


random coil (H bonds involved)

N C N C
H O H O
or
O H O H

C N C N
Secondary structure of proteins -  helix
H bond between the N-H of every peptide bond to the C=O of the next peptide bond of the
same chain. R groups are not involved.
(e.g. in protein -keratin - abundant in skin, hair, nails and horns)
[Fig. 4-10, p. 128]

(Pitch)
Secondary structure of proteins – β sheet
Polypeptide chains are held together by H bonds between N-H group of one polypeptide chain
and C=O group of the other chain
(e.g. in the protein fibroin - abundant in silk) [Fig. 4-10, p. 128]
a helices can wrap around one another by interactions between their
hydrophobic side chains to form a stable coiled-coil. [Fig. 4-16]
e.g.  keratin in the skin and myosin in muscles
Tertiary structure is determined by the interactions
between the side chains (R groups)

List these types of


interactions and
which ones are
weak or strong
Four Levels of Protein Structure
 Tertiary Structure:

Three dimensional folded structure due to


attractions and repulsions between R
groups

All but peptide bonds are


involved in tertiary structure.
Tertiary structure of proteins
• 3D conformation or shape
• Depends on the properties of the R groups of amino acid
residues
• Fold spontaneously or with the help of molecular
chaperones
• Stabilized by covalent and non-covalent bonds
Noncovalent bonds help protein folding (Fig. 4-4)
Also review Panel 2-7 (pp. 78,79) on noncovalent bonds
Covalent disulfide bonds between adjacent cysteine side chains
help stabilize a favored protein conformation [Fig. 4-29]
Quaternary structure is the overall protein structure
resulting from combinations of polypeptide subunits
Four Levels of Protein Structure
 Quaternary structure:
Association of two or more protein chains

eg. Hemoglobin is composed of


4 protein chains
2 are called alpha hemoglobin
2 are called beta hemoglobin
Quaternary structure of proteins:
hemoglobin, a protein in red blood cells, has
four sub units (two copies each of - and β-
globins containing a heme molecule [Fig. 4-23].
Bioinformatics how to …

use publicly available free tools to


predict protein structure
Learning Objectives
After this lesson you should be able to:
 Explain the individual steps involved in calculating a protein structure

prediction.
 Identify suitable templates for modelling.

 Outline the principles behind protein structure prediction methods.

 Describe the differences between homology modelling and ab initio

structure prediction.

Describe the major pitfalls in protein modelling .
Protein Bioinformatics: Protein sequence
analysis
 Help to characterize protein sequences in silico and allows
prediction of protein structure and function
 Statistically significant BLAST hits usually signifies sequence
homology
 Homologous sequences may or may not have the same function
but would always (very few exceptions) have the same structural
fold
 Protein sequence analysis allows protein classification

84
Development of protein sequence databases
 Atlas of protein sequence and structure – Dayhoff (1966) first
sequence database (pre-bioinformatics). Currently known as
Protein Information Resource (PIR)
 Protein data bank (PDB) – structural database (1972) remains
most widely used database of structures
 UniProt – The United Protein Databases (UniProt, 2003) is a
central database of protein sequence and function created by
joining the forces of the SWISS-PROT, TrEMBL and PIR protein
database activities

85
The Protein Data Bank (PDB) is a repository for the 3-D
structural data of large biological molecules, such as
proteins and nucleic acids.

Obtained by X-ray crystallography or NMR spectroscopy.

Submitted by biologists and biochemists from around the


world.
Protein sequence analysis overview
 Protein databases
 PIR and UniProt
 Searching databases
 Peptide search, BLAST search, Text search
 Information retrieval and analysis
 Protein records at UniProt and PIR
 Multiple sequence alignment
 Secondary structure prediction
 Homology modeling

87
Universal Protein Knowledgebase
(UniProt)
PIR (Protein Information Resource) has recently joined forces with EBI (European
Bioinformatics Institute) and SIB (Swiss Institute of Bioinformatics) to establish the
UniProt

http://www.uniprot.org/

Clustering at
UniProt NREF 100, 90, 50%

Literature-Based
Automated Annotation UniProt Knowledgebase Annotation

Classification
UniProt Archive

Swiss- TrEMBL PIR-PSD RefSeq GenBank/ EnsEMBL PDB Patent Other


Prot EMBL/DDBJ Data Data
88
Peptide Search

89

You might also like