First Lecture
First Lecture
Proteomics
Genomics and Proteomics
• The field of genomics deals with the DNA sequence,
organization, function, and evolution of genomes
2
Genetic Engineering
• In genetic engineering, the immediate goal of an experiment
is to insert a particular fragment of chromosomal DNA into a
plasmid or a viral DNA molecule
4
Origin of “Genomics”: 1987
-
What is genomics?
-
Central Dogma of Biology
http://www.lhsc.on.ca/Patients_Families_Visitors/Genetics/Inherited_Metabolic/Mitochondria/DiseasesattheMolecularLevel.htm
What can genomics tell us?
DNA Sequence
Protein/Gene Function
How do we study genomics?
1. Isolate nucleic acid molecules from biological samples
2. Determine nucleotide sequence using biochemical
techniques
3. Digitize nucleotide sequence
4. Examine digital sequences to identify patterns with
algorithms and statistics
5. Relate patterns to biological observations by:
6. Comparing patterns detected across many samples
7. Manipulating a system to see how patterns change
Sequencing Techniques
Sanger sequencing – fluorescent-labeled DNA fragments
12
What’s Next ?
THE “POST-GENOMICS” ERA
Goal: 13
CCTGACAAATTCGACGTGCGGCATTGCATGCAGACGTGCATG
CGTGCAAATAATCAATGTGGACTTTTCTGCGATTATGGAAGAA
CTTTGTTACGCGTTTTTGTCATGGCTTTGGTCCCGCTTTGTTC
AGAATGCTTTTAATAAGCGGGGTTACCGGTTTGGTTAGCGAGA
AGAGCCAGTAAAAGACGCAGTGACGGAGATGTCTGATG CAA
TAT GGA CAA TTG GTT TCT TCT CTG AAT ......
.............. TGAAAAACGTA
14
Identify the genes within a
given sequence of DNA
15
promoter TF binding site
CCTGACAAATTCGACGTGCGGCATTGCATGCAGACGTGCATG
Transcription
CGTGCAAATAATCAATGTGGACTTTTCTGCGATTATGGAAGAA
Start Site
CTTTGTTACGCGTTTTTGTCATGGCTTTGGTCCCGCTTTGTTC
AGAATGCTTTTAATAAGCGGGGTTACCGGTTTGGTTAGCGAGA
AGAGCCAGTAAAAGACGCAGTGACGGAGATGTCTGATG CAA
TAT GGA CAA TTG GTT TCT TCT CTG AAT .................................
.............. TGAAAAACGTA
Ribosome binding Site
ORF=Open Reading Frame
CDS=Coding Sequence 16
Comparative
genomics
Human ATAGCGGGGGGATGCGGGCCCTATACCC
Chimp ATAGGGG - - GGATGCGGGCCCTATACCC
Mouse ATAGCG - - - GGATGCGGCGC -TATACCA
17
Structural
genomics
18
Functional Genomics
• Genomic sequencing has made possible a new approach to genetics called functional
genomics, which focuses on genome-wide patterns of gene expression and the
mechanisms by which gene expression is coordinated
• DNA microarray (or chip) - a flat surface about the size of a postage stamp with up to
100,000 distinct spots, each containing a different immobilized DNA sequence suitable
for hybridization with DNA or RNA isolated from cells growing under different conditions
• DNA microarrays are used to estimate the relative level of gene expression of each
gene in the genome
19
20
Assigning the structures of all proteins
protein complexes
fold Evolutionary
Biologic processes relationship
Sequence 2 ATACCATAAGCGAG
Match Mismatch
Alignment 1 ATACACAGTAGGAGATACCAGTAAGGGAGGGGG
--------------ATACCA-TAAGCGAG----
Gap
Alignment 2 ATACACAGTAGGAGATACCAGTAAGGGAGGGGG
ATAC-CA--------------TAAGCGAG----
Alignment 3 ATACACAGTAGGAGATACCAGTAAGGGAGGGGG
ATAC-CA-TA--AG---C--G--AG--------
Scoring/Substitution Matrices
G -1 -3 2 -3 -3
ATACCAGTAAGG-GAG Score = 19
ATACCA-TAAG-AGAG
T -3 -1 -3 2 -3
ATACCA-GTAAGGGAG Score = -20
- -3 -3 -3 -3 NA
A-TACCATAAGAGAG-
Sequence Alignment Algorithms
AAANTAIYYDPNPDMP A--
NTAI-YDPN--M-
Interior sequences are aligned as well as possible
AERAKDNLCRLEHTTLRKVTAAANTAIYYDPNPDMPVVAEDQEWVNVYYEM
A-----N------T-----------AI-YD--P------------N----M
AAANTAIYYDPNPDMP -
AANTAI-YDPN--M-
AERAKDNLCRLEHTTLRKVTAAANTAIYYDPNPDMPVVAEDQEWVNVYYEM
---------------------AANTAI-YDPN--M----------------
Local Alignment
CATATTGCAGTGGTCCCGCGTCAGGCT
TAAATTGCGT-GGTCGCACTGCACGCT 30
GLOBAL VS. LOCAL ALIGNMENT
An error?
A polymorphism?
A different allele?
Incorrect alignment?
Millions of DNA
Find all locations sequences 30-150
where sequences nucleotides long
map in genome
● Key findings:
○ ~20k genes
○ More segmental duplications than expected
○ Fewer than 7% of protein families vertebrate
specific
○ ~3% of sequence codes for protein coding genes
○ >85% of the genome is transcribed
○ Repetitive elements may comprise >66% of genome
How The Genome Was Determined
International Human Genome Sequencing Consortium
● Genic sequences
● What do our genes do?
● How are genes controlled?
● What genes are different between humans?
● How are genes associated with disease?
Gene Expression
Gene Expression
- Wikipedia
But What Is A Gene?
● DNA?
● RNA?
● Protein?
● Informational molecule?
● Functional molecule?
● mRNA transcription/translation
○ Fluorescent tagging + microscopy
○ ribosomal capture
● mRNA abundance
○ Northern blots
○ Quantitative polymerase chain reaction (qPCR)
○ Microarrays
○ High-throughput sequencing
How We Measure Gene Expression
● Protein abundance
○ Western blots
○ Fluorescent tagging + microscopy
○ Mass spectrometry
○ Protein arrays
● mRNA/Protein localization
○ Fluorescent tagging + microscopy
mRNA Measurement Considerations
60
Four Levels of Protein Structure
Different Levels of Protein Structure
Protein function depends on
specific conformation (shape)
Primary Structure:
Linear Sequence of Amino Acids
Each amino acid has
H O
central carbon liked to
---hydrogen (H) H2 C C
---amino group (NH2) N R OH
---acid group (COOH)
---unique group (R)
The carboxyl group of one amino acid is linked
to the amino group of the next amino acid.
Amino acids are linked together by covalent peptide bonds
(Fig. 4-1)
Proteins are made up of a polypeptide backbone with
attached side chains
(Fig. 4-2)
Schematic amino acid R groups
A Ala
C Cys
D Asp
E Glu
F Phe*
G Gly
H His* C
I Ile*
K Lys*
N
L Leu* O
M Met* S
N Asn
P Pro
Q Gln
R Arg*
S Ser
T Thr*
V Val*
W Trp*
Y Tyr
The secondary structure of
protein depends on hydrogen
bonding between C=O and N-
H groups.
Alpha Helix, Beta sheets, Turn
and loop
Four Levels of Protein Structure
Secondary Structure:
N C N C
H O H O
or
O H O H
C N C N
Secondary structure of proteins - helix
H bond between the N-H of every peptide bond to the C=O of the next peptide bond of the
same chain. R groups are not involved.
(e.g. in protein -keratin - abundant in skin, hair, nails and horns)
[Fig. 4-10, p. 128]
(Pitch)
Secondary structure of proteins – β sheet
Polypeptide chains are held together by H bonds between N-H group of one polypeptide chain
and C=O group of the other chain
(e.g. in the protein fibroin - abundant in silk) [Fig. 4-10, p. 128]
a helices can wrap around one another by interactions between their
hydrophobic side chains to form a stable coiled-coil. [Fig. 4-16]
e.g. keratin in the skin and myosin in muscles
Tertiary structure is determined by the interactions
between the side chains (R groups)
prediction.
Identify suitable templates for modelling.
structure prediction.
Describe the major pitfalls in protein modelling .
Protein Bioinformatics: Protein sequence
analysis
Help to characterize protein sequences in silico and allows
prediction of protein structure and function
Statistically significant BLAST hits usually signifies sequence
homology
Homologous sequences may or may not have the same function
but would always (very few exceptions) have the same structural
fold
Protein sequence analysis allows protein classification
84
Development of protein sequence databases
Atlas of protein sequence and structure – Dayhoff (1966) first
sequence database (pre-bioinformatics). Currently known as
Protein Information Resource (PIR)
Protein data bank (PDB) – structural database (1972) remains
most widely used database of structures
UniProt – The United Protein Databases (UniProt, 2003) is a
central database of protein sequence and function created by
joining the forces of the SWISS-PROT, TrEMBL and PIR protein
database activities
85
The Protein Data Bank (PDB) is a repository for the 3-D
structural data of large biological molecules, such as
proteins and nucleic acids.
87
Universal Protein Knowledgebase
(UniProt)
PIR (Protein Information Resource) has recently joined forces with EBI (European
Bioinformatics Institute) and SIB (Swiss Institute of Bioinformatics) to establish the
UniProt
http://www.uniprot.org/
Clustering at
UniProt NREF 100, 90, 50%
Literature-Based
Automated Annotation UniProt Knowledgebase Annotation
Classification
UniProt Archive
89