0% found this document useful (0 votes)
14 views64 pages

Cours M1OSBIntroductionProteoIF-TC-2023

This document provides an introduction to proteomics and the metabolome. It discusses how the proteome and metabolome represent highly dynamic layers of physiological mechanisms and are regulated by both genetic and external factors. Modern proteomics combines biological questions with cutting-edge technologies and bioinformatics to identify proteins on a large scale, obtain their quantification levels, and map them to signaling pathways to understand protein functions in physiological or pathophysiological mechanisms. Protein databases like UniProtKB are crucial for large-scale protein identification by providing annotated protein sequences.

Uploaded by

aida062023
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views64 pages

Cours M1OSBIntroductionProteoIF-TC-2023

This document provides an introduction to proteomics and the metabolome. It discusses how the proteome and metabolome represent highly dynamic layers of physiological mechanisms and are regulated by both genetic and external factors. Modern proteomics combines biological questions with cutting-edge technologies and bioinformatics to identify proteins on a large scale, obtain their quantification levels, and map them to signaling pathways to understand protein functions in physiological or pathophysiological mechanisms. Protein databases like UniProtKB are crucial for large-scale protein identification by providing annotated protein sequences.

Uploaded by

aida062023
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Master de Bioinformatique M1

Omics 1- Introductions aux Données Omics


Proteomics & Metabolomics

PhD Tristan Cardon ([email protected]) sur la base du cours du


Pr. Isabelle Fournier, Université de Lille , Bât SN3, Porte 113 - [email protected]
The Proteome & Metabolome
One layer of the physiological mechanisms

• Proteome and Metabolome


are highly dynamic
• There regulation is involved
in the expression of the
phenotype i.e.. ”The
phenome”
• Their regulation is under the
dependence of the genome
though external factors will
show a strong influence on
the regulation of both
metabolites and proteins
• Feedback loops also make
proteins and metabolites
able to control gene
expression

Trends in Genetics 2020 2


Proteome & Proteomics
The post transcriptomics
The proteome is the set of proteins expressed in a cell, a group of cells, an organ
or an organism at a given physiological state and at a given time
The proteome approach refers to the global analysis of protein expression
The term ”Proteome” was coined in 1994 by Mark Wlikins (University
Macquarie, Sydney) to describe the total set of proteins encoded by a genome

The PROTEOMICS

Includes all the tools and strategies used to study the proteome, i.e. to
identify and characterize proteins
N.B The term was introduced in 1997 by P. James in his publication Terme
« Protein identification in the post-genome era: the rapid rise of proteomics. »,
Quarterly reviews of biophysics 3
The complexity of the Proteome
A high dynamics giving rise to huge a complexity
The proteome is highly dynamic by comparison to the genome
A single
genome

But different
proteomes
Human genome Human Transcriptome
2.9 billion bp 10,000-12,000 gene
20,000-25,000 genes products
+ Post-translational
About Reference modifications (PTMs) + Isoforms & Truncated
75,000 proteins e.g. phosphorylation,
proteins
glycosylation …

Not all
expressed at the
same time or in
the same cells
4
The PTMs
Chemical Modifications Fundamental to the Cell Signaling

5
The PTMs
Chemical Modifications Fundamental to the Cell Signaling

e.g. Phosphorylation
is an important cellular regulatory mechanism as many enzymes and
receptors are activated/deactivated by phosphorylation and
dephosphorylation events, by means of kinases and phosphatases. In
particular, the protein kinases are responsible for cellular transduction
signaling and their hyperactivity.

6
Grabbing the Proteome
What for?
Large Scale Identification of Proteins

Identify the proteins


(get the primary structure i.e. the
amino acid sequences)
x3
Ref /2
Sample 1Sample 2Sample 3
Obtain the relative
quantification of the proteins
Access fold changes

Plot back proteins in


signaling pathways

Protein functions in physiological or


pathophysiological mechanism
7
Grabbing the Proteome
What for?
Large Scale Identification of Proteins

Identify the the proteins


(get the primary structure i.e. the
amino acid sequences)
x3
Ref /2
Sample 1Sample 2Sample 3
Identify Protein Interaction Obtain the relative
Partners quantification ofIdentify PTMs and get their
the proteins
Access fold changeslocalization
(Protein-Protein Interaction = PPI)

Plot back proteins in


signaling pathways

Protein functions in physiological or


pathophysiological mechanism
8
Grabbing the Proteome
What does it Mean?

MODERN PROTEOMICS
Is a combination of biological questions, cutting-edge bioanalytical
technologies and bioinformatics
9
Protein Databases
A Mandatory Step to the Proteome Identification

Proteins sequences are stored in databases

• GenPept (GenBank Gene Products Data Bank): National Center of Biotechnology


Information (NCBI)
-Proteins sequences are directly translated from nucleotide sequences with minimal
annotation
-Gene banks: GenBank/EMBL/DDBJ

• NCBI’s Entrez Protein: NCBI


-Specific bank containing both protein sequences from genes translation
(GenBank/EMBL/DDBJ) and sequences annotated in Swiss-prot, PIR (Protein
Information Resource), RefSEq and PDB (Protein Databank)
-Redundant bank. May contains annotation errors

• RefSeq (Reference Sequence): NCBI


-Provide a panel of non redundant preference proteins
-Only for limited number of species 1100 viruses et 150 bacteria and only few organisms

10
Protein Databases
A Mandatory Step to the Proteome Identification

• PIR-PSD (Protein Information Resource Protein Sequence Database): Created in 1984


-Succession of National Biomedical Research Foundation Protein Sequence Database
-Proteins sequences are organized by families and super-families and annotated with
functional, structural , genetics and bibliographic data

• Swiss-Prot: created in 1986 by Amos Bairoch


-Best corrected and less redundant
-Annotations contain various information's on the proteins including functionality, nature,
structure, post-translational modifications, mutations, similarities, implication in
pathologies, etc.

• TrEMBL:
-Complementary bank to Swiss-Prot to access new protein sequences
-All sequences translated from DDBJ/EMBL/GenBank as well as sequences from
publications loaded by users

UniprotKB
11
Protein Databases
UniProtKB - https://www.uniprot.org/
• The UniProt Knowledgebase (UniProtKB) is the central hub for the collection of functional
information on proteins, with accurate, consistent and rich annotation.
• In addition to capturing the core data mandatory for each UniProtKB entry (mainly, the amino
acid sequence, protein name or description, taxonomic data and citation information), as
much annotation information as possible is added.

12
Protein Databases
UniProtKB - https://www.uniprot.org/

13
Protein Databases
UniProtKB - https://www.uniprot.org/

The large-scale identification of the proteins is obtained


through the database interrogation (no de novo sequencing)
If proteins are not referenced in the databases,
they will not be identified

14
NeXtProt
The Human Protein Project

15
The Hidden Proteome
The Alternative Proteins: A novel class of Proteins

Death of a dogma:
Eukaryotic mRNAs can code for
more than one protein

The proteome is even more


complex than initially
expected

The Alternative proteins or the


so-called Ghost Proteome 16
Proteome Identification
OpenProt: A novel database

17
The Proteome Dynamic Range
An additional difficulty to grab the proteome complexity
The proteome vs. the transcriptome
• The dynamic range of the proteome is wide
• mRNA = 3-4 orders of magnitude dynamic range
• Proteins = span over >7 orders of magnitude
• Abundances from 1 to 10,000,000 copies per cell

Need for a technology that can address


• Complex mixtures
• High sensitivity
• Good dynamic range
• Fast

Mass
Spectrometry
(MS)

18
Mass Spectrometry
A versatile & robust technology
Mass Spectrometry
• Give access to the molecular weight (MW) of compounds with high accuracy
• Through the measurement of M/Z
• Provide structural information
• For peptides and proteins = access the amino acid sequence
• MS separates molecules = can cope with analyzing mixtures

Orbitrap MS

FT-ICR MS TimsTOF MS

19
Mass Spectrometry
A versatile & robust technology
• Structure information obtained by gas phase fragmentation

So-called Tandem MS or
MS/MS

Cleavage of the peptide


Peptide fragmentation pattern nomenclature backbones bonds
(Biemann 1986; Hunt et al. 1986, Roepstorff 1984)
20
The Proteome Dynamic Range
MS vs. Proteome Dynamic Range

Modern MS instrumentation
performs analysis with up to
four orders of magnitude of the
dynamic range in the untargeted
mode, in a stark mismatch with
the proteome dynamic range

Important improvements
have been made over the
last decade in the speed
and depth of the proteome
analysis

21
Large-Scale MS Based Strategies
Untargeted Proteome-Wise Analysis
I. Samples

II. Extraction
According to the phyico-
chemical properties e.g.
large hydrophobic III. Separation
(membrane proteins)
vs. low hydrophilic (e.g. Proteome Complexity IV. MS Analysis
cytokines) far too important for
direct analysis by MS

Identification Relative
Quantification

V. Purification
/production

VI. Protein higher structural orders


Structural Biology
e.g. NMR, CryoEM
V. Absolute quantification
22
Large-Scale MS Based Strategies
What Samples ?
Sample quality is central
• Be careful to protein degradation
• Low Temperature storage (-80°C)
• Enzyme inactivation (protease inhibitor cocktails, T° spike,…)
• Sampling standardization
Biopsies
Organism
Organs

Tissues

Cells in culture

23
Large-Scale MS Based Strategies
The Historic Gel Based Workflow
The conventional 2D gel Separation Pipeline "Peptide Mass Fingerprint
MS analysis PMF
In gel Enzymatic
Protein (MW, IP) digestion (MALDI, ESI) ★

Intensity
Spot exision Low confidence ID M/z

100 kDa Databank Fragmentation


identification MS/MS
50 kDa
Peptide Sequence Tag
PST
25 kDa
High confidence ID
Parent ion
f3+
Intensity
Gel 2D 10 kDa
Isoelectric focusing (IEF) f1+ f5+
Separation protein f4+
Isoelectric point (IP) f2+

M/z 24
Large-Scale MS Based Strategies
The Historic Gel Based Workflow
The relative quantification is obtained from the 2D gels spots

Limitations: Low-throughput, manual with limited reproducibility

25
Large-Scale MS Based Strategies
The Historic Gel Based Workflow
The 2D DIGE Proteomics

26
Large-Scale MS Based Strategies
Protein Identification Through Database interrogation
Protein Identification is obtained by comparison of in silico to experimental measurements

Experimental Peak list


Peptide Mass Fingerprint
PMF
546,45
576,32
678,09 Comparison using search
897,98 engine
1002,56 MS-Fit, Mascot, GlobalServer,
1152.30 Profound, Peptide search
1198.65
1211.32
1234,78
1564,90
1678,09
1876,86
2009,76
Protein in silico protein
Databases digestion
Theoretical peak list

27
28
Multilayer organization of the cell

What is that representation ?

Explain the link between the different


layers

Trends in Genetics 2020 29


Primary structure: amino Identification by MS
acide sequence

Secondary structure:
Alpha helix and beta
sheet

Spectre MS (MS1)
Tertiary structure: ?
526.27 ?
3D structure in space ?
?

m/z

Peptide Mass Fingerprint


Quaternary structure: PMF
combination of multiple
tertiary structure
30
Large-Scale MS Based Strategies
Protein Identification Through Database interrogation
Protein Identification is obtained by comparison of in silico to experimental measurements

Experimental Peak list


Peptide Mass Fingerprint
PMF
546,45
576,32
678,09 Comparison using search
897,98 engine
1002,56 MS-Fit, Mascot, GlobalServer,
1152.30 Profound, Peptide search
1198.65
1211.32
1234,78
1564,90
1678,09
1876,86
2009,76
Protein in silico protein
Databases digestion
Theoretical peak list

31
Large-Scale MS Based Strategies
Protein Identification Through Database interrogation

M2 Protéomique [email protected]

Protein Database
species
Digestion
Digestion efficiency
enzyme

Expected Biological processes


chemical (e.g. post-translational
modification modification
m/z measurement
If known
precision(ppm or u.)
(e.g. ED gels)
Spectral resolution
MS and/or (Mono or Avg)
MS/MS Peak
list

32
Large-Scale MS Based Strategies
Protein Identification Through Database interrogation

M2 Protéomique
[email protected]

33
Large-Scale MS Based Strategies
Protein Identification Through Database interrogation

Careful: Only partial sequence information obtained


34
Large-Scale MS Based Strategies
Protein Identification Through Database interrogation

35
Identification by MS

Primary structure: amino


acide sequence

Secondary structure:
Alpha helix and beta
sheet
Spectre MS (MS1)
?
526.27 ?
?
?
Tertiary structure:
3D structure in space m/z

Spectre MS/MS (MS2)


Quaternary structure: ?
? ?
combination of multiple ?
tertiary structure
m/z
36
Large-Scale MS Based Strategies
Bottom-Up vs. Top-Down Strategies

37
Large-Scale MS Based Strategies
Bottom-Up vs. Top-Down Strategies
Bottom-Up Shot-Gun Top-Down

Proteins
Proteins Proteins
separation or
separation digestion in bulk
not

Proteins
digestion MS on all protein
MS of native proteins
digestion products

MS on digestion

Proteins ID
Proteins ID
Proteins ID

products
PMF
MS2 to MSn of
MS2 on native proteins
peptides
MS2 to MSn of
peptides
PST
AA sequences of Partial AA sequence of
AA sequences of peptides of all proteins proteins (<60 AA)
peptides

Databank Databank Databank


interrogation interrogation interrogation
38
Large-Scale MS Based Strategies
The ShotGun Approach
The shotgun approach is achieved using LC-MS (nanoESI source)

In shotgun proteomics,
the dynamic range of signal
intensities of peptides resulting
from the proteome’s digestion is
at least an order of magnitude
larger than that of the original
proteome which make the game
of protein identification even
more difficult

• In shotgun the identification is


based on the MS/MS data
(fragmentation) alone
• LC is used to separate
digested peptides
• LC requires instrument to be
fast (few seconds per
chromatogram peaks /each
peak corresponds to tens of
peptides still)
39
Large-Scale MS Based Strategies
Label Free vs. Label-Based Methods for Relative Quantification

40
Large-Scale MS Based Strategies
Label Free vs. Label-Based Methods for Relative Quantification

Label-free quantification (LFQ) may be based on precursor signal


intensity or on spectral counting
41
Large-Scale MS Based Strategies
Processing Tools for Identification

42
Large-Scale MS Based Strategies
Processing Tools for Identification

Overview of the computational workflow.


MaxQuant supports the processing of raw data
derived from Thermo Fisher Scientific, Bruker
Daltonics, AB Sciex and Agilent Technologies
mass spectrometry systems. After specifying the
minimum set of required (Box 3) and optional
parameters, the workflow passes through 12
main steps, including feature detection, first and
main search, and peptide/ protein identification
and quantification. The results can be visualized
in the MaxQuant Viewer module5 or be further
analyzed using the Bioinformatics platform
Perseus to gain biologically meaningful insights.
*Requires the installation of vendor-specific
software to read the measured raw data. T

43
Large-Scale MS Based Strategies
Statistical Tools for Data Analysis

44
Large-Scale MS Based Strategies
Statistical Tools for Data Analysis

The central data format of Perseus is the data


matrix, in which biological samples are represented
as columns, and proteins or other molecular species
are represented as rows. Perseus distinguishes
several different types of columns.

Machine learning for clinical proteomics and biomarker


discovery. The Learning plugin in Perseus provides
implementation of classification and regression analyses and
implements various feature-selection methods. Estimation of
the accuracy of a trained predictor, including the feature-
selection step, is performed in a crossvalidation procedure in
which the data set is first split into training and test subsets
and the classifier is trained on the training set, and its
performance is then estimated on the test set. After training,
the classification–regression model then assigns a predicted
class to the samples of unknown class. The feature-selection
procedure outputs the ranks for all proteins with best ranks
corresponding to the most discriminative proteins in the data.
45
Shot Gun Proteomics
From Relative quantification to Signaling Pathways

Heatmaps are used


to find protein
clusters specific to
certain conditions

Go-Terms (Gene
ontology) are
searched for each
cluster

Subnetwork
enrichment
analysis
e.g. Pathway Studio
v10.0 Elsevier

Careful: Network analysis is based on Bibliographic data (here >15,000 refs included) 46
Shot Gun Proteomics
From Relative quantification to Signaling Pathways

3 Classes of description of proteins and genes:


Molecular function
biological process
Cellular component

For example, cytochrome C is characterized by:

- Molecular function : activity of oxydoreduction


- Biological process : oxidative phosphorylation
- Cellular component : mitochondriale matrix

Gene Ontology describes the meaning and order


of known data and signaling pathways for genes

http://www.pantherdb.org/

47
Shot Gun Proteomics
From Relative quantification to Signaling Pathways

Cellular process

Signal transduction

Cell surface receptor


signaling pathway

enzyme linked receptor


protein signaling pathway

transmembrane receptor transmembrane receptor


protein tyrosine kinase protein serine/threonine
signaling pathway kinase signaling pathway48
Shot Gun Proteomics
From Relative quantification to Signaling Pathways

Gène

GO-terme
49
Shot Gun Proteomics
From Relative quantification to Signaling Pathways
Ajouter des protéines
Banque de données des connues pour être en
partenaires d’interaction interaction avec les
https://string-db.org/ cibles
Donne accès aux voies
de signalisations type
processus biologique ou
fonction moléculaires

Supprimer les nœuds sans connexion

Clustering k-mean afin de retrouver les 50


groupes de protéines proches
Large-Scale MS Based Strategies
PTMs Identification Strategies

51
Large-Scale MS Based Strategies
PTMs Identification Strategies
Identification of PTMs require an enrichment step to be performed due to the low
abundances of modified proteins and their transient nature

52
Large-Scale MS Based Strategies
PTMs Identification Strategies
Identification of PTMs require an enrichment step to be performed due to the low
abundances of modified proteins and their transient nature

Phosphoproteins enrichment
53
Large-Scale MS Based Strategies
The Shot Gun Approach
The Top Down requires higher performances MS instruments
Intact protein fragmentation more difficult

Gives Access to PTMs and proteoforms


54
Large-Scale MS Based Strategies
Proteogenomic Approach

ADN ARN Protéine


Mutation analysis and protein impact

sequencing LC-MS/MS

De Novo Alignment on Extraction des spectre de


assembling Referencing genome masse des peptides

New genome Variant identification


Contig 1 Contig 2

assemblage Metaproteomic identification

Dedicated databasis

Genome annotation by Mapping


peptide

55
Application of Large-Scale Proteomics
Cancer Research
Studying pre-cancerous lesions from risk patients (BRCA1 mutated) who had
undergone prophylactic ovariectomy

Benign

SCOUT
Secretory Cell OUTgrowth
(PAX2- , Bcl2)

STIL
Serous Tubal
Intraepthelial Lesion
IHC P53 or KI67 (P53 signature)

TILT
Tubal Intraepithelial Lesion
Remove cover slide in Transition (p53 & Ki67+)

STIC
Antigen retrieval Serous Tubal Intraepithelial
Carcinoma (p53+ &
Ki67+++)

Micro-digestion & LMJ Invasive high grade


micro-extraction serous carcinoma

National PHRC Fimbria, PI E. Leblanc 56


Application of Large-Scale Proteomics
Cancer Research
Metastasis-associated protein MTA2 MTA2 Protein names Gene names
Transcription elongation factor A protein 1 TCEA1 Fumarylacetoacetase FAH
YLP motif-containing protein 1 YLPM1 RNA-binding protein 10 RBM10
Serine/threonine-protein phosphatase 4 catalytic
Cullin-5 CUL5
subunit PPP4C
C-type mannose receptor 2 MRC2
DnaJ homolog subfamily C member 3 DNAJC3
Malectin MLEC
NHL repeat-containing protein 2 NHLRC2
Calcyphosin-like protein CAPSL

Syntaxin-binding protein 3 STXBP3


Aldo-keto reductase family 1 member AKR1
ATPase family AAA domain-containing protein 3 ATAD3
Thioredoxin domain-containing protein 17 TXNDC17

Carcinoma
Carcinoma
Carcinoma

57
Application of Large-Scale Proteomics
Cancer Research
325 out of 1242
proteins show
significate 25
variations after Normal-p53
Normal-p53-stil
Anova (FDR 0.01) 20 Normal-p53-STIC
Carcinome

Percent of genes
15

10

Normal P53: Syndecan-4 mediated signaling pathway

Under-expressed Over-expressed Normal-P53-STIL: Beta1 integrin and Integrins family


cell surface interaction, TRAIL, proteoglycan syndecan
National PHRC Fimbria, PI E. Leblanc
58
Application of Large-Scale Proteomics
Agri-food

59
Application of Large-Scale Proteomics
Agri-food

60
Application of Large-Scale Proteomics
Paleoproteomics

61
Application of Large-Scale Proteomics
Paleoproteomics

62
Application of Large-Scale Proteomics
Space

63
Thanks for Your Attention

64

You might also like