Proteogenomics Digital DOCX Download
Proteogenomics Digital DOCX Download
Visit the link below to download the full version of this book:
https://medipdf.com/product/proteogenomics/
Proteogenomics
Editor
Ákos Végvári
Clinical Protein Science & Imaging,
Department of Medical
Bioengineering, Biomedical Center
Lund University
Lund, Sweden
Department of Pharmacology &
Toxicology
University of Texas Medical Branch
Galveston, TX, USA
v
vi Preface
vii
viii Contents
Abstract
Proteogenomic strategies aim to refine genome-wide annotations of pro-
tein coding features by using actual protein level observations. Most of the
currently applied proteogenomic approaches include integrative analysis
of multiple types of high-throughput omics data, e.g., genomics, transcrip-
tomics, proteomics, etc. Recent efforts towards creating a human proteome
map were primarily targeted to experimentally detect at least one protein
product for each gene in the genome and extensively utilized proteoge-
nomic approaches. The 14 year long wait to get a draft human proteome
map, after completion of similar efforts to sequence the genome, explains
the huge complexity and technical hurdles of such efforts. Further, the
integrative analysis of large-scale multi-omics datasets inherent to these
studies becomes a major bottleneck to their success. However, recent
developments of various analysis tools and pipelines dedicated to prote-
ogenomics reduce both the time and complexity of such analysis. Here, we
summarize notable approaches, studies, software developments and their
potential applications towards eukaryotic genome annotation and clinical
proteogenomics.
Keywords
Shotgun proteomics • Peptide identification • RNA-Seq • HUPO • Genome
annotation
1.1 Introduction
the predictability of potential outcomes. However, sible protein species arising from the genome
the techniques for probing these proteome com- (Tanner et al. 2007). This is primarily due to
ponents are not completely unbiased, i.e., knowl- alternative splicing of transcripts and only a tiny
edge of each component of the proteome is fraction of the eukaryotic genome being protein
necessary and prerequisite to probe their expres- coding. Alternatively, proteogenomic databases
sion. These proteomic techniques are largely for eukaryotes, to discover novel protein iso-
dependent on mass spectrometry (MS) based forms, generally integrate high-throughput tran-
shotgun proteomics. Mass spectra, containing scriptomic information to discover new proteins
mass to charge ratios and intensities for pep- from MS data searches. The high error rate, a
tides and their fragments are searched against a byproduct of searching an extremely large data-
database of known proteins to identify the base, is one of the major concerns in most of
expressed proteins and their quantities (Eng et al. these studies (Krug et al. 2013; Yadav et al. 2013).
2011). One of the limitations of this method lies Another factor contributing to potential false pos-
in the database itself, against which the spectral itive identifications is genomic polymorphism
data generated in MS are searched. A protein between individual genomes and the reference
missing from the database cannot be probed for genome. These individual polymorphisms may
its expression, despite being present in the sam- result in new peptides from known genes, which
ple (Frank et al. 2007). Thus, for comprehensive may be mapped incorrectly to other places in the
proteome profiling, the search database should be genome, leading to incorrect assignment of novel
complete. However, most of these databases are translated genomic regions. Additionally, infer-
neither complete nor error free (Kumar et al. ring the exact isoform expressed in a given bio-
2016b). Proteogenomic techniques address this logical state is a difficult task in eukaryotic
problem by designing custom databases to iden- proteogenomics. Since various proteogenomic
tify the errors and achieve the completeness of studies utilize a translated transcriptome as
the proteome definition for any organism search database, which comprises of sequences
(Castellana and Bafna 2010; Nesvizhskii 2014). of several transcripts from the same gene, many
Contrary to the routine proteomic searches, pro- of the peptide identifications are shared among
teogenomic databases include proteins beyond multiple database entries. Inferring the expressed
the annotated proteome. Proteins from any organ- protein isoform/s from the identified peptide list
ism are generally annotated by computationally then becomes a non-trivial exercise and if incor-
predicting protein coding genes in the genome. rect it may adversely affect the conclusions. In
While largely correct, these predictions also con- addition to these, proteogenomic approaches are
tain several inaccuracies. Proteogenomics relies compute resource intensive (Castellana and
on the detection of unique peptides from the MS Bafna 2010). Modern day approaches integrate
data to correct these inaccuracies and refine the multiple layers of omics information to discover
protein annotations on a genome wide scale novel protein isoforms. Each of these omics data-
(Jaffe et al. 2004; Yates et al. 1995). sets, for example genomics, transcriptomics, pro-
Although very useful, these approaches are teomics, etc., is difficult to analyze independently.
full of conceptual and technical challenges Further, their integration requires multivariate
(Castellana and Bafna 2010). The order of com- analyses (Horvatovich et al. 2015; Zhang et al.
plexity of proteogenomic approaches varies for 2014) and considerations of multiple possible
different organisms. For example, for a prokary- explanations for the observation (Omenn et al.
otic genome, a six frame translated genome data- 2015).
base should represent almost all possible protein The complexity of such an analysis is reflected
coding genomic regions (Armengaud 2013; in several of the recent studies. For example, even
Kelkar et al. 2011; Kumar et al. 2013, 2014, after a decade since the human genome got
2016a). However, in the case of complex eukary- sequenced, the characterization of the human
otes it would represent only a fraction of the pos- proteome was achieved only recently and only as
1 Proteogenomic Tools and Approaches to Explore Protein Coding Landscapes of Eukaryotic Genomes 3
a draft version (Kim et al. 2014; Wilhelm et al. tides, identified from proteogenomics, may reveal
2014). Nearly 20 % of the defined human protein translation at the intergenic, intronic or annotated
coding genes are yet to be characterized at the untranslated regions (UTRs) which may facilitate
protein level. Several worldwide initiatives are discovery of new genes, exons, splice variants
underway to detect at least one protein product and mutated proteins. However, such an analysis
for each of the human protein coding genes would require creation of custom search data-
(Deutsch et al. 2015; Kumar et al. 2015; Nilsson bases which maximizes the representation of
et al. 2015; Paik et al. 2015). Similar incomplete such novel proteoforms; isoforms of proteins.
proteome scenario exists for other model organ- Figure 1.1 highlights various possible custom
isms, like mouse (Brosch et al. 2011), rat (Kumar database approaches and associated potential dis-
et al. 2016b; Low et al. 2013), zebrafish (Kelkar coveries. Recently, various software tools and
et al. 2014), corn (Zea maize) (Castellana et al. pipelines have been developed which either cre-
2014), etc. Despite various advances in MS ate a custom database or provide an end to end
instrumentation and analysis methods, defining solution for proteogenomic data analysis and
the protein coding fraction for any genome conclusions. The most significant contribution of
remains incomplete. While the dynamics of pro- these software solutions is to expand the outreach
tein expression is certainly one of the causes, the of such approaches to a larger scientific commu-
limited sensitivity of the method to detect low nity, in addition to reducing the technical com-
abundant proteins remains an open challenge and plexity and potential errors.
a primary cause of not detecting many proteins.
Complexity of data analysis is another bottleneck
in the detection of many proteins. Proteogenomic 1.3 Proteogenomics Software
analyses directly address this point but are yet to Tools and Pipelines
be adapted in mainstream proteomic practice.
Several of the recent tools and software packages A typical proteogenomic analysis includes cus-
that have been developed for use in proteoge- tom database creation, peptide identification,
nomic analyses should make it an easy to imple- genomic mapping of identified peptides and
ment approach and should expand its applications. inferring the corrected or new gene model.
Here, we would describe various analysis tools Several of the recently developed tools offer only
and pipelines targeted for eukaryotic proteoge- a part of the proteogenomic analysis, whereas
nomic pipelines. few pipelines offer a complete proteogenomic
workflow imlementation. For example:
Fig. 1.1 Proteogenomic databases and refinement of untranslated regions (Annotated). Blue color rectangles for
genome annotations. ORF Open Reading Frame, CDS cod- CDS and peptides correspond to gene on positive strand
ing DNA sequences, TIS translation initiation site, UTR whereas red colored ones for gene on negative strand
– TheProteogenomic Mapping Tool (Sanders (GAPP) (Shadforth et al. 2006) was designed
et al. 2011) allows mapping of peptides back to specific to the human genome. This web based
the genome in a quick and effective manner application improved the annotation for vari-
– SpliceVista (Zhu et al. 2014) is a Python ous genes by analyzing publicly available pro-
package that maps identified peptides on all of teomics data. However, this pipeline is no
the known splice-variants of proteins. It also longer active for use
allows integrated visualization of proteomics – PepLine (Ferro et al. 2008) is standalone soft-
data with transcript information ware for genome annotation which is indepen-
– dasHPPboard (Tabas-Madrid et al. 2015) is a dent of database search method. It rather relies
HUPO endorsed data integration platform on a hybrid tag based search to identify pep-
which permits analysis and visualization of tide tags and then maps and clusters these tags
multiple omics datasets including proteomics back to genome to discover potential trans-
– VESPA (Peterson et al. 2012) is a JAVA based lated regions. Due to the suspected low sensi-
application that enables integrated visualiza- tivity and high-error rates of tag based peptide
tion of transcriptomic and proteomics datasets detection and genome mapping approach, it
in proteogenomic context has only seen limited application in proteoge-
– iPiG (Kuhring and Renard 2012) allows inte- nomics research
gration of peptide identification into genome – Peppy (Risk et al. 2013) is one of the earliest
browser and thus, enables concurrent analysis developed pipelines for proteogenomic analy-
of multiple omics information sis. It is a fast and automated framework for
– PGx (Askenazi et al. 2015), a recent tool con- quickly searching MS data against the
verts peptide identifications into browser extremely large eukaryotic genome translated
extensible format (BED) which contain databases to discover novel translated regions.
genomic co-ordinates of features and can be Use of advanced computational methods in
visualized in genome browsers like UCSC this tool makes proteogenomic searches
– Among the earliest proteogenomic pipelines, implementable on simple desktop even for
Genome Annotating Proteomic Pipeline higher eukaryotic genomes which generally
1 Proteogenomic Tools and Approaches to Explore Protein Coding Landscapes of Eukaryotic Genomes 5
necessitate higher memory and compute infra- junctions and somatic variations. By enabling
structure. Additionally, it allows a blind modi- searches against cancer specific variations
fication search to account for novel post from COSMIC database and fusion proteins,
translation modifications which otherwise are PGTools also allows human cancer specific
very difficult to detect by regular proteomics proteogenomic studies. Further, its multiple
searches. Despite these positive features, search engine approach adds sensitivity to the
Peppy has limited eukaryotic analyses appli- overall peptide detection process. However,
cation as a large fraction of novel proteins in due to differences in peptide detection confi-
eukaryotes originate from alternate splicing of dence inherent to variable database sizes,
transcripts which cannot be represented in a result integration from these different data-
genome translated search database as imple- bases presents new challenges. Additionally,
mented in this pipeline the approach lacks the strength of individual
– Enosi (Castellana et al. 2014) proteogenomic or tissue specific proteogenomic searches as
pipeline is comprised of two functionalities. that from RNA-Seq data
First, SpliceDB tool (Burset et al. 2001) is – ProteoAnnotator (Ghali et al. 2014) is a
used to create a comprehensive yet compact recent, open source and powerful pipeline for
database of splice junctions from RNA-Seq proteogenomic discoveries from MS datasets.
reads. This fasta formatted splice graph data- It addresses one of the common problems of
base is then searched with MS data using proteomics and proteogenomics research: file
MS-GF+ search engine (Kim and Pevzner format standards. The entire pipeline supports
2014) which is a sensitive tool to detect more and exports HUman Proteomics Organization
peptides. To evaluate novel proteogenomic (HUPO) – Proteomics Standards Initiative
events including splice junctions, Enosi uti- (PSI) supported file formats like
lizes a probabilistic scoring which takes into MzIdentML. Proteoannotator also allows
account the number of spectra and peptides multiple database searches but primarily relies
assigned to the locus, the quality of the on gene predictions. Searching MS data
assigned peptide spectral matches and the against gene predictions is an excellent
shared mapping of the peptide. The eventProb approach for a newly sequenced genome pri-
probabilistic score allows Enosi to rank and marily due to increased sensitivity of peptide
filter the proteogenomic findings according to detection attributable to small search database
their confidence. Further, the framework can compared to genomic or transcriptomic data-
utilize ab initio gene predictions and RNA- bases. The pipeline also introduces a “non-
Seq information to estimate the boundaries of canonical gene model score” calculation
alternate gene models which accommodate which allows to assign confidence values to
the identified novel peptides. Additionally, novel discoveries and thus automated assess-
Enosi pipeline is fully automated software and ment of quality of novel findings. In addition
utilizes multi-threading to speed up the MS to these new features, it also presents an auto-
data searches mated framework which integrates multiple
– PGTools (Nagaraj et al. 2015) is an end to end peptide search engines and comprehensive
solution which seamlessly integrates multiple statistical algorithm, FDRscore for result inte-
components of proteogenomic analysis. It is gration. Although it is very effective for prote-
an open source software suite which offers ogenomically annotating new genomes,
fully automated searches along with the meta- individual or sample based database searches
analysis and visualization of novel findings. It are difficult to implement in this framework
allows searches against multiple custom data- – Integrated transcriptomic-proteomic pipe-
bases, e.g., databases containing translated line (ITP) (Kumar et al. 2016b) is a recently
entries from transcripts, non-coding genes, published pipeline and comprises two analysis
UTRs, six frame translated genome, splice modules, each for transcriptomics and pro-
6 D. Kumar and D. Dash
teomics data. The transcriptomic analysis ation and thus facilitates clinical proteoge-
module uses Tuxedo suite of tools to align and nomic analysis
assemble RNA-Seq reads into transcripts by – GALAXY-P (Jagtap et al. 2014) is among the
utilizing the reference genome. Second mod- few web-based frameworks for proteogenom-
ule creates a translated transcriptome database ics. Despite its web based implementation, it
from the assembled transcripts and then allows extensive analysis for eukaryotic
searches mass spectra against this database genomes with flexibilities at every step of
using multiple search engines. Although the analysis. It extends the Galaxy bioinformatics
pipeline lacks an entirely automated structure framework for proteomics data analysis and
for public use, the approach has several advan- allows user to create custom integrative analy-
tages. For example, using a reference genome sis workflows. Default workflows within
guided transcriptome assembly provides a Galaxy-P allow MS data format conversion,
definitive transcript model for the discovered creation of proteogenomic databases from
novel peptides and thus, proper reannotation various web resources, two step database
of exon boundaries and coding splice variants search and statistical assessment of identified
are possible. Similarly, quantities of transcript peptides, sequence similarity searches of
isoforms may indicate most probable protein novel findings, evaluation of peptide-spectral
coding isoform despite extensive peptide shar- matches by visualization and comprehensive
ing among isoforms. It also allows creation of genomic visualization of novel peptides. The
tissue or individual specific search databases Galaxy framework allows smooth integration
specifically useful in clinical studies. In addi- of various genomics and transcriptomics data
tion to these, multiple search engines and analysis and with the Galaxy-P development,
FDRscore (Jones et al. 2009; Kumar et al. integration of proteomics with other omics
2013) based result integration within the sec- datasets becomes easy to implement. For
ond module EuGenoSuite, maximize both example, Sheynkman et al. (2014) developed
the sensitivity and specificity of peptide detec- three analysis workflows which enable pro-
tion. Identified peptides are also exported into teomics data searching within Galaxy-P
gene transfer format (GTF) which can be eas- framework against single amino acid poly-
ily integrated into most of the genome brows- morphism (SAP) and splice variant database
ers and thus enabling easy visualization of developed from RNA-Seq data
novel regions – QUILTS (Zhang et al. 2009) is a software to
– PPLine (Krasnov et al. 2015) is a Python create individual specific human proteoge-
language based automated proteogenomic nomic search databases by integrating SAP
pipeline which integrates proteomics with variations, splice variants, gene fusions to
exome sequencing and transcriptome canonical protein sequences. Individual spe-
sequencing technologies. Its major focus is cific genomic and transcriptomic variations
to discover variant novel peptides resulting have been attributed to different diseases pri-
from single nucleotide polymorphism (SNP), marily cancers and thus, it should allow clini-
insertions-deletions in the genomic DNA and cal proteogenomic studies focused to detect
due to alternative splicing. It integrates sev- disease specific variants. However, it is lim-
eral tools to accurately call SNPs from exome ited to human only and does not allow similar
sequencing reads, align RNA-Seq reads, analysis for other model organisms, used to
assemble transcripts including splice junc- study human diseases.
tion isoforms from reads and then allows
proteomics data searches against variant pep- With so many alternatives, one compelling
tide database. This comprehensive software question still remains: Which one is the best?
enables sample/tissue specific database cre- Although, there have not been many studies
1 Proteogenomic Tools and Approaches to Explore Protein Coding Landscapes of Eukaryotic Genomes 7
which compare the various pipelines available consideration while evaluating a novel translated
for eukaryotic proteogenomics, our recent study region.
suggests that many of these are actually comple- Integration of other omics readouts in prote-
mentary in their results (Kumar et al. 2016b). We ogenomic frameworks could also be extremely
concluded that due to differences in their search beneficial. Particularly, ribosome bound RNAs
database compositions ITP, Enosi, (Ribosome profiling), rather than entire tran-
ProteoAnnotator and Peppy bring complemen- scriptome, to create a custom search database
tary peptide detections. Although, there are many that would allow for a better profiling of trans-
technical challenges to run multiple proteoge- lated proteins and thus a better genome annota-
nomic pipelines on a large scale proteomic data- tion. The recently developed PROTEOFORMER
set, the strategy would help achieve a (Crappe et al. 2014) pipeline integrates ribosome
comprehensive catalogue of novel translation profiling with MS based proteomics and prote-
events across genome. ogenomics analysis and could be extremely use-
ful in eukaryotic genome annotations. However,
a similar integration in other existing pipelines
1.4 Future Perspectives would expand the reach of such methods. These
pipelines also need to include provisions for
Although these tools have reduced the technical unsequenced genomes. Custom de novo assem-
complexity of proteogenomic searches, quality bled transcriptomes may provide templates for
assessment of novel discoveries still remains a proteome profiling from MS data (Brinkman
formidable challenge. Many studies indicate the et al. 2015). Proteogenomic pipelines need to be
necessity of manual inspection of identified pep- extended to include genome independent data-
tide spectrum matches to ascertain true identifi- base creation, to facilitate similar analysis for
cations (Omenn et al. 2015). However, it is not unsequenced or partially sequenced genomes.
feasible to implement manual inspection on large Proteogenomic analyses hold promise for
scale studies. Tools like Enosi and human disease related studies as well. Recent
ProteoAnnotator devised automated scoring sys- studies suggest the potential of proteogenomics
tems to evaluate the novel identifications sepa- in the discovering novel candidates in different
rately for their authenticity, but a comprehensive cancers (Alfaro et al. 2014; Rivers et al. 2014;
statistical framework dedicated to large scale Woo et al. 2014; Zhang et al. 2014). However,
proteogenomic studies is still needed. For exam- most of the existing pipelines do not consider dis-
ple, both of the studies claiming to achieve a draft ease related genetic components. Extending these
human proteome map have been heavily criti- analysis frameworks would not only benefit new
cized for their high number of “low quality” studies, they would also assist in revisiting previ-
identifications, adding up to false positives ous datasets for proteogenomic reanalysis.
(Ezkurdia et al. 2014). There have been few
approaches suggested to overcome these hurdles Acknowledgements Authors would like to thank CSIR-
(Shanmugam and Nesvizhskii 2015; Zhang et al. IGIB for compute infrastruture and project BSC0121 for
publication charges.
2015). However, these are yet to be implemented
in automated pipelines. Other than statistical
attributes, false positives may also arise due to
incorrect genomic mapping of identified pep- References
tides. The genome of an individual can vary con-
Alfaro, J. A., Sinha, A., Kislinger, T., & Boutros, P. C.
siderably from the reference genomes at various (2014). Onco-proteogenomics: Cancer proteomics
places, characterized by genomic variations like joins forces with genomics. Nature Methods, 11(11),
SNPs, insertions and deletions. If these are not 1107–1113. Available from: PM:25357240.
Armengaud, J. (2009). A perfect genome annotation is
taken into account, many of the peptide identifi-
within reach with the proteomics and genomics alli-
cations may be incorrectly assigned to novel loci. ance. Current Opinion in Microbiology, 12(3), 292–
Proteogenomic pipelines need to include this 300. Available from: PM:19410500.