100% found this document useful (8 votes)
115 views

Proteogenomics Digital DOCX Download

The document discusses the concept of proteogenomics, which integrates proteomics and genomics to enhance genome annotation and understand biological functions. It highlights the advancements in technologies like high-throughput DNA sequencing and mass spectrometry that enable the identification of novel genes and protein-coding features. The book aims to provide insights into the applications of proteogenomics in human disease research, particularly in cancer and personalized medicine.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (8 votes)
115 views

Proteogenomics Digital DOCX Download

The document discusses the concept of proteogenomics, which integrates proteomics and genomics to enhance genome annotation and understand biological functions. It highlights the advancements in technologies like high-throughput DNA sequencing and mass spectrometry that enable the identification of novel genes and protein-coding features. The book aims to provide insights into the applications of proteogenomics in human disease research, particularly in cancer and personalized medicine.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Proteogenomics

Visit the link below to download the full version of this book:

https://medipdf.com/product/proteogenomics/

Click Download Now


More information about this series at http://www.springer.com/series/5584
Ákos Végvári
Editor

Proteogenomics
Editor
Ákos Végvári
Clinical Protein Science & Imaging,
Department of Medical
Bioengineering, Biomedical Center
Lund University
Lund, Sweden
Department of Pharmacology &
Toxicology
University of Texas Medical Branch
Galveston, TX, USA

ISSN 0065-2598 ISSN 2214-8019 (electronic)


Advances in Experimental Medicine and Biology
ISBN 978-3-319-42314-2 ISBN 978-3-319-42316-6 (eBook)
DOI 10.1007/978-3-319-42316-6

Library of Congress Control Number: 2016951213

© Springer International Publishing Switzerland 2016


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or
part of the material is concerned, specifically the rights of translation, reprinting, reuse of
illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way,
and transmission or information storage and retrieval, electronic adaptation, computer software,
or by similar or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are
exempt from the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in
this book are believed to be true and accurate at the date of publication. Neither the publisher nor
the authors or the editors give a warranty, express or implied, with respect to the material
contained herein or for any errors or omissions that may have been made.

Printed on acid-free paper

This Springer imprint is published by Springer Nature


The registered company is Springer International Publishing AG Switzerland
Preface

The concept of proteogenomics, utilizing advances from the fields of pro-


teomics and genomics, was introduced at around the time of the completion
of the sequencing of the human genome. The emergence of proteogenomics
is mainly due to the rapid development of two key technologies: high-
throughput DNA sequencing and mass spectrometry-based proteomics. The
ability to determine protein sequences by mass spectrometry has provided a
unique tool to the identification and the verification of novel genes, predicted
exons, and open reading frames. Consequently, proteogenomics has been
used for genome annotation, including the validation of known or annotated
protein-coding genes; the improvement of gene annotations assigning correct
start sites; the mapping of signal peptides, proteolysis, and other posttransla-
tional modifications (an important element of biological function that is not
encoded directly in the genome); as well as the identification of splicing vari-
ants and mutant proteoforms often associated with disease progression.
Considering the rapid advancement in the field, it is perhaps appropriate to
define proteogenomics as an intensive research area that investigates the cor-
relations between proteomic data and their corresponding genomic and tran-
scriptomic data, keeping the goal to improve our knowledge about life at the
molecular level, which is a more complete view that has been initially sug-
gested. The interplay between the two data streams of genomics and pro-
teomics certainly allows for a better understanding of biological functions
and molecular mechanisms in health and disease. Today, genome sequencing
provides nearly complete coverage, including transcriptome profiling, while
targeted proteomics can be focused on specific regions of the proteome and
determine predicted proteins.
The goal of this book is to display this extended view on proteogenomics,
depicting research areas where proteogenomics is actively playing an essen-
tial role and also highlighting some emerging research arenas without pre-
tending to cover all fields of application. The chapters of this book offer the
readers a general insight to the integrative analyses of various types of omics
data and present advances within specific principles, such as next-generation
sequencing of DNA, mRNA sequencing, ribosome profiling, as well as mass
spectrometry- and antibody-based proteomics. The applications are selected
to exemplify the great potential of proteogenomics to contribute to human
disease research, particularly to cancer and personalized medicine.

v
vi Preface

Importantly, this book attempts to identify some common features that


integrate the various fields and areas where intensive efforts should be made
to drive research more efficiently in the near future. One of these is certainly
bioinformatics, which has shown amazing power and development during the
last couple of years and which is anticipated to provide powerful approaches
to improve our ability to work with and combine the large data sets that
genomics, transcriptomics, and proteomics generate.
At last, I would like to thank all the authors of this book for their excep-
tional contributions, sharing their expert views of the field, and presenting
their original research. Their enthusiasm and timely delivery of their manu-
scripts helped me tremendously to realize this project. It is my sincere hope
that the readers would enjoy this book as much as I enjoyed preparing it.

Galveston, TX, USA Ákos Végvári


March 1, 2016
Contents

1 Proteogenomic Tools and Approaches to Explore Protein


Coding Landscapes of Eukaryotic Genomes ............................. 1
Dhirendra Kumar and Debasis Dash
2 Next Generation Sequencing Data and Proteogenomics .......... 11
Kelly V. Ruggles and David Fenyö
3 Proteogenomics: Key Driver for Clinical Discovery
and Personalized Medicine.......................................................... 21
Ruggero Barbieri, Victor Guryev, Corry-Anke Brandsma,
Frank Suits, Rainer Bischoff, and Peter Horvatovich
4 Identification of Small Novel Coding Sequences,
a Proteogenomics Endeavor ........................................................ 49
Volodimir Olexiouk and Gerben Menschaert
5 Using Proteomics Bioinformatics Tools and Resources
in Proteogenomic Studies ............................................................ 65
Marc Vaudel, Harald Barsnes, Helge Ræder,
and Frode S. Berven
6 Mutant Proteogenomics............................................................... 77
Ákos Végvári
7 Proteogenomic Analysis of Single Amino Acid
Polymorphisms in Cancer Research........................................... 93
Alba Garin-Muga, Fernando J. Corrales, and Victor Segura
8 Developments for Personalized Medicine of Lung Cancer
Subtypes: Mass Spectrometry-Based Clinical
Proteogenomic Analysis of Oncogenic Mutations ..................... 115
Toshihide Nishimura and Haruhiko Nakamura
9 Proteogenomics for the Study of Gastrointestinal
Stromal Tumors............................................................................ 139
Tadashi Kondo

vii
viii Contents

10 Proteogenomics for the Comprehensive Analysis


of Human Cellular and Serum Antibody Repertoires.............. 153
Paula Díez and Manuel Fuentes
11 Antibody-Based Proteomics ........................................................ 163
Christer Wingren

Index ...................................................................................................... 181


Proteogenomic Tools
and Approaches to Explore Protein 1
Coding Landscapes of Eukaryotic
Genomes

Dhirendra Kumar and Debasis Dash

Abstract
Proteogenomic strategies aim to refine genome-wide annotations of pro-
tein coding features by using actual protein level observations. Most of the
currently applied proteogenomic approaches include integrative analysis
of multiple types of high-throughput omics data, e.g., genomics, transcrip-
tomics, proteomics, etc. Recent efforts towards creating a human proteome
map were primarily targeted to experimentally detect at least one protein
product for each gene in the genome and extensively utilized proteoge-
nomic approaches. The 14 year long wait to get a draft human proteome
map, after completion of similar efforts to sequence the genome, explains
the huge complexity and technical hurdles of such efforts. Further, the
integrative analysis of large-scale multi-omics datasets inherent to these
studies becomes a major bottleneck to their success. However, recent
developments of various analysis tools and pipelines dedicated to prote-
ogenomics reduce both the time and complexity of such analysis. Here, we
summarize notable approaches, studies, software developments and their
potential applications towards eukaryotic genome annotation and clinical
proteogenomics.

Keywords
Shotgun proteomics • Peptide identification • RNA-Seq • HUPO • Genome
annotation

1.1 Introduction

Biological systems are complex, self-replicable


D. Kumar • D. Dash (*)
G.N. Ramachandran Knowledge Centre for Genome machineries of which major components are pro-
Informatics, CSIR-Institute of Genomics and teins. Understanding the dynamics of protein
Integrative Biology, South Campus, Sukhdev Vihar, expression in these systems may lead to a better
Mathura Road, Delhi 110025, India interpretation of the underlying mechanisms and
e-mail: [email protected]

© Springer International Publishing Switzerland 2016 1


Á. Végvári (ed.), Proteogenomics, Advances in Experimental Medicine and Biology 926,
DOI 10.1007/978-3-319-42316-6_1
2 D. Kumar and D. Dash

the predictability of potential outcomes. However, sible protein species arising from the genome
the techniques for probing these proteome com- (Tanner et al. 2007). This is primarily due to
ponents are not completely unbiased, i.e., knowl- alternative splicing of transcripts and only a tiny
edge of each component of the proteome is fraction of the eukaryotic genome being protein
necessary and prerequisite to probe their expres- coding. Alternatively, proteogenomic databases
sion. These proteomic techniques are largely for eukaryotes, to discover novel protein iso-
dependent on mass spectrometry (MS) based forms, generally integrate high-throughput tran-
shotgun proteomics. Mass spectra, containing scriptomic information to discover new proteins
mass to charge ratios and intensities for pep- from MS data searches. The high error rate, a
tides and their fragments are searched against a byproduct of searching an extremely large data-
database of known proteins to identify the base, is one of the major concerns in most of
expressed proteins and their quantities (Eng et al. these studies (Krug et al. 2013; Yadav et al. 2013).
2011). One of the limitations of this method lies Another factor contributing to potential false pos-
in the database itself, against which the spectral itive identifications is genomic polymorphism
data generated in MS are searched. A protein between individual genomes and the reference
missing from the database cannot be probed for genome. These individual polymorphisms may
its expression, despite being present in the sam- result in new peptides from known genes, which
ple (Frank et al. 2007). Thus, for comprehensive may be mapped incorrectly to other places in the
proteome profiling, the search database should be genome, leading to incorrect assignment of novel
complete. However, most of these databases are translated genomic regions. Additionally, infer-
neither complete nor error free (Kumar et al. ring the exact isoform expressed in a given bio-
2016b). Proteogenomic techniques address this logical state is a difficult task in eukaryotic
problem by designing custom databases to iden- proteogenomics. Since various proteogenomic
tify the errors and achieve the completeness of studies utilize a translated transcriptome as
the proteome definition for any organism search database, which comprises of sequences
(Castellana and Bafna 2010; Nesvizhskii 2014). of several transcripts from the same gene, many
Contrary to the routine proteomic searches, pro- of the peptide identifications are shared among
teogenomic databases include proteins beyond multiple database entries. Inferring the expressed
the annotated proteome. Proteins from any organ- protein isoform/s from the identified peptide list
ism are generally annotated by computationally then becomes a non-trivial exercise and if incor-
predicting protein coding genes in the genome. rect it may adversely affect the conclusions. In
While largely correct, these predictions also con- addition to these, proteogenomic approaches are
tain several inaccuracies. Proteogenomics relies compute resource intensive (Castellana and
on the detection of unique peptides from the MS Bafna 2010). Modern day approaches integrate
data to correct these inaccuracies and refine the multiple layers of omics information to discover
protein annotations on a genome wide scale novel protein isoforms. Each of these omics data-
(Jaffe et al. 2004; Yates et al. 1995). sets, for example genomics, transcriptomics, pro-
Although very useful, these approaches are teomics, etc., is difficult to analyze independently.
full of conceptual and technical challenges Further, their integration requires multivariate
(Castellana and Bafna 2010). The order of com- analyses (Horvatovich et al. 2015; Zhang et al.
plexity of proteogenomic approaches varies for 2014) and considerations of multiple possible
different organisms. For example, for a prokary- explanations for the observation (Omenn et al.
otic genome, a six frame translated genome data- 2015).
base should represent almost all possible protein The complexity of such an analysis is reflected
coding genomic regions (Armengaud 2013; in several of the recent studies. For example, even
Kelkar et al. 2011; Kumar et al. 2013, 2014, after a decade since the human genome got
2016a). However, in the case of complex eukary- sequenced, the characterization of the human
otes it would represent only a fraction of the pos- proteome was achieved only recently and only as
1 Proteogenomic Tools and Approaches to Explore Protein Coding Landscapes of Eukaryotic Genomes 3

a draft version (Kim et al. 2014; Wilhelm et al. tides, identified from proteogenomics, may reveal
2014). Nearly 20 % of the defined human protein translation at the intergenic, intronic or annotated
coding genes are yet to be characterized at the untranslated regions (UTRs) which may facilitate
protein level. Several worldwide initiatives are discovery of new genes, exons, splice variants
underway to detect at least one protein product and mutated proteins. However, such an analysis
for each of the human protein coding genes would require creation of custom search data-
(Deutsch et al. 2015; Kumar et al. 2015; Nilsson bases which maximizes the representation of
et al. 2015; Paik et al. 2015). Similar incomplete such novel proteoforms; isoforms of proteins.
proteome scenario exists for other model organ- Figure 1.1 highlights various possible custom
isms, like mouse (Brosch et al. 2011), rat (Kumar database approaches and associated potential dis-
et al. 2016b; Low et al. 2013), zebrafish (Kelkar coveries. Recently, various software tools and
et al. 2014), corn (Zea maize) (Castellana et al. pipelines have been developed which either cre-
2014), etc. Despite various advances in MS ate a custom database or provide an end to end
instrumentation and analysis methods, defining solution for proteogenomic data analysis and
the protein coding fraction for any genome conclusions. The most significant contribution of
remains incomplete. While the dynamics of pro- these software solutions is to expand the outreach
tein expression is certainly one of the causes, the of such approaches to a larger scientific commu-
limited sensitivity of the method to detect low nity, in addition to reducing the technical com-
abundant proteins remains an open challenge and plexity and potential errors.
a primary cause of not detecting many proteins.
Complexity of data analysis is another bottleneck
in the detection of many proteins. Proteogenomic 1.3 Proteogenomics Software
analyses directly address this point but are yet to Tools and Pipelines
be adapted in mainstream proteomic practice.
Several of the recent tools and software packages A typical proteogenomic analysis includes cus-
that have been developed for use in proteoge- tom database creation, peptide identification,
nomic analyses should make it an easy to imple- genomic mapping of identified peptides and
ment approach and should expand its applications. inferring the corrected or new gene model.
Here, we would describe various analysis tools Several of the recently developed tools offer only
and pipelines targeted for eukaryotic proteoge- a part of the proteogenomic analysis, whereas
nomic pipelines. few pipelines offer a complete proteogenomic
workflow imlementation. For example:

1.2 Basics of Proteogenomics – CustomProDB (Wang and Zhang 2013), an R


package that allows for the creation of custom
Proteomics allows probing the expression of pro- proteogenomic databases by incorporating
teins from biological samples in a high- single nucleotide polymorphism information
throughput manner (Steen and Mann 2004). from a common variant call format (vcf) file
Peptides are identified from mass spectra by or from RNA-Seq data
searching against a protein sequence database – SpliceDB (Burset et al. 2001) allows creation
using a search engine (Geer et al. 2004; Yadav of highly sensitive yet compact splice graph
et al. 2011) and identified peptides are mapped database in FASTA format which can be search
back to protein sequences to infer the expressed by any of the peptide identification tools
proteins (Eng et al. 2011). Proteogenomic – MSProGene (Zickmann and Renard 2015) is
approaches integrate these large-scale peptide another standalone application that allows
discoveries with genomics and transcriptomics creation of a sample specific search database
data to refine or enrich the annotation of protein from RNA-Seq data with network information
coding genes (Armengaud 2009). Novel pep- of peptide sharing among the database entries
4 D. Kumar and D. Dash

Fig. 1.1 Proteogenomic databases and refinement of untranslated regions (Annotated). Blue color rectangles for
genome annotations. ORF Open Reading Frame, CDS cod- CDS and peptides correspond to gene on positive strand
ing DNA sequences, TIS translation initiation site, UTR whereas red colored ones for gene on negative strand

– TheProteogenomic Mapping Tool (Sanders (GAPP) (Shadforth et al. 2006) was designed
et al. 2011) allows mapping of peptides back to specific to the human genome. This web based
the genome in a quick and effective manner application improved the annotation for vari-
– SpliceVista (Zhu et al. 2014) is a Python ous genes by analyzing publicly available pro-
package that maps identified peptides on all of teomics data. However, this pipeline is no
the known splice-variants of proteins. It also longer active for use
allows integrated visualization of proteomics – PepLine (Ferro et al. 2008) is standalone soft-
data with transcript information ware for genome annotation which is indepen-
– dasHPPboard (Tabas-Madrid et al. 2015) is a dent of database search method. It rather relies
HUPO endorsed data integration platform on a hybrid tag based search to identify pep-
which permits analysis and visualization of tide tags and then maps and clusters these tags
multiple omics datasets including proteomics back to genome to discover potential trans-
– VESPA (Peterson et al. 2012) is a JAVA based lated regions. Due to the suspected low sensi-
application that enables integrated visualiza- tivity and high-error rates of tag based peptide
tion of transcriptomic and proteomics datasets detection and genome mapping approach, it
in proteogenomic context has only seen limited application in proteoge-
– iPiG (Kuhring and Renard 2012) allows inte- nomics research
gration of peptide identification into genome – Peppy (Risk et al. 2013) is one of the earliest
browser and thus, enables concurrent analysis developed pipelines for proteogenomic analy-
of multiple omics information sis. It is a fast and automated framework for
– PGx (Askenazi et al. 2015), a recent tool con- quickly searching MS data against the
verts peptide identifications into browser extremely large eukaryotic genome translated
extensible format (BED) which contain databases to discover novel translated regions.
genomic co-ordinates of features and can be Use of advanced computational methods in
visualized in genome browsers like UCSC this tool makes proteogenomic searches
– Among the earliest proteogenomic pipelines, implementable on simple desktop even for
Genome Annotating Proteomic Pipeline higher eukaryotic genomes which generally
1 Proteogenomic Tools and Approaches to Explore Protein Coding Landscapes of Eukaryotic Genomes 5

necessitate higher memory and compute infra- junctions and somatic variations. By enabling
structure. Additionally, it allows a blind modi- searches against cancer specific variations
fication search to account for novel post from COSMIC database and fusion proteins,
translation modifications which otherwise are PGTools also allows human cancer specific
very difficult to detect by regular proteomics proteogenomic studies. Further, its multiple
searches. Despite these positive features, search engine approach adds sensitivity to the
Peppy has limited eukaryotic analyses appli- overall peptide detection process. However,
cation as a large fraction of novel proteins in due to differences in peptide detection confi-
eukaryotes originate from alternate splicing of dence inherent to variable database sizes,
transcripts which cannot be represented in a result integration from these different data-
genome translated search database as imple- bases presents new challenges. Additionally,
mented in this pipeline the approach lacks the strength of individual
– Enosi (Castellana et al. 2014) proteogenomic or tissue specific proteogenomic searches as
pipeline is comprised of two functionalities. that from RNA-Seq data
First, SpliceDB tool (Burset et al. 2001) is – ProteoAnnotator (Ghali et al. 2014) is a
used to create a comprehensive yet compact recent, open source and powerful pipeline for
database of splice junctions from RNA-Seq proteogenomic discoveries from MS datasets.
reads. This fasta formatted splice graph data- It addresses one of the common problems of
base is then searched with MS data using proteomics and proteogenomics research: file
MS-GF+ search engine (Kim and Pevzner format standards. The entire pipeline supports
2014) which is a sensitive tool to detect more and exports HUman Proteomics Organization
peptides. To evaluate novel proteogenomic (HUPO) – Proteomics Standards Initiative
events including splice junctions, Enosi uti- (PSI) supported file formats like
lizes a probabilistic scoring which takes into MzIdentML. Proteoannotator also allows
account the number of spectra and peptides multiple database searches but primarily relies
assigned to the locus, the quality of the on gene predictions. Searching MS data
assigned peptide spectral matches and the against gene predictions is an excellent
shared mapping of the peptide. The eventProb approach for a newly sequenced genome pri-
probabilistic score allows Enosi to rank and marily due to increased sensitivity of peptide
filter the proteogenomic findings according to detection attributable to small search database
their confidence. Further, the framework can compared to genomic or transcriptomic data-
utilize ab initio gene predictions and RNA- bases. The pipeline also introduces a “non-
Seq information to estimate the boundaries of canonical gene model score” calculation
alternate gene models which accommodate which allows to assign confidence values to
the identified novel peptides. Additionally, novel discoveries and thus automated assess-
Enosi pipeline is fully automated software and ment of quality of novel findings. In addition
utilizes multi-threading to speed up the MS to these new features, it also presents an auto-
data searches mated framework which integrates multiple
– PGTools (Nagaraj et al. 2015) is an end to end peptide search engines and comprehensive
solution which seamlessly integrates multiple statistical algorithm, FDRscore for result inte-
components of proteogenomic analysis. It is gration. Although it is very effective for prote-
an open source software suite which offers ogenomically annotating new genomes,
fully automated searches along with the meta- individual or sample based database searches
analysis and visualization of novel findings. It are difficult to implement in this framework
allows searches against multiple custom data- – Integrated transcriptomic-proteomic pipe-
bases, e.g., databases containing translated line (ITP) (Kumar et al. 2016b) is a recently
entries from transcripts, non-coding genes, published pipeline and comprises two analysis
UTRs, six frame translated genome, splice modules, each for transcriptomics and pro-
6 D. Kumar and D. Dash

teomics data. The transcriptomic analysis ation and thus facilitates clinical proteoge-
module uses Tuxedo suite of tools to align and nomic analysis
assemble RNA-Seq reads into transcripts by – GALAXY-P (Jagtap et al. 2014) is among the
utilizing the reference genome. Second mod- few web-based frameworks for proteogenom-
ule creates a translated transcriptome database ics. Despite its web based implementation, it
from the assembled transcripts and then allows extensive analysis for eukaryotic
searches mass spectra against this database genomes with flexibilities at every step of
using multiple search engines. Although the analysis. It extends the Galaxy bioinformatics
pipeline lacks an entirely automated structure framework for proteomics data analysis and
for public use, the approach has several advan- allows user to create custom integrative analy-
tages. For example, using a reference genome sis workflows. Default workflows within
guided transcriptome assembly provides a Galaxy-P allow MS data format conversion,
definitive transcript model for the discovered creation of proteogenomic databases from
novel peptides and thus, proper reannotation various web resources, two step database
of exon boundaries and coding splice variants search and statistical assessment of identified
are possible. Similarly, quantities of transcript peptides, sequence similarity searches of
isoforms may indicate most probable protein novel findings, evaluation of peptide-spectral
coding isoform despite extensive peptide shar- matches by visualization and comprehensive
ing among isoforms. It also allows creation of genomic visualization of novel peptides. The
tissue or individual specific search databases Galaxy framework allows smooth integration
specifically useful in clinical studies. In addi- of various genomics and transcriptomics data
tion to these, multiple search engines and analysis and with the Galaxy-P development,
FDRscore (Jones et al. 2009; Kumar et al. integration of proteomics with other omics
2013) based result integration within the sec- datasets becomes easy to implement. For
ond module EuGenoSuite, maximize both example, Sheynkman et al. (2014) developed
the sensitivity and specificity of peptide detec- three analysis workflows which enable pro-
tion. Identified peptides are also exported into teomics data searching within Galaxy-P
gene transfer format (GTF) which can be eas- framework against single amino acid poly-
ily integrated into most of the genome brows- morphism (SAP) and splice variant database
ers and thus enabling easy visualization of developed from RNA-Seq data
novel regions – QUILTS (Zhang et al. 2009) is a software to
– PPLine (Krasnov et al. 2015) is a Python create individual specific human proteoge-
language based automated proteogenomic nomic search databases by integrating SAP
pipeline which integrates proteomics with variations, splice variants, gene fusions to
exome sequencing and transcriptome canonical protein sequences. Individual spe-
sequencing technologies. Its major focus is cific genomic and transcriptomic variations
to discover variant novel peptides resulting have been attributed to different diseases pri-
from single nucleotide polymorphism (SNP), marily cancers and thus, it should allow clini-
insertions-deletions in the genomic DNA and cal proteogenomic studies focused to detect
due to alternative splicing. It integrates sev- disease specific variants. However, it is lim-
eral tools to accurately call SNPs from exome ited to human only and does not allow similar
sequencing reads, align RNA-Seq reads, analysis for other model organisms, used to
assemble transcripts including splice junc- study human diseases.
tion isoforms from reads and then allows
proteomics data searches against variant pep- With so many alternatives, one compelling
tide database. This comprehensive software question still remains: Which one is the best?
enables sample/tissue specific database cre- Although, there have not been many studies
1 Proteogenomic Tools and Approaches to Explore Protein Coding Landscapes of Eukaryotic Genomes 7

which compare the various pipelines available consideration while evaluating a novel translated
for eukaryotic proteogenomics, our recent study region.
suggests that many of these are actually comple- Integration of other omics readouts in prote-
mentary in their results (Kumar et al. 2016b). We ogenomic frameworks could also be extremely
concluded that due to differences in their search beneficial. Particularly, ribosome bound RNAs
database compositions ITP, Enosi, (Ribosome profiling), rather than entire tran-
ProteoAnnotator and Peppy bring complemen- scriptome, to create a custom search database
tary peptide detections. Although, there are many that would allow for a better profiling of trans-
technical challenges to run multiple proteoge- lated proteins and thus a better genome annota-
nomic pipelines on a large scale proteomic data- tion. The recently developed PROTEOFORMER
set, the strategy would help achieve a (Crappe et al. 2014) pipeline integrates ribosome
comprehensive catalogue of novel translation profiling with MS based proteomics and prote-
events across genome. ogenomics analysis and could be extremely use-
ful in eukaryotic genome annotations. However,
a similar integration in other existing pipelines
1.4 Future Perspectives would expand the reach of such methods. These
pipelines also need to include provisions for
Although these tools have reduced the technical unsequenced genomes. Custom de novo assem-
complexity of proteogenomic searches, quality bled transcriptomes may provide templates for
assessment of novel discoveries still remains a proteome profiling from MS data (Brinkman
formidable challenge. Many studies indicate the et al. 2015). Proteogenomic pipelines need to be
necessity of manual inspection of identified pep- extended to include genome independent data-
tide spectrum matches to ascertain true identifi- base creation, to facilitate similar analysis for
cations (Omenn et al. 2015). However, it is not unsequenced or partially sequenced genomes.
feasible to implement manual inspection on large Proteogenomic analyses hold promise for
scale studies. Tools like Enosi and human disease related studies as well. Recent
ProteoAnnotator devised automated scoring sys- studies suggest the potential of proteogenomics
tems to evaluate the novel identifications sepa- in the discovering novel candidates in different
rately for their authenticity, but a comprehensive cancers (Alfaro et al. 2014; Rivers et al. 2014;
statistical framework dedicated to large scale Woo et al. 2014; Zhang et al. 2014). However,
proteogenomic studies is still needed. For exam- most of the existing pipelines do not consider dis-
ple, both of the studies claiming to achieve a draft ease related genetic components. Extending these
human proteome map have been heavily criti- analysis frameworks would not only benefit new
cized for their high number of “low quality” studies, they would also assist in revisiting previ-
identifications, adding up to false positives ous datasets for proteogenomic reanalysis.
(Ezkurdia et al. 2014). There have been few
approaches suggested to overcome these hurdles Acknowledgements Authors would like to thank CSIR-
(Shanmugam and Nesvizhskii 2015; Zhang et al. IGIB for compute infrastruture and project BSC0121 for
publication charges.
2015). However, these are yet to be implemented
in automated pipelines. Other than statistical
attributes, false positives may also arise due to
incorrect genomic mapping of identified pep- References
tides. The genome of an individual can vary con-
Alfaro, J. A., Sinha, A., Kislinger, T., & Boutros, P. C.
siderably from the reference genomes at various (2014). Onco-proteogenomics: Cancer proteomics
places, characterized by genomic variations like joins forces with genomics. Nature Methods, 11(11),
SNPs, insertions and deletions. If these are not 1107–1113. Available from: PM:25357240.
Armengaud, J. (2009). A perfect genome annotation is
taken into account, many of the peptide identifi-
within reach with the proteomics and genomics alli-
cations may be incorrectly assigned to novel loci. ance. Current Opinion in Microbiology, 12(3), 292–
Proteogenomic pipelines need to include this 300. Available from: PM:19410500.

You might also like