PIIS2667237522002892
PIIS2667237522002892
OPEN ACCESS
Review
Applications of deep learning
in understanding gene regulation
Zhongxiao Li,1,2,4 Elva Gao,3,4 Juexiao Zhou,1,2 Wenkai Han,1,2 Xiaopeng Xu,1,2 and Xin Gao1,2,*
1Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University
*Correspondence: [email protected]
https://doi.org/10.1016/j.crmeth.2022.100384
SUMMARY
Gene regulation is a central topic in cell biology. Advances in omics technologies and the accumulation of
omics data have provided better opportunities for gene regulation studies than ever before. For this reason
deep learning, as a data-driven predictive modeling approach, has been successfully applied to this field
during the past decade. In this article, we aim to give a brief yet comprehensive overview of representative
deep-learning methods for gene regulation. Specifically, we discuss and compare the design principles
and datasets used by each method, creating a reference for researchers who wish to replicate or improve
existing methods. We also discuss the common problems of existing approaches and prospectively
introduce the emerging deep-learning paradigms that will potentially alleviate them. We hope that this article
will provide a rich and up-to-date resource and shed light on future research directions in this area.
Cell Reports Methods 3, 100384, January 23, 2023 ª 2022 The Author(s). 1
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
ll
OPEN ACCESS Review
poor model generalizability, as features used in one scenario principles. In particular, we comprehensively list the datasets
may not be as effective in other cases. It would be very much that are used in each study as a convenient reference for
appreciated if such discriminative features can be automatically researchers willing to replicate existing methods or develop
discovered by the learning algorithms directly from the data new methods in this field. Prospectively, we will discuss the
themselves. application potential of various emerging deep-learning para-
Since 2012, deep learning has achieved remarkable success digms, such as self-supervised learning, meta-learning, and
in various other fields via a data-driven approach.16 Deep large-scale pre-trained models for biological sequences, that
learning is a general term for machine-learning algorithms that will potentially alleviate the problems of existing approaches.
are made up of deep neural networks (DNNs). DNNs consist of We also point out the trend of using the integration of structural
multiple artificial neural network layers, which are biologically information, multi-omics profiles, and single-cell profiles for
inspired data-processing units that serve as non-linear transfor- gene regulation studies. We hope that this article will provide a
mation functions of their inputs. As each layer takes as input the rich and up-to-date resource and serve as a starting point for
result from the previous layer, the transformation becomes new researchers interested in this area.
increasingly complex when the number of layers increases.
Those functions are learnable in the sense that they can adjust REVIEW OF DEEP-LEARNING APPLICATIONS IN GENE
themselves during the ‘‘training’’ process. Deep-learning models REGULATION
are usually trained by fitting themselves on the training data
through the optimization of an objective function (the ‘‘loss func- Types of neural networks used in gene regulation
tion’’) using the gradient descent algorithm.16 It is then expected studies
to be able to perform inference tasks on data that come from the During the past decade, the neural network models used in gene
same or similar statistical distribution as the training data. regulation studies have largely followed what has been used in
Although the success of deep learning is in part due to its computer vision and natural language processing, from which
learning capacity, generalizability, and computational efficiency deep learning first originated. The most popular types include
on dedicated computational architectures, the most important multi-layer perceptrons (MLPs) (Figure 2A), which are quite pop-
aspect is its representation learning ability. In contrast to the ular in early applications of deep learning on tabular data. The
‘‘shallow learning’’ algorithms, deep learning, based on DNNs, input layer of MLPs directly takes in the data values from the input
perform inference tasks with a deep and hierarchical architec- dataset and is subsequently processed by one or more hidden
ture. The lower layers in the hierarchy learn the ‘‘representa- layers. Finally, an output layer summarizes the processed infor-
tions,’’ which are highly discriminative features discovered by mation of the earlier hidden layers to produce the final prediction.
the algorithm using a data-driven approach. Its higher layers Following the success in image recognition and text classifica-
summarize the representations from the lower layers and tion, convolutional neural networks (CNNs)20 have proved to be
produce the result of the inference. This makes deep learning very useful for handling raw biological sequence data, whether
especially useful in omics because it overcomes the limitations it is DNA, RNA, or protein sequences (Figure 2B). The CNNs
of the ‘‘shallow learning’’ methods and could discover patterns employ convolution filters to process sequential or image data
in biological sequences or measurements that are yet unknown. in a way that respects the spatial structure of the data. To handle
This is undoubtedly one of the reasons why many successful long-range interactions of sequential data, recurrent neural net-
deep-learning applications in omics have emerged in the past works (RNNs) such as gated recurrent units (GRUs)21 and long
decade. In addition, copious omics data take a form that is short-term memory (LSTM)22 have received particular interest
amenable to being processed by deep-learning algorithms. For in biological sequence analysis (Figure 2C). The RNNs employ
example, there is a similarity between biological sequences hidden states in the network that will remember sequential infor-
and natural languages. Certain sequence motifs serve as reg- mation at earlier locations. These hidden states benefit the
ulatory codes, and the interactions between such codes modeling of long-range interactions. Graph neural networks
serve as the regulatory grammar. This has led to numerous suc- (GNNs) are designed to handle structured datasets that are rep-
cessful biological applications of off-the-shelf deep-learning resented as a graph (Figure 2D). GNNs also have input, hidden,
models.17–19 Returning to the topic of gene regulation, it will be and output layers in their architectures. In contrast to plain
of great interest to find out whether deep learning in omics can MLPs that handle individual data points independently, the hid-
decipher the regulatory code and grammar, model the regulatory den layers and output layers of GNNs respect the topological
process, help us understand the regulatory mechanism, and structure of the dataset. In more recent years, inspired by the suc-
assist us in achieving the major goals of omics. cess in natural language processing and understanding, Trans-
In this survey and perspective, we aim to give a brief yet formers19,23,24 have also received a lot of attention for biological
systematic review of the application of deep learning in gene sequence data processing (Figure 2E). Transformers are power-
regulation studies with various kinds of omics data. We will cover ful learners of sequential data, partly due to their employment of
applications of deep learning at various omics levels, including the self-attention mechanism that handles the pairwise interde-
the genomic, transcriptomic, and proteomic levels (Figure 1). pendencies between the sequence elements. More recently,
We will focus on the formulation of various prediction tasks Transformers are also popular choices for self-supervised
addressing different biological questions that are attempted by learning on biological sequences,18 which we will discuss in sub-
deep learning. We will investigate the model architectures used sequent sections. According to the nature of the prediction tasks
by the studies and discuss their functionalities and design formulated in each study, researchers designed deep-learning
architectures utilizing one or more of the aforementioned Deepbind17 is a pioneer work dedicated to the prediction of
networks with highly customized configurations to achieve higher nucleic acid-protein binding. Based on a CNN architecture, it
performance, greater computational efficiency, and better bio- can learn from multiple DNA-protein binding experimental
logical interpretability. In this article, we do not wish to provide profiling technologies, including protein binding microarrays60
a comprehensive introduction to the neural network types used (PBM), ChIP-seq,1 and HT-SELEX.61 DeepSEA30 is one of the
in deep learning. Instead, we refer readers to dedicated hands- seminal works that applied deep learning to whole-genome
on tutorials25,26 and introductory textbooks27 on deep learning. functional genomics annotations. DeepSEA uses a CNN-based
architecture that takes in 1,000-bp human genomic DNA
Genomic-level applications sequence and performs a multi-task prediction of DNase I hyper-
In this section, we review the most representative research sensitivity (DHS), transcription factor (TF) binding, and histone
works that applied deep learning to the study of genomic-level modification in multiple cell lines. DeepSEA was trained on 125
regulations. All works that we review have formulated a super- DHS profiles (by DNase-seq4) and 690 TF binding profiles (by
vised learning problem and can predict functional genomic ChIP-seq1 for 160 distinct TFs) from ENCODE,8 and 104 histone
features from the genomic sequence. In this way, they aim to modification profiles (by ChIP-seq) from Roadmap Epigenom-
decipher the regulatory code and grammar from the genomic ics.31 DeepSEA uses one model to predict the signals measured
sequence and predict how genetic variations will affect a partic- from 919 epigenomic profiles (including 125 DHS predictions,
ular regulatory mechanism. 690 TF binding predictions, and 104 histone modification predic-
The relevant studies and methods are listed and summarized tions). This multi-task design allows it to share the learned
in Table 1. In particular, we list the functionalities each method genomic grammar while performing different tasks. Additionally,
achieves, the datasets each method uses, and the deep-learning the authors trained a boosted logistic classifier using the predic-
architectures on which the methods are based. As the regulatory tions of DeepSEA and showed that it could prioritize functional
code and grammar of genomic sequences are interpreted differ- non-coding regulatory mutations in HGMD32 and expression
ently in different organisms and tissue/cell types, most models quantitative trait loci (eQTL) in GRASP.33
have particular ways to provide organism- and tissue-/cell- Motivated by the success of DeepSEA, follow-up works have
type-specific predictions. Therefore, we particularly highlight improved and extended DeepSEA in multiple different aspects.
the species and tissue/cell types that are involved in each study. Basset34 is also a CNN-based model particularly focused on
DHS prediction. It extends DeepSEA’s DHS prediction to a total LSTM (BiLSTM). It outperformed DeepSEA even though they
of 164 cell types (125 cell types from ENCODE8 and 39 cell types were trained exactly on the same dataset.
from Roadmap Epigenomics9). Basset trained cell-type-specific The above integrative functional genomic models do not
models by fitting one model for each cell type. Instead of using shadow the effectiveness of dedicated predictive models.
the pure CNN architecture, DanQ35 explored the effectiveness CpGenie36 predicts genomic DNA methylation status from their
of a hybrid architecture combining CNN and bidirectional sequences. It was trained on restricted representation bisulfite
d Roadmap Epigenom-
ics (DNase-seq 39 cell
types)
DanQ35 2016 U U U U human multiple
d the same dataset as d CNN + BiLSTM
DeepSEA (1 kb input)
CpGenie36 2017 U U human, mouse multiple lymphoblastoid
d ENCODE (RRBS and d CNN (1,001 bp cell lines
WGBS data) input)
De-Fine37 2018 U U human K562 and GM12878
d ENCODE (TF binding d CNN (300 bp
profiles of K562 and input)
OPEN ACCESS
GM12878)
ll
(Continued on next page)
Table 1. Continued
6 Cell Reports Methods 3, 100384, January 23, 2023
Functionalities Tissue/cell
OPEN ACCESS
Method name Year DHSa Histonea TFa Varianta DNA met.a 3Da Datasets Model Species types
Basenji38 2018 U U U human multiple
ll
d ENCODE (DNase-seq d CNN (131 kb input,
593 profiles, Histone with dilated convolu-
modification 1,704 tion and densely
profiles) connected layers)
d Roadmap Epigenom d multi-task learning
ics (DNase-seq 356
profiles, Histone
modification 603 pro-
files)
d FANTOM539 (973
CAGE40 experiments)
d All sequencing data-
sets are remapped
and genomic coverage
is re-estimated with
multi-mapping reads
in consideration
Expecto41 2018 U U U U human multiple (>200)
d ENCODE (DNAse-seq d three-stage model
125 profiles, TF ChIP- o stage one: epige
seq 690 profiles) nomic effects
d Roadmap Epigenom- model
ics (DNase-seq 209 d CNN-based
profiles, Histone d 2,000 bp input
modification 978 pro- o stage two: spatial
files) transformation
d 218 tissue expression o stage three:
profiles from expression predic-
ENCODE, Roadmap tion
Epigenomics, and
GTEx.10
(Continued on next page)
Review
Review
Table 1. Continued
Functionalities Tissue/cell
Method name Year DHSa Histonea TFa Varianta DNA met.a 3Da Datasets Model Species types
Xpresso42 2020 U U human, mouse multiple
d gene expression data d CNN (with dilated
o protein coding convolution,
gene expression of 10.5 kbp input)
56 tissues from the
Roadmap Epige-
nomics
o 254 mouse RNA-
seq datasets from
ENCODE
d promoter sequences
from FANTOM5
DeepMEL43 2020 U U trained on human, melanoma samples
d omniATAC-seq data d similar architecture with cross-species
containing 16 human as DanQ (500 bp generalizability
melanoma cell lines44 input)
o 24 co-accessible
regions (‘‘topics’’)
identified by cis-
Topic45
Enformer46 2021 U U U U human, mouse multiple
d same dataset as used d CNN + Transformer
in Kelley47 (196 kb input)
o human: 38,171 se- d cross-species training
quences, 2,131 TF
binding profiles,
Cell Reports Methods 3, 100384, January 23, 2023 7
OPEN ACCESS
and 357 CAGE
profiles
ll
(Continued on next page)
8 Cell Reports Methods 3, 100384, January 23, 2023
Table 1. Continued
Functionalities Tissue/cell
OPEN ACCESS
Method name Year DHSa Histonea TFa Varianta DNA met.a 3Da Datasets Model Species types
ll
BPNet48 2021 U mouse mESC
d ChIP-nexus49 profiles d 10-layer CNN (1 kb
of four pluripotency input, with dilated
TFs (Oct4, Sox2, convolution and resid-
Nanog, and Klf4) at ual connections)
147,974 genomic d multi-task prediction
regions of four TFs
3D genomic models:
Akita50 2020 U U human, mouse multiple (multiple human
d human cell line HFF d Akita ‘‘trunk:’’ Basenji- and mouse cell lines;
Micro-C51 like architecture for mouse neuronal tissues)
d human cell line genomic DNA
H1hESC Micro-C51 sequence processing
d human cell line (1 Mb)
GM12878 Hi-C52 d Akita ‘‘head:’’ for
d human cell line IMR90 transforming 1D
Hi-C52 genome representa-
d human cell line tions to 2D genome-
HCT116 Hi-C53 folding maps (2 kb
d mouse mESC Micro- bins)
C54 o pairwise averaging
d mouse neural devel- deep-learning rep-
opment Hi-C55 resentations
o addition of
genomic distances
between bins via
positional embed-
ding
Orca56 2022 U U U U human multiple (HFF and H1
d human cell line HFF d multi-resolution 1D- hESC cell lines)
Micro-C51 CNN encoder (256, 32,
d human cell line and 1 Mb input)
H1hESC Micro-C51 o with an auxiliary
prediction of his-
tone modifications
and DNase-seq
d cascading 2D-CNN
decoder
(Continued on next page)
Review
ll
Review OPEN ACCESS
DHS, DNase I hypersensitivity prediction; Histone, histone modification prediction; TF, transcription factor binding prediction; Variant, effect of genetic variants prediction; DNA met, DNA
sequencing (RRBS) and whole-genome bisulfite sequencing
(WGBS) profiles from ENCODE. De-Fine37 focused on TF bind-
ing prediction and was trained specifically on TF binding profiles
from the K562 and the GM12878 cell lines from ENCODE. In
contrast to the previous studies, De-Fine used a cell-line-spe-
cific genome sequence instead of the human reference genome
Tissue/cell
expression prediction
Graph Attention Net-
based on epigenetic
Seq-GraphReg: tis-
CNN for sequential
level neurons, the model is able to process 131 kbp inputs. The
sue-aware gene
works59 for 3D
agnostic gene
quences
from 4D Nucleome
GEO: GSE63525)
(accession GEO:
GSE162819)
inference.
2022
Year
instead of predicting the binary chromatin accessibility per loca- Transcriptomic-level applications
tion it predicts 24 co-accessible regions as identified by cis- The transcriptome serves as a central stage for gene regulation.
Topic.45 This better utilizes the co-regulatory mechanism of The initiation of transcription requires the recognition of a pro-
accessible chromatin regions. BPNet,48 a 10-layer CNN with moter sequence by an RNA polymerase, binding of transcription
dilated convolution and residual connections, trained on ChIP- factors to enhancers, and determination of a transcriptional start
nexus49 (a high-resolution improvement of ChIP-seq) data of site (TSS). Such a process can be extensively regulated to con-
four pluripotency TFs (Oct4, Sox2, Nanog, and Klf4), is able to trol the rate of gene expression, and, if a gene has multiple pro-
produce base-resolution binding affinity prediction to genomic moters, the utilization of different promoters may produce RNA
sequences of all four TFs in a multi-task fashion. BPNet is able transcripts with different 50 UTRs that will potentially have
to discover interesting cooperativity between TF motifs located different translational efficiency.66 RNA splicing is also a highly
within 1,000 bp regions, such as the Oct4-Sox2 motif and regulated process and is a significant contributor to eukaryotic
Oct4-Oct4 motif, and the cooperation between the Nanog motif transcriptome diversity.67 In eukaryotes, the possible usage of
and AT-rich motif in a periodic manner. multiple polyadenylation sites (PASs) produces mRNAs with
All of the above methods regard the genome as linear se- different 30 UTRs that may contain important regulatory ele-
quences per chromosome. However, the cellular genome does ments.68 After the completion of the above process, the mRNA
have a three-dimensional (3D) structure. The 3D structure of molecule is transported out of the nucleus. RNA subcellular
the genome is under extensive regulation and is able to affect localization controls the spatial distribution of the newly tran-
gene expression, DNA replication, and DNA repair. Several scribed mRNAs. Post-transcriptional mRNAs may also be selec-
approaches have been dedicated to deciphering the regulatory tively targeted by microRNAs (miRNAs), which are able to down-
code and grammar of the 3D structure of the genome from regulate the expression of certain genes. The 50 UTR of mRNA
genomic sequences. Akita50 is a deep-learning method that has an important effect on its translational efficiency, which
can predict genome folding from the genomic sequences. After directly controls the rate of protein synthesis.
training on Hi-C65 and Micro-C51 profiling data from five human We summarize research works that use deep learning to
cell lines, one mouse cell line, and multiple mouse neuronal tis- model each of the aforementioned processes in Table 2.
sues, Akita is able to infer the genome-folding map of each cell Although some of the ‘‘genomic-level’’ prediction methods in
type, which is a two-dimensional (2D) matrix representing pair- the previous section may also have some transcriptomic-level
wise contact between genomic regions. Akita utilizes a predictions, especially for the integrative functional genomic
Basenji-like architecture as its ‘‘trunk’’ for processing 1 Mb models such as Basenji and Expecto, we focus here on methods
genomic sequences. It then uses a ‘‘head’’ to transform the that are dedicated to particular aspects of transcriptomic-level
one-dimensional (1D) genomic sequence representations into regulation.
2D maps. The mean squared error between the predicted 2D CNNProm69 is an early deep-learning-based method for pro-
map and experimental Hi-C or Micro-C data is used as the moter sequence recognition. The model uses one to two layers
training objective. Orca,56 a very recent improvement on Akita, of CNN for the binary classification of sequences into pro-
enables the prediction of the genome-folding map at multiple moter/non-promoter sequences. Effectiveness has been
resolutions. Orca uses a multi-resolution 1D-CNN genomic demonstrated in both prokaryotes (Escherichia coli and Bacillus
sequence encoder which can take in 256 Mb, 32 Mb, or 1 Mb in- subtilis) and eukaryotes (human, mouse, and Arabidopsis). As a
puts and encodes them into 1D sequence representations. Orca successor to CNNProm, DeeReCT-PromID73 enlarged the size
then uses a cascading 2D-CNN decoder to decode sequence of the input to 600 bp and enabled genome-wide scanning of
representations into 2D genome-folding maps. Using only the promoters. The authors pointed out that models for promoter
Micro-C profiles of the two human cell lines (HFF and H1 recognition that are trained on curated balanced datasets may
hESC) that Akita has used, Orca produces genome-folding not be directly applicable for genome-wide scanning. This is
map predictions at various scales, from 1 Mbp regions each because the majority of genomic regions are negative examples
within one chromosome to 256 Mbp regions that cover multiple (non-promoters) and, therefore, the tolerability of the false-pos-
chromosomes. Furthermore, Orca’s sequence encoder is simul- itive rate should be much lower. DeeReCT-PromID employed a
taneously trained on the DHS and histone modification profiles of strategy for iteratively selecting hard negative samples to reduce
the two cell lines from ENCODE and Roadmap Epigenomics, the false-positive rate of the model. DeeReCT-TSS further
making it an integrative and multi-purpose model. In contrast improved on DeeReCT-PromID by inferring promoter usage in
to the earlier works that have focused on sequence-based pre- different cell lines through both promoter sequences and RNA-
diction of genome-folding maps, a recent model, GraphReg,57 seq evidence. It demonstrated its functionality by training and
instead utilized the 3D structure of the genome for better predic- evaluating on ten FANTOM539 cell lines, using genomic
tion of gene expression levels. GraphReg contains a set of two sequence and RNA-seq as input and matched CAGE-seq40
models, Epi-GraphReg and Seq-GraphReg. Epi-GraphReg data as ground truth.
infers tissue-agnostic gene expression levels based on epige- RNA splicing plays a critical role in transcriptomic-level regu-
netic and 3D genomic profiles, and Seq-GraphReg infers tis- lation. By producing transcripts with different combinations of
sue-aware gene expression levels based on genomic sequence exons and introns, it contributes significantly to eukaryotic
and 3D genomic profiles. Both Epi-GraphReg and Seq- transcriptomic diversity. Given the complexity of different pat-
GraphReg utilize graph attention networks59 for modeling the terns of alternative splicing, early works on deep-learning-based
spatial interactions between genomic locations. splicing prediction particularly focused on one alternative
cell lines
Splicing:
Barash et al.75 2010 splicing prediction of mouse multiple (27 tissues)
cassette exons d Microarray profile of d a dedicated probabi-
3,665 cassette exons listic model to estimate
in 27 mouse tissues (qinc ; qexc ; qnc Þ from
from Fagnani et al.76 microarray profiles
d 1,014-dim features d a one-layer NN (1,014
extracted from flank- dim input) to predict
ing sequence of the the above probabilities
cassette exon from features of flank-
ing sequence
(Continued on next page)
OPEN ACCESS
ll
12 Cell Reports Methods 3, 100384, January 23, 2023
Table 2. Continued
Method name Year Main functionalities Datasets Model Species Tissue/cell types
OPEN ACCESS
Leung et al.77 2014 splicing prediction of mouse multiple (5 tissues)
ll
cassette exons d RNA-seq profile of d MLP (1,393 dim input),
11,019 cassette exons with indices of two
in five mouse tissues tissues to compare
from Brawand et al.78
d 1,393-dim features
extracted from flank-
ing sequence of the
cassette exon
Xiong et al.79 2015 splicing prediction of human multiple (16 tissues)
cassette exons in d Bodymap 2.0 (NCBI d MLP (1,393 dim input)
exon triplets accession GEO: d use Bayesian MCMC
GSE30611) for learning without
o 10,689 cassette overfitting
exons
o 16 normal tissues
d 1,393-dim features
extracted from
flanking sequence of
the cassette exon
DARTS80 2019 splicing prediction of human K562, HepG2
cassette exons d training: ENCODE8 d MLP (2,926 + 1,498 3
guided by RNA-seq d K562 and HepG2 2 dim input)
shRNA RBP o 2,926 cis sequence
knockdown datasets features
d testing: Roadmap o 1,498 3 2 RBP
Epigenomics31 expression levels
RNA-seq data o BHT integration
and deep-learning
prediction, and
RNA-seq evidence
(Continued on next page)
Review
Review
Table 2. Continued
Method name Year Main functionalities Datasets Model Species Tissue/cell types
81
SpliceAI 2019 Splice sites prediction human non-specific
from pre-mRNA d GENCODE82 v24 d CNN with dilated
isoforms convolution62 and
o train: 13,384 residual block (5,000
genes, 130,796 nt input)
donor-acceptor d dense classifica-
pairs tion of (no splice
o test: 1,652 genes, site, donor,
14,289 donor- acceptor)
acceptor pairs
d GTEx10 (novel
isoforms)
o 67,012 splice
donors, 62,911
splice acceptors
Pangolin83 2022 splice sites prediction human multiple (4 tissues in
from pre-mRNA d reference transcripts d CNN with dilated rhesus monkey each species)
o GENCODE v34 for convolution and mouse
human transcripts residual blocks rat
o ENSEMBL84 (15,000 nt input)
release 100 for d predicts per-tissue
rhesus monkey splicing event
transcripts
o GENCODE m25 for
mouse transcripts
o ENSEMBL release
101 for rat tran-
Cell Reports Methods 3, 100384, January 23, 2023 13
scripts
d RNA-seq data of the
four tissues (heart,
liver, brain, and testis)
of human, rhesus
monkey, mouse, and
rat from Cardoso-
Moreira et al.85
(Continued on next page)
OPEN ACCESS
ll
14 Cell Reports Methods 3, 100384, January 23, 2023
Table 2. Continued
Method name Year Main functionalities Datasets Model Species Tissue/cell types
OPEN ACCESS
Polyadenylation:
ll
Leung et al., 201886 2018 PAS quantification human multiple (7 tissue types)
(pairwise comparison) d dataset for PAS d two-branch CNN for
reference PAS pairwise compar-
o PolyA_DB 287 ison
o GENCODE
o APADB88
o Derti et al. (polyA-
seq data)89
o Lianoglou et al.90
(30 -seq data)
d dataset for PAS
quantification
d Lianoglou
et al.90(30 -seq data)
DeeReCT-PolyA91 2019 PAS recognition human non-specific
d Dragon human poly(A) d CNN with group mouse
dataset92 normalization95 (200 nt
o 14,740 sequences input)
for the 12 main hu-
man PAS motif
variants
d Omni human poly(A)
dataset93
o 18,786 positive
true PAS se-
quences for 12 hu-
man PAS motif
variants
d Xiao et al.94 30 -READS
sequencing of mouse
fibroblast cells of
C57BL/6J (BL),
SPRET/EiJ (SP), and
their F1
(Continued on next page)
Review
Review
Table 2. Continued
Method name Year Main functionalities Datasets Model Species Tissue/cell types
96
APARENT 2019 PAS quantification human non-specific
d 3 million APA d two-layer CNN (186 nt,
massively parallel a length that all ran
reporter assay from domized regions of the
13 libraries reporters can fit in)
o use 9 out of 13 d prediction of the
libraries for inclusion ratio of the
training, 2.4 proximal PAS in the
million variants; the reporter assay
other four are held d gradient-based
out entirely forward engineering
of PAS sequences
DeeReCT-APA97 2021 PAS quantification mouse (BL, SP, and fibroblast
d Xiao et al.94 30 -READS d CNN + BiLSTM (448 nt BLxSP F1 hybrid)
sequencing of mouse per each PAS, variable
fibroblast cells of PASs in each example)
C57BL/6J (BL), d models the inter-
SPRET/EiJ (SP), and actions between
their F1 competing PAS
RNA subcellular localization:
RNATracker98 2019 subcellular localization human HepG2 and HEK293T
prediction d mRNA sequences d CNN + BiLSTM (200 cell lines
from Ensembl84 2017 nt to more than 30,000
release nt)
d RNA secondary struc-
ture implied from
Cell Reports Methods 3, 100384, January 23, 2023 15
RNAplfold99
d mRNA subcellular
localization profiles
o CeFra-Seq data
from Benoit Bouvr-
ette et al., 2018100
d cytosol, nu-
clear, mem-
branes, insol-
uble
o APEX-RIP data
from Kaewsapsak
et al.101
d ER, mitochon-
OPEN ACCESS
drial, cytosol,
nuclear
ll
(Continued on next page)
16 Cell Reports Methods 3, 100384, January 23, 2023
OPEN ACCESS
ll
Table 2. Continued
Method name Year Main functionalities Datasets Model Species Tissue/cell types
MicroRNA targets:
MiRTDL102 2015 microRNA targets prediction human non-specific
d TarBase dataset103 d CNN prediction based mouse
o 1,297 positive on 20 features of the rat
miRNA + mRNA miRNA and mRNA pair
pairs and 309
negative pairs (hu-
man, mouse, and
rat)
o Dataset further
extended by a
constraint relaxing
method, 198,620
positive pairs, and
19,660 negative
pairs
Translation:
Cuperus et al.104 2017 50 UTR translational yeast N/A
efficiency prediction d measurement of d 3-layer CNN (>50 nt to
489,348 50-nt-long 50 fit the randomized
UTR of yeast in a region)
massively parallel d forward engineering of
growth selection 50 UTR sequences
experiment
RNA-protein binding:
DeepBind17 2015 sequence-based RNA- multiple (24 eukaryotes) non-specific
protein binding d RNAcompete105 d CNN (101 nt input)
prediction o 207 distinct RBPs
from 24 eukaryotes
NucleicNet106 2019 structure-based RNA- multiple non-specific
protein binding d 483 RNA-protein d CNN with ResNet-like
prediction complexes from architecture
PDB107 and de-dupli
cated to 158 ribonu
cleoprotein structures
Review
ll
Review OPEN ACCESS
splicing type: the cassette exons. For example, an early work of Instead of tackling the PAS recognition problem, Leung et al.
Barash et al. developed a one-layer neural network for the pre- were the first to address the PAS quantification problem by
diction of cassette exon differential usage across mouse tis- applying a two-branch CNN for pairwise comparison of
sues.75 The network takes in a 1,014-dimensional vector con- competing PAS of a gene.86 Leung et al. assembled a reference
taining sequence features flanking the exon of interest and of PASs in the human genome using four different reference
outputs three-class classification scores for ‘‘increased usage,’’ databases and used 30 -seq data from Lianoglou et al.90 for quan-
‘‘decreased usage,’’ or ‘‘no change.’’ The model is trained on tification. Instead of casting the PAS quantification problem into
microarray profiles of 3,665 cassette exons in 27 mouse tis- pairwise comparison problems, DeeReCT-APA97 handles vari-
sues.76 Leung et al. further enlarged the dataset to a total of able numbers of PASs per gene using a combined CNN and
11,019 cassette exons in mouse and increased the dimension BiLSTM architecture. DeeReCT-APA is able to model the inter-
of sequence features to 1,393 for tissue-specific splicing pattern actions between competing PASs and achieves better perfor-
prediction.77 Using the same sequence features as Leung et al., mance than the model from Leung et al.86 Instead of training
Xiong et al. developed an MLP for the prediction of 10,689 hu- models only on endogenous PAS sequences, APARENT96
man cassette exons.79 The model was trained on Bodymap trained a two-layer CNN on 3 million synthesized massively par-
2.0 RNA-seq data (NCBI accession GEO: GSE30611) and used allel reporter assay (MPRA) sequences of APA. The MPRA is able
a Bayesian MCMC procedure to reduce overfitting. Using the to measure hundreds of thousands of synthesized PAS
predictive model, the researchers were able to examine muta- sequences’ regulatory activities in parallel. APARENT’s CNN
tions that alter splicing in genes involved in several human ge- was trained to predict the measured regulatory activity given
netic diseases. DARTS80 provided an example of integrating the PAS sequence. Using a gradient-based optimization of input
deep-learning-processed sequence features and RNA-binding sequences, the APARENT model is able to engineer PAS
protein (RBP) expression levels with low-coverage RNA-seq sequences to have desired levels of regulatory activity.
evidence for the differential usage analysis of cassette exons be- Transcriptomic-level regulation can also be carried out
tween conditions. DARTS integrates MLP-based deep learning through other post-transcriptional mechanisms. mRNA subcel-
with the Bayesian hypothesis test (BHT) by asking its deep- lular localization controls gene expression both spatially (by
learning module to provide a prior distribution for its BHT mod- transporting mRNA into different subcellular structures) and
ule. In this way, DARTS enables deep-learning-guided study of quantitatively (by modulating the accessibility of mRNA to ribo-
alternative splicing even when the experimental RNA-seq data somes). RNATracker98 is a deep-learning tool that predicts
are not of enough sequencing depth. such localization patterns of mRNAs. Utilizing mRNA sequence
SpliceAI81 is the first deep-learning-based splice site predictor and RNA secondary structure predicted from RNAplfold,99
for all splicing types. SpliceAI simulates the in vivo pre-mRNA RNATracker is able to classify mRNAs into their plausible subcel-
processing machinery and directly predicts splice sites from lular localizations. The version of RNATracker trained on CeFra-
raw pre-mRNA sequences. SpliceAI takes in long input (5,000 Seq data100 classifies mRNA localizations into cytosol,
nt) to handle large chunks of pre-mRNA sequences and per- nuclear, membranes, and insoluble, while the version trained
forms three-class classification (no splice site, donor site, and on APEX-RIP data101 classifies localizations into ER mitochon-
acceptor site) per each pre-mRNA location. SpliceAI also utilized drial, cytosol, and nuclear. mRNAs can also be targeted with
the dilated convolution62 and residual block108 for increased miRNAs that could silence mRNA expression. MiRTDL102 is a
receptive field of high-level neurons. Pangolin83 further extends CNN-based tool that is able to predict potential miRNA-mRNA
SpliceAI in a multi-task fashion for the detection of splice sites interactions. The 50 UTR of an mRNA greatly affects ribosomal
in a total of four tissue types (heart, liver, brain, and testis) from translational efficiency and is under frequent regulation. Similar
four species (human, rhesus monkey, mouse, and rat). to the motivation of the APARENT model for the prediction of
The termination of eukaryotic Pol II transcription in eukaryotic PAS strength, Cuperus et al. developed a 3-layer CNN to predict
cells requires cleavage at the 30 end of the transcript and an addi- 50 UTR translational efficiency by training it on 489,348 synthe-
tion of a poly(A) tail, a process called polyadenylation. Similar to sized 50 UTRs in yeast.104 The translational efficiency of each
promoters that determine TSSs, PASs determine transcription synthesized 50-nt-long 50 UTR was measured by a massively
termination sites. A gene may have multiple competing PASs, parallel growth selection experiment and was used as the
and cells from different tissue types or conditions may preferen- CNN’s prediction target.
tially use each of them. Such alternative polyadenylation (APA) Under the hood of all the transcriptomic-level gene regulations
could modify the 30 UTRs of transcripts and could strongly affect are complex interactions between RNA and various other types
mRNA stability109 and cellular localization,110 and is involved in of biomolecules. RNA-protein binding is certainly one of the most
various human diseases. DeeReCT-PolyA91 is one of the first important, as most post-transcriptional regulations are mediated
deep-learning methods for recognizing PASs. Using a CNN through RBPs. Besides predicting DNA-protein binding,
with group normalization95 to increase robustness, DeeReCT- DeepBind17 is also able to predict RNA-protein bindings. After
PolyA takes in 200 nt sequences and predicts whether they being trained on the RNAcompete assay data,105 it is able to pre-
contain a PAS or not. The model achieved state-of-the-art per- dict the binding preference of an RNA molecule to 207 distinct
formance and substantially outperformed non-deep-learning RBPs from 24 eukaryotes based on the RNA sequence.
methods on two human polyadenylation datasets, the Dragon NucleicNet106 pursued a path different from DeepBind. Instead
human poly(A) dataset92 and the Omni human poly(A) dataset,93 of performing sequence-based RNA-protein binding predic-
and one mouse polyadenylation dataset from Xiao et al.94 tions, it makes RBP-centric predictions based on their
structures. Trained on 158 ribonucleoprotein structures from the the phenotype represents one of the ultimate goals of gene regu-
PDB,107 it is able to predict an RBP’s binding sites for RNAs as lation studies. In this section, we summarize deep-learning
well as each binding site’s preference of each type of RNA models that have been developed to assist in genotype-to-
constituents. Readers are referred to Wei et al.111 for a detailed phenotype and phenotype-to-genotype inferences in animal
survey of deep-learning applications in RNA-protein binding and plant species (Table 3, lower half).
predictions. Following the development of DeepSEA, Zhou et al. further
developed DeepSEA-based models for the prediction of the
Proteomic-level applications effect of non-coding variants on autism spectrum disorder
After a protein is translated from its mRNA template, regulation (ASD).128 In this work, two DeepSEA-based models were trained
can take place at the proteomic level via a number of mecha- to predict transcriptional and post-transcriptional regulatory
nisms. Protein post-translational modifications (PTMs) contain effects separately. The resulting predictions were summarized
a family of such regulatory processes that covalently modify a into disease impact scores through training an LR model on
protein after it is translated. For example, the serine (Ser), threo- top of the model predictions using known disease-associated
nine (Thr), and tyrosine (Tyr) residues can be modified by phos- mutations. Using the disease impact scores, it was then able
phorylation, which plays an important role in intracellular signal to prioritize disease-associated mutations observed in 1,790
transduction. Furthermore, the lysine (Lys) residues can be ASD-affected families. DeepWAS129 introduces a deep-
modified through ubiquitination, which can mark a protein for learning-assisted genome-wide association study (GWAS) pipe-
degradation. Protein subcellular localization determines the line. DeepWAS utilizes the pre-trained DeepSEA model to
cellular compartments where a protein resides and exerts its produce a list of candidate variants for a GWAS. In this way, it
functions, which will substantially affect its function and reduces the number of candidates for GWAS and increased its
activities. statistical power. The authors demonstrated its effectiveness
We summarize these proteomic-level deep-learning applica- by improving three existing GWASs for multiple sclerosis,130
tions in Table 3 (upper half). Similarly, we highlight their function- major depressive disorder,131 and body height.132
alities, training datasets, and model architectures to facilitate In plant species, deep learning has also been applied in multi-
future explorations in this area. ple plant phenotype prediction tasks. For example, DeepGP133
Deep-learning models have been developed for the prediction applied a CNN-based model for phenotype prediction in two
of PTMs from protein sequences. DeepPhos112 is a densely con- polyploid outcrossing species: strawberry and blueberry. Five
nected CNN architecture119 that predicts phosphorylation sites strawberry fruit quality traits were predicted for strawberry indi-
from protein sequences. DeepPhos112 formulates three different viduals based on microarray genotypes, and five blueberry fruit
phosphorylation prediction tasks. The general prediction in- quality traits were predicted for blueberry individuals based on
volves predicting whether an amino acid position is a phosphor- genotypes obtained from Rapid Genomics Capture-seq.136
ylation site. The residual-specific prediction requires a model to Shook et al. studied the possibility of predicting crop yield based
predict in which amino acid type phosphorylation occurs. The on genotype and environmental factors. Using the Uni-form Soy-
kinase-specific prediction requires a model to predict which ki- bean Tests data,138 which contains soybean yields in United
nase is responsible for the phosphorylation event. DeepUbi120 States and Canada during 2003–2015, the authors separately
is a CNN architecture that predicts ubiquitination from protein built LSTM and temporal attention139 models for soybean crop
sequences, achieving an area under the curve of 0.9 in a total yield prediction. At each time step, the model considers the
of 176 species. MusiteDeep122–124 is a series of works for multi- crop’s genotype and seven weather variables during its growth
ple PTM type prediction. Its latest version uses an ensemble of period and forecasts the yield during harvest seasons. Such
multi-layer CNN and Capsule Network125 that is able to handle deep-learning applications in plant phenotype prediction tasks
13 different PTM types. MusiteDeep updates its predictions for will provide valuable insights for plant breeding.
UniProt protein sequences every 3 months, and its prediction
results are available at https://www.musite.net. PROBLEMS AND LIMITATIONS OF CURRENT DEEP-
DeepLoc126 is a deep-learning-based prediction tool that is LEARNING APPLICATIONS
able to infer protein subcellular localization from their sequence.
DeepLoc is designed as a combined CNN and LSTM architec- Challenges in training and interpreting deep-learning
ture. It uses attention-based decoding to identify sequence models
regions with high predictive power. It organizes the ten subcellu- In this section we discuss challenges that are common in the
lar locations in a hierarchical manner and uses hierarchical tree applications of deep learning, especially those in model training
classification likelihood to train the model. In this way, without and model interpretation.
using information from homology sequences, it is able to achieve Deep-learning models are known to be difficult to train
78% accuracy for a total of ten subcellular locations. because their high non-linearity makes the optimization of the
objective function difficult. Typically, deep-learning models are
Phenotypic-level applications in animal and plant trained using stochastic gradient descent (SGD), as it is an effi-
species cient first-order optimization algorithm and its stochasticity
The aforementioned deep-learning applications in gene regula- allows it to jump out from local minima.27 However, the size of
tion have been mainly at the microscopic level of biological pro- each gradient descent step (the ‘‘learning rate’’) can be difficult
cesses. However, bridging the gap between the genotype and to configure. A small learning rate can result in slow training
OPEN ACCESS
ll
20 Cell Reports Methods 3, 100384, January 23, 2023
Table 3. Continued
Method name Year Main functionalities Datasets Model Species
OPEN ACCESS
Protein-subcellular localization:
ll
DeepLoc126 2017 subcellular localization multiple eukaryotes
prediction d UniProt release d CNN + LSTM (max.
2016_04 1,000 aa input)
o R40 aa d attention-based de-
o with no more than 1 coding
subcellular loca- d hierarchical tree clas-
tion sification and likeli-
o with experimental hood
support
o CD-HIT similarity
%30%
d Höglund et al., 2006127
d 10 subcellular loca-
tions in total
Genotype-to-phenotype inference in animal species:
Zhou et al.128 2019 prediction of the effect of human
non-coding variants to d Roadmap Epigenom d transcriptional regula-
autism spectrum disorder ics histone marks and tory effects model
DNase I profiles o DeepSEA with
o 2,002 epigenetic doubled convolu-
features tion layers
d ENCODE and o model prediction
previously published expanded from the
CLIP datasets original 919 epige-
o 231 profiles for a netic targets to
total of 82 RBPs 2,002 targets
d The Simons Simplex d post-transcriptional
Collection of whole- regulatory effects
genome sequencing model
data of 7,097 genomes o similar architecture
for 1,790 ASD- as DeepSEA
affected families o prediction of bind-
ing affinity of 82
unique RBPs
DeepWAS129 2020 using genomic deep- human
learning model to d KKNMS microarray d using the pre-trained
enhance genome- profiles for multiple DeepSEA model for
wide association sclerosis (MS)130 prioritizing variants
studies d MDDC microarray that affect genomic
profiles for major functional units
Review
depressive disorder d using the prioritized
(MDD)131 variants to propose
d KORA microarray pro candidate variants for
files for body height132 GWAS analysis
(Continued on next page)
Review
Table 3. Continued
Method name Year Main functionalities Datasets Model Species
Genotype-to-phenotype inference in plant species:
DeepGP133 2020 multiple phenotype strawberry
prediction in polyploid d five advanced selec- d using both CNNs and blueberry
outcrossing species tion trials of strawberry Bayesian penalized
(University of linear regression for
Florida)134 phenotype prediction.
o evaluation of five
yield and fruit
quality traits
d soluble solid
content
d average fruit
weight
d total market-
able yield
d early market-
able yield
d percentage of
culled fruit
o microarray geno-
typing of 1,233
individuals
d one cycle of blueberry
breeding program
(University of
Florida)135
o evaluation of yield
Cell Reports Methods 3, 100384, January 23, 2023 21
OPEN ACCESS
o soybean yield in
USA and Canada
ll
during 2003–2015
ll
OPEN ACCESS Review
progress and becoming stuck in a local minimum; a high learning choices in practice. To facilitate the hyperparameter tuning pro-
rate will make the algorithm fail to converge. The direction of an cess, software libraries with integrated hyperparameter selec-
SGD update, i.e., the direction toward which the objective func- tion algorithms such as Raytune146 and KerasTuner149 can be
tion decreases the fastest, could also alternate too rapidly, applied to existing projects with minimal modifications to the
making the optimization trajectory oscillate around a local existing source code.
minimum.140 Another challenge of deep learning is the difficulty in its inter-
Therefore, several improvements to SGD have been made to pretation. Unlike shallow models such as linear models, decision
make the algorithm more efficient and stable, as summarized trees, and SVMs, deep-learning models have complex hierarchi-
in Table 4. Adagrad141 adaptively modulates the learning rate cal architectures and their hidden states cannot be interpreted in
for each model parameter based on its magnitude. RMSprop142 easy-to-understand terminologies. However, existing deep-
also adaptively modulates the learning rates, but it is based on learning studies in gene regulation have employed various
the exponential moving averages of the parameters’ magnitude. kinds of methods to improve model interpretability. We catego-
Improvements to the direction of parameter update are also rize such methods into four general categories, which are
available, such as momentum and Nesterov momentum.156 summarized in Table 4.
Adam140 introduces the momentum update to RMSprop and Example-based methods
has been effective in the optimization of large CNNs. More For example-based methods, the deep-learning models are
recently, several successors to Adam have become more and explained by training or testing examples in the dataset. To inter-
more popular, including NAdam143 (Adam with Nesterov mo- pret a specific layer or a hidden state neuron, examples that
mentum), AdamW144 (Adam with decoupled weight decay), result in their high activation are selected (e.g., top 5% among
and RAdam145 (Adam with more stabilized adaptive learning all examples). The commonality among those examples can be
rate in the warm-up process). To boost model performance, used as an interpretation. For example, the subsequences that
researchers are always encouraged to apply these algorithms result in high activation of a specific convolution filter can be
and their variants in real-world practice. collapsed into position weight matrices (PFMs) and visualized
Another challenge in training deep-learning models concerns by sequence logos. In this way, the regulatory motifs that the
hyperparameter selection. The selection of hyperparameters model is ‘‘looking at’’ can be revealed. This technique is
can substantially affect training stability and model perfor- commonly used by sequence-based models, such as described
mance. As the model grows larger, the space of hyperpara- by Alipahani et al.17 and Bogard et al.96
meter combinations increases exponentially. Therefore, it is Perturbation-based methods
necessary to employ heuristic search strategies in hyperpara- Another way to interpret model predictions is based on perturba-
meter tuning. Techniques such as random search,157 coordi- tion. This is done by modifying (‘‘perturbing’’) the model’s input
nate descent, and Bayesian optimization158 are common and inspecting the changes in the output. It is expected that
the model’s prediction will substantially decrease if the most 1. The limitation of data volume in biological studies may
discriminative part of the input example is perturbed. For hinder machine-learning model development. Unlike fields
example, by performing in silico pointwise mutations for a such as computer vision and natural language processing,
biological sequence, we can produce a so-called mutation where it is easy to collect terabytes or even petabytes of
map of the sequence by asking the model to predict a score training data from the internet or from crowd-sourcing
for each of the mutated sequences. This method has been platforms,160 biological data have to be generated from
systematically investigated by DeepBind to confirm the putative biological experiments. If a particular regulatory mecha-
sequence motifs that are recognized by its convolution filters.17 nism cannot be studied by an established experimental
Attribution-based methods technique that generates a large enough number of
In contrast to example-based and perturbation-based methods training examples, it will be impossible to study them using
which still treat the deep-learning model as a black box, attri- machine-learning methods. Even though such experi-
bution-based methods open up the black box by attributing a mental techniques are available, the unavailability of
model’s intermediate network value to the model’s input. such data may also arise from financial constraints or pri-
Such methods compute an attribution score for each element vacy concerns.
of the input. The magnitude of the score indicates the amount 2. The biological and technical variations across experi-
of its contribution, and the sign of the score shows whether the mental conditions may limit the model’s generalization
contribution is positive or negative. Saliency map159 is one the performance. For example, in the Basenji paper, the au-
simplest attribution-based methods. It is defined just as thors observed, on average, a Pearson correlation of
the model’s gradients with respect to the input. In recent years 0.479 between biological replicates,38 even though they
more theoretically guaranteed approaches have been devel- are from the same consortium. It is therefore difficult to
oped, including Integrated Gradients,152 SHAP,153 and Deep- tell whether a model with a high performance score is
LIFT.154 For instance, BPNet extensively utilized DeepLIFT generalizable to other examples or is simply overfitting to
when inspecting the model’s binding affinity predictions to those random variations.
the TFs. Attribution-based methods seem to have a higher 3. The use of endogenous sequences does not always imply
sensitivity than example-based methods. In the BPNet paper, the model’s generalizability to unseen cases. Most models
the researchers systematically discussed the complex take in biological sequences as input for their prediction,
sequence motifs that could only be discovered by DeepLIFT but most of them only use endogenous sequences for
but not by PFMs, such as the helical periodicity patterns of training. For example, DeepSEA,30 Basset,34 and Ex-
Nanog.48 pecto41 were trained solely on the human reference
Model-based methods genome GRCh37. It remains elusive how well those
Instead of making post hoc interpretations of the model, it is models are generalizable to genetic variations that may
better to consider interpretability during the model’s develop- have a different ‘‘regulatory grammar’’ from those that
ment. There are building blocks of deep-learning models that are in the observed endogenous sequences. Furthermore,
are inherently interpretable, such as attention modules.23 Atten- the number of such sequences pertinent to a particular
tion scores can provide the location of regions to which the regulatory event may comprise only a tiny fraction of the
model is paying attention. Dividing models into different stages organism’s genome, transcriptome, or proteome. For
and producing interpretable results at the end of each stage is example, the promoter sequences that regulate TSS only
also a common strategy. For example, Expecto predicts gene consist of genomic sequences at the beginning of each
expression prediction in three stages.41 In the first stage, it trans- gene. This could further limit the diversity of training
forms genomics sequences into epigenomic features, which data. As we have introduced in previous sections, only
consists of 2,002 genome-wide histone marks; in the second APARENT96 for polyadenylation and the Cuperus
stage, it aggregates epigenomic features produced in the et al.104 model for 50 UTR translation efficiency utilized
second step based on spatial closeness; in the third stage, it pre- measurements of synthesized exogenous sequences
dicts gene expression level from the aggregated features. In this from MPRA data for model training and evaluation.
way, the transparency at each stage provides interpretability for The limitation of sequence-only models
the whole pipeline. Most gene regulation is a concerted effect of both cis-acting
sequence motifs and trans-acting binding molecules (mainly
Limitations of existing deep-learning applications in binding proteins) residing in a cellular environment. Most of the
gene regulation aforementioned deep-learning models take nucleotide or amino
Apart from the aforementioned general problems of deep- acid sequences containing only cis-acting information as input.
learning algorithms, there are limitations specific to gene regula- Some methods modeled the trans-acting effects implicitly by
tion that will potentially challenge their application potential. making tissue-specific predictions. For example, DeepSEA,
The problem of overfitting Basset, and Basenji38 perform multi-task prediction across mul-
Deep-learning models are well known for the overfitting issue. tiple tissue types, and in Leung et al.,78 the researchers trained
The apparent high performance on a benchmark dataset does separate models for each tissue type. For some methods,
not always imply successful generalization to other unseen ex- such trans-acting effects are ignored completely (e.g., in
amples. This is particularly true for applications in gene regula- CNNProm69 and APARENT96). For all those models with tis-
tion, due to three problems that are not trivial to overcome. sue-specific predictions, the trans-acting environment is
assumed to be static, and the models always produce the same better performance on several sequence classification tasks
results for a tissue type even though gene regulation is dynamic such as promoter recognition, TF binding site prediction, splice
with respect to internal and external conditions. When prediction site prediction, and functional genetic variants classification.
in a new tissue type is needed, the model needs to be retrained The model also showed cross-species transfer learning ability
using experimental profiles coming from that tissue, which may through the prediction of mouse TF binding sites. Instead of per-
not always be available. The only model that explicitly models forming the pre-training task on one amino acid sequence only,
trans-acting effects is DARTS,80 which considers the expression the MSA Transformer24 extended the Transformer model to
levels of 1,498 RBPs for splicing prediction. handle multiple sequence alignments (MSAs) of amino acid
Not enough consideration of interactions among sequences to better utilize contextual information both within se-
regulatory events quences and across homologous sequences. The MSA Trans-
The multiple layers of gene regulation do not happen indepen- former showed even superior performance on downstream pro-
dently. A proteomic-level regulation of one protein may affect tein secondary structure prediction and protein contact map
the transcriptomic-level regulation of another gene. However, prediction than ESM-1b.
most methods developed so far have only considered regulatory Through a language modeling objective, the pre-trained
events independently. Even for the multi-task models that pre- models can utilize a massive amount of unlabeled biological
dict multiple genomic features simultaneously, the interactions sequence data that are not specific to one species or one predic-
between those predicted events are not explicitly taken into tion task. In this way, it is able to discover regulatory grammars
consideration. DeeReCT-APA97 considers the interactions across multiple genomic regions or from multiple species. It
among multiple PASs; however, the interaction with the regula- will be of great interest to see whether such pre-trained models
tory events of other types, e.g., splicing, is beyond its reach. are systematically beneficial to downstream prediction tasks of
gene regulation, especially when the size of the downstream
NEW DEEP-LEARNING METHODS AND PERSPECTIVES task datasets is not enough to train deep-learning models from
scratch.
In the following sections, we discuss several promising new par-
adigms in deep learning that will potentially overcome the limita- Few-shot and meta-learning mechanisms produce data-
tions already described (Figure 3). We list related works in Table 5 efficient deep-learning models
as examples for each new paradigm and hope that they can shed Another trend in the deep-learning community to tackle the
light on new deep-learning-based gene regulation studies. problem of data insufficiency is to develop deep-learning models
that utilize data efficiently. In particular, ‘‘few-shot learning’’ is
Pre-trained self-supervised models could alleviate the aimed at solving a prediction task with only a few training exam-
problem of data insufficiency ples (Figure 3B). This challenging problem is usually tacked by
In recent years, pre-trained models have achieved great success ‘‘meta-learning,’’ whereby a ‘‘meta-model’’ is trained that is
in processing and understanding the natural language. Pre- easily generalizable across a set of similar tasks. When it is
trained Transformer models such as BERT162 and GPT207–209 required to perform a specific task, it is able to quickly adapt
perform self-supervised learning on massive corpora, aiming to itself to it with a few provided training examples from the task.
predict randomly masked tokens from their context (the Such methods have already been applied in the classification
‘‘masked language modeling’’ task) or the next token given the of biological sequences. For example, the previously introduced
previous tokens (the ‘‘causal language modeling’’ task). The DeeReCT-TSS74 applied a gradient-based meta-learning algo-
pre-trained models then display strong transfer learning ability. rithm, Reptile,164 for the fast adaptation of the TSS prediction
After fine-tuning on a very small amount of data from some model to a total of ten cell types. The authors discovered that us-
downstream tasks, the model achieves state-of-the-art perfor- ing 20% of data from each cell type to pre-train a meta-model
mance (Figure 3A). and then adapt it to a specific cell type using the rest of the data
Pre-trained models for biological sequences have been devel- benefited model performance. MIMML165 is a newly proposed
oped in parallel. For example, Rives et al.18 developed a protein meta-learning framework for bioactive peptide function predic-
sequence model, ESM-1b, which is a 33-layer Transformer tion. MIMML is based on the Prototypical Network,168 which per-
architecture with 650 million parameters. ESM-1b performs forms few-shot classification by measuring the distance from a
BERT-like masked language modeling and is trained on 250 query example to a few exemplars of each class. MIMML is
million protein sequences from Uniref. 50,210 which contains able to perform few-shot prediction of a total of 16 peptide
clusters from the UniProt Archive with 50% sequence similarity. functions. With the above successful applications, we expect
Taking the network representations from ESM-1b, downstream meta-learning to have a greater impact, especially on prediction
classifiers trained on small datasets perform quite well on protein tasks, with many related classes but only a few training examples
secondary structure and protein contact map prediction. for each of them.
DNABERT19 is a DNA sequence model based on a 12-layer
BERT-base162 Transformer architecture with 110 million param- Incorporation of structural information benefits
eters pre-trained on the k-mer representation of the human modeling
genome for genomic sequence modeling. DNABERT is trained We have previously pointed out the limitation of sequence-only
with the masked language modeling task by tokenizing the models for not explicitly considering trans-acting factors. At
human genome into k-mers. The model showed similar or even the molecular level, such factors are constantly dependent on
OPEN ACCESS
Self-supervised pre-trained models:
ll
ESM-1b Transformer18 2020 250 million protein sequences from Transformer (33 layers, 650M params) pre-train task: protein masked language
UniProt Archive (UniParc)161 modeling with amino acid sequence
collected by Uniref. 50129 downstream tasks: contact map
prediction, secondary structure
prediction
DNABERT19 2021 k-mers of the human genome BERT-base162 (12 layers, 110M params) DNA masked language modeling;
downstream fine-tuning achieves
strong performance on: promoter
recognition, TF binding sites prediction,
splice sites prediction, functional
genetic variants classification, cross-
species transfer learning
MSA Transformer24 2021 260 million MSAs from UniProt Transformer adapted to MSA (12 layers, protein masked language modeling with
collected by UniClust30163 100M params) MSA; downstream tasks: unsupervised
and supervised contact map prediction,
8-class secondary structure prediction
Few-shot/meta- learning:
DeeReCT-TSS74 2021 FANTOM539 Reptile algorithm164 for meta-learning the Reptile meta-learning algorithm
allows fast adaptation to new tissue
types
MIMML165 2022 starPepDB166 Prototypical Network168 for metric- few-shot bioactive peptide function
BIOPEP-UWM167 based meta-learning (few-shot learning) prediction
mutual information maximization loss
Incorporation of structural information:
NucleicNet106 2019 483 RNA-protein complexes CNN with ResNet-like architecture structure-based RNA-protein binding
from the PDB107 and de-dup prediction
licated to 158 ribonucleoprotein
structures
MaSIF169 2020 PDB geodesic convolutional neural protein pocket classification (MaSIF-
networks170 ligand)
protein interface prediction (MaSIF-site)
protein-protein interaction (PPI) search
(MaSIF-search)
dMaSIF171 2021 PDB quasi-geodesic convolution on point protein interface prediction
cloud representation of protein surfaces
PPI search
(Continued on next page)
Review
Review
Table 5. Continued
Method name Year Datasets Model Functionalities
Multi-omics models:
MOMA172 2016 the study curated the dataset combination of RNN-based deep using a layer-by-layer approach, the
‘‘Ecomics,’’ which has integrated data learning and LASSO regression model predicts multi-omics quantities
of the transcriptome, proteome, (transcriptomic, proteomic,
metabolome, fluxome, and phenome of metabolomic, fluxomic, and phenomic)
E. coli under different experimental
conditions (available from http://
prokaryomics.com)
DSPN173 2018 the study curated the dataset resource conditional deep Boltzmann machine174 the model predicts brain phenotypes in
‘‘PsychENCODE,’’ which includes an interpretable and generative way and
comprehensive functional genomic data is able to impute intermediate
(genotype, bulk transcriptome, ‘‘molecular phenotypes’’
chromatin, and Hi-C profiles) of the
brain of 1,866 individuals
deepManReg175 2022 the Patch-seq176 multi-omics DNN with manifold alignment178 multi-modal alignment of multi-omics
transcriptomic and electrophysiological data
data for neuron phenotype
classification177
Chaudhary et al.179 2018 230 samples from TCGA with RNA-seq autoencoder-based dimensionality the multi-omics model is able to cluster
data, microRNA-seq data, and DNA reduction,180 feature selection, and patients into different survival groups,
methylation profiles integration of multi-omics data and survival-correlated autoencoder
features have verified predictive
performance on independent datasets
Utilizing single-cell profiles:
Pseudobulk level:
Cusanovich et al., 2018181 2018 scATAC-seq of 100,000 cells from 13 Basset model trained to predict
Cell Reports Methods 3, 100384, January 23, 2023 27
OPEN ACCESS
Hou et al.184 (human and mouse
scRRBS-seq)
ll
(Continued on next page)
28 Cell Reports Methods 3, 100384, January 23, 2023
Table 5. Continued
Method name Year Datasets Model Functionalities
OPEN ACCESS
CNNC185 2019 CNN TF target gene prediction, disease-
ll
d scRNA-seq profiles related genes prediction, and causality
o mouse scRNA-seq dataset186 inference between genes
d 43,261 expression profiles
from
d over 500 different scRNA-
seq studies
o mESC data (GEO: GSE65525)
d prediction targets datasets
o the GTRD database187 for
mESC ChIP-seq peak regions
o KEGG188 and Reactome
pathway189 data
SCALE190 2019 variational autoencoder197 with clustering, batch effect removal, and
d acute myeloid leukemia dataset Gaussian mixture model imputation of scATAC-seq data
from191
d GM12878/HEK293T dataset
from192
d InSilico dataset192,193
o in silico mixture of scATAC-
seq experiments of six cell
lines
d mixture of mouse splenocyte da-
taset194
d P56 mouse forebrain dataset195
d breast tumor dataset196
o mixture of tumor epithelial
cells and tumor-infiltrating
immune cells
scFAN198 2020 3-layer CNN inferring TF binding activity of scATAC-
d ENCODE GM12878, H1-ESC, seq using TF binding model pre-trained
K562 TF binding profiles on bulk data
scGNN199 2021 GNN-based autoencoder clustering, scRNA-seq data imputation
d four scRNA-seq datasets
o the Chung data (GEO:
GSE75688)
o the Klein data (GEO:
GSE65525)
o the Zeisel data (GEO:
GSE60361)
o the AD case data (GEO:
Review
GSE138852)
(Continued on next page)
ll
Review OPEN ACCESS
quasi-manifold alignment206
2022
Year
multi-omics data is essentially dealing with data coming from sequencing quality and coverage. This not only introduces new
multiple modalities. The concept of multi-modal fusion strategies challenges in data processing, analysis, and interpretation but
for multi-modal inputs, e.g., early fusion versus late fusion, is also also for the development of gene regulation models that uti-
applicable to multi-omics models. Early fusion, i.e., data integra- lize them.
tion at the networks’ early processing stage, may favor two Current deep-learning-based gene regulation models gener-
omics profiles that are similar in the technology of measurement, ally utilize single-cell profiles in two different ways (Figure 3E).
and late fusion, i.e., data integration at the networks’ late pro- The first operates at the pseudobulk level. The model aggregates
cessing stage, may favor two omics profiles that are similar in single-cell measurements within each cell cluster into one rofile.
their subject of measurement but with very different technolo- The model then utilizes the aggregated pseudobulk profiles in
gies. Production of the so-called joint representations215 by the same way as bulk omics profiles. Despite information loss
mapping data points from multiple omics data sources into a during the aggregation process, the utilization of pseudobulk
same semantic space may also be beneficial for unified down- profiles still has an advantage over real bulk omics profiles
stream network processing and analysis. DeepManReg175 because they represent measurements from pure cell types
shows itself as such an example. Designed as a DNN with a without interference from others. As an example, Cusanovich
manifold alignment objective, it performs multi-modal alignment et al. performed scATAC-seq on 100,000 somatic cells from
of transcriptomic and electrophysiological data in a Patch-seq adult mice.181 The researchers developed a model based on
multi-omics experiment176 and has been effective in neuron the architecture of Basset to predict chromatin accessibility in
phenotype classification. each of the 85 identified cell types in a multi-task fashion. The
Although most existing deep-learning methods independently model was trained on the aggregated pseudobulk profiles within
consider each regulatory event, gene regulation itself is a holistic each cell cluster. Very recently Janssens et al. developed the
cellular process. Future deep-learning models for gene regula- DeepFlyBrain model, based on DeepMEL, for the prediction of
tion modeling should not only integrate multi-omics data sources chromatin co-accessible regions in the Drosophila brain.151
as input but also consider the relationship between multi-omics Similarly, the authors trained the DeepFlyBrain model on the
quantities in their output (Figure 3D). For this, the Multi-Omics aggregated pseudobulk profiles of three cell types, namely
Model and Analytics (MOMA)172 provided such an example in Kenyon cell, T neurons, and glia.
E. coli. The MOMA model predicts multi-omics quantities of E. The other way is to utilize single-cell profiles at the genuine
coli given their different growth conditions. Using RNN-based single-cell level. As single-cell profiles are well known for their
deep learning and LASSO regression, MOMA adopts a layer- sparsity, much research has been dedicated to applying deep
by-layer approach to predict transcriptomic, proteomic, metab- learning for the imputation and inference on single-cell profiles.
olomic, fluxomic, and phenomic quantities one after another, For example, DeepCpG,182 trained on the scBS-seq and
specifically taking into consideration the effect on an omics scRRBS-seq profiles of multiple human and mouse tissues,
quantity by the quantities from previous omics layers. Similarly, uses a CNN + bidirectional GRU architecture and can impute
the Deep Structured Phenotype Network (DSPN)173 predicts methylation status for low-coverage single-cell methylation pro-
brain phenotypes from multiple functional genomic data modal- files. scGNN199 is a GNN-based autoencoder model for scRNA-
ities based on a hierarchical conditional deep Boltzmann ma- seq data enhancement. scGNN utilizes multiple GNN and
chine (DBM) architecture.174 The DBMs are also arranged in a autoencoders that are effective in producing relationship-aware
layer-by-layer fashion by first predicting the ‘‘intermediate mo- cell embeddings. The authors demonstrated that scGNN was
lecular phenotypes’’ and then the brain phenotypes. This makes effective in improving cell clustering and data imputation among
DSPN a generative model that is more interpretable than the four independent publicly available scRNA-seq datasets.
common discriminative models in deep learning. These could SCALE190 is a variational autoencoder and Gaussian mixture
shed light on future gene regulation modeling works that aim to model-based deep-learning model that performs imputation
simulate the underlying biological processes more realistically. for low-coverage scATAC-seq profiles. Additionally, SCALE’s
latent embedding of each cell was shown to be effective in scA-
Utilization of single-cell profiles TAC-seq cell clustering and batch effect removal. scBasset is a
Nearly all of the aforementioned deep-learning models for gene recent model for scATAC-seq profile imputation. It improves
regulation have been trained on bulk sequencing profiles. In upon SCALE by guiding imputation with the underlying genomic
recent years, single-cell omics profiling technologies have sequence. This is achieved by processing the genomic
improved substantially. This includes single-cell RNA-seq sequences into deep representations with a 6-layer CNN and
(scRNA-seq) for gene expression level profiling, single-cell incorporating them at the imputation step. scFAN198 is able to
ATAC-seq (scATAC-seq)192,216 for chromatin accessibility infer single-cell TF binding activity from scATAC-seq profiles.
profiling, single-cell bisulfite sequencing (scBS-seq),183 and scFAN utilizes a sequence-based TF binding model that was
single-cell reduced representation bisulfite sequencing trained on bulk TF binding profiles. scFAN then infers the per-
(scRRBS-seq)217 for methylation profiling, single-cell ChIP-seq cell TF binding activity by asking the model to predict the TF
(scChIP-seq)218 for protein-DNA binding profiling, and Smart- binding affinity to the chromatin-accessible regions of each cell
seq219 for full-length transcriptome profiling. Therefore, more as reported by the scATAC-seq profile.
and more data at single-cell resolution have accumulated. Sin- Deep learning has also been effective in making inferences on
gle-cell profiles distinguish themselves from bulk profiles in their gene regulation networks using scRNA-seq data. For example,
high dimensionality, high dropout rate (sparsity), and low CNNC185 infers the causality between two genes, e.g., gene A
and gene B, in a gene regulatory network. CNNC uses a CNN to 2. Ule, J., Jensen, K., Mele, A., and Darnell, R.B. (2005). CLIP: a method for
analyze the 2D expression level histogram between the two identifying protein–RNA interaction sites in living cells. Methods 37,
376–386. https://doi.org/10.1016/j.ymeth.2005.07.018.
genes as if it were a 2D image and predicts whether there is an
interaction between gene A and gene B, and, if so, whether 3. Licatalosi, D.D., Mele, A., Fak, J.J., Ule, J., Kayikci, M., Chi, S.W., Clark,
T.A., Schweitzer, A.C., Blume, J.E., Wang, X., et al. (2008). HITS-CLIP
gene A causally influences gene B or vice versa. scTeni-
yields genome-wide insights into brain alternative RNA processing. Na-
foldKnk202 is a model for the in silico prediction of the gene ture 456, 464–469. https://doi.org/10.1038/nature07488.
knockout (KO) effects based on scRNA-seq data and gene reg-
4. Song, L., and Crawford, G.E. (2010). DNase-seq: a high-resolution tech-
ulatory networks. scTenifoldKnk first constructs a gene regulato- nique for mapping active gene regulatory elements across the genome
ry network based on a given scRNA-seq dataset. It then from mammalian cells. pdb.prot5384. Cold Spring Harb. Protoc. 2010.
performs an in silico KO experiment by modifying the edges of 5. Buenrostro, J.D., Giresi, P.G., Zaba, L.C., Chang, H.Y., and Greenleaf,
the target genes in the network. scTenifoldKnk then performs a W.J. (2013). Transposition of native chromatin for fast and sensitive epi-
quasi-manifold alignment of the network before and after KO genomic profiling of open chromatin, DNA-binding proteins and nucleo-
to predict its influence on the gene expression levels of all genes some position. Nat. Methods 10, 1213–1218. https://doi.org/10.1038/
in the network. nmeth.2688.
As more and more single-cell profiling techniques are 6. Hoque, M., Ji, Z., Zheng, D., Luo, W., Li, W., You, B., Park, J.Y., Yehia, G.,
emerging and maturing, it is expected that more deep-learning and Tian, B. (2013). Analysis of alternative cleavage and polyadenylation
by 3’ region extraction and deep sequencing. Nat. Methods 10, 133–139.
applications for the imputation and inference in those data
https://doi.org/10.1038/nmeth.2288.
modalities are going to emerge. With the accumulation of evi-
7. Siva, N. (2008). 1000 Genomes project. Nat. Biotechnol. 26, 256–257.
dence provided by single-cell gene regulation profiles, future
8. ENCODE Project Consortium (2012). An integrated encyclopedia of DNA
gene regulation models will certainly better capture the gene
elements in the human genome. Nature 489, 57–74.
regulation heterogeneity among cells.
9. Roadmap Epigenomics Consortium; Kundaje, A., Meuleman, W., Ernst,
J., Bilenky, M., Yen, A., Heravi-Moussavi, A., Kheradpour, P., Zhang,
Conclusions Z., Wang, J., Ziller, M.J., et al. (2015). Integrative analysis of 111 refer-
To conclude, deep learning has certainly had successful applica- ence human epigenomes. Nature 518, 317–330. https://doi.org/10.
1038/nature14248.
tions in gene regulation. Being a data-driven approach, deep-
learning-based methods have successfully modeled regulatory 10. Lonsdale, J., Thomas, J., Salvatore, M., Phillips, R., Lo, E., Shad, S.,
Hasz, R., Walters, G., Garcia, F., Young, N., et al. (2013). The geno-
processes at various omics levels with high accuracy. With
type-tissue expression (GTEx) project. Nat. Genet. 45, 580–585.
further improvement in deep-learning paradigms, ongoing
11. Leinonen, R., Sugawara, H., and Shumway, M.; International Nucleotide
development in omics technologies, and accumulation of omics
Sequence Database Collaboration (2011). The sequence read archive.
data, deep-learning models are expected to be more accurate Nucleic Acids Res. 39, D19–D21.
and make breakthroughs by providing biologically insightful pre-
12. Leinonen, R., Akhtar, R., Birney, E., Bower, L., Cerdeno-Tárraga, A.,
dictions. We believe that in the foreseeable future, deep- Cheng, Y., Cleland, I., Faruque, N., Goodgame, N., Gibson, R., et al.
learning-based predictive models for gene regulation will (2011). The European nucleotide archive. Nucleic Acids Res. 39,
become indispensable tools that will aid biologists in solving D28–D31.
real-world biological problems. 13. The UniProt Consortium (2017). UniProt: the universal protein knowl-
edgebase. Nucleic Acids Res. 45, D158–D169.
ACKNOWLEDGMENTS 14. Eddy, S.R. (1996). Hidden Markov models. Curr. Opin. Struct. Biol. 6,
361–365. https://doi.org/10.1016/S0959-440X(96)80056-X.
This work was supported by Office of Research Administration (ORA) at 15. Zeng, I.S.L., and Lumley, T. (2018). Review of statistical learning methods
KAUST under award numbers FCC/1/1976-44-01, FCC/1/1976-45-01, URF/ in integrated omics studies (an integrated information science). Bio-
1/4098-01-01, URF/1/4352-01-01, REI/1/5202-01-01, REI/1/4940-01-01, inform. Biol. Insights 12. 1177932218759292.
RGC/3/4816-01-01, and REI/1/0018-01-01.
16. LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. Nature 521,
436–444. https://doi.org/10.1038/nature14539.
AUTHOR CONTRIBUTIONS 17. Alipanahi, B., Delong, A., Weirauch, M.T., and Frey, B.J. (2015). Predict-
ing the sequence specificities of DNA- and RNA-binding proteins by
Z.L., E.G., and X.G. conceived the project. Z.L. and E.G. collected and re- deep learning. Nat. Biotechnol. 33, 831–838. https://doi.org/10.1038/
viewed relevant articles. Z.L. and E.G. drafted major parts of the manuscript. nbt.3300. http://www.nature.com/nbt/journal/v33/n8/abs/nbt.3300.
J.Z. contributed to the design of figure illustrations. W.H. and X.X. contributed
18. Rives, A., Meier, J., Sercu, T., Goyal, S., Lin, Z., Liu, J., Guo, D., Ott, M.,
to the design of tables. All authors read and approved the final manuscript.
Zitnick, C.L., Ma, J., and Fergus, R. (2021). Biological structure and func-
tion emerge from scaling unsupervised learning to 250 million protein se-
DECLARATION OF INTERESTS quences. Proc. Natl. Acad. Sci. USA 118, e2016239118. https://doi.org/
10.1073/pnas.2016239118.
The authors declare no competing interests. 19. Ji, Y., Zhou, Z., Liu, H., and Davuluri, R.V. (2021). DNABERT: pre-trained
bidirectional encoder representations from Transformers model for DNA-
REFERENCES language in genome. Bioinformatics 37, 2112–2120. https://doi.org/10.
1093/bioinformatics/btab083.
1. Kharchenko, P.V., Tolstorukov, M.Y., and Park, P.J. (2008). Design and 20. Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based
analysis of ChIP-seq experiments for DNA-binding proteins. Nat. Bio- learning applied to document recognition. Proc. IEEE 86, 2278–2324.
technol. 26, 1351–1359. https://doi.org/10.1109/5.726791.
26. Li, Y., Huang, C., Ding, L., Li, Z., Pan, Y., and Gao, X. (2019). Deep 43. Minnoye, L., Taskiran, I.I., Mauduit, D., Fazio, M., Van Aerschot, L., Hul-
learning in bioinformatics: introduction, application, and perspective in selmans, G., Christiaens, V., Makhzami, S., Seltenhammer, M., Karras,
the big data era. Methods 166, 4–21. P., et al. (2020). Cross-species analysis of enhancer logic using deep
learning. Genome Res. 30, 1815–1834.
27. Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning (MIT
press). 44. Wouters, J., Kalender-Atak, Z., Minnoye, L., Spanier, K.I., De Waege-
neer, M., Bravo González-Blas, C., Mauduit, D., Davie, K., Hulselmans,
28. Weirauch, M.T., Cote, A., Norel, R., Annala, M., Zhao, Y., Riley, T.R.,
G., Najem, A., et al. (2020). Robust gene expression programs underlie
Saez-Rodriguez, J., Cokelaer, T., Vedenko, A., Talukder, S., et al.
recurrent cell states and phenotype switching in melanoma. Nat. Cell
(2013). Evaluation of methods for modeling transcription factor sequence
Biol. 22, 986–998. https://doi.org/10.1038/s41556-020-0547-3.
specificity. Nat. Biotechnol. 31, 126–134.
45. Bravo González-Blas, C., Minnoye, L., Papasokrati, D., Aibar, S., Hulsel-
29. Jolma, A., Yan, J., Whitington, T., Toivonen, J., Nitta, K.R., Rastas, P.,
mans, G., Christiaens, V., Davie, K., Wouters, J., and Aerts, S. (2019).
Morgunova, E., Enge, M., Taipale, M., Wei, G., et al. (2013). DNA-binding
cisTopic: cis-regulatory topic modeling on single-cell ATAC-seq data.
specificities of human transcription factors. Cell 152, 327–339. https://
Nat. Methods 16, 397–400. https://doi.org/10.1038/s41592-019-0367-1.
doi.org/10.1016/j.cell.2012.12.009.
Agarwal, V., Visentin, D., Ledsam, J.R., Grabska-Barwinska,
46. Avsec, Z.,
30. Zhou, J., and Troyanskaya, O.G. (2015). Predicting effects of noncoding A., Taylor, K.R., Assael, Y., Jumper, J., Kohli, P., and Kelley, D.R.
variants with deep learning–based sequence model. Nat. Methods 12, (2021). Effective gene expression prediction from sequence by inte-
931–934. https://doi.org/10.1038/nmeth.3547. grating long-range interactions. Nat. Methods 18, 1196–1203. https://
31. Bernstein, B.E., Stamatoyannopoulos, J.A., Costello, J.F., Ren, B., Milo- doi.org/10.1038/s41592-021-01252-x.
savljevic, A., Meissner, A., Kellis, M., Marra, M.A., Beaudet, A.L., Ecker, 47. Kelley, D.R. (2020). Cross-species regulatory sequence activity predic-
J.R., et al. (2010). The NIH roadmap epigenomics mapping consortium. tion. PLoS Comput. Biol. 16, e1008050. https://doi.org/10.1371/jour-
Nat. Biotechnol. 28, 1045–1048. nal.pcbi.1008050.
32. Stenson, P.D., Ball, E.V., Mort, M., Phillips, A.D., Shiel, J.A., Thomas, Weilert, M., Shrikumar, A., Krueger, S., Alexandari, A., Dalal,
48. Avsec, Z.,
N.S.T., Abeysinghe, S., Krawczak, M., and Cooper, D.N. (2003). Human K., Fropf, R., McAnany, C., Gagneur, J., Kundaje, A., and Zeitlinger, J.
gene mutation database (HGMD): 2003 update. Hum. Mutat. 21, (2021). Base-resolution models of transcription-factor binding reveal
577–581. soft motif syntax. Nat. Genet. 53, 354–366. https://doi.org/10.1038/
33. Leslie, R., O’Donnell, C.J., and Johnson, A.D. (2014). GRASP: analysis of s41588-021-00782-6.
genotype–phenotype results from 1390 genome-wide association 49. He, Q., Johnston, J., and Zeitlinger, J. (2015). ChIP-nexus enables
studies and corresponding open access database. Bioinformatics 30, improved detection of in vivo transcription factor binding footprints.
i185–i194. https://doi.org/10.1093/bioinformatics/btu273. Nat. Biotechnol. 33, 395–401. https://doi.org/10.1038/nbt.3121.
34. Kelley, D.R., Snoek, J., and Rinn, J.L. (2016). Basset: learning the regu- 50. Fudenberg, G., Kelley, D.R., and Pollard, K.S. (2020). Predicting 3D
latory code of the accessible genome with deep convolutional neural net- genome folding from DNA sequence with Akita. Nat. Methods 17,
works. Genome Res. 26, 990–999. 1111–1117. https://doi.org/10.1038/s41592-020-0958-x.
35. Quang, D., and Xie, X. (2016). DanQ: a hybrid convolutional and recurrent 51. Krietenstein, N., Abraham, S., Venev, S.V., Abdennur, N., Gibcus, J.,
deep neural network for quantifying the function of DNA sequences. Nu- Hsieh, T.-H.S., Parsi, K.M., Yang, L., Maehr, R., Mirny, L.A., et al.
cleic Acids Res. 44, e107. https://doi.org/10.1093/nar/gkw226. (2020). Ultrastructural details of mammalian chromosome architecture.
36. Zeng, H., and Gifford, D.K. (2017). Predicting the impact of non-coding Mol. Cell 78, 554–565.e7.
variants on DNA methylation. Nucleic Acids Res. 45, e99. https://doi. 52. Rao, S.S.P., Huntley, M.H., Durand, N.C., Stamenova, E.K., Bochkov,
org/10.1093/nar/gkx177. I.D., Robinson, J.T., Sanborn, A.L., Machol, I., Omer, A.D., and Lander,
37. Wang, M., Tai, C., E, W., and Wei, L. (2018). DeFine: deep convolutional E.S. (2014). A 3D map of the human genome at kilobase resolution re-
neural networks accurately quantify intensities of transcription factor- veals principles of chromatin looping. Cell 159, 1665–1680. https://doi.
DNA binding and facilitate evaluation of functional non-coding variants. org/10.1016/j.cell.2014.11.021.
Nucleic Acids Res. 46, e69. https://doi.org/10.1093/nar/gky215. 53. Rao, S.S.P., Huang, S.-C., Glenn St Hilaire, B., Engreitz, J.M., Perez,
38. Kelley, D.R., Reshef, Y.A., Bileschi, M., Belanger, D., McLean, C.Y., and E.M., Kieffer-Kwon, K.-R., Sanborn, A.L., Johnstone, S.E., Bascom,
Snoek, J. (2018). Sequential regulatory activity prediction across chro- G.D., Bochkov, I.D., et al. (2017). Cohesin loss eliminates all loop do-
mosomes with convolutional neural networks. Genome Res. 28, mains. Cell 171, 305–320.e24. https://doi.org/10.1016/j.cell.2017.
739–750. 09.026.
39. Noguchi, S., Arakawa, T., Fukuda, S., Furuno, M., Hasegawa, A., Hori, F., 54. Hsieh, T.-H.S., Cattoglio, C., Slobodyanyuk, E., Hansen, A.S., Rando,
Ishikawa-Kato, S., Kaida, K., Kaiho, A., Kanamori-Katayama, M., et al. O.J., Tjian, R., and Darzacq, X. (2020). Resolving the 3D landscape of
transcription-linked mammalian chromatin folding. Mol. Cell 78, 539– Martı́nez-Flores, I., Pannier, L., Castro-Mondragón, J.A., et al. (2016).
553.e8. https://doi.org/10.1016/j.molcel.2020.03.002. RegulonDB version 9.0: high-level integration of gene regulation, coex-
55. Bonev, B., Mendelson Cohen, N., Szabo, Q., Fritsch, L., Papadopoulos, pression, motif clustering and beyond. Nucleic Acids Res. 44,
G.L., Lubling, Y., Xu, X., Lv, X., Hugnot, J.-P., Tanay, A., and Cavalli, G. D133–D143.
(2017). Multiscale 3D genome rewiring during mouse neural develop- 72. Ishii, T., Yoshida, K., Terai, G., Fujita, Y., and Nakai, K. (2001). DBTBS: a
ment. Cell 171, 557–572.e24. https://doi.org/10.1016/j.cell.2017.09.043. database of Bacillus subtilis promoters and transcription factors. Nucleic
56. Zhou, J. (2022). Sequence-based modeling of three-dimensional Acids Res. 29, 278–280.
genome architecture from kilobase to chromosome scale. Nat. Genet. 73. Umarov, R., Kuwahara, H., Li, Y., Gao, X., and Solovyev, V. (2019). Pro-
54, 725–734. https://doi.org/10.1038/s41588-022-01065-4. moter analysis and prediction in the human genome using sequence-
57. Karbalayghareh, A., Sahin, M., and Leslie, C.S. (2022). Chromatin based deep learning models. Bioinformatics 35, 2730–2737.
interaction–aware gene regulatory modeling with graph attention net- 74. Zhou, J., zhang, b., Li, H., Zhou, L., Li, Z., Long, Y., Han, W., Wang, M.,
works. Genome Res. 32, 930–944. Cui, H., Chen, W., and Gao, X. (2021). DeeReCT-TSS: a novel meta-
58. Reiff, S.B., Schroeder, A.J., Kırlı, K., Cosolo, A., Bakker, C., Mercado, L., learning-based method annotates TSS in multiple cell types based on
Lee, S., Veit, A.D., Balashov, A.K., Vitzthum, C., et al. (2022). The 4D Nu- DNA sequences and RNA-seq data. Preprint at bioRxiv. https://doi.
cleome Data Portal as a resource for searching and visualizing curated org/10.1101/2021.07.14.452328.
nucleomics data. Nat. Commun. 13, 2365. https://doi.org/10.1038/
75. Barash, Y., Calarco, J.A., Gao, W., Pan, Q., Wang, X., Shai, O., Blen-
s41467-022-29697-4.
cowe, B.J., and Frey, B.J. (2010). Deciphering the splicing code. Nature
59. Velickovic, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., and Ben- 465, 53–59. http://www.nature.com/nature/journal/v465/n7294/suppinfo/
gio, Y. (2017). Graph attention networks. Preprint at arXiv. https://doi.org/ nature09000_S1.html.
10.48550/arXiv.1710.10903.
76. Fagnani, M., Barash, Y., Ip, J.Y., Misquitta, C., Pan, Q., Saltzman, A.L.,
60. Mukherjee, S., Berger, M.F., Jona, G., Wang, X.S., Muzzey, D., Snyder, Shai, O., Lee, L., Rozenhek, A., Mohammad, N., et al. (2007). Functional
M., Young, R.A., and Bulyk, M.L. (2004). Rapid analysis of the DNA-bind- coordination of alternative splicing in the mammalian central nervous
ing specificities of transcription factors with DNA microarrays. Nat. system. Genome Biol. 8, R108. https://doi.org/10.1186/gb-2007-8-
Genet. 36, 1331–1339. 6-r108.
61. Jolma, A., Kivioja, T., Toivonen, J., Cheng, L., Wei, G., Enge, M., Taipale, 77. Leung, M.K.K., Xiong, H.Y., Lee, L.J., and Frey, B.J. (2014). Deep learning
M., Vaquerizas, J.M., Yan, J., Sillanpää, M.J., et al. (2010). Multiplexed of the tissue-regulated splicing code. Bioinformatics 30, i121–i129.
massively parallel SELEX for characterization of human transcription fac- https://doi.org/10.1093/bioinformatics/btu277.
tor binding specificities. Genome Res. 20, 861–873. https://doi.org/10.
1101/gr.100552.109. 78. Brawand, D., Soumillon, M., Necsulea, A., Julien, P., Csárdi, G., Harrigan,
P., Weier, M., Liechti, A., Aximu-Petri, A., Kircher, M., et al. (2011). The
62. Yu, F., and Koltun, V. (2016). Multi-scale context aggregation by dilated
evolution of gene expression levels in mammalian organs. Nature 478,
convolutions. Preprint at arXiv. https://doi.org/10.48550/arXiv.1511.
343–348.
07122.
79. Xiong, H.Y., Alipanahi, B., Lee, L.J., Bretschneider, H., Merico, D., Yuen,
63. Cao, R., Wang, L., Wang, H., Xia, L., Erdjument-Bromage, H., Tempst, P.,
R.K.C., Hua, Y., Gueroussov, S., Najafabadi, H.S., Hughes, T.R., et al.
Jones, R.S., and Zhang, Y. (2002). Role of histone H3 lysine 27 methyl-
(2015). RNA splicing. The human splicing code reveals new insights
ation in Polycomb-group silencing. Science 298, 1039–1043. https://
into the genetic determinants of disease. Science 347, 1254806.
doi.org/10.1126/science.1076997.
https://doi.org/10.1126/science.1254806.
64. Corces, M.R., Trevino, A.E., Hamilton, E.G., Greenside, P.G., Sinnott-
Armstrong, N.A., Vesuna, S., Satpathy, A.T., Rubin, A.J., Montine, K.S., 80. Zhang, Z., Pan, Z., Ying, Y., Xie, Z., Adhikari, S., Phillips, J., Carstens,
Wu, B., et al. (2017). An improved ATAC-seq protocol reduces back- R.P., Black, D.L., Wu, Y., and Xing, Y. (2019). Deep-learning augmented
ground and enables interrogation of frozen tissues. Nat. Methods 14, RNA-seq analysis of transcript splicing. Nat. Methods 16, 307–310.
959–962. https://doi.org/10.1038/nmeth.4396. https://doi.org/10.1038/s41592-019-0351-9.
65. Lieberman-Aiden, E., van Berkum, N.L., Williams, L., Imakaev, M., Ra- 81. Jaganathan, K., Kyriazopoulou Panagiotopoulou, S., McRae, J.F., Dar-
goczy, T., Telling, A., Amit, I., Lajoie, B.R., Sabo, P.J., Dorschner, bandi, S.F., Knowles, D., Li, Y.I., Kosmicki, J.A., Arbelaez, J., Cui, W.,
M.O., et al. (2009). Comprehensive mapping of long-range interactions Schwartz, G.B., et al. (2019). Predicting splicing from primary sequence
reveals folding principles of the human genome. Science 326, with deep learning. Cell 176, 535–548.e24. https://doi.org/10.1016/j.cell.
289–293. https://doi.org/10.1126/science.1181369. 2018.12.015.
66. Davuluri, R.V., Suzuki, Y., Sugano, S., Plass, C., and Huang, T.H.M. 82. Harrow, J., Frankish, A., Gonzalez, J.M., Tapanari, E., Diekhans, M., Ko-
(2008). The functional consequences of alternative promoter use in kocinski, F., Aken, B.L., Barrell, D., Zadissa, A., Searle, S., et al. (2012).
mammalian genomes. Trends Genet. 24, 167–177. https://doi.org/10. GENCODE: the reference human genome annotation for the ENCODE
1016/j.tig.2008.01.008. Project. Genome Res. 22, 1760–1774.
67. Witten, J.T., and Ule, J. (2011). Understanding splicing regulation through 83. Zeng, T., and Li, Y.I. (2022). Predicting RNA splicing from DNA sequence
RNA splicing maps. Trends Genet. 27, 89–97. using Pangolin. Genome Biol. 23, 103. https://doi.org/10.1186/s13059-
68. Elkon, R., Ugalde, A.P., and Agami, R. (2013). Alternative cleavage and 022-02664-4.
polyadenylation: extent, regulation and function. Nat. Rev. Genet. 14, 84. Hubbard, T., Barker, D., Birney, E., Cameron, G., Chen, Y., Clark, L., Cox,
496–506. https://doi.org/10.1038/nrg3482. T., Cuff, J., Curwen, V., Down, T., et al. (2002). The Ensembl genome
69. Umarov, R.K., and Solovyev, V.V. (2017). Recognition of prokaryotic and database project. Nucleic Acids Res. 30, 38–41.
eukaryotic promoters using convolutional deep learning neural networks. 85. Cardoso-Moreira, M., Halbert, J., Valloton, D., Velten, B., Chen, C., Shao,
PLoS One 12, e0171410. ~o, K., Rummel, C., Ovchinnikova, S., et al. (2019).
Y., Liechti, A., Ascença
70. Dreos, R., Ambrosini, G., Cavin Périer, R., and Bucher, P. (2013). EPD Gene expression across mammalian organ development. Nature 571,
and EPDnew, high-quality promoter resources in the next-generation 505–509.
sequencing era. Nucleic Acids Res. 41, D157–D164. 86. Leung, M.K.K., Delong, A., and Frey, B.J. (2018). Inference of the human
71. Gama-Castro, S., Salgado, H., Santos-Zavaleta, A., Ledezma-Tejeida, polyadenylation code. Bioinformatics 34, 2889–2898. https://doi.org/10.
D., Muñiz-Rascado, L., Garcı́a-Sotelo, J.S., Alquicira-Hernández, K., 1093/bioinformatics/bty211.
122. Wang, D., Zeng, S., Xu, C., Qiu, W., Liang, Y., Joshi, T., and Xu, D. (2017). 139. Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation
MusiteDeep: a deep-learning framework for general and kinase-specific by jointly learning to align and translate. Preprint at arXiv. https://doi.org/
phosphorylation site prediction. Bioinformatics 33, 3909–3916. 10.48550/arXiv.1409.0473.
123. Wang, D., Liang, Y., and Xu, D. (2019). Capsule network for protein post- 140. Kingma, D.P., and Ba, J. (2014). Adam: a method for stochastic optimi-
translational modification site prediction. Bioinformatics 35, 2386–2394. zation. Preprint at arXiv. https://doi.org/10.48550/arXiv.1409.0473.
124. Wang, D., Liu, D., Yuchi, J., He, F., Jiang, Y., Cai, S., Li, J., and Xu, D. 141. Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient
(2020). MusiteDeep: a deep-learning based webserver for protein post- methods for online learning and stochastic optimization. J. Mach. Learn.
translational modification site prediction and visualization. Nucleic Acids Res. 12.
Res. 48, W140–W146. 142. Graves, A. (2013). Generating sequences with recurrent neural networks.
Preprint at arXiv. https://doi.org/10.48550/arXiv.1308.0850.
125. Sabour, S., Frosst, N., and Hinton, G.E. (2017). Dynamic routing between
capsules. Preprint at arXiv. https://doi.org/10.48550/arXiv.1710.09829. 143. Dozat, T. (2016). Incorporating Nesterov Momentum into Adam.
144. Loshchilov, I., and Hutter, F. (2017). Decoupled weight decay regulariza-
126. Almagro Armenteros, J.J., Sønderby, C.K., Sønderby, S.K., Nielsen, H.,
tion. Preprint at arXiv. https://doi.org/10.48550/arXiv.1711.05101.
and Winther, O. (2017). DeepLoc: prediction of protein subcellular local-
ization using deep learning. Bioinformatics 33, 3387–3395. 145. Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., and Han, J. (2019). On
the variance of the adaptive learning rate and beyond. Preprint at arXiv.
127. Höglund, A., Dönnes, P., Blum, T., Adolph, H.-W., and Kohlbacher, O.
https://doi.org/10.48550/arXiv.1908.03265.
(2006). MultiLoc: prediction of protein subcellular localization using
N-terminal targeting sequences, sequence motifs and amino acid 146. Liaw, R., Liang, E., Nishihara, R., Moritz, P., Gonzalez, J.E., and Stoica, I.
composition. Bioinformatics 22, 1158–1165. (2018). Tune: a research platform for distributed model selection and
training. Preprint at arXiv. https://doi.org/10.48550/arXiv.1807.05118.
128. Zhou, J., Park, C.Y., Theesfeld, C.L., Wong, A.K., Yuan, Y., Scheckel, C.,
147. Abadi, M. (2016). TensorFlow: Learning Functions at Scale.
Fak, J.J., Funk, J., Yao, K., Tajima, Y., et al. (2019). Whole-genome deep-
learning analysis identifies contribution of noncoding mutations to autism 148. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Kill-
risk. Nat. Genet. 51, 973–980. https://doi.org/10.1038/s41588-019- een, T., Lin, Z., Gimelshein, N., and Antiga, L. (2019). PyTorch: an imper-
0420-0. ative style, high-performance deep learning library. Preprint at arXiv.
https://doi.org/10.48550/arXiv.1912.01703.
€hnel, B.,
129. Arloth, J., Eraslan, G., Andlauer, T.F.M., Martins, J., Iurato, S., Ku
149. O’Malley, T.B., Elie, Long, J., Chollet, F., Jin, H., and Invernizzi, L. (2019).
Waldenberger, M., Frank, J., Gold, R., Hemmer, B., et al. (2020). Deep-
KerasTuner. https://github.com/keras-team/keras-tuner.
WAS: multivariate genotype-phenotype associations by directly inte-
grating regulatory information using deep learning. PLoS Comput. Biol. 150. Chollet, F.a.o. (2015). Keras. https://keras.io.
16, e1007616. https://doi.org/10.1371/journal.pcbi.1007616. 151. Janssens, J., Aibar, S., Taskiran, I.I., Ismail, J.N., Gomez, A.E., Aughey,
G., Spanier, K.I., De Rop, F.V., González-Blas, C.B., Dionne, M., et al.
130. Andlauer, T.F.M., Buck, D., Antony, G., Bayas, A., Bechmann, L., Ber-
(2022). Decoding gene regulation in the fly brain. Nature 601, 630–636.
thele, A., Chan, A., Gasperi, C., Gold, R., Graetz, C., et al. (2016). Novel
https://doi.org/10.1038/s41586-021-04262-z.
multiple sclerosis susceptibility loci implicated in epigenetic regulation.
Sci. Adv. 2, e1501678. 152. Sundararajan, M., Taly, A., and Yan, Q. (2017). Axiomatic Attribution for
Deep Networks (PMLR)), pp. 3319–3328.
131. Muglia, P., Tozzi, F., Galwey, N.W., Francks, C., Upmanyu, R., Kong,
X.Q., Antoniades, A., Domenici, E., Perry, J., Rothen, S., et al. (2010). 153. Lundberg, S.M., and Lee, S.-I. (2017). A unified approach to interpreting
Genome-wide association study of recurrent major depressive disorder model predictions. Preprint at arXiv. https://doi.org/10.48550/arXiv.
in two European case–control cohorts. Mol. Psychiatry 15, 589–601. 1705.07874.
154. Shrikumar, A., Greenside, P., and Kundaje, A. (2017). Learning Important
132. Wichmann, H.-E., Gieger, C., and Illig, T. (2005). KORA-gen-resource for
Features through Propagating Activation Differences (PMLR)),
population genetics, controls and a broad spectrum of disease pheno-
pp. 3145–3153.
types. Gesundheitswesen 67, 26–30. MONICA/KORA Study Group.
155. Kokhlikyan, N., Miglani, V., Martin, M., Wang, E., Alsallakh, B., Reynolds,
133. Zingaretti, L.M., Gezan, S.A., Ferra~o, L.F.V., Osorio, L.F., Monfort, A.,
J., Melnikov, A., Kliushkina, N., Araya, C., and Yan, S. (2020). Captum: a
Muñoz, P.R., Whitaker, V.M., and Pérez-Enciso, M. (2020). Exploring unified and generic model interpretability library for pytorch. Preprint at
deep learning for complex trait genomic prediction in polyploid outcross- arXiv. https://doi.org/10.48550/arXiv.2009.07896.
ing species. Front. Plant Sci. 11, 25. https://doi.org/10.3389/fpls.2020.
156. Nesterov, Y.E. (1983). A Method for Solving the Convex Programming
00025.
Problem with Convergence Rate O(1/k2), pp. 543–547.
134. Gezan, S.A., Osorio, L.F., Verma, S., and Whitaker, V.M. (2017). An
157. Bergstra, J., and Bengio, Y. (2012). Random search for hyper-parameter
experimental validation of genomic selection in octoploid strawberry.
optimization. J. Mach. Learn. Res. 13, 281–305.
Hortic. Res. 4, 16070.
158. Shahriari, B., Swersky, K., Wang, Z., Adams, R.P., and De Freitas, N.
~o, L.F.V., Amadeu, R.R.,
135. de Bem Oliveira, I., Resende, M.F.R., Jr., Ferra (2016). Taking the human out of the loop: a review of Bayesian optimiza-
Endelman, J.B., Kirst, M., Coelho, A.S.G., and Munoz, P.R. (2019). tion. Proc. IEEE 104, 148–175.
Genomic prediction of autotetraploids; influence of relationship matrices,
159. Simonyan, K., Vedaldi, A., and Zisserman, A. (2013). Deep inside convo-
allele dosage, and continuous genotyping calls in phenotype prediction.
lutional networks: visualising image classification models and saliency
G3 9, 1189–1198.
maps. Preprint at arXiv. https://doi.org/10.48550/arXiv.1312.6034.
136. Benevenuto, J., Ferra~o, L.F.V., Amadeu, R.R., and Munoz, P. (2019). How
160. Paolacci, G., Chandler, J., and Ipeirotis, P.G. (2010). Running experi-
can a high-quality genome assembly help plant breeders? Gigascience ments on amazon mechanical turk. Judgment and Decision making 5,
8, giz068. https://doi.org/10.1093/gigascience/giz068. 411–419.
137. Shook, J., Gangopadhyay, T., Wu, L., Ganapathysubramanian, B., Sar- 161. UniProt Consortium (2021). UniProt: the universal protein knowledge-
kar, S., and Singh, A.K. (2021). Crop yield prediction integrating genotype base in 2021. Nucleic Acids Res. 49, D480–D489.
and weather variables using deep learning. PLoS One 16, e0252402.
162. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-
https://doi.org/10.1371/journal.pone.0252402.
training of Deep Bidirectional Transformers for Language Understanding
138. Abney, T.S., and Crochet, W.D. (2005). The Uniform Soybean Tests: (Minneapolis, Minnesota: Association for Computational Linguistics),
Northern Region 2005. pp. 4171–4186.
166. Aguilera-Mendoza, L., Marrero-Ponce, Y., Beltran, J.A., Tellez Ibarra, R., 183. Smallwood, S.A., Lee, H.J., Angermueller, C., Krueger, F., Saadeh, H.,
Guillen-Ramirez, H.A., and Brizuela, C.A. (2019). Graph-based data inte- Peat, J., Andrews, S.R., Stegle, O., Reik, W., and Kelsey, G. (2014). Sin-
gration from bioactive peptide databases of pharmaceutical interest: to- gle-cell genome-wide bisulfite sequencing for assessing epigenetic het-
ward an organized collection enabling visual network analysis. Bioinfor- erogeneity. Nat. Methods 11, 817–820.
matics 35, 4739–4747. 184. Hou, Y., Guo, H., Cao, C., Li, X., Hu, B., Zhu, P., Wu, X., Wen, L., Tang, F.,
167. Minkiewicz, P., Iwaniak, A., and Darewicz, M. (2019). BIOPEP-UWM Huang, Y., and Peng, J. (2016). Single-cell triple omics sequencing re-
database of bioactive peptides: current opportunities. Int. J. Mol. Sci. veals genetic, epigenetic, and transcriptomic heterogeneity in hepatocel-
20, 5978. lular carcinomas. Cell Res. 26, 304–319.
185. Yuan, Y., and Bar-Joseph, Z. (2019). Deep learning for inferring gene re-
168. Snell, J., Swersky, K., and Zemel, R.S. (2017). Prototypical networks for
lationships from single-cell expression data. Proc. Natl. Acad. Sci. USA
few-shot learning. Preprint at arXiv. https://doi.org/10.48550/arXiv.1703.
116, 27151–27158. https://doi.org/10.1073/pnas.1911536116.
05175.
186. Alavi, A., Ruffalo, M., Parvangada, A., Huang, Z., and Bar-Joseph, Z.
169. Gainza, P., Sverrisson, F., Monti, F., Rodolà, E., Boscaini, D., Bronstein,
(2018). A web server for comparative analysis of single-cell RNA-seq
M.M., and Correia, B.E. (2020). Deciphering interaction fingerprints from
data. Nat. Commun. 9, 4768.
protein molecular surfaces using geometric deep learning. Nat. Methods
17, 184–192. https://doi.org/10.1038/s41592-019-0666-6. 187. Yevshin, I., Sharipov, R., Valeev, T., Kel, A., and Kolpakov, F. (2016).
GTRD: a database of transcription factor binding sites identified by
170. Masci, J., Boscaini, D., Bronstein, M., and Vandergheynst, P. (2015).
ChIP-seq experiments. Nucleic Acids Res., gkw951.
Geodesic convolutional neural networks on riemannian manifolds. Pre-
print at arXiv. https://doi.org/10.48550/arXiv.1501.06297. 188. Kanehisa, M., Furumichi, M., Tanabe, M., Sato, Y., and Morishima, K.
(2017). KEGG: new perspectives on genomes, pathways, diseases and
171. Sverrisson, F., Feydy, J., Correia, B.E., and Bronstein, M.M. (2021). Fast
drugs. Nucleic Acids Res. 45, D353–D361.
end-to-end learning on protein surfaces. Preprint at bioRxiv. https://doi.
189. Fabregat, A., Jupe, S., Matthews, L., Sidiropoulos, K., Gillespie, M., Ga-
org/10.1101/2020.12.28.424589.
rapati, P., Haw, R., Jassal, B., Korninger, F., May, B., et al. (2018). The re-
172. Kim, M., Rai, N., Zorraquino, V., and Tagkopoulos, I. (2016). Multi-omics actome pathway knowledgebase. Nucleic Acids Res. 46, D649–D655.
integration accurately predicts cellular state in unexplored conditions for
190. Xiong, L., Xu, K., Tian, K., Shao, Y., Tang, L., Gao, G., Zhang, M., Jiang,
Escherichia coli. Nat. Commun. 7, 13090. https://doi.org/10.1038/
T., and Zhang, Q.C. (2019). SCALE method for single-cell ATAC-seq
ncomms13090.
analysis via latent feature extraction. Nat. Commun. 10, 4576. https://
173. Wang, D., Liu, S., Warrell, J., Won, H., Shi, X., Navarro, F.C.P., Clarke, D., doi.org/10.1038/s41467-019-12630-7.
Gu, M., Emani, P., Yang, Y.T., et al. (2018). Comprehensive functional
191. Corces, M.R., Buenrostro, J.D., Wu, B., Greenside, P.G., Chan, S.M.,
genomic resource and integrative model for the human brain. Science
Koenig, J.L., Snyder, M.P., Pritchard, J.K., Kundaje, A., Greenleaf,
362, eaat8464. https://doi.org/10.1126/science.aat8464.
W.J., et al. (2016). Lineage-specific and single-cell chromatin accessi-
174. Salakhutdinov, R., and Hinton, G. (2009). Deep Boltzmann machines. In bility charts human hematopoiesis and leukemia evolution. Nat. Genet.
Proceedings of the Twelth International Conference on Artificial Intelli- 48, 1193–1203.
gence and Statistics, D. David van and W. Max, eds. (PMLR).
192. Buenrostro, J.D., Wu, B., Litzenburger, U.M., Ruff, D., Gonzales, M.L.,
175. Nguyen, N.D., Huang, J., and Wang, D. (2022). A deep manifold-regular- Snyder, M.P., Chang, H.Y., and Greenleaf, W.J. (2015). Single-cell chro-
ized learning model for improving phenotype prediction from multi-modal matin accessibility reveals principles of regulatory variation. Nature 523,
data. Nat. Comput. Sci. 2, 38–46. https://doi.org/10.1038/s43588-021- 486–490.
00185-x. 193. Li, W.V., and Li, J.J. (2018). An accurate and robust imputation method
176. Cadwell, C.R., Scala, F., Li, S., Livrizzi, G., Shen, S., Sandberg, R., Jiang, scImpute for single-cell RNA-seq data. Nat. Commun. 9, 997.
X., and Tolias, A.S. (2017). Multimodal profiling of single-cell morphology, 194. Chen, X., Miragaia, R.J., Natarajan, K.N., and Teichmann, S.A. (2018). A
electrophysiology, and gene expression using Patch-seq. Nat. Protoc. rapid and robust method for single cell chromatin accessibility profiling.
12, 2531–2553. Nat. Commun. 9, 5345.
177. Gouwens, N.W., Sorensen, S.A., Baftizadeh, F., Budzillo, A., Lee, B.R., 195. Preissl, S., Fang, R., Huang, H., Zhao, Y., Raviram, R., Gorkin, D.U.,
Jarsky, T., Alfiler, L., Baker, K., Barkan, E., Berry, K., et al. (2020). Inte- Zhang, Y., Sos, B.C., Afzal, V., Dickel, D.E., et al. (2018). Single-nucleus
grated morphoelectric and transcriptomic classification of cortical analysis of accessible chromatin in developing mouse forebrain reveals
GABAergic cells. Cell 183, 935–953.e19. cell-type-specific transcriptional regulation. Nat. Neurosci. 21, 432–439.
178. Nguyen, N.D., Blaby, I.K., and Wang, D. (2019). ManiNetCluster: a novel 196. Chen, X., Litzenburger, U.M., Wei, Y., Schep, A.N., LaGory, E.L.,
manifold learning approach to reveal the functional links between gene Choudhry, H., Giaccia, A.J., Greenleaf, W.J., and Chang, H.Y. (2018).
networks. BMC Genom. 20, 1003. Joint single-cell DNA accessibility and protein epitope profiling reveals
179. Chaudhary, K., Poirion, O.B., Lu, L., and Garmire, L.X. (2018). Deep environmental regulation of epigenomic heterogeneity. Nat. Commun.
learning–based multi-omics integration robustly predicts survival in liver 9, 4590.
CancerUsing deep learning to predict liver cancer prognosis. Clin. Can- 197. Kingma, D.P., and Welling, M. (2013). Auto-encoding variational bayes.
cer Res. 24, 1248–1259. Preprint at arXiv. https://doi.org/10.48550/arXiv.1312.6114.
198. Fu, L., Zhang, L., Dollinger, E., Peng, Q., Nie, Q., and Xie, X. (2020). Pre- 209. Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P.,
dicting transcription factor binding in single cells through deep learning. Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language
Sci. Adv. 6, eaba9031. https://doi.org/10.1126/sciadv.aba9031. models are few-shot learners. Preprint at arXiv. https://doi.org/10.48550/
199. Wang, J., Ma, A., Chang, Y., Gong, J., Jiang, Y., Qi, R., Wang, C., Fu, H., arXiv.2005.14165.
Ma, Q., and Xu, D. (2021). scGNN is a novel graph neural network frame- 210. Suzek, B.E., Wang, Y., Huang, H., McGarvey, P.B., and Wu, C.H.; UniProt
work for single-cell RNA-Seq analyses. Nat. Commun. 12, 1882. https:// Consortium (2015). UniRef clusters: a comprehensive and scalable alter-
doi.org/10.1038/s41467-021-22197-x. native for improving sequence similarity searches. Bioinformatics 31,
200. Yuan, H., and Kelley, D.R. (2022). scBasset: sequence-based modeling 926–932.
of single-cell ATAC-seq using convolutional neural networks. Nat. 211. Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger,
Methods 19, 1088–1096. https://doi.org/10.1038/s41592-022-01562-8.
O., Tunyasuvunakool, K., Bates, R., Zı́dek, A., Potapenko, A., et al.
201. Buenrostro, J.D., Corces, M.R., Lareau, C.A., Wu, B., Schep, A.N., Aryee, (2021). Highly accurate protein structure prediction with AlphaFold. Na-
M.J., Majeti, R., Chang, H.Y., and Greenleaf, W.J. (2018). Integrated sin- ture 596, 583–589. https://doi.org/10.1038/s41586-021-03819-2.
gle-cell analysis maps the continuous regulatory landscape of human he- 212. Townshend, R.J.L., Eismann, S., Watkins, A.M., Rangan, R., Karelina, M.,
matopoietic differentiation. Cell 173, 1535–1548.e16. Das, R., and Dror, R.O. (2021). Geometric deep learning of RNA structure.
202. Osorio, D., Zhong, Y., Li, G., Xu, Q., Yang, Y., Tian, Y., Chapkin, R.S., Science 373, 1047–1051. https://doi.org/10.1126/science.abe5650.
Huang, J.Z., and Cai, J.J. (2022). scTenifoldKnk: an efficient virtual 213. Xinshi Chen, Y.L., Umarov, R., Gao, X., and Song, L. (2020). RNA Sec-
knockout tool for gene function predictions via single-cell gene regulato- ondary Structure Prediction by Learning Unrolled Algorithms (ICLR).
ry network perturbation. Patterns 3, 100434. https://doi.org/10.1016/j. 214. Sverrisson, F., Feydy, J., Correia, B.E., and Bronstein, M.M. (2021). Fast
patter.2022.100434. end-to-end learning on protein surfaces. Preprint at bioRxiv. https://doi.
203. Little, D.R., Gerner-Mauro, K.N., Flodby, P., Crandall, E.D., Borok, Z., org/10.1101/2020.12.28.424589.
Akiyama, H., Kimura, S., Ostrin, E.J., and Chen, J. (2019). Transcriptional 215. Baltrusaitis, T., Ahuja, C., and Morency, L.-P. (2019). Multimodal machine
control of lung alveolar type 1 cell development and maintenance by NK learning: a survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell.
homeobox 2-1. Proc. Natl. Acad. Sci. USA 116, 20545–20555. 41, 423–443. https://doi.org/10.1109/tpami.2018.2798607.
204. Nugent, A.A., Lin, K., Van Lengerich, B., Lianoglou, S., Przybyla, L., Da- 216. Cusanovich, D.A., Daza, R., Adey, A., Pliner, H.A., Christiansen, L., Gun-
vis, S.S., Llapashtica, C., Wang, J., Kim, D.J., Xia, D., et al. (2020). TREM2 derson, K.L., Steemers, F.J., Trapnell, C., and Shendure, J. (2015). Multi-
regulates microglial cholesterol metabolism upon chronic phagocytic plex single-cell profiling of chromatin accessibility by combinatorial
challenge. Neuron 105, 837–854.e9. cellular indexing. Science 348, 910–914. https://doi.org/10.1126/sci-
205. Chen, L., Toke, N.H., Luo, S., Vasoya, R.P., Fullem, R.L., Parthasarathy, ence.aab1601.
A., Perekatt, A.O., and Verzi, M.P. (2019). A reinforcing HNF4–SMAD4 217. Farlik, M., Sheffield, N.C., Nuzzo, A., Datlinger, P., Schönegger, A., Klug-
feed-forward module stabilizes enterocyte identity. Nat. Genet. 51, hammer, J., and Bock, C. (2015). Single-cell DNA methylome sequencing
777–785. and bioinformatic inference of epigenomic cell-state dynamics. Cell Rep.
206. Wang, C., and Mahadevan, S. (2009). A General Framework for Manifold 10, 1386–1397.
Alignment. 218. Rotem, A., Ram, O., Shoresh, N., Sperling, R.A., Goren, A., Weitz, D.A.,
207. Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. (2018). and Bernstein, B.E. (2015). Single-cell ChIP-seq reveals cell subpopula-
Improving Language Understanding by Generative Pre-training. tions defined by chromatin state. Nat. Biotechnol. 33, 1165–1172.
208. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. 219. Picelli, S., Faridani, O.R., Björklund, A.K., Winberg, G., Sagasser, S., and
(2019). Language models are unsupervised multitask learners. OpenAI Sandberg, R. (2014). Full-length RNA-seq from single cells using Smart-
Blog 1, 9. seq2. Nat. Protoc. 9, 171–181.