0% found this document useful (0 votes)
6 views

Deep Learning in Protein Structural

Uploaded by

deepa yaganti
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Deep Learning in Protein Structural

Uploaded by

deepa yaganti
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

ll

OPEN ACCESS

Review
Deep Learning in Protein Structural
Modeling and Design
Wenhao Gao,1 Sai Pooja Mahajan,1 Jeremias Sulam,2 and Jeffrey J. Gray1,*
1Department of Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
2Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
*Correspondence: [email protected]
https://doi.org/10.1016/j.patter.2020.100142

THE BIGGER PICTURE Proteins are linear polymers that fold into an incredible variety of three-dimensional
structures that enable sophisticated functionality for biology. Computational modeling allows scientists to
predict the three-dimensional structure of proteins from genomes, predict properties or behavior of a protein,
and even modify or design new proteins for a desired function. Advances in machine learning, especially
deep learning, are catalyzing a revolution in the paradigm of scientific research. In this review, we summarize
recent work in applying deep learning techniques to tackle problems in protein structural modeling and
design. Some deep learning-based approaches, especially in structure prediction, now outperform conven-
tional methods, often in combination with higher-resolution physical modeling. Challenges remain in exper-
imental validation, benchmarking, leveraging known physics and interpreting models, and extending to other
biomolecules and contexts.

Development/Pre-production: Data science output has been


rolled out/validated across multiple domains/problems

SUMMARY

Deep learning is catalyzing a scientific revolution fueled by big data, accessible toolkits, and powerful
computational resources, impacting many fields, including protein structural modeling. Protein structural
modeling, such as predicting structure from amino acid sequence and evolutionary information, designing
proteins toward desirable functionality, or predicting properties or behavior of a protein, is critical to under-
stand and engineer biological systems at the molecular level. In this review, we summarize the recent ad-
vances in applying deep learning techniques to tackle problems in protein structural modeling and design.
We dissect the emerging approaches using deep learning techniques for protein structural modeling and
discuss advances and challenges that must be addressed. We argue for the central importance of structure,
following the ‘‘sequence / structure / function’’ paradigm. This review is directed to help both computa-
tional biologists to gain familiarity with the deep learning methods applied in protein modeling, and computer
scientists to gain perspective on the biologically meaningful problems that may benefit from deep learning
techniques.

INTRODUCTION niques, such as X-ray crystallography,1 NMR spectroscopy2


and, increasingly, cryoelectron microscopy,3 computational
Proteins are linear polymers that fold into various specific confor- structure prediction from the genetically encoded amino acid
mations to function. The incredible variety of three-dimensional sequence of a protein has been used as an alternative when
(3D) structures determined by the combination and order in experimental approaches are limited. Computational methods
which 20 amino acids thread the protein polymer chain have been used to predict the structure of proteins,4 illustrate
(sequence of the protein) enables the sophisticated functionality the mechanism of biological processes,5 and determine the
of proteins responsible for most biological activities. Hence, ob- properties of proteins.6 Furthermore, all naturally occurring pro-
taining the structures of proteins is of paramount importance in teins are a result of an evolutionary process of random variants
both understanding the fundamental biology of health and dis- arising under various selective pressures. Through this process,
ease and developing therapeutic molecules. While protein struc- nature has explored only a small subset of theoretically possible
ture is primarily determined by sophisticated experimental tech- protein sequence space. To explore a broader sequence and

PATTER 1, December 11, 2020 ª 2020 The Author(s). 1


This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
ll
OPEN ACCESS Review

dramatic impacts on digital applications like image classifica-


tion,19 speech recognition,20 and game playing.21 Success in
these areas has inspired an increasing interest in more complex
data types, including protein structures.22 In the most recent
Critical Assessment of Structure Prediction (CASP13 held in
2018),4 a biennial community experiment to determine the
state-of-the-art in protein structure prediction, DL-based
methods accomplished a striking improvement in model accu-
racy (see Figure 1), especially in the ‘‘difficult’’ target category
where comparative modeling (starting with a known, related
structure) is ineffective. The CASP13 results show that the com-
plex mapping from amino acid sequence to 3D protein structure
can be successfully learned by a neural network and generalized
to unseen cases. Concurrently, for the protein design problem,
progress in the field of deep generative models has spawned a
range of promising approaches.23–25
In this review, we summarize the recent progress in applying
DL techniques to the problem of protein modeling and discuss
the potential pros and cons. We limit our scope to protein struc-
ture and function prediction, protein design with DL (see
Figure 2), and a wide array of popular frameworks used in these
applications. We discuss the importance of protein representa-
tion, and summarize the approaches to protein design based
on DL for the first time. We also emphasize the central impor-
tance of protein structure, following the sequence / structure
/ function paradigm and argue that approaches based on
structures may be most fruitful. We refer the reader to other re-
view papers for more information on applications of DL in biology
and medicine,16,15 bioinformatics,27 structural biology,17 folding
and dynamics,18,28 antibody modeling,29 and structural annota-
Figure 1. Striking Improvement in Model Accuracy in CASP13 Due to tion and prediction of proteins.30,31 Because DL is a fast-moving,
the Deployment of Deep Learning Methods interdisciplinary field, we chose to include preprints in this re-
(A) Trend lines of backbone accuracy for the best models in each of the 13 view. We caution the reader that these contributions have not
CASP experiments. Individual target points are shown for the two most recent
experiments. The accuracy metric, GDT_TS, is a multiscale indicator of the been peer-reviewed, yet are still worthy of attention now for their
closeness of the Ca atoms in a model to those in the corresponding experi- ideas. In fact, in communities such as computer science, it is not
mental structure (higher numbers are more accurate). Target difficulty is based uncommon for manuscripts to remain in this stage indefinitely,
on sequence and structure similarity to other proteins with known experi-
mental structures (see Kryshtafovych et al.4 for details). Figure from Kryshta-
and some seminal contributions, such as Kingma and Welling’s
fovych et al. (2019).4 definitive paper on autoencoders (AEs),32 are only available as
(B) Number of FM + FM/TBM (FM, free modeling; TBM, template-based preprints. In addition, we urge caution with any protein design
modeling) domains (out of 43) solved to a TM score threshold for all groups in
studies that are purely in silico, and we highlight those that
CASP.13 AlphaFold ranked first among them, showing that the progress is
mainly due to the development of DL-based methods. Figure from Senior et al. include experimental validation as a sign of their trustworthiness.
(2020).26
PROTEIN STRUCTURE PREDICTION AND DESIGN
structural space that potentially contains proteins with enhanced
or novel properties, techniques, such as de novo design can be Problem Definition
used to generate new biological molecules that have the poten- The prediction of protein 3D structure from amino acid sequence
tial to tackle many outstanding challenges in biomedicine and has been a grand challenge in computational biophysics for de-
biotechnology.7,8 cades.33,34 Folding of peptide chains is a fundamental concept in
While the application of machine learning and more general biophysics, and atomic-level structures of proteins and com-
statistical methods in protein modeling can be traced back de- plexes are often the starting point to understand their function
cades,9–13 recent advances in machine learning, especially in and to modulate or engineer them. Thanks to the recent ad-
deep learning (DL)-related techniques,14 have opened up new vances in next-generation sequencing technology, there are
avenues in many areas of protein modeling.15–18 DL is a set of now over 180 million protein sequences recorded in the UniProt
machine learning techniques based on stacked neural network dataset.35 In contrast, only 158,000 experimentally determined
layers that parameterize functions in terms of compositions of structures are available in the Protein Data Bank. Thus, compu-
affine transformations and non-linear activation functions. Their tational structure prediction is a critical problem of both practical
ability to extract domain-specific features that are adaptively and theoretical interest.
learned from data for a particular task often enables them to sur- More recently, the advances in structure prediction have led to
pass the performance of more traditional methods. DL has made an increasing interest in the protein design problem. In design,

2 PATTER 1, December 11, 2020


ll
Review OPEN ACCESS

Figure 2. Schematic Comparison of Three


Major Tasks in Protein Modeling: Function
Prediction, Structure Prediction, and Protein
Design
In function prediction, the sequence and/or the
structure is known and the functionality is needed as
output of a neural net. In structure prediction,
sequence is known input and structure is unknown
output. Protein design starts from desired function-
ality, or a step further, structure that can perform this
functionality. The desired output is a sequence that
can fold into the structure or has such functionality.

the objective is to obtain a novel protein sequence that will fold and properties) and can be loosely referred to as protein engi-
into a desired structure or perform a specific function, such as neering or protein redesign. The latter class of methods is called
catalysis. Naturally occurring proteins represent only an infinites- de novo protein design, a term originally coined in 1997 when
imal subset of all possible amino acid sequences selected by the Dahiyat and Mayo45 designed the FSD-1 protein, a soluble
evolutionary process to perform a specific biological function.7 protein with a completely new sequence that folded into the pre-
Proteins with more robustness (higher thermal stability, resis- viously known structure of a zinc finger. Korendovych and De-
tance to degradation) or enhanced properties (faster catalysis, Grado’s46 recent retrospective chronicles the development of
tighter binding) might lie in the space that has not been explored de novo design. Originally de novo design meant creation of
by nature, but is potentially accessible by de novo design. The entirely new proteins from scratch exploiting a target structure
current approach for computational de novo design is based but, especially in the DL era, many authors now use the term
on physical and evolutionary principles and requires significant to include methods that ignore structure in creating new se-
domain expertise. Some successful examples include novel quences, often using extensive training data from known pro-
folds,36 enzymes,37 vaccines,38 novel protein assemblies,39 teins in a particular functional class. In this review, we split our
ligand-binding protein,40 and membrane proteins.41 While discussion of methods according to whether they trained directly
some papers occasionally refer to redesign of naturally occurring between sequence and function (as certain natural language
proteins or interfaces as ‘‘de novo’’, in this review we restrict that processing [NLP]-based DL paradigms allow), or whether they
term only to works where completely new folds or interfaces are directly include protein structural data (like historical methods
created. in rational protein design; see below in the section on ‘‘Protein
Design’’).
Conventional Computational Approaches Despite significant progress in the last several decades in the
The current methodology for computational protein structure field of computational protein structure prediction and
prediction is largely based on Anfinsen’s42 thermodynamic hy- design,7,34 accurate structure prediction and reliable design
pothesis, which states that the native structure of a protein both remain challenging. Conventional approaches rely heavily
must be the one with the lowest free energy, governed by the on the accuracy of the energy functions to describe protein
energy landscape of all possible conformations associated physics and the efficiency of sampling algorithms to explore
with its sequence. Finding the lowest-energy state is challenging the immense protein sequence and structure space. Both pro-
because of the immense space of possible conformations avail- tein engineering and de novo approaches are often combined
able to a protein, also known as the ‘‘sampling problem’’ or with experimental directed evolution8,47 to achieve the optimal
Levinthal’s43 paradox. Furthermore, the approach requires final molecules.7
accurate free energy functions to describe the protein energy
landscape and rank different conformations based on their DL ARCHITECTURES
energy, referred to as the ‘‘scoring problem.’’ In light of these
challenges, current computational techniques rely heavily on In conventional computational approaches, predictions from
multiscale approaches. Low-resolution, coarse-grained energy data are made by means of physical equations and modeling.
functions are used to capture large-scale conformational sam- Machine learning puts forward a different paradigm in which al-
pling, such as the hydrophobic burial and formation of local sec- gorithms automatically infer—or learn—a relationship between
ondary structural elements. Higher-resolution energy functions inputs and outputs from a set of hypotheses. Consider a collec-
are used to explicitly model finer details, such as amino acid tion of N training samples comprising features x in an input space
side-chain packing, hydrogen bonding, and salt bridges.44 X (e.g., amino acid sequences), and corresponding labels y in
Protein design problems, sometimes known as the inverse of some output space Y (e.g., residue pairwise distances), where
structure prediction problems, require a similar toolbox. Instead fxi ; yi gN
i = 1 are sampled independently and identically distributed
of sampling the conformational space, a protein design protocol from some joint distribution P. In addition, consider a function
samples the sequence space that folds into the desired topol- f : X /Y in some function class H, and a loss function
ogy. Past efforts can be broadly divided into two broad classes: [ : Y3Y/R that measures how much fðxÞ deviates from the
modifying an existing protein with known sequence and proper- corresponding label y. The goal of supervised learning is to
ties, or generating novel proteins with sequences and/or folds find a function f˛H that minimizes the expected loss, E½[ðfðxÞ;
unrelated to those found in nature. The former class evolves an yÞ, for ðx; yÞ sampled from P. Since one does not have
existing protein’s amino acid sequence (and as a result, structure access to the true distribution but rather N samples from it, the

PATTER 1, December 11, 2020 3


ll
OPEN ACCESS Review

A B

C D

Figure 3. Schematic Representation of Several Architectures Used in Protein Modeling and Design
(A) CNNs are widely used in structure prediction.
(B) RNNs learn in an auto-regressive way and can be used for sequence generation.
(C) The VAE can be jointly trained by protein and properties to construct a latent space correlated with properties.
(D) In the GAN setting, a mapping from a priori distribution to the design space can be obtained via the adversarial training.

popular empirical risk minimization (ERM) approach seeks to Convolutional Neural Networks
minimize the loss over the training samples instead. In neural Convolutional networks architectures50 are most commonly
network models, in particular, the function class is parameter- applied to image analysis or other problems where shift-invari-
ized by a collection of weights. Denoting these parameters ance or covariance is needed. Inspired by the fact that an object
collectively by q, ERM boils down to an optimization problem on an image can be shifted in the image and still be the same ob-
of the form ject, convolutional neural networks (CNNs) adopt convolutional
kernels for the layer-wise affine transformation to capture this
1 XN
translational invariance. A 2D convolutional kernel w applied to
min [ ðfq ðxi Þ; yi Þ: (Equation 1)
q N i=1 a 2D image data x can be defined as
XX
Sði; jÞ = ðx  wÞði; jÞ = xðm; nÞwði  m; j  nÞ;
m n
The choice of the network determines how the hypothesis class
is parameterized. Deep neural networks typically implement a non- (Equation 2)
linear function as the composition of affine maps, W l : Rnl /Rnl + 1 ,
where W l x = Wl x + bl , and other non-linear activation functions, where Sði; jÞ represents the output at position ði;jÞ, xðm; nÞ is the
sð ,Þ. Rectifying linear units and max-pooling are some of the value of the input x at position ðm;nÞ, wði m; j nÞ is the param-
most popular non-linear transformations applied in practice. The eter of kernel w at position ði  m; j  nÞ, and the summation is
architecture of the model determines how these functions are over all possible positions. An important variant of CNN is the re-
composed, the most popular option being their sequential compo- sidual network (ResNet),51 which incorporates skip-connections
sition fðxÞ = W L sðW L1 sðW L2 sð.W 2 sðW 1 xÞÞÞÞ for a network between layers. These modification have shown great advan-
with L layers. Computing fðxÞ is typically referred to as the forward tages in practice, aiding the optimization of these typically
pass. huge models. CNNs, especially ResNets, have been widely
We will not dwell on the details of the optimization problem in used in protein structure prediction. An example is AlphaFold,22
Equation (1), which is typically carried out via stochastic gradient which used ResNets to predict protein inter-residue distance
descent algorithms or variations thereof, efficiently implemented maps from amino acid sequences (Figure 3A).
via back-propagation (see instead, e.g., LeCun et al.,14 Sun,48
and Schmidhuber).49 Rather, in this section we summarize Recurrent Neural Networks
some of the most popular models widely used in protein struc- Recurrent architectures are based on applying several iterations
tural modeling, including how different approaches are best of the same function along a sequential input.52 This can be seen
suited for particular data types or applications. High-level dia- as an unfolded architecture, and it has been widely used to pro-
grams of the major architectures are shown in Figures 3. cess sequential data, such as time series data and written text

4 PATTER 1, December 11, 2020


ll
Review OPEN ACCESS

Figure 4. Different Types of Representation


Schemes Applied to a Protein

framework, an AE does not learn


labeled outputs but instead attempts
to learn some representation of the
original input. This is typically accom-
plished by training two parametric
maps: an encoder function g : X /Rm
that maps an input x to an m-dimen-
sional representation or latent space,
and a decoder intended to implement
the inverse map so that fðgðxÞÞzx.
Typically, the latent representation is
of small dimension (m is smaller than
(i.e., NLP). With an initial hidden state hð0Þ and sequential data the ambient dimension of X ) or constrained in some other
[xð1Þ ; xð2Þ ; .; xðnÞ ], we can obtain hidden states recursively: way (e.g., through sparsity).
    Variational autoencoders (VAEs),32,62 in particular, provide a
hðtÞ = gðtÞ xðtÞ ; xðt1Þ ; xðt2Þ ; .; xð1Þ = f hðt1Þ ; xðtÞ ; q ; stochastic map between the input space and the latent space.
(Equation 3) This map is beneficial because, while the input space may
have a highly complex distribution, the distribution of the repre-
where f represents a function or transformation from one position sentation z can be much simpler; e.g., Gaussian. These methods
to the next, and gðtÞ represents the accumulative transformation derive from variational inference, a method from machine
up to position t. The hidden state vector at position i, hðiÞ , con- learning that approximates probability densities through optimi-
tains all the information that has been seen before. As the zation.63 The stochastic encoder, given by the inference model
same set of parameters (usually called a cell) can be applied q4 ðzjxÞ and parametrized by weights 4, is trained to approximate
recurrently along the sequential data, an input of variable length the true posterior distribution of the representation given the
can be fed to a recurrent neural network (RNN). Due to the data, pq ðzjxÞ. The decoder, on the other hand, provides an esti-
gradient vanishing and explosion problem (the error signal de- mate for the data given the representation, pq ðxjzÞ. Direct optimi-
creases or increases exponentially during training), more recent zation of the resulting objective is intractable, however. Thus,
variants of standard RNNs, namely long short-term memory training is done by maximizing the ‘‘evidence lower bound,’’
(LSTM)53 and gated recurrent unit54 are more widely used. An Lq;4 ðxÞ, instead, which provides a lower bound on the log-like-
example of an RNN approach in the context of protein structure hood of the data:
prediction is using an N-terminal subsequence of a protein to Lq;4 ðxÞ = Ezq4 ðzjxÞ logpq ðxjzÞ  DKL ðq4 ðzjxÞjjpq ðzjxÞÞ:
predict the next amino acid in the protein sequence (Figure 3B;
(Equation 4)
e.g., Mu€ller et al.55).
In conjunction with recurrent networks, attention mechanisms
were first proposed (in an encoder-decoder framework) to learn Here, DKL ðq4 jjpq Þ is the Kullback-Leibler divergence, which
which parts of a source sentence are most relevant to predicting quantifies the distance between distributions q4 and pq . Employ-
a target word.56 Compared with RNN models, attention-based ing Gaussians for the factorized variational and likelihood distri-
models are more parallelizable and better at capturing long- butions, as well as using a change of variables via differentiable
range dependencies, and they are driving big advances in maps, allows for the efficient optimization of these architectures.
NLP.57,58 Recently, the transformer model, which solely adopted An example of applying VAE in the protein modeling field is
attention layers without any recurrent or convolutional layers, learning a representation of antimicrobial protein sequences
was able to surpass state-of-the-art methods on language trans- (Figure 3C; e.g., Das et al.64). The resulting continuous real-
lation tasks.57 For proteins, these methods could learn which valued representation can then be used to generate new se-
parts of an amino acid sequence are critical to predicting a target quences likely to have antimicrobial properties.
residue or the properties of a target residue. For example, trans-
former-based models have been used to generate protein se- Generative Adversarial Network
quences conditioned on target structure,23 learn protein Generative adversarial networks (GANs)65 are another class of
sequence data to predict secondary structure and fitness land- unsupervised (generative) models. Unlike VAEs, GANs are
scapes,59 and to encode the context of the binding partner in trained by an adversarial game between two models, or net-
antibody-antigen binding surface prediction.60 works: a generator, G, which given a sample, z, from some sim-
ple distribution pz ðzÞ (e.g., Gaussian), seeks to map it to the dis-
Variational Autoencoder tribution of some data class (e.g., naturally looking images); and
AEs,61 unlike the networks discussed so far, provide a a discriminator, D, whose task is to detect whether the images
model for unsupervised learning. Within this unsupervised are real (i.e., belonging to the true distribution of the data,

PATTER 1, December 11, 2020 5


ll
OPEN ACCESS Review

Table 1. Features Contained by CUProtein Dataset


Feature Name Description Dimensions Type IO
AA Sequence sequence of amino acid n31 21 chars input
PSSM position-specific scoring matrix, a residue- n 3 21 real [0, 1] input
wise score for motifs appearance
MSA covariance covariance matrix across homologous NA n3n real [0, 1] input
sequences
SS a coarse categorized secondary structure n31 3 or 8 chars input
(Q3 or Q8)
Distance matrices pairwise distance between residues (Ca n3n positive real (Å) output
or Cb)
Torsion angles variable dihedral angles for each residues n32 real [p, +p] (radians) output
(4, c)
n, number of residues in one protein. Data from Drori et al.78

pdata ðxÞ), or fake (produced by the generator). With this game- proteins in DL models (Figure 4): sequence-based, structure-
based setup, the generator model is trained by maximizing the based, and one special form of representation relevant to
error rate of the discriminator, thereby training it to ‘‘fool’’ the computational modeling of proteins: coarse-grained models.
discriminator. The discriminator, on the other hand, is trained
to foil such fooling. The original objective function as formulated Amino Acid Sequence as Representation
by Goodfellow et al.65 is: As the amino acid sequence contains the information essential
to reach the folded structure for most proteins,42 it is widely
min maxVðD; GÞ = Expdata ðxÞ ½logDðxÞ + Ezpz ðzÞ ½logð1DðGðzÞÞÞ: used as an input in functional prediction and structure predic-
G D

(Equation 5) tion tasks. The amino acid sequence, like other sequential
data, is typically converted into one-hot encoding-based repre-
sentation (each residue is represented with one high bit to iden-
Training is performed by stochastic optimization of this differ- tify the amino acid type and all the others low) that can be
entiable loss function. While intuitive, this original GAN objective directly used in many sequence-based DL techniques.73,74
can suffer from issues, such as mode collapse and instabilities However, this representation is inherently sparse and, thus,
during training. The Wasserstein GAN (WGAN)66 is a popular sample-inefficient. There are many easily accessible additional
extension of GAN which introduces a Wasserstein-1 distance features that can be concatenated with amino acid sequences
measure between distributions, leading to easier and more providing structural, evolutionary, and biophysical information.
robust training.67 Some widely used features include predicted secondary struc-
An example of a GAN, in the context of protein modeling ture, high-level biological features, such as sub-cellular locali-
is learning the distribution of protein backbone distances to zation and unique functions,75 and physical descriptors, such
generate novel protein-like folds (Figure 3D).68 During as AAIndex,76 hydrophobicity, ability to form hydrogen bonds,
training, one network G generates folds, and a second charge, solvent-accessible surface area, etc. A sequence can
network D aims to distinguish between generated folds be augmented with additional data from sequence databases,
and fake folds. such as multiple sequence alignments (MSA) or position-spe-
cific scoring matrices (PSSMs),77 or pairwise residue co-evolu-
PROTEIN REPRESENTATION AND FUNCTION tion features. Table 1 lists typical features as used in CU-
PREDICTION Protein.78

One of the most fundamental challenges in protein modeling is Learned Representation from Amino Acid Sequence
the prediction of functionality from sequence or structure. Func- Because the performance of machine learning algorithms highly
tion prediction is typically formulated as a supervised learning depends on the features we choose, labor-intensive and
problem. The property to predict can either be a protein-level domain-based feature engineering was vital for traditional ma-
property, such as a classification as an enzyme or non- chine learning projects. Now, the exceptional feature extraction
enzyme,69 or a residue-level property, such as the sites or motifs ability of neural networks makes it possible to ‘‘learn’’ the repre-
of phosphorylation (DeepPho)70 and cleavage by proteases.71 sentation, with or without giving the model any labels.72 As pub-
The challenging part here and in the following models is how to licly available sequence data are abundant (see Table 2), a well-
represent the protein. Representation refers to the encoding of learned representation that utilizes these data to capture more
a protein that serves as an input for prediction tasks or the output information is of particular interest. The class of algorithms that
for generation tasks. Although a deep neural network is in princi- address the label-less learning problem fall under the umbrella
ple capable of extracting complex features, a well-chosen repre- of unsupervised or semi-supervised learning, which extracts in-
sentation can make learning more effective and efficient.72 In this formation from unlabeled data to reduce the number of labeled
section, we will introduce the commonly used representations of samples needed.

6 PATTER 1, December 11, 2020


ll
Review OPEN ACCESS

Table 2. A Summary of Publicly Available Molecular Biology Databases


Dataset Description N Website
European Bioinformatics Institute a collections of wide range of – https://www.ebi.ac.uk
(EMBL-EBI) datasets
National Center for Biotechnology a collections of biomedical and – https://www.ncbi.nlm.nih.gov
Information (NCBI) genomic databases
Protein Data Bank (PDB) 3D structural data of biomolecules, 160,000 https://www.rcsb.org
such as proteins and nucleic acids
Nucleic Acid Database (NDB) structure of nucleic acids and 10,560 http://ndbserver.rutgers.edu
complex assemblies
Universal Protein Resource protein sequence and function  562; 000 http://www.uniprot.org/
(UniProt) infromations
Sequence Read Archive (SRA) raw sequence data from ‘‘next-  3 3 1016 NCBI database
generation’’ sequencing
technologies

The most straightforward way to learn from amino acid ning evolutionary diversity. Their transformer model was superior
sequence is to directly apply NLP algorithms. Word2Vec79 and to traditional LSTM-based models on tasks, such as the predic-
Doc2Vec80 are groups of algorithms widely used for learning tion of secondary structure and long-range contacts, as well as
word or paragraph embeddings. These models are trained by the effect of mutations on activity on deep mutational scanning
either predicting a word from its context or predicting its context benchmarks.
from one central word. To apply these algorithm, Asgari and Mo- AEs can also provide representations for subsequent super-
frad81 first proposed a Word2Vec-based model called BioVec, vised tasks.32 Ding et al.92 showed that a VAE model is able to
which interprets the non-overlapping 3-mer sequence of amino capture evolutionary relationships between sequences and sta-
acids (e.g., alanine-glutamine-lysine or AQL) as ‘‘words’’ and bility of proteins, while Sinai et al.93 and Riesselman et al.94
lists of shifted ‘‘words’’ as ‘‘sentences.’’ They then represent a showed that the latent vectors learned from VAEs are able to pre-
protein as the summation of all overlapping sequence fragments dict the effects of mutations on fitness and activity for a range of
of length k, or k-mers (called ProtVec). Predictions based on the proteins, such as poly(A)-binding protein, DNA methyltransfer-
ProtVec representation outperformed state-of-the-art machine ase, and b-lactamase. Recently, a lower-dimensional embed-
learning methods in the Pfam protein family82 classification ding of the sequence was learned for the more complex task
(93% accuracy for ~7,000 proteins, versus 69.1%–99.6%83 of structure prediction.78 Alley et al.’s87 UniRep surpassed
and 75%84 for previous methods). Many Doc2Vec-type exten- former models, but since UniRep is trained on 24 million se-
sions were developed based on the 3-mer protocol. Yu et al.85 quences and previous models (e.g., Prot2Vec) were trained on
showed that non-overlapping k-mers perform better than the much smaller datasets (0.5 million), it is not clear if the improve-
overlapping ones, and Yang et al.86 compared the performance ment was due to better methods or the larger training dataset.
of all Doc2Vec frameworks for thermostability and enantioselec- Rao et al.95 introduced multiple biological-relevant semi-super-
tivity prediction. vised learning tasks, TAPE, and benchmarked the performance
In these approaches, the three-residue segmentation of a pro- against various protein representations. Their results show con-
tein sequence is arbitrary and does not embody any biophysical ventional alignment-based inputs still outperform current self-
meaning. Alternatively, Alley et al.87 directly used an RNN (unidi- supervised models on multiple tasks, and the performance on
rectional multiplicative long-short-term-memory or mLSTM)88 a single task cannot evaluate the capacity of models. A compre-
model, called UniRep, to summarize arbitrary length protein se- hensive and persuasive comparison of representations is
quences into a fixed-length real representation by averaging required.
over the representation of each residue.87 Their representation
achieved lower mean squared errors on 15 property prediction Structure as Representation
tasks (e.g., absorbance, activity, stability) compared with former Since the most important functions of a protein (e.g., binding,
models, including Yang et al.’s86 Doc2Vec. Heinzinger et al.89 signaling, catalysis) can be traced back to the 3D structure of
adopted the bidirectional LSTM in a manner similar to Peters the protein, direct use of 3D structural information, and analo-
et al.’s90 ELMo (Embeddings from Language Models) model gously, learning a good representation based on 3D structure,
and surpassed Asgari and Mofrad’s81 Word2Vec model at pre- are highly desired. The direct use of raw 3D representations
dicting secondary structure and regions with intrinsic disorder (such as coordinates of atoms) is hindered by considerable chal-
at the per-residue level.89 The success of the transformer model lenges, including the processing of unnecessary information due
in language processing, especially those trained on large num- to translation, rotation, and permutation of atomic indexing.
ber of parameters, such as BERT58 and GPT3,91 has inspired Townshend et al.96,97 and Simonovsky and Meyers96,97 obtained
its application in biological sequence modeling. Rives et al.59 a translationally invariant, 3D representation of each residue by
trained a transformer model with 670 million parameters on 86 voxelizing its atomic neighborhood for a grid-based 3D CNN
billion amino acids across 250 million protein sequences span- model. The work of Kolodny et al.,98 Taylor,99 and Li and

PATTER 1, December 11, 2020 7


ll
OPEN ACCESS Review

Koehl100 representing the 3D structure of a protein as 1D strings of the protein, called MaSIF. They calculated ‘‘fingerprints’’ for
of geometric fragments for structure comparison and fold recog- patches on the protein surface using geodesic convolutional
nition may also prove useful in DL approaches. Alternatively, the layers, which were further used to perform tasks, such as binding
torsion angles of the protein backbone, which are invariant to site prediction or ultra-fast protein-protein interaction (PPI)
translation and rotation, can fully recapitulate protein backbone search. The performance of MaSIF approached the baseline of
structure under the common assumption that variation in bond current methods in docking and function prediction, providing
lengths and angles is negligible. AlQuraishi101 used backbone a proof-of-concept to inspire more applications of geometry-
torsion angles to represent the 3D structure of the protein as a based representation learning.
1D data vector. However, because a change in a backbone tor-
sion angle at a residue affects the inter-residue distances Score Function and Force Field
between all preceding and subsequent residues, these 1D vari- A high-quality force field (or, more generally, score function) for
ables are highly interdependent, which can frustrate learning. sampling and/or ranking models (decoys) is one of the most vital
To circumvent these limitations, many approaches use 2D pro- requirements for protein structural modeling.118 A force field de-
jections of 3D protein structure data, such as residue-residue scribes the potential energy surface of a protein. A score function
distance and contact maps,24,102 and pseudo-torsion angles may contain knowledge-based terms that do not necessarily
and bond angles that capture the relative orientations between have a valid physical meaning, and they are designed to distin-
pairs of residues.103 While these representations guarantee guish near-native conformations from non-native ones (for
translational and rotational invariance, they do not guarantee example, learning the GDT_TS).119 A molecular dynamics (MD)
invertibility back to the 3D structure. The structure must be or Monte Carlo (MC) simulation with a state-of-the-art force field
reconstructed by applying constraints on distance or contact or score function can reproduce reasonable statistical behaviors
parameters using algorithms, such as gradient descent minimi- of biomolecules.120–122
zation, multidimensional scaling, a program like the Crystallog- Current DL-based efforts to learn the force field can be divided
raphy and NMR system (CNS),104 or in conjunction with an into two classes: ‘‘fingerprint’’ based and graph based. Behler
energy-function-based protein structure prediction program.22 and Parrinello123 developed roto-translationally invariant fea-
An alternative to the above approaches for representing pro- tures, i.e., the Behler-Parrinello fingerprint, to encode the atomic
tein structures is the use of a graph, i.e., a collection of nodes environment for neural networks to learn potential surfaces from
or vertices connected by edges. Such a representation is highly density functional theory (DFT) calculations. Smith et al.
amenable to the graph neural network (GNN) paradigm,105 which extended this framework and tested its accuracy by simulating
has recently emerged as a powerful framework for non- systems up to 312 atoms (Trp-cage) for 1 ns.124,125 Another fam-
Euclidean data106 in which the data are represented with rela- ily that includes deep tensor neural networks126 and SchNet127
tionships and inter-dependencies, or edges between objects uses graph convolutions to learn a representation for each
or nodes.107 While the representation of proteins as graphs atom within its chemical environment. Although the prediction
and the application of graph theory to study their structure and quality and the ability to learn a representation with novel chem-
properties has a long history,108 the efforts to apply GNNs to pro- ical insight make the graph-based approach increasingly popu-
tein modeling and design is quite recent. As a benchmark, many lar,28 the application scales poorly to larger systems and thus
GNNs69,109 have been applied to classify enzymes from non-en- has mainly focused on small organic molecules.
zymes in the PROTEINS110 and D&D111 datasets. Fout et al.112 We anticipate a shift toward DL-based score functions because
utilized a GNN in developing a model for protein-protein interface of the enormous gains in speed and efficiency. For example,
prediction. In their model, the node feature comprised residue Zhang et al.128 showed that MD simulation on a neural potential
composition and conservation, accessible surface area, residue was able to reproduce energies, forces, and time-averaged prop-
depth, and protrusion index; and the edge feature comprised a erties comparable with ab initio MD (AIMD) at a cost that scales lin-
distance and an angle between the normal vectors of the amide early with system size, compared with cubic scaling typical for
plane of each node/residue. A similar framework was used to AIMD with DFT. Although these force fields are, in principle, gener-
predict antibody-antigen binding interfaces.60 Zamora-Resendiz alizable to larger systems, direct applications of learned potentials
and Crivelli113 and Gligorijevic et al.114 further generalized and to model full proteins are still rare. PhysNet, trained on a set of small
validated the use of graph-based representations and the graph peptide fragments (at most eight heavy atoms), was able to gener-
convolutional network (GCN) framework in protein function pre- alize to deca-alanine (Ala10),129 and ANI-1x and AIMNet have been
diction tasks, using a class activation map to interpret the struc- tested on chignolin (10 residues) and Trp-cage (20 residues) within
tural determinants of the functionalities. Torng and Altman115 the ANI-MD benchmark dataset.125,130 Lahey and Rowley131 and
applied GCNs to model pocket-like cavities in proteins to predict Wang et al.132 combined the quantum mechanics/molecular me-
the interaction of proteins with small molecules, and Ingraham chanics (QM/MM) strategy133 and the neural potential to model
et al.23 adopted a graph-based transformer model to perform a docking with small ligands and larger proteins.131,132 Recently,
protein sequence design task. These examples demonstrate Wang et al.134 proposed an end-to-end differential MM force field
the generality and potential of the graph-based representation by training a GNN on energies and forces to learn atom-typing and
and GNNs to encode structural information for protein modeling. force field parameters.
The surface of the protein or a cavity is an information-rich re-
gion that encodes how a protein may interact with other mole- Coarse-Grained Models
cules and its environment. Recently, Gainza et al.116 used a geo- Coarse-grained models are higher-level abstractions of biomol-
metric DL framework117 to learn a surface-based representation ecules, such as using a single pseudo-atom or a bead to

8 PATTER 1, December 11, 2020


ll
Review OPEN ACCESS

represent multiple atoms, grouped based on local connectivity suggest distance relationships between residue pairs and aid
and/or chemical properties. Coarse graining smoothens out structure construction from sequence through contact or dis-
the energy landscape, and thereby helps avoid trapping in local tance constraints. Because co-evolution information relies on
minima and speeds up conformational sampling.135 One can statistical averaging of sequence information from a large num-
learn the atomic-level properties to construct a fast and accurate ber of MSAs,145,155,156 this approach is not effective when the
neural coarse-grained model once the coarse-grained mapping protein target has only a few sequence homologs. Neural net-
is given. Early attempts to apply DL-based methods to coarse- works were, at first, introduced to deduce evolutionary couplings
graining focus on water molecules with the roto-translationally between distant homologs, thereby improving ECA-type contact
invariant features.136,137 Wang et al.138 developed CGNet and predictions for contact-assisted protein folding.154 While the
learned the coarse-grained model of the mini protein, chignolin, application of neural networks to learn inter-residue protein con-
in which the atoms of a residue are mapped to the corresponding tacts dates back to the early 2000s,157,158 more recently this
Ca atom. The free energy surface learned with CGNet is quanti- approach was adopted by MetaPSICOV (two-layer NN),146
tatively correct and MD simulations performed with CGNet PConsC2 (two-layer NN),145 and CoinDCA-NN (five-layer
potentially predict the same set of metastable states (folded, NN),155 which combined neural networks with ECAs. However,
unfolded, and misfolded). Another critical question for coarse there was no significant advantage to neural networks compared
graining is determining which sets of atoms to map into a united with other machine learning methods at that time.159
atom. For example, one choice is to use a single coarse-grained In 2017, Wang et al.102 proposed RaptorX-Contact, a residual
atom to represent a whole residue, and a different choice is to neural network (ResNet)-based model,51 which, for the first time
use two coarse-grained atoms, one to represent the backbone used a deep neural network for protein contact prediction, signif-
and the other to represent the side chain. To determine the icantly improving the accuracy on blind, challenging targets with
optimal choice, Wang and Gómez-Bombarelli139 applied an novel folds. RaptorX-Contact ranked first in free modeling tar-
encoder-decoder-based model to explicitly learn the lower- gets at CASP12.161 Its architecture (Figure 5(a)) entails (1) a 1D
dimensional representation of proteins by minimizing the infor- ResNet that inputs MSAs, predicted secondary structure and
mation loss at different levels of coarse graining. Li et al.140 solvent accessibility (from DL-based prediction tool RaptorX-
treated this problem as a graph segmentation problem and pre- Property)162 and (2) a 2D ResNet with dilations that inputs the
sented a GNN-based coarse-graining mapping predictor called 1D ResNet output and inter-residue co-evolution information
Deep Supervised Graph Partitioning Model. from CCMpred.144 In its original formulation, RaptorX-Contact
outputs a binary classification of contacting versus non-contact-
STRUCTURE DETERMINATION ing residue pairs.102 Later versions were trained to learn multi-
class classification for distance distributions between Cb
The most successful application of DL in the field of protein atoms.147 The primary contributors to the accuracy of predic-
modeling so far has been the prediction of protein structure. Pro- tions was the co-evolution information from CCMpred and the
tein structure prediction is formulated as a well-defined problem depth of the 2D ResNet, suggesting that the deep neural network
with clear inputs and outputs: predict the 3D structure (output) learned co-evolution information better than previous methods.
given amino acid sequences (input), with the experimental struc- Later, the method was extended to predict Ca Ca, Ca Cg,
tures as the ground truth (labels). This problem perfectly fits the Cg Cg, N-O distances and torsion angles (DL-based RaptorX-
classical supervised learning approach, and once the problem is Angle),163 giving constraints to locate side chains and addition-
defined in these terms, the remaining challenge is to choose a ally constrain the backbone; all five distances, torsions, and sec-
framework to handle the complex relationship between input ondary structure predictions were converted to constraints for
and output. The CASP experiment for structure prediction is folding by CNS.147 At CASP12, however, RaptorX-Contact (orig-
held every 2 years and served as a platform for DL to compete inal contact-based formulation) and DL drew limited attention
with state-of-the-art methods and, impressively, outshine them because the difference between top-ranked predictions from
in certain categories. We will first discuss the application of DL DL-based methods and hybrid DCA-based methods was small.
to the protein folding problem, and then comment on some prob- This situation changed at CASP13 4 when one DL-based
lems related to structure determination. Table 3 summarizes ma- model, AlphaFold, developed by team A7D, or Deep-
jor DL efforts in structure prediction. Mind,26,22,164 ranked first and significantly improved the accu-
racy of ‘‘free modeling’’ (no templates available) targets
Protein Structure Prediction (Figure 1). The A7D team modified the traditional simulated an-
Before the notable success of DL at CASP12 (2016) and CASP13 nealing protocol with DL-based predictions and tested three
(2018), the state-of-the-art methodology used complex work- protocols based on deep neural networks. Two protocols
flows based on a combination of fragment insertion and struc- used memory-augmented simulated annealing (with domain
ture optimization methods, such as simulated annealing with a segmentation and fragment assembly) with potentials gener-
score function or energy potential. Over the last decade, the ated from predicted inter-residue distance distributions and
introduction of co-evolution information in the form of evolu- predicted GDT_TS,165 respectively, whereas the third protocol
tionary coupling analysis (ECA)154 improved predictions. ECA re- directly applies gradient descent optimization on a hybrid po-
lies on the rationale that residue pairs in contact in 3D space tend tential combining predicted distance and Rosetta score. For
to evolve or mutate together; otherwise, they would disrupt the the distance prediction network, a deep ResNet, similar to
structure to destabilize the fold or render a large conformational that of RaptorX,102 inputs MSA data and predicts the probabil-
change. Thus, evolutionary couplings from sequencing data ity of distances between b carbons. A second network was

PATTER 1, December 11, 2020 9


ll
OPEN ACCESS Review

Table 3. A Summary of Structure Prediction Models


Model Architecture Dataset N_train Performance Testset Citation
/ MLP(2-layer) proteases 13 3.0 Å RMSD (1TRM),1.2 Å 1TRM, 6PTI Bohr et al.9
RMSD (6PTI)
PSICOV graphical Lasso – – precision: Top-L 0.4, Top-L/2 150 Pfam Jones et al.141
0.53,Top-L/5 0.67, Top-L/
10 0.73
CMAPpro 2D biRNN + MLP ASTRAL 2,352 precision: Top-L/5 0.31, Top-L/ ASTRAL 1.75 Di Lena et al.142
10 0.4 CASP8, 9
DNCON RBM PDB SVMcon 1,230 precision: Top-L 0.46, Top-L/2 SVMCON_TEST, Eickholt et al.143
0.55, Top-L/5 0.65 D329, CASP9
CCMpred LM – – precision: Top-L 0.5, Top-L/2 150 Pfam Seemayer et al.144
0.6, Top-L/5 0.75, Top-L/10 0.8
PconsC2 Stacked RF PSICOV set 150 positive predictive value set of 383 Skwark et al.145
(PPV) 0.44 CASP10(114)
MetaPSICOV MLP PDB 624 precision: Top-L 0.54, Top-L/2 150 Pfam Jones et al.146
0.70, Top-L/5 0.83, Top-L/
10 0.88
RaptorX-Contact ResNet subset of PDB25 6,767 TM score: 0.518 (CCMpred: Pfam, CASP11, Wang et al, 2017102
0.333, MetaPSICOV: 0.377) CAMEO, MP
RaptorX-Distance ResNet subset of PDB25 6,767 TM score: 0.466 (CASP12), CASP12 + 13, Xu, 2018147
0.551 (CAMEO), 0.474 (CASP13) CAMEO
DeepCov 2D CNN PDB 6,729 precision: Top-L 0.406, Top-L/2 CASP12 Jones et al, 2018148
0.523, Top-L/5 0.611, Top-L/
10 0.642
SPOT ResNet, PDB 11,200 AUC: 0.958 (RaptorX-contact 1,250 chains Hanson et al.149
Res-bi-LSTM ranked 2nd: 0.909) after June 2015
DeepMetaPSICOV ResNet PDB 6,729 precision: Top-L/5 0.6618 CASP13 Kandathil
et al, 2019150
MULTICOM 2D CNN CASP 8-11 425 TM score: 0.69, GDT_TS: 63.54, CASP13 Hou et al.151
SUM Z score ( 2.0): 99.47
C-I-TASSER* 2D CNN – – TM score: 0.67, GDT_HA: 0.44, CASP13 Zheng et al.152
RMSD: 6.19, SUM Z score( 
2:0): 107.59
AlphaFold ResNet PDB 31,247 TM score: 0.70, GDT_TS: CASP13 Senior et al.22
61.4,SUM Z score (
2.0): 120.43
MapPred ResNet PISCES 7,277 precision: 78.94% in SPOT, SPOT, CAMEO, Wu et al, 2019153
77.06% in CAMEO, 77.05 in CASP12
CASP12
trRosetta ResNet PDB 15,051 TM_score: 0.625 CASP13, CAMEO Yang et al, 2020103
(AlphaFold: 0.587)
RGN bi-LSTM ProteinNet 12 104,059 10.7 Å dRMSD on FM, 6.9 Å CASP12 AlQuraishi, 2019101
(before 2016)** on TBM
/ biGRU, CUProtein 75,000 preceded CASP12 winning CASP12 + 13 Drori et al.78
Res LSTM team, comparable with
AlphaFold in RMSD
FM, free modeling; GRU, gated recurrent unit; LM, pseudo-likelihood maximization; MLP, multi-layer perceptron; MP, membrane protein; RBM,
restricted Boltzmann machine; RF, random forest; RMSD, root-mean square deviation; TBM, template-based modeling.
*C-I-TASSER and C-QUARK were reported, we only report one here.
**RGN was trained on different ProteinNet for each CASP, we report the latest one here.

trained to predict GDT_TS of the candidate structure with of nine-residue fragments for the memory-augmented simu-
respect to the true or native structure. The simulated annealing lated annealing system. Gradient-based optimization per-
process was improved with a conditional variational autoen- formed slightly better than the simulated annealing, suggesting
coder (CVAE)166 model that constructs a mapping between that traditional simulated annealing is no longer necessary and
the backbone torsions and a latent space conditioned by state-of-the-art performance can be reached with simply opti-
sequence. With this network, the team generated a database mizing a network predicted potential. AlphaFold’s authors,

10 PATTER 1, December 11, 2020


ll
Review OPEN ACCESS

Figure 5. Two Representative DL Approaches to Protein Structure Prediction


(A) Residue distance prediction by RaptorX: the overall network architecture of the deep dilated ResNet used in CASP13. Inputs of the first-stage, 1D con-
volutional layers are a sequence profile, predicted secondary structure, and solvent accessibility. The output of the first stage is then converted into a 2D matrix by
concatenation and fed into a deep ResNet along with pairwise features (co-evolution information, pairwise contact, and distance potential). A discretized inter-
residue distance is the output. Additional network layers can be attached to predict torsion angles and secondary structures. Figure from Xu and Wang (2019).160
(B) Direct structure prediction: overview of recurrent geometric networks (RGN) approach. The raw amino acid sequence along with a PSSM are fed as input
features, one residue at a time, to a bidirectional LSTM net. Three torsion angles for each residue are predicted to directly construct the 3D structure. Figure from
AlQuraishi (2019).101

like the RaptorX-Contact group, emphasized that the accuracy orientations along with b carbon distances. The geometric fea-
of predictions relied heavily on learned distance distributions tures—Ca-Cb torsions, pseudo-bond angles, and azimuthal rota-
and co-evolutionary data. tions—directly describe the relevant coordinates for the physical
Yang et al.103 further improved the accuracy of predictions on interaction of two amino acid side chains. These additional out-
CASP13 targets using a shallower network than former models puts created significant improvement on a relatively fixed DL
(61 versus 220 ResNet blocks in AlphaFold) by also training their framework, suggesting that there is room for additional
neural network model (named trRosetta) to learn inter-residue improvement.

PATTER 1, December 11, 2020 11


ll
OPEN ACCESS Review

An alternative and intuitive approach to structure prediction is unlike the majority of the proteins that are water soluble. Wang
directly learning the mapping from sequence to structure with a et al.173 used a DL transfer learning framework comprising
neural network. AlQuraishi101 developed such an end-to-end one-shot learning from non-MPs to MPs. They showed that
differentiable protein structure predictor, called RGN, that allows transfer learning works surprisingly well here because the most
direct prediction of torsion angles to construct the protein back- frequently occurring contact patterns in soluble proteins and
bone (Figure 5B). RGN is a bidirectional LSTM that inputs a MPs are similar. Other efforts include classification of the
sequence, PSSM, and positional information and outputs pre- trans-membrane topology.174 Since experimental biophysical
dicted backbone torsions. Overall 3D structure predictions are data are sparse for MPs, Alford and Gray175 compiled a collec-
within 1–2 Å of those made by top-ranked groups at CASP13, tion of 12 diverse benchmark sets for membrane protein predic-
and this approach boasts a considerable advantage in predic- tion and design for testing and learning of implicit membrane en-
tion time compared with strategies that learn potentials. More- ergy models.
over, the method does not use MSA-based information and Loop modeling is a special case of structure prediction, where
could potentially be improved with the inclusion of evolutionary most of the 3D protein structure is given, but coordinates of seg-
information. The RGN strategy is generalizable and well suited ments of the polypeptide are missing and need to be completed.
for protein structure prediction. Several generative methods Loops are irregular and sometimes flexible segments, and thus
(see below) also entail end-to-end structure prediction models, their structures have been difficult to capture experimentally or
such as the CVAE framework used by AlphaFold, albeit with computationally.176,177 So far, DL frameworks based on inter-
more limited success.22 residue distance prediction (similar to protein structure predic-
tion),178 and those based on treating the loop residue distances
Related Applications with the remaining residues as an image inpainting problem179
Side-chain prediction is required for homology modeling and have been applied to loop modeling. Recently, Ruffolo et al.177
various protein engineering tasks, such as fixed-backbone used a RaptorX-like network setup and a trRosetta geometric
design. Side-chain prediction is often embedded in high-resolu- representation to predict the structure of antibody hypervariable
tion structure prediction methods, traditionally with dead-end complementarity-determining region (CDR) H3 loops, which is
elimination167 or preferential sampling from backbone-depen- critical for antigen binding.
dent side-chain rotamer libraries.168 Liu et al.169 specifically
trained a 3D CNN to evaluate the probability score for different PROTEIN DESIGN
potential rotamers. Du et al.170 adopted an energy-based model
(EBM)171 to recover rotamers for backbone structures. Recent We divide the current DL approaches to protein design into two
protein structure prediction models, such as Gao et al.’s163 Rap- broad categories. The first uses knowledge of other sequences
torX-angle and Yang et al.’s103 trRosetta, predict the structural (either ‘‘all’’ sequenced proteins or a certain class of proteins)
features that help locate the position of side-chain atoms as well. to design sequences directly (Table 4). These approaches are
PPI prediction identifies residues at the interface of the two well suited to create new proteins with functionality matching ex-
proteins forming a complex. Once the interface residues are isting proteins based on sequence information alone, in a
determined, a local search and scoring protocol can be used manner similar to consensus design.180 The second class fol-
to determine the structure of a complex. Similar to protein lows the ‘‘fold-before-function’’ scheme and seeks to stabilize
folding, efforts have focused on learning to classify contact or specific 3D structures, perhaps but not necessarily with the
not. For example, Townshend et al.96 developed a 3D CNN intent to perform a desired function (Tables 5 and 6). The first
model (SASNet) that voxelizes the 3D environment around the approach can be described as function/sequence (structure
target residue, and Fout et al.112 developed a GCN-based model agnostic), and the second approach fits the traditional stepwise
with each interacting partner represented as a graph. Unlike inverse design: function /structure / sequence.
those starting from the unbound structures, Zeng et al.172 reuse Many of the recent studies describe novel algorithms that
the model trained on single-chain proteins (i.e., RaptorX-Con- output putative designed protein sequences, but only a few
tact) to predict PPI with sequence information alone, which re- studies also present experimental validation. In traditional pro-
sulted in the RaptorX-Complex that outperforms ECA-based tein design studies, it is not uncommon for most designs to
methods at contact prediction. Another interesting approach fail, and some of the early reports of protein designs were later
directly compares the geometry of two protein patches. Gainza withdrawn when the experimental evidence was not confirmed
et al.116 trained their MaSIF model by minimizing the Euclidean by others. As a result, it is usually expected that design studies
distances between the complementary surface patches on the offer rigorous experimental evidence. In this review, because
two proteins while maximizing the distances between non-inter- we are interested in creative, emerging DL methods for design,
acting surface patches. This step is followed by a quick nearest- we include papers that lack experimental validation, and many
neighbor scanning to predict binding partners. The accuracy of of these have in silico tests that help gauge validity. In addition,
MaSIF was comparable with traditional docking methods. How- we make a special note of recent studies that present experi-
ever, MaSIF, similar to existing methods, showed low prediction mental validation of designs.
accuracy for targets that involve conformational changes during
binding. Direct Design of Sequence
Membrane proteins (MPs) are partially or fully embedded in a Approaches that attempt to design for sequences parallel work
hydrophobic environment composed of a lipid bilayer and, in the field of NLP, where an auto-regressive framework is com-
consequently, they exhibit hydrophobic motifs on the surface mon, most notably, the RNN. In language processing, an RNN

12 PATTER 1, December 11, 2020


ll
Review OPEN ACCESS

Table 4. Generative Models to Identify Sequence from Function (Design for Function)
Model Architecture Output Dataset N_train Performance Citation
– WGAN + DNA chromosome 1 4.6M ~4 times stronger than training Killoran et al.181
AM of human hg 38 data in predicted TF binding
– VAE AA 5 protein families – natural mutation probability Sinai et al.93
prediction rho = 0.58
– LSTM AA ADAM, APD, 1,554 predicted antimicrobial property €ller et al, 201855
Mu
DADP 0.79 ± 0.25 (random: 0.63 ± 0.26)
PepCVAE CVAE AA – 15K labeled, generate predicted AMP with 83% Das et al.64
1.7M unlabeled (random, 28%; length, 30)
FBGAN WGAN DNA UniProt (res., 50) 3,655 predicted antimicrobial property Gupta et al.182
over 0.9 after 60 epochs
DeepSequence VAE AA mutational 41 scans aimed for mutation effect prediction, Riesselman et al.94
scan data outperformed previous models
DbAS-VAE VAE+AS DNA simulated data – predicted protein expression Brookes et al.183
surpassed FB-GAN/VAE
– LSTM musical – 56 betas + generated proteins capture the Yu et al.184
scores 38 alphas secondary structure feature
BioSeqVAE VAE AA UniProt 200,000 83.7% reconstruction Costello et al.185
accuracy,70.6% EC accuracy
– WGAN AA antibiotic 6,023 29% similar to training sequence Chhibbar et al.186
resistance (BLASTp)
determinants
PEVAE VAE AA 3 protein 31,062 latent space captures phylogenetic, Ding et al.92
families ancestral relationship, and stability
– ResNet AA mutation 1.2M (nano) predicted mutation effect reached Riesselman et al.187
data + Ilama state-of-the-art, built a library of
immune repertoire CDR3 seq
Vampire VAE AA immuneACCESS – generated sequences predicted to Davidson et al, 2019188
be similar to real CDR3 sequences
ProGAN CGAN AA eSol 2,833 solubility prediction R2 improved Han et al, 2019189
from 0.41 to 0.45
ProteinGAN GAN AA MDH from 16,706 60 sequences were tested in vitro, Repecka et al.190
UniProt 19 soluble, 13 with catalytic activity
CbAS-VAE VAE+AS AA protein 5,000 predicted protein fluorescence Brookes et al.183
fluorescence surpassed FB-VAE/DbAS
dataset
AA, amino acid sequence; AM, activation maximization; AS, adaptive sampling; CGAN, conditional generative adversarial network; CVAE, conditional
variational autoencoder; DNA, DNA sequence; EC, enzyme commission.

model is able to take the beginning of a sentence and predict the RNNs, Riesselman et al.187 used a residual causal dilated
next word in that sentence. Likewise, given a starting amino acid CNN206 in an auto-regressive way and generated a functional
residue or a sequence of residues, a protein design model can single-domain antibody library conditioned on the naive immune
output a categorical distribution for each of the 20 amino acid repertoires from llamas; although experimental validation was
residues for the next position in the sequence. The next residue not presented. Such applications could potentially speed up
in the sequence is sampled from this categorical distribution, and simplify the task of generating sequence libraries in the lab.
which in turn is used as the input to predict the following one. Another approach to sequence generation is mapping the
Following this approach, new sequences, sampled from the dis- latent space to the sequence space, and common strategies
tribution of the training data, are generated, with the goal of hav- to train such a mapping include AEs and GANs. As mentioned
ing properties similar to those in the training set. Mu €ller et al.55 earlier, AEs are trained to learn a bidirectional mapping between
first applied an LSTM RNN framework to learn sequence pat- a discrete design space (sequence) and a continuous real-
terns of antimicrobial peptides (AMPs),204 a highly specialized valued space (latent space). Thus, many applications of AEs
sequence space of cationic, amphipathic helices. The same use the learnt latent representation to capture the sequence dis-
group then applied this framework to design membranolytic anti- tribution of a specific class of proteins, and subsequently, to pre-
cancer peptides.205 Twelve of the generated peptides were syn- dict the effect of variations in sequence (or mutations) on protein
thesized and six of them killed MCF7 human breast adenocarci- function.92–94 The utility of this learned latent space, however, is
noma cells with at least 3-fold selectivity against human more than that. A well trained real-valued latent space can be
erythrocytes. In another application, instead of traditional used to interpolate between two training samples, or even

PATTER 1, December 11, 2020 13


ll
OPEN ACCESS Review

Table 5. Generative Models for Protein Structure Design


Model Architecture Representation Dataset N_train Performance Citation
– DCGAN Ca-Ca PDB (16-, 64-, 115,850 meaningful secondary structure, Anand et al.24
distances 128-residue reasonable Ramachandran plot
fragments)
RamaNet GAN torsion angles ideal helical 607 generated torsions are concentrated Sabban et al.191
structures from around helical region
PDB
– DCGAN backbone PDB (64-residue 800,000 smooth interpolations; recover from Anand et al.68
distance fragment) sequence design and folding
Ig-VAE VAE coordinates AbDb (antibody 10,768 sampled 5,000 Igs screened for Eguchi et al.192
and backbone structure) SARS-CoV2 Binder
distance
– CNN same as – – 27 out of 129 sequence-structure Anishchenko et al.193
(input trRosetta pairs experimentally validated
design)
CNN, convolutional neural network; DCGAN, deep convolutional generative adversarial network; GAN, generative adversarial network; VAE, varia-
tional autoencoder.

extrapolate beyond the training data to yield novel sequences. native one) as a success metric. To compare, Kuhlman and
One such example is the PepCVAE model.64 Following a semi- Baker208 reported sequence recovery rates of 51% for core res-
supervised learning approach, Das et al.64 trained a VAE model idues and 27% for all amino acid residues using traditional de
on an unlabeled dataset of 1:73106 sequences and then refined novo design approaches. Because the mapping from sequence
the model for the AMP subspace using a 15,000 sequence- to structure is not unique (within a neighborhood of each struc-
labeled dataset. By concatenating a conditional code indicating ture), it is not clear that higher sequence recovery rates would
if a peptide is antimicrobial, the CVAE framework allows efficient be meaningful.
sampling of AMPs selectively from the broader peptide space. A class of efforts, pioneered by the SPIN model,209 inputs a
More than 82% of the generated peptides were predicted to five-residue sliding window to predict the amino acid probabil-
exhibit antimicrobial properties according to a state-of-the-art ities for the center position to generate sequences compatible
AMP classifier. with a desired structure. The features in such models include 4
Unlike AEs, GANs focus on learning the unidirectional map- and c dihedrals, a sequence profile of a five-residue fragment
ping from a continuous real-valued space to the design space. derived from similar structures, and a rotamer-based energy
In an early example, Killoran et al.’s181 developed a model that profile of the target residue using the DFIRE potential.
combines a standard GAN and activation maximization to design SPIN209 reached a 30.7% sequence recovery rate and Wang
DNA sequences that bind to a specific protein. Repecka et al.190 et al.194 and O’Connell et al.’s25 SPIN2 further improved it to
trained ProteinGAN on the bacterial enzyme malate dehydroge- 34%. Another class of efforts inputs the voxelized local envi-
nase (MDH) to generate new enzyme sequences that were active ronment of an amino acid residue. In Zhang et al.’s197,198 and
and soluble in vitro, some with over 100 mutations, with a 24% Shroff et al.’s197,198 models, voxelized local environment was
success rate. Another interesting GAN-based framework is fed into a 3D CNN framework to predict the most stable residue
Gupta and Zou’s207 FeedBack GAN (FBGAN) that learns to type at the center of a region. Shroff et al.198 reported a 70%
generate cDNA sequences for peptides. They add a feedback- recovery rate and the mutation sites were validated experimen-
loop architecture to optimize the synthetic gene sequences for tally. Anand et al.202 trained a similar model to design
desired properties using an oracle (an external function sequences for a given backbone. Their protocol involves itera-
analyzer). At every epoch, they update the positive training tively sampling from predicted conditional distributions, and it
data for the discriminator with high-scoring sequences from recovered from 33% to 87% of native sequence identities.
the generator so that the score of generated sequences in- They tested their model by designing sequences for five pro-
creases gradually. They demonstrated the efficacy of their model teins, including a de novo TIM barrel. The designed sequences
by successfully biasing generated sequences toward antimicro- were 30%–40% identical to native sequences and predicted
bial activity and a desired secondary structure. structures were 2–5 Å root-mean-square deviation from the
native conformation.
Design with Structure as Intermediate Other approaches generate full sequences conditioned by a
Within the fold-before-function scheme, for design one first target structure. Greener et al.195 trained a CVAE model to
picks a protein fold or topology according to certain desirable generate sequences conditioned on protein topology repre-
properties, then determines the amino acid sequence that could sented in a string.99 The resulting sequence was verified to be
fold into that structure (function / structure /sequence). Un- stable with molecular simulation. Karimi et al.210 developed
der the supervised learning setting, most efforts use the native gcWGAN that combined a CGAN and a guidance strategy to
sequences as the ground truth and recovery rate of native se- bias the generated sequences toward a desired structure.
quences (i.e., the percentage of sequence that matches the They used a fast structure prediction algorithm211 as an ‘‘oracle’’

14 PATTER 1, December 11, 2020


ll
Review OPEN ACCESS

Table 6. Generative Models to Identify Sequence from Structure (Protein Design)


Model Architecture Input Dataset N_train Performance Citation
SPIN MLP sliding window PISCES 1,532 sequence recovery of 30.7% Li et al.100
with 136 features on 1,532 proteins (CV)
SPIN2 MLP sliding window PISCES 1,532 sequence recovery of 34.4% O’Connell et al.25
with 190 features on 1,532 proteins (CV)
– MLP target residue and PDB 10,173 sequence recovery of 34% Wang et al.194
its neighbor as pairs on 10,173 proteins
– CVAE string encoded PDB, 3,785 verified with structure prediction Greener et al.195
structure or metal MetalPDB and dynamic simulation
SPROF Bi-LSTM + 2D 112 1-D features + PDB 11,200 sequence recovery of 39.8% Chen et al.196
ResNet Ca distance map on protein
ProDCoNN 3D CNN gridded atomic PDB 17,044 sequence recovery of 42.2% Zhang et al.197
coordinates on 5,041 proteins
– 3D CNN gridded atomic PDB-REDO 19,436 sequence recovery 70%, Shroff et al.198
coordinates experimental validation of
mutation
ProteinSolver Graph NN partial sequence, UniParc 723106 sequence recovery of 35%, Strokach
adjacency matrix residues folding and MD test with et al, 2019199
4 proteins
gcWGAN CGAN random noise + SCOPe 20,125 diversity and TM score of Karimi et al.200
structure prediction from designed
sequence RcVAE
– Graph backbone structure CATH based 18,025 perplexity: 6.56 (rigid), 11.13 Ingraham et al.23
Transformer in graph (flexible) (random: 20.00)
DenseCPD ResNet gridded backbone PISCES 2:63106 sequence recovery of 54.45% Qi et al.201
atomic density residues on 500 proteins
– 3D CNN gridded atomic PDB 21,147 sequence recovery from 33% Anand et al.202
coordinates to 87%, test with folding of
TIM barrel
– CNN (input Same as trRosetta – – Norn et al.203
design)
Bi-LSTM, bidirectional long short-term memory; CV, cross-validation; MLP, multi-layer perceptron.

to assess the output sequence and provide feedback to refine sequence given a target structure as a constraint satisfaction
the model. They examined the model for six folds using Ro- problem. They optimized their GNN architecture on the related
setta-based structure prediction, and gcWGAN had higher TM problem of filling in a Sudoku puzzle followed by training on mil-
score distributions and more diverse sequence profiles than lions of protein sequences corresponding to thousands of struc-
CVAE.195 Another notable experiment is Ingraham et al.’s23 tural folds. They were able to validate designed sequences in sil-
graph transformer model that inputs a structure, represented ico and demonstrate that some designs folded to their target
as a graph, and outputs the sequence profile. They treat the structures in vitro.213
sequence design problem similar to a machine translation prob- An ambitious design goal is to generate new structures without
lem, i.e., a translation from structure to sequence. Like the orig- specifying the target structure. Anand and Huang were the first
inal transformer model,57 they adopted an encoder-decoder to generate new structures using DL. They tested various repre-
framework with self-attention mechanisms to dynamically learn sentations (e.g., full atom, torsion-only) with a deep convolu-
the relationship between information in two neighbor layers. tional GAN (DCGAN) framework that generates sequence-
They measured their results by perplexity, a widely used metric agnostic, fixed-length short protein structural fragments.24
in speech recognition,212 and the per-residue perplexity (lower They found that the distance map of Ca atoms gives the most
is better) for single chains was 9.15, lower than the perplexity meaningful protein structures, although the asymmetry of c
for SPIN2 (12.86). Norn et al. treated the protein design problem and 4 torsion angles214 was only recovered with torsion-based
as that of maximizing the probability of a sequence given a struc- representations. Later they extended this work to all atoms in
ture. They back-propagate through the trRosetta structure pre- the backbone and combined with a recovery network to avoid
diction network103 to find a sequence that minimizes the dis- the time-consuming structure reconstruction process.68 They
tance between predicted structure and a desired structure.203 showed that some of the designed folds are stable in molecular
Norn et al. validate their designs computationally by showing simulation. In a more narrowly focused study, Eguchi et al.192
the generated sequences have deep wells in their modeled en- trained a VAE model with the structures of immunoglobulin (Ig)
ergy landscapes. Strokach et al. treated the design of protein proteins, called Ig-VAE. By sampling the latent space, they

PATTER 1, December 11, 2020 15


ll
OPEN ACCESS Review

generated 5,000 new Ig structures (sequence-agnostic) and then ical principles restrict the range of plausible solutions. Some ex-
screened them with computational docking to identify putative amples in related fields include imposing a physics-based model
binders to SARS-CoV2-RBD. prior,220,221 adding a regularization term with physical mean-
Another approach exploits a DL structure prediction algorithm ing,222 and adopting a specific formula to conserve physical
and a Markov Chain MC (MCMC) search to find sequences that symmetry.223,224 Similarly, in protein modeling, well-established
fold into novel compact structures. Anishchenko et al.193 iterated empirical observations can help restrict the solution space, such
sequences through the DL network, trRosetta,103 to ‘‘halluci- as the Ramanchandran distribution of backbone torsion an-
nate’’215 mutually compatible sequence-structure pairs in a gles214 and the Dunbrack or Richardsons library of side-chain
manner similar to ‘‘input design’’.183 By maximizing the contrast conformations.225,226
between the distance distributions predicted by trRosetta and a
background network trained on noise, they obtained new se- Closed-Loop Design
quences with geometric maps with sharp geometric features. The performance of DL methodologies relies heavily on the
Impressively, 27 of the127 hallucinated sequences were experi- quality of data, but the publicly available datasets may not
mentally validated to fold into monomeric, highly stable, proteins cover important sample space because of experimental
with circular dichroism spectra compatible with the predicted accessibility at the time of experiments. Furthermore, the da-
structure. taset may contain harmful noise from non-uniform experi-
mental protocols and conditions. A possible solution may
OUTLOOK AND CONCLUSION be to combine model training with experimental data gener-
ation. For instance, one may devise a closed-loop strategy
In this review, we have summarized the current state-of-the-art to generate experimental data, on-the-fly, for queries (or
DL techniques applied to the problem of protein structure pre- model inputs) that are most likely to improve the model,
diction and design. As in many other areas, DL shows the poten- and update the training dataset with the newly generated
tial to revolutionize the field of protein modeling. While DL origi- data.227–230 For such a strategy to be feasible, automated
nated from computer vision, NLP and machine learning, its fast synthesis and characterization is necessary. As high-
development combined with knowledge from operations throughput synthesis and testing of protein (or DNA and
research,216 game theory,65 and variational inference32 among RNA) can be carried out in parallel, automation is possible.
other fields, has resulted in many new and powerful frameworks While such a strategy may seem far-fetched, automated plat-
to solve increasingly complex problems. The application of DL forms such as those from Ginkgo Bioworks or Transcriptic
for biomolecular structure has just begun, and we expect to are already on the market.
see more efforts on methodology development and applications
in protein modeling and design. Reinforcement Learning
We observed several trends. Another approach to overcome the limitation of data availabil-
ity is reinforcement learning (RL). Biologically meaningful data
Experimental Validation may be generated on-the-fly in simulated environments, such
An important gap in current DL work in protein modeling, espe- as the Foldit game. In the most famous application of RL, Al-
cially protein design (with few notable exceptions),205,190,198,193 phaGo Zero,21 an RL agent (network) was able to learn and
is the lack of experimental validation. Past blind challenges, master the game by learning from the game environment
e.g., CASP and CAPRI, and design claims have shown that alone. There are already some examples of RL in the field of
experimental validation in this field is of paramount importance, chemistry and electric engineering to optimize organic mole-
where computational models are still prone to error. A key next cules or computational chips.231–233 One suitable protein
stage for this field is to engage collaborations between machine modeling problem for an RL algorithm would be training an
learning experts and experimental protein engineers to test and artificial intelligence (AI) agent to make a series of ‘‘moves’’
validate these emerging approaches. to fold a protein, similar to the Foldit game.234,235 Such studies
are still rare and previous attempts have focused on folding
Importance of Benchmarking the 2D hydrophobic-polar model of proteins.236,237 Although
In other fields of machine learning, standardized benchmarks the results did not yet beat conventional methods, Gao238
have triggered rapid progress.217–219 CASP is a great example recently explored using policy and reward networks in an RL
that provides a standardized platform for benchmarking diverse scheme to fold 3D protein structures de novo by guiding the
algorithms, including emerging DL-based approaches. A well- selection of MC moves in Rosetta. Angermueller et al.239
defined question and proper evaluation (especially experimental) applied a model-based RL framework to designing sequences
would lead to more open competition among a broader range of of AMPs and transcription factor binding sites.
groups and, eventually, the innovation of more diverse and
powerful algorithms. Model Interpretability
One should keep in mind that a neural network represents
Imposing a Physics-Based Prior nothing more (and nothing less) than a powerful and flexible
One common topic among the machine learning community is regression model. In addition, due to their highly recursive na-
how to utilize existing domain knowledge to reduce the effort ture, neural networks tend to be regarded as ‘‘black-boxes’’,
during training. Unlike certain classical ML problems, such as im- i.e., too complicated for practitioners to understand the resulting
age classification, in protein modeling, a wide range of biophys- parameters and functions. Although model interpretability in ML

16 PATTER 1, December 11, 2020


ll
Review OPEN ACCESS

is a rapidly developing field, many popular approaches, such as Laboratory, NM, for helpful discussion and Dr. Andrew D. White at the Depart-
ment of Chemical Engineering at University of Rochester, NY, and Alexander
saliency analysis240–242 for image classification models, are far
Rives at the Department of Computer Science at New York University, NY,
from satisfactory.243 Although other approaches244,245 offer for helpful suggestions. We are also grateful for insightful suggestions from
more reliable interpretations, their application to DL model inter- the reviewers.
pretation has been largely missing in protein modeling. As a
result, current DL models offer limited understanding of the com- AUTHOR CONTRIBUTIONS
plex patterns they learn.
Conceptualization, W.G. and J.J.G.; Investigation, W.G. and S.P.M.; Writing –
Beyond Proteins Original Draft, W.G.; Writing – Review & Editing, W.G., S.P.M., J.S., and J.J.G.;
Funding Acquisition, J.J.G.; Resources, J.J.G.; Supervision, J.S. and J.J.G.
DL-based methods are general and so, with appropriate repre-
sentation and sufficient training data, they can be applied to other
molecules. Like proteins, nucleic acids, carbohydrates, and lipids REFERENCES
are also polymers, composed of nucleotides, monosaccharides,
1. Slabinski, L., Jaroszewski, L., Rodrigues, A.P., Rychlewski, L., Wilson,
and aliphatic subunits and head groups, respectively. Many ap- I.A., Lesley, S.A., and Godzik, A. (2007). The challenge of protein struc-
proaches developed for learning protein sequence and structural ture determination-lessons from structural genomics. Protein Sci. 16,
information can be extended to these other classes of biomole- 2472–2482.
cules.246,247 Finally, biology often conjugates these molecules, 2. Markwick, P.R.L., Malliavin, T., and Nilges, M. (2008). Structural biology
e.g., for glycoproteins. DL approaches that build up from basic by NMR: structure, dynamics, and interactions. PLoS Comput. Biol. 4,
chemistry, such as those being developed for small molecule e1000168.

drugs,248–251 may inspire methods to treat these biomolecules 3. Jonic, S., and Vénien-Bryan, C. (2009). Protein structure determination
that do not fall into a strict polymer type. by electron cryo-microscopy. Curr. Opin. Pharmacol. 9, 636–642.

4. Kryshtafovych, A., Schwede, T., Topf, M., Fidelis, K., and Moult, J.
The ‘‘Sequence / Structure / Function’’ Paradigm (2019). Critical assessment of methods of protein structure prediction
(CASP)—Round XIII. Proteins 87, 1011–1020.
We know from molecular biophysics that a sequence translates
into function through the physical intermediary of a 3D molecular 5. Hollingsworth, S.A., and Dror, R.O. (2018). Molecular dynamics simula-
structure. Allosteric proteins,252 for instance, may exhibit tion for all. Neuron 99, 1129–1143.
different structural conformations under different physiological 6. Ranjan, A., Fahad, M.S., Fernandez-Baca, D., Deepak, A., and Tripathi,
conditions (e.g., pH) or environmental stimuli (e.g., small mole- S. (2019). Deep robust framework for protein function prediction using
cules, inhibitors), reminding us that context is as important as variable-length protein sequences. IEEE/ACM Trans. Comput. Biol. Bio-
inform. 17, 1648–1659.
protein sequence. That is, despite Anfinsen’s42 hypothesis,
sequence alone does not always fully determine the structure. 7. Huang, P.S., Boyken, S.E., and Baker, D. (2016). The coming of age of de
novo protein design. Nature 537, 320–327.
Some proteins require chaperones to fold to their native struc-
ture, meaning that a sequence could result in non-native confor- 8. Yang, K.K., Wu, Z., and Arnold, F.H. (2019). Machine-learning-guided
mations when the kinetics of folding to the native structure may directed evolution for protein engineering. Nat. Methods 16, 687–694.
be unfavorable in the absence of a chaperone. Because many 9. Bohr, H., Bohr, J., Brunak, S., Cotterill, J.R., Fredholm, H., Lautrup, B.,
powerful DL algorithms in NLP operate on sequential data, it and Petersen, S. (1990). A novel approach to prediction of the 3-dimen-
may seem reasonable to use protein sequences alone for sional structures of protein backbones by neural networks. FEBS Lett.
261, 43–46.
training DL models. In principle, with a suitable framework and
training, DL could disentangle the underlying relationships be- 10. Schneider, G., and Wrede, P. (1994). The rational design of amino acid
sequences by artificial neural networks and simulated molecular evolu-
tween sequence and structural elements. However, a careful se-
tion: de novo design of an idealized leader peptidase cleavage site. Bio-
lection of DL frameworks that are structure or mechanism-aware phys. J. 66, 335–344.
will accelerate learning and improve predictive power. Indeed,
€ller, J., Nissen, E., Röns-
11. Schneider, G., Schrödl, W., Wallukat, G., Mu
many successful DL frameworks applied so far (e.g., CNNs or peck, W., Wrede, P., and Kunze, R. (1998). Peptide design by artificial
graph CNNs) factor in the importance of learning on structural in- neural networks and computer-based evolutionary search. Proc. Natl.
formation. Acad. Sci. U S A 95, 12179–12184.
Finally, with the hope of gaining insight into the fundamental 12. Ofran, Y., and Rost, B. (2003). Predicted protein-protein interaction sites
science of biomolecules, there is a desire to link AI approaches from local sequence information. FEBS Lett. 544, 236–239.
to the underlying biochemical and biophysical principles that
13. Nielsen, M., Lundegaard, C., Worning, P., Lauemøller, S.L., Lamberth, K.,
drive biomolecular function. For more practical purposes, a Buus, S., Brunak, S., and Lund, O. (2003). Reliable prediction of T-cell
deeper understanding of underlying principles and hidden pat- epitopes using neural networks with novel sequence representations.
Protein Sci. 12, 1007–1017.
terns that lead to pathology is important in the development of
therapeutics. Thus, while efforts strictly limited to sequences 14. LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. Nature 521,
are abundant, we believe that models with structural insights 436–444.
will play a more critical role in the future. €rnamaa, T., Parts, L., and Stegle, O. (2016). Deep
15. Angermueller, C., Pa
learning for computational biology. Mol. Syst. Biol. 12, 878.
ACKNOWLEDGMENTS
16. Ching, T., Himmelstein, D.S., Beaulieu-Jones, B.K., Kalinin, A.A., Do,
B.T., Way, G.P., Ferrero, E., Agapow, P.-M., Zietz, M., Hoffman, M.M.,
This work was supported by the NIH through grant R01-GM078221. We thank et al. (2018). Opportunities and obstacles for deep learning in biology
Dr. Justin S. Smith at the Center for Nonlinear Studies at Los Alamos National and medicine. J. R. Soc. Interfaces 15, 20170387.

PATTER 1, December 11, 2020 17


ll
OPEN ACCESS Review
17. Mura, C., Draizen, E.J., and Bourne, P.E. (2018). Structural biology meets 39. King, N.P., Sheffler, W., Sawaya, M.R., Vollmar, B.S., Sumida, J.P., An-
data science: does anything change? Curr. Opin. Struct. Biol. dré, I., Gonen, T., Yeates, T.O., and Baker, D. (2012). Computational
52, 95–102. design of self-assembling protein nanomaterials with atomic level accu-
racy. Science 336, 1171–1174.
18. Noé, F., De Fabritiis, G., and Clementi, C. (2020). Machine learning for
protein folding and dynamics. Curr. Opin. Struct. Biol. 60, 77–84. 40. Tinberg, C.E., Khare, S.D., Dou, J., Doyle, L., Nelson, J.W., Schena, A.,
Jankowski, W., Kalodimos, C.G., Johnsson, K., Stoddard, B.L., et al.
19. Guo, Y., Liu, Y., Oerlemans, A., Lao, S., Wu, S., and Lew, M.S. (2016). (2013). Computational design of ligand-binding proteins with high affinity
Deep learning for visual understanding: a review. Neurocomputing and selectivity. Nature 501, 212–216.
187, 27–48.
41. Joh, N.H., Wang, T., Bhate, M.P., Acharya, R., Wu, Y., Grabe, M., Hong,
20. Young, T., Hazarika, D., Poria, S., and Cambria, E. (2018). Recent trends M., Grigoryan, G., and DeGrado, W.F. (2014). De novo design of a trans-
in deep learning based natural language processing. IEEE Comput. Intel- membrane Zn2+-transporting four-helix bundle. Science 346,
ligence Mag. 13, 55–75. 1520–1524.
21. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A.,
42. Anfinsen, C.B. (1973). Principles that govern the folding of protein chains.
Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. (2017). Mastering
Science 181, 223–230.
the game of go without human knowledge. Nature 1550, 354.

22. Senior, A.W., Evans, R., Jumper, J., Kirkpatrick, J., Sifre, L., Green, T., 43. Levinthal, C. (1968). Are there pathways for protein folding? J. Chim.

Chongli, Q., Zı́dek, A., Nelson, A.W.R., Bridgland, A., et al. (2019). Protein Phys. 65, 44–45.
structure prediction using multiple deep neural networks in the 13th Crit-
ical Assessment of Protein Structure Prediction (CASP13). Proteins 87, 44. Li, B., Fooksa, M., Heinze, S., and Meiler, J. (2018). Finding the needle in
1141–1148. the haystack: towards solving the protein-folding problem computation-
ally. Crit. Rev. Biochem. Mol. Biol. 53, 1–28.
23. Ingraham, J., Garg, V., Barzilay, R., and Jaakkola, T. (2019). Generative
models for graph-based protein design. Adv. Neural Inf. Process. Syst. 45. Dahiyat, B.I., and Mayo, S.L. (1997). De novo protein design: fully auto-
15820–15831. mated sequence selection. Science 278, 82–87.

24. Anand, N., and Huang, P. (2018). Generative modeling for protein struc- 46. Korendovych, I.V., and DeGrado, W.F. (2020). De novo protein
tures. Adv. Neural Inf. Process. Syst. 7494–7505. design, a retrospective. Q. Rev. Biophys. 53, https://doi.org/10.1017/
S0033583519000131.
25. O’Connell, J., Li, Z., Hanson, J., Heffernan, R., Lyons, J., Paliwal, K., Deh-
zangi, A., Yang, Y., and Zhou, Y. (2018). SPIN2: predicting sequence pro- 47. Dougherty, M.J., and Arnold, F.H. (2009). Directed evolution: new parts
files from protein structures using deep neural networks. Proteins: Struct. and optimized function. Curr. Opin. Biotechnol. 20, 486–491.
Funct. Bioinformatics 86, 629–633.
48. Sun, R. (2019). Optimization for deep learning: theory and algorithms. ar-
26. Senior, A.W., Evans, R., Jumper, J., Kirkpatrick, J., Sifre, L., Green, T., Xiv 1912, 08957.

Qin, C., Zı́dek, A., Nelson, A.W., Bridgland, A., et al. (2020). Improved
protein structure prediction using potentials from deep learning. Nature, 49. Schmidhuber, J. (2015). Deep learning in neural networks: an overview.
1–5. Neural Networks 61, 85–117.
27. Li, Y., Huang, C., Ding, L., Li, Z., Pan, Y., and Gao, X. (2019). Deep
50. LeCun, Y., Boser, B.E., Denker, J.S., Henderson, D., Howard, R.E., Hub-
learning in bioinformatics: introduction, application, and perspective in
bard, W.E., and Jackel, L.D. (1990). Handwritten digit recognition with a
the big data era. Methods 166, 4–21.
back-propagation network. Adv. Neural Inf. Process. Syst. 396–404.
€ller, K.-R., and Clementi, C. (2020). Machine
28. Noé, F., Tkatchenko, A., Mu
learning for molecular simulation. Annu. Rev. Phys. Chem. 71, 361–390. 51. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recog-
nition. Proceedings of the IEEE Conference on Computer Vision and
29. Graves, J., Byerly, J., Priego, E., Makkapati, N., Parish, S.V., Medellin, B., Pattern Recognition 2016, 770–778.
and Berrondo, M. (2020). A review of deep learning methods for anti-
bodies. Antibodies 9, 12. 52. Jordan, M.I. (1997). Serial order: a parallel distributed processing
approach. Adv. Psychol. 121, 471–495.
30. Kandathil, S.M., Greener, J.G., and Jones, D.T. (2019). Recent develop-
ments in deep learning applied to protein structure prediction. Proteins: 53. Hochreiter, S., and Schmidhuber, J. (1997). Long short-term memory.
Struct. Funct. Bioinformatics 87, 1179–1189. Neural Comput. 9, 1735–1780.

31. Torrisi, M., Pollastri, G., and Le, Q. (2020). Deep learning methods in pro- €nboer, B., Gulcehre, C., Bahdanau, D., Bougares, F.,
54. Cho, K., Van Merrie
tein structure prediction. Comput. Struct. Biotechnol. J. 18, 1301–1310. Schwenk, H., and Bengio, Y. (2014). Learning phrase representations us-
ing RNN encoder-decoder for statistical machine translation. arXiv
32. Kingma, D.P., and Welling, M. (2013). Auto-encoding variational Bayes. 1406, 1078.
arXiv 1312, 6114.
55. Mu€ller, A.T., Hiss, J.A., and Schneider, G. (2018). Recurrent neural
33. Pauling, L., and Niemann, C. (1939). The structure of proteins. J. Am. network model for constructive peptide design. J. Chem. Inf. Model.
Chem. Soc. 61, 1860–1867. 58, 472–479.
34. Kuhlman, B., and Bradley, P. (2019). Advances in protein structure pre-
56. Bahdanau, D.; Cho, K.H.; Bengio, Y. Neural machine translation by jointly
diction and design. Nat. Rev. Mol. Cell Biol. 20, 681–697.
learning to align and translate. 3rd International Conference on Learning
Representations, ICLR 2015—Conference Track Proceedings. 2015.
35. UniProt-Consortium. (2019). UniProt: a worldwide hub of protein knowl-
edge. Nucleic Acids Res. 47, D506–D515.
57. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez,
36. Kuhlman, B., Dantas, G., Ireton, G.C., Varani, G., Stoddard, B.L., and A.N., Kaiser, q., and Polosukhin, I. (2017). Attention is all you need.
Baker, D. (2003). Design of a novel globular protein fold with atomic-level Adv. Neural Inf. Process. Syst. 2017, 5999–6009.
accuracy. Science 302, 1364–1368.
58. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: pre-
37. Fisher, M.A., McKinley, K.L., Bradley, L.H., Viola, S.R., and Hecht, M.H. training of deep bidirectional transformers for language understanding.
(2011). De novo designed proteins from a library of artificial sequences arXiv 1810, 04805.
function in Escherichia coli and enable cell growth. PLoS One 6, e15364.
59. Rives, A., Goyal, S., Meier, J., Guo, D., Ott, M., Zitnick, C.L., et al. (2019).
38. Correia, B.E., Bates, J.T., Loomis, R.J., Baneyx, G., Carrico, C., Jardine, Biological structure and function emerge from scaling unsupervised
J.G., Rupert, P., Correnti, C., Kalyuzhniy, O., Vittal, V., et al. (2014). Proof learning to 250 million protein sequences. bioRxiv, 622803, https://doi.
of principle for epitope-focused vaccine design. Nature 507, 201. org/10.1101/622803. https://www.biorxiv.org/content/10.1101/622803v3.

18 PATTER 1, December 11, 2020


ll
Review OPEN ACCESS

60. Pittala, S., and Bailey-Kellogg, C. (2020). Learning context-aware struc- 81. Asgari, E., and Mofrad, M.R. (2015). Continuous distributed representa-
tural representations to predict antigen and antibody binding interfaces. tion of biological sequences for deep proteomics and genomics. PLoS
Bioinformatics 36, 3996–4003. One 10, e0141287.

61. Hinton, G.E., and Zemel, R.S. (1994). Autoencoders, minimum descrip- 82. El-Gebali, S., Mistry, J., Bateman, A., Eddy, S.R., Luciani, A., Potter, S.C.,
tion length and Helmholtz free energy. Adv. Neural Inf. Process. Qureshi, M., Richardson, L.J., Salazar, G.A., Smart, A., et al. (2019). The
Syst. 3–10. Pfam protein families database in 2019. Nucleic Acids Res. 47,
D427–D432.
62. Kingma, D.P., and Welling, M. (2019). An introduction to variational au-
toencoders. arXiv 1906, 02691. 83. Cai, C., Han, L., Ji, Z.L., Chen, X., and Chen, Y.Z. (2003). SVM-Prot: web-
based support vector machine software for functional classification of a
63. Blei, D.M., Kucukelbir, A., and McAuliffe, J.D. (2017). Variational infer- protein from its primary sequence. Nucleic Acids Res. 31, 3692–3697.
ence: a review for statisticians. J. Am. Stat. Assoc. 112, 859–877.
84. Aragues, R., Sali, A., Bonet, J., Marti-Renom, M.A., and Oliva, B. (2007).
Characterization of protein hubs by inferring interacting motifs from pro-
64. Das, P., Wadhawan, K., Chang, O., Sercu, T., Santos, C.D., Riemer, M.,
tein interactions. PLoS Comput. Biol. 3, e178.
Chenthamarakshan, V., Padhi, I., and Mojsilovic, A. (2018). PepCVAE:
semi-supervised targeted design of antimicrobial peptide sequences. ar- 85. Yu, C., van der Schaar, M., and Sayed, A.H. (2016). Distributed learning
Xiv 1810, 07743. for stochastic generalized Nash equilibrium problems. CoRR. https://doi.
org/10.1109/TSP.2017.2695451.
65. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D.,
Ozair, S., et al. (2014). Generative adversarial nets. Adv. Neural Inf. Pro- 86. Yang, K.K., Wu, Z., Bedbrook, C.N., and Arnold, F.H. (2018). Learned
cess. Syst. 2672–2680. https://papers.nips.cc/paper/5423-generative- protein embeddings for machine learning. Bioinformatics 34, 2642–2648.
adversarial-nets.
87. Alley, E.C., Khimulya, G., Biswas, S., AlQuraishi, M., and Church, G.M.
66. Arjovsky, M., Chintala, S., Bottou, L., and Wasserstein, G.A.N. (2017). ar- (2019). Unified rational protein engineering with sequence-based deep
Xiv 1701, 07875. representation learning. Nat. Methods 16, 1315–1322.

 ic
67. Kurach, K., Luc , M., Zhai, X., Michalski, M., and Gelly, S. (2019). A large- 88. Krause, B., Lu, L., Murray, I., and Renals, S. (2016). Multiplicative LSTM
scale study on regularization and normalization in GANs. Int. Conf. Mach. for sequence modelling. arXiv 1609, 07959.
Learn. 3581–3590.
89. Heinzinger, M., Elnaggar, A., Wang, Y., Dallago, C., Nechaev, D.,
68. Anand, N., Eguchi, R., and Huang, P.-S. (2019). Fully differentiable full- Matthes, F., and Rost, B. (2019). Modeling aspects of the language of
atom protein backbone generation. Int. Conf. Learn. Rep. 35. https:// life through transfer-learning protein sequences. BMC Bioinformatics
openreview.net/revisions?id=SJxnVL8YOV. 20, 723.

69. Niepert, M., Ahmed, M., and Kutzkov, K. (2016). Learning convolutional 90. Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and
neural networks for graphs. Int. Conf. Mach. Learn. 2014–2023. Zettlemoyer, L. (2018). Deep contextualized word representations. arXiv
1802, 05365.
70. Luo, F., Wang, M., Liu, Y., Zhao, X.-M., and Li, A. (2019). DeepPhos: pre-
91. Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P.,
diction of protein phosphorylation sites with deep learning. Bioinformat-
Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language
ics 35, 2766–2773.
models are few-shot learners. arXiv 2005, 14165.
71. Li, F., Chen, J., Leier, A., Marquez-Lago, T., Liu, Q., Wang, Y., Revote, J., 92. Ding, X., Zou, Z., and Brooks, C.L., Iii (2019). Deciphering protein evolu-
Smith, A.I., Akutsu, T., Webb, G.I., et al. (2020). DeepCleave: a deep tion and fitness landscapes with latent space models. Nat. Commun.
learning predictor for caspase and matrix metalloprotease substrates 210, 1–13.
and cleavage sites. Bioinformatics 36, 1057–1065.
93. Sinai, S., Kelsic, E., Church, G.M., and Nowak, M.A. (2017). Variational
72. Bengio, Y., Courville, A., and Vincent, P. (2013). Representation learning: auto-encoding of protein sequences. arXiv 1712, 03346.
a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intelli-
gence 35, 1798–1828. 94. Riesselman, A.J., Ingraham, J.B., and Marks, D.S. (2018). Deep genera-
tive models of genetic variation capture the effects of mutations. Nat.
73. Romero, P.A., Krause, A., and Arnold, F.H. (2013). Navigating the protein Methods 15, 816–822.
fitness landscape with Gaussian processes. Proc. Natl. Acad. Sci. U S A
110, E193–E201. 95. Rao, R., Bhattacharya, N., Thomas, N., Duan, Y., Chen, P., Canny, J.,
et al. (2019). Evaluating protein transfer learning with TAPE. Adv. Neural
74. Bedbrook, C.N., Yang, K.K., Rice, A.J., Gradinaru, V., and Arnold, F.H. Inf. Process. Syst. 9689–9701. http://papers.nips.cc/paper/9163-
(2017). Machine learning to design integral membrane channel rhodop- evaluating-protein-transfer-learning-with-tape.
sins for efficient eukaryotic expression and plasma membrane localiza-
tion. PLoS Comput. Biol. 13, e1005786. 96. Townshend, R., Bedi, R., and Dror, R.O. (2018). Generalizable protein
interface prediction with end-to-end learning. arXiv 1807, 01297.
75. Ofer, D., and Linial, M. (2015). ProFET: feature engineering captures high-
level protein functions. Bioinformatics 31, 3429–3436. 97. Simonovsky, M., and Meyers, J. (2020). DeeplyTough: learning structural
comparison of protein binding sites. J. Chem. Inf. Model. 60, 2356–2366.
76. Kawashima, S., Pokarowski, P., Pokarowska, M., Kolinski, A., Katayama,
T., and Kanehisa, M. (2007). AAindex: amino acid index database, prog- 98. Kolodny, R., Koehl, P., Guibas, L., and Levitt, M. (2002). Small libraries of
ress report 2008. Nucleic Acids Res. 36, D202–D205. protein fragments model native protein structures accurately. J. Mol.
Biol. 323, 297–307.
77. Wang, S., Peng, J., Ma, J., and Xu, J. (2016). Protein secondary structure 99. Taylor, W.R.A. (2002). ‘‘periodic table’’ for protein structures. Nature 416,
prediction using deep convolutional neural fields. Sci. Rep. 6, 18962. 657–660.
78. Drori, I., Thaker, D., Srivatsa, A., Jeong, D., Wang, Y., Nan, L., Wu, F., 100. Li, J., and Koehl, P. (2014). 3D representations of amino acids–
Leggas, D., Lei, J., Lu, W., et al. (2019). Accurate protein structure predic- applications to protein sequence comparison and classification. Com-
tion by embeddings and deep learning representations. arXiv put. Struct. Biotechnol. J. 11, 47–58.
1911, 05531.
101. AlQuraishi, M. (2019). End-to-End differentiable learning of protein struc-
79. Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estima- ture. Cell Syst. 8, 292–301.e3.
tion of word representations in vector space. arXiv 1301, 3781.
102. Wang, S., Sun, S., Li, Z., Zhang, R., and Xu, J. (2017). Accurate de novo
80. Le, Q., and Mikolov, T. (2014). Distributed representations of sentences prediction of protein contact map by ultra-deep learning model. PLoS
and documents. Int. Conf. Mach. Learn. 1188–1196. Comput. Biol. 13, e1005324.

PATTER 1, December 11, 2020 19


ll
OPEN ACCESS Review
103. Yang, J., Anishchenko, I., Park, H., Peng, Z., Ovchinnikov, S., and Baker, 122. Alford, R.F., Leaver-Fay, A., Jeliazkov, J.R., OʼMeara, M.J., DiMaio, F.P.,
D. (2020). Improved protein structure prediction using predicted interre- Park, H., Shapovalov, M.V., Renfrew, P.D., Mulligan, V.K., Kappel, K.,
sidue orientations. Proc. Natl. Acad. Sci. U S A 117, 1496–1503. et al. (2017). The Rosetta all-atom energy function for macromolecular
modeling and design. J. Chem. Theor. Comput. 13, 3031–3048.
104. Brunger, A.T. (2007). Version 1.2 of the crystallography and NMR system.
Nat. Protoc. 2, 2728. 123. Behler, J., and Parrinello, M. (2007). Generalized neural-network repre-
sentation of high-dimensional potential-energy surfaces. Phys. Rev.
105. Zhou, J., Cui, G., Zhang, Z., Yang, C., Liu, Z., and Sun, M. (2018). Graph Lett. 98, 146401.
neural networks: a review of methods and applications. arXiv
1812, 08434. 124. Smith, J.S., Isayev, O., and Roitberg, A.E. (2017). ANI-1: an extensible
neural network potential with DFT accuracy at force field computational
106. Ahmed, E., Saint, A., Shabayek, A., Cherenkova, K., Das, R., Gusev, G., cost. Chem. Sci. 8, 3192–3203.
Aouada, D., and Ottersten, B. (2018). Deep learning advances on
different 3D data representations: a survey. arXiv 1, 01462. 125. Smith, J.S., Nebgen, B., Lubbers, N., Isayev, O., and Roitberg, A.E.
(2018). Less is more: sampling chemical space with active learning.
107. Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., and Philip, S.Y. (2020). A J. Chem. Phys. 148, 241733.
comprehensive survey on graph neural networks. IEEE Trans. Neural
126. Schu €tt, K.T., Arbabzadah, F., Chmiela, S., Mu
€ller, K.R., and Tkatchenko,
Networks Learn. Syst. 1–21. https://ieeexplore.ieee.org/abstract/
document/9046288. A. (2017). Quantum-chemical insights from deep tensor neural networks.
Nat. Commun. 8, 1–8.
108. Vishveshwara, S., Brinda, K., and Kannan, N. (2002). Protein structure: in- €tt, K.T., Sauceda, H.E., Kindermans, P.-J., Tkatchenko, A., and
127. Schu
sights from graph theory. J. Theor. Comput. Chem. 1, 187–211. €ller, K.-R. (2018). SchNet—a deep learning architecture for molecules
Mu
and materials. J. Chem. Phys. 148, 241722.
109. Ying, Z., You, J., Morris, C., Ren, X., Hamilton, W., and Leskovec, J.
(2018). Hierarchical graph representation learning with differentiable 128. Zhang, L., Han, J., Wang, H., Car, R., and Weinan, E. (2018). Deep poten-
pooling. Adv. Neural Inf. Process. Syst. 4800–4810. https://papers. tial molecular dynamics: a scalable model with the accuracy of quantum
nips.cc/paper/7729-hierarchical-graph-representation-learning-with- mechanics. Phys. Rev. Lett. 120, 143001.
differentiable-pooling.
129. Unke, O.T., and Meuwly, M. (2019). PhysNet: a neural network for pre-
110. Borgwardt, K.M., Ong, C.S., Schönauer, S., Vishwanathan, S., Smola, dicting energies, forces, dipole moments, and partial charges.
A.J., and Kriegel, H.-P. (2005). Protein function prediction via graph ker- J. Chem. Theor. Comput. 15, 3678–3693.
nels. Bioinformatics 21, i47–i56.
130. Zubatyuk, R., Smith, J.S., Leszczynski, J., and Isayev, O. (2019). Accu-
111. Dobson, P.D., and Doig, A.J. (2003). Distinguishing enzyme structures rate and transferable multitask prediction of chemical properties with
from non-enzymes without alignments. J. Mol. Biol. 330, 771–783. an atoms-in-molecules neural network. Sci. Adv. 5, eaav6490.

112. Fout, A., Byrd, J., Shariat, B., and Ben-Hur, A. (2017). Protein interface 131. Lahey, S.-L.J., and Rowley, C.N. (2020). Simulating protein-ligand bind-
prediction using graph convolutional networks. Adv. Neural Inf. Process. ing with neural network potentials. Chem. Sci. 11, 2362–2368.
Syst. 6530–6539. https://papers.nips.cc/paper/7231-protein-interface-
prediction-using-graph-convolutional-networks. 132. Wang, Z., Han, Y., Li, J., and He, X. (2020). Combining the fragmentation
approach and neural network potential energy surfaces of fragments for
113. Zamora-Resendiz, R., and Crivelli, S. (2019). Structural learning of pro- accurate calculation of protein energy. J. Phys. Chem. B 124,
teins using graph convolutional neural networks. bioRxiv, 610444. 3027–3035.
https://www.biorxiv.org/content/10.1101/610444v1.
133. Senn, H.M., and Thiel, W. (2009). QM/MM methods for biomolecular sys-
114. Gligorijevic, V., Renfrew, P.D., Kosciolek, T., Leman, J.K., Cho, K., Vata- tems. Angew. Chem. Int. Ed. 48, 1198–1229.
nen, T., et al. (2019). Structure-based function prediction using graph
convolutional networks. bioRxiv, 786236. https://www.biorxiv.org/ 134. Wang, Y., Fass, J., and Chodera, J.D. (2020). End-to-End Differentiable
content/10.1101/786236v2. Molecular Mechanics Force Field Construction (arXiv). https://arxiv.org/
abs/2010.01196.
115. Torng, W., and Altman, R.B. (2019). Graph convolutional neural networks
for predicting drug-target interactions. J. Chem. Inf. Model. 59, 135. Kmiecik, S., Gront, D., Kolinski, M., Wieteska, L., Dawid, A.E., and Kolin-
4131–4149. ski, A. (2016). Coarse-grained protein models and their applications.
Chem. Rev. 116, 7898–7936.
116. Gainza, P., Sverrisson, F., Monti, F., Rodola, E., Boscaini, D., Bronstein,
M., and Correia, B. (2020). Deciphering interaction fingerprints from pro- 136. Zhang, L., Han, J., Wang, H., Car, R., and Weinan, E. (2018). DeePCG:
tein molecular surfaces using geometric deep learning. Nat. Methods 17, constructing coarse-grained models via deep neural networks.
184–192. J. Chem. Phys. 149, 034101.

137. Patra, T.K., Loeffler, T.D., Chan, H., Cherukara, M.J., Narayanan, B., and
117. Bronstein, M.M., Bruna, J., LeCun, Y., Szlam, A., and Vandergheynst, P.
Sankaranarayanan, S.K. (2019). A coarse-grained deep neural network
(2017). Geometric deep learning: going beyond Euclidean data. IEEE
model for liquid water. Appl. Phys. Lett. 115, 193101.
Signal. Process. Mag. 34, 18–42.
138. Wang, J., Olsson, S., Wehmeyer, C., Pérez, A., Charron, N.E., De Fabri-
118. Nerenberg, P.S., and Head-Gordon, T. (2018). New developments in tiis, G., Noé, F., and Clementi, C. (2019). Machine learning of coarse-
force fields for biomolecular simulations. Curr. Opin. Struct. Biol. 49, grained molecular dynamics force fields. ACS Cent. Sci. 5, 755–767.
129–138.
139. Wang, W., and Gómez-Bombarelli, R. (2019). Learning coarse-grained
119. Derevyanko, G., Grudinin, S., Bengio, Y., and Lamoureux, G. (2018). particle latent space with auto-encoders. Adv. Neural Inf. Process.
Deep convolutional networks for quality assessment of protein folds. Bio- Syst. 1.
informatics 34, 4046–4053.
140. Li, Z., Wellawatte, G.P., Chakraborty, M., Gandhi, H.A., Xu, C., and White,
120. Best, R.B., Zhu, X., Shim, J., Lopes, P.E., Mittal, J., Feig, M., and MacK- A.D. (2020). Graph neural network based coarse-grained mapping pre-
erell, A.D., Jr. (2012). Optimization of the additive CHARMM all-atom pro- diction. Chem. Sci. 11, 9524–9531.
tein force field targeting improved sampling of the backbone 4, c and
side-chain c1 and c2 dihedral angles. J. Chem. Theor. Comput. 8, 141. Jones, D.T., Buchan, D.W., Cozzetto, D., and Pontil, M. (2011). PSICOV:
3257–3273. precise structural contact prediction using sparse inverse covariance
estimation on large multiple sequence alignments. Bioinformatics 28,
121. Weiner, S.J., Kollman, P.A., Case, D.A., Singh, U.C., Ghio, C., Alagona, 184–190.
G., Profeta, S., and Weiner, P. (1984). A new force field for molecular me-
chanical simulation of nucleic acids and proteins. J. Am. Chem. Soc. 106, 142. Di Lena, P., Nagata, K., and Baldi, P. (2012). Deep architectures for pro-
765–784. tein contact map prediction. Bioinformatics 28, 2449–2457.

20 PATTER 1, December 11, 2020


ll
Review OPEN ACCESS

143. Eickholt, J., and Cheng, J. (2012). Predicting protein residue-residue 164. AlQuraishi, M. (2019). AlphaFold at CASP13. Bioinformatics 35,
contacts using deep networks and boosting. Bioinformatics 28, 4862–4865.
3066–3072.
 Moult, J., and Fidelis, K. (1999). Processing and
165. Zemla, A., Venclovas, C.,
144. Seemayer, S., Gruber, M., and Söding, J. (2014). CCMpred—fast and analysis of CASP3 protein structure predictions. Proteins 37, 22–29.
precise prediction of protein residue-residue contacts from correlated
mutations. Bioinformatics 30, 3128–3130. 166. Kingma, D.P., Mohamed, S., Rezende, D.J., and Welling, M. (2014).
Semi-supervised learning with deep generative models. Adv. Neural
145. Skwark, M.J., Raimondi, D., Michel, M., and Elofsson, A. (2014). Inf. Process. Syst. 3581–3589.
Improved contact predictions using the recognition of protein like con-
tact patterns. PLoS Comput. Biol. 10, e1003889. 167. Desmet, J., De Maeyer, M., Hazes, B., and Lasters, I. (1992). The dead-
end elimination theorem and its use in protein side-chain positioning. Na-
146. Jones, D.T., Singh, T., Kosciolek, T., and Tetchner, S. (2014). MetaPSI- ture 356, 539–542.
COV: combining coevolution methods for accurate prediction of con-
tacts and long range hydrogen bonding in proteins. Bioinformatics 31, 168. Krivov, G.G., Shapovalov, M.V., and Dunbrack, R.L. (2009). Improved
999–1006. prediction of protein side-chain conformations with SCWRL4. Proteins
77, 778–795.
147. Xu, J. (2019). Distance-based protein folding powered by deep learning.
Proc. Natl. Acad. Sci. U S A 116, 16856–16865. 169. Liu, K., Sun, X., Ma, J., Zhou, Z., Dong, Q., Peng, S., Wu, J., Tan, S., Blo-
bel, G., and Fan, J. (2017). Prediction of amino acid side chain conforma-
148. Jones, D.T., and Kandathil, S.M. (2018). High precision in protein contact tion using a deep neural network. arXiv 1707, 08381.
prediction using fully convolutional neural networks and minimal
sequence features. Bioinformatics 34, 3308–3315. 170. Du, Y., Meier, J., Ma, J., Fergus, R., and Rives, A. (2020). Energy-based
models for atomic-resolution protein conformations. arXiv 2004, 13167.
149. Hanson, J., Paliwal, K., Litfin, T., Yang, Y., and Zhou, Y. (2018). Accurate
prediction of protein contact maps by coupling residual two-dimensional 171. LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M., and Huang, F. (2006). A
bidirectional long short-term memory with convolutional neural net- Tutorial on Energy-Based Learning (Predicting Structured Data), p. 1.
works. Bioinformatics 34, 4039–4045.
172. Zeng, H., Wang, S., Zhou, T., Zhao, F., Li, X., Wu, Q., and Xu, J. (2018).
150. Kandathil, S.M., Greener, J.G., and Jones, D.T. (2019). Prediction of in- ComplexContact: a web server for inter-protein contact prediction using
terresidue contacts with DeepMetaPSICOV in CASP13. Proteins 87, deep learning. Nucleic Acids Res. 46, W432–W437.
1092–1099.
173. Wang, S., Li, Z., Yu, Y., and Xu, J. (2017). Folding membrane proteins by
151. Hou, J., Wu, T., Cao, R., and Cheng, J. (2019). Protein tertiary structure deep transfer learning. Cell Syst. 5, 202–211.e3.
modeling driven by deep learning and contact distance prediction in
CASP13. Proteins 87, 1165–1178. €ll, L., and Elofsson, A. (2015). The
174. Tsirigos, K.D., Peters, C., Shu, N., Ka
TOPCONS web server for consensus prediction of membrane protein to-
152. Zheng, W., Li, Y., Zhang, C., Pearce, R., Mortuza, S., and Zhang, Y. pology and signal peptides. Nucleic Acids Res. 43, W401–W407.
(2019). Deep-learning contact-map guided protein structure prediction
in CASP13. Proteins 87, 1149–1164. 175. Alford, R.F., and Gray, J.J. (2020). Big data from sparse data: diverse sci-
entific benchmarks reveal optimization imperatives for implicit mem-
153. Wu, Q., Peng, Z., Anishchenko, I., Cong, Q., Baker, D., and Yang, J. brane energy functions. Biophys. J. 118, 361a.
(2020). Protein contact prediction using metagenome sequence data
and residual neural networks. Bioinformatics 36, 41–48. 176. Stein, A., and Kortemme, T. (2013). Improvements to robotics-inspired
conformational sampling in Rosetta. PLoS One 8, e63090.
154. Marks, D.S., Colwell, L.J., Sheridan, R., Hopf, T.A., Pagnani, A., Zec-
china, R., and Sander, C. (2011). Protein 3D structure computed from 177. Ruffolo, J.A., Guerra, C., Mahajan, S.P., Sulam, J., and Gray, J.J. (2020).
evolutionary sequence variation. PLoS One 6, e28766. Geometric potentials from deep learning improve prediction of CDR H3
loop structures. Bioinformatics 36, i268–i275.
155. Ma, J., Wang, S., Wang, Z., and Xu, J. (2015). Protein contact prediction
by integrating joint evolutionary coupling analysis and supervised 178. Nguyen, S.P., Li, Z., Xu, D., and Shang, Y. (2017). New deep learning
learning. Bioinformatics 31, 3506–3513. methods for protein loop modeling. IEEE/ACM Trans. Comput. Biol. Bio-
inform. 16, 596–606.
156. Remmert, M., Biegert, A., Hauser, A., and Söding, J. (2012). HHblits:
lightning-fast iterative protein sequence searching by HMM-HMM align- 179. Li, Z.; Nguyen, S.P.; Xu, D.; Shang, Y. Protein loop modeling using deep
ment. Nat. Methods 9, 173–175. generative adversarial network. Proceedings—International Conference
on Tools with Artificial Intelligence, ICTAI. 2018; pp 1085–1091.
157. Fariselli, P., Olmea, O., Valencia, A., and Casadio, R. (2001). Prediction of
contact maps with neural networks and correlated mutations. Protein 180. Porebski, B.T., and Buckle, A.M. (2016). Consensus protein design. Pro-
Eng. 14, 835–843. tein Eng. Des. Select. 29, 245–251.

158. Horner, D.S., Pirovano, W., and Pesole, G. (2007). Correlated substitution 181. Killoran, N., Lee, L.J., Delong, A., Duvenaud, D., and Frey, B.J. (2017).
analysis and the prediction of amino acid structural contacts. Brief. Bio- Generating and designing DNA with deep generative models. arXiv
inform. 9, 46–56. 1712, 06148.

159. Monastyrskyy, B., d’Andrea, D., Fidelis, K., Tramontano, A., and Kryshta- 182. Gupta, A., and Zou, J. (2018). Feedback GAN FBGAN for DNA: a novel
fovych, A. (2014). Evaluation of residue–residue contact prediction in feedback-loop architecture for optimizing protein functions. arXiv
CASP10. Proteins 82, 138–153. 1804, 01694.

160. Xu, J., and Wang, S. (2019). Analysis of distance-based protein structure 183. Brookes, D.H., Park, H., and Listgarten, J. (2019). Conditioning by adap-
prediction by deep learning in CASP13. Proteins 87, 1069–1081. tive sampling for robust design. arXiv 1901, 10060.

161. Moult, J., Fidelis, K., Kryshtafovych, A., Schwede, T., and Tramontano, A. 184. Yu, C.-H., Qin, Z., Martin-Martinez, F.J., and Buehler, M.J. (2019). A self-
(2018). Critical assessment of methods of protein structure prediction consistent sonification method to translate amino acid sequences into
(CASP)—Round XII. Proteins 86, 7–15. musical compositions and application in protein design using artificial in-
telligence. ACS Nano 13, 7471–7482.
162. Wang, S., Li, W., Liu, S., and Xu, J. (2016). RaptorX-Property: a web
server for protein structure property prediction. Nucleic Acids Res. 44, 185. Costello, Z., and Martin, H.G. (2019). How to hallucinate functional pro-
W430–W435. teins. arXiv 1903, 00458.

163. Gao, Y., Wang, S., Deng, M., and Xu, J. (2018). RaptorX-Angle: real-value 186. Chhibbar, P., and Joshi, A. (2019). Generating protein sequences from
prediction of protein backbone dihedral angles through a hybrid method antibiotic resistance genes data using generative adversarial networks.
of clustering and deep learning. BMC Bioinformatics 19, 100. arXiv 1904, 13240.

PATTER 1, December 11, 2020 21


ll
OPEN ACCESS Review
187. Riesselman, A.J., Shin, J.-E., Kollasch, A.W., McMahon, C., Simon, E., 207. Gupta, A., and Zou, J. (2019). Feedback GAN for DNA optimizes protein
Sander, C., Manglik, A., Kruse, A.C., and Marks, D.S. (2019). Acceler- functions. Nat. Machine Intelligence 1, 105–111.
ating protein design using autoregressive generative models. bioRxiv,
757252. 208. Kuhlman, B., and Baker, D. (2000). Native protein sequences are close to
optimal for their structures. Proc. Natl. Acad. Sci. U S A 97, 10383–10388.
188. Davidsen, K., Olson, B.J., DeWitt, W.S., III, Feng, J., Harkins, E., Bradley,
P., and Matsen IV, F.A. (2019). Deep generative models for T cell receptor 209. Li, Z., Yang, Y., Faraggi, E., Zhan, J., and Zhou, Y. (2014). Direct predic-
protein sequences. eLife 8, https://doi.org/10.7554/eLife.46935. tion of profiles of sequences compatible with a protein structure by neural
networks with fragment-based local and energy-based nonlocal profiles.
189. Han, X., Zhang, L., Zhou, K., and Wang, X. (2019). ProGAN: protein sol- Proteins 82, 2565–2573.
ubility generative adversarial nets for data augmentation in DNN frame-
work. Comput. Chem. Eng. 131, 106533. 210. Karimi, M., Zhu, S., Cao, Y., and Shen, Y. (2020). De novo protein design
for novel folds using guided conditional Wasserstein generative adversa-
190. Repecka, D., Jauniskis, V., Karpus, L., Rembeza, E., Zrimec, J., Povilo- rial networks. J. Chem. Inf. Model. https://doi.org/10.1021/acs.jcim.
niene, S., et al. (2019). Expanding functional protein sequence space us- 0c00593.
ing generative adversarial networks. bioRxiv, 789719, https://doi.org/10.
1101/789719. https://www.biorxiv.org/content/10.1101/789719v2. 211. Hou, J., Adhikari, B., and Cheng, J. (2017). DeepSF: deep convolutional
neural network for mapping protein sequences to folds. Bioinformatics
191. Sabban, S., and Markovsky, M. (2020). RamaNet: computational de novo 34, 1295–1303.
helical protein backbone design using a long short-term memory gener-
ative neural network. F1000Research 9, 298. 212. Jelinek, F., Mercer, R.L., Bahl, L.R., and Baker, J.K. (1977). Perplexity—a
measure of the difficulty of speech recognition tasks. J. Acoust. Soc. Am.
192. Eguchi, R.R., Anand, N., Choe, C.A., and Huang, P.-S. (2020). Ig-VAE: 62, S63.
generative modeling of immunoglobulin proteins by direct 3D coordinate
generation. bioRxiv, 242347. https://www.biorxiv.org/content/10.1101/ 213. Strokach, A., Becerra, D., Corbi-Verge, C., Perez-Riba, A., and Kim, P.
2020.08.07.242347v1. (2019). Fast and flexible design of novel proteins using graph neural net-
works. bioRxiv, 868935.
193. Anishchenko, I., Chidyausiku, T.M., Ovchinnikov, S., Pellock, S.J., Baker,
D., and Harvard, J. (2020). De novo protein design by deep network hallu- 214. Ramachandran, G.N. (1963). Stereochemistry of polypeptide chain con-
cination. bioRxiv, 211482. https://www.biorxiv.org/content/10.1101/ figurations. J. Mol. Biol. 7, 95–99.
2020.07.22.211482v1.
215. (2015). https://research.googleblog.com/2015/06/Uinceptionism-going-
194. Wang, J., Cao, H., Zhang, J.Z., and Qi, Y. (2018). Computational protein deeper-into-neural.html.
design with deep learning neural networks. Sci. Rep. 8, 6349.
216. Sutton, R.S., and Barto, A.G. (2018). Reinforcement Learning: An Intro-
195. Greener, J.G., Moffat, L., and Jones, D.T. (2018). Design of metallopro- duction (MIT press).
teins and novel protein folds using variational autoencoders. Sci. Rep.
217. Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Fei-Fei, L. Imagenet: A
8, 1–12.
large-scale hierarchical image database. 2009 IEEE Conference on Com-
196. Chen, S., Sun, Z., Lin, L., Liu, Z., Liu, X., Chong, Y., Lu, Y., Zhao, H., and puter Vision and Pattern Recognition 2009, 248–255.
Yang, Y. (2019). To improve protein sequence profile prediction through
218. Mayr, A., Klambauer, G., Unterthiner, T., and Hochreiter, S. (2016). Deep-
image captioning on pairwise residue distance map. J. Chem. Inf. Model.
Tox: toxicity prediction using deep learning. Front. Environ. Sci. 3, 80.
60, 391–399.
219. Brown, N., Fiscato, M., Segler, M.H., and Vaucher, A.C. (2019). Guaca-
197. Zhang, Y., Chen, Y., Wang, C., Lo, C.-C., Liu, X., Wu, W., and Zhang, J. Mol: benchmarking models for de novo molecular design. J. Chem. Inf.
(2019). ProDCoNN: protein design using a convolutional neural network. Model. 59, 1096–1108.
Proteins 88, 819–829.
220. Lutter, M., Ritter, C., and Peters, J. (2019). Deep Lagrangian networks:
198. Shroff, R., Cole, A.W., Morrow, B.R., Diaz, D.J., Donnell, I., Gollihar, J., using physics as model prior for deep learning. arXiv 1907, 04490.
Ellington, A.D., and Thyer, R. (2019). A structure-based deep learning
framework for protein engineering. bioRxiv, 833905. 221. Greydanus, S., Dzamba, M., and Yosinski, J. (2019). Hamiltonian neural
networks. Adv. Neural Inf. Process. Syst. 15379–15389.
199. Strokach, A., Becerra, D., Corbi-Verge, C., Perez-Riba, A., and Kim, P.M.
(2019). Designing real novel proteins using deep graph neural networks. 222. Raissi, M., Perdikaris, P., and Karniadakis, G.E. (2019). Physics-informed
bioRxiv, 868935. neural networks: a deep learning framework for solving forward and in-
verse problems involving nonlinear partial differential equations.
200. Karimi, M., Zhu, S., Cao, Y., and Shen, Y. (2019). De novo protein design J. Comput. Phys. 378, 686–707.
for novel folds using guided conditional Wasserstein generative adversa-
rial networks gcWGAN. bioRxiv, 769919. 223. Zepeda-Núñez, L., Chen, Y., Zhang, J., Jia, W., Zhang, L., and Lin, L.
(2019). Deep Density: circumventing the Kohn-Sham equations via sym-
201. Qi, Y., and Zhang, J.Z. (2020). DenseCPD: improving the accuracy of metry preserving neural networks. arXiv 1912, 00775.
neural-network-based computational protein sequence design with Den-
seNet. J. Chem. Inf. Model. 60, 1245–1252. 224. Han, J., Li, Y., Lin, L., Lu, J., Zhang, J., and Zhang, L. (2019). Universal
approximation of symmetric and anti-symmetric functions. arXiv
202. Anand, N., Eguchi, R.R., Derry, A., Altman, R.B., and Huang, P. (2020). 1912, 01765.
Protein sequence design with a learned potential. bioRxiv, 895466.
225. Shapovalov, M.V., and Dunbrack, R.L., Jr. (2011). A smoothed back-
203. Norn, C., Wicky, B.I., Juergens, D., Liu, S., Kim, D., Koepnick, B., et al. bone-dependent rotamer library for proteins derived from adaptive
(2020). Protein sequence design by explicit energy landscape optimiza- kernel density estimates and regressions. Structure 19, 844–858.
tion. bioRxiv, 218917, https://doi.org/10.1101/2020.07.23.218917.
https://www.biorxiv.org/content/10.1101/2020.07.23.218917v1.full. 226. Hintze, B.J., Lewis, S.M., Richardson, J.S., and Richardson, D.C. (2016).
Molprobity’s ultimate rotamer-library distributions for model validation.
204. Waghu, F.H., Gopi, L., Barai, R.S., Ramteke, P., Nizami, B., and Idicula- Proteins 84, 1177–1189.
Thomas, S. (2014). CAMP: collection of sequences and structures of anti-
microbial peptides. Nucleic Acids Res. 42, D1154–D1158. 227. Jensen, K.F., Coley, C.W., and Eyke, N.S. (2019). Autonomous discovery
in the chemical sciences part I: progress. Angew. Chem. Int. Ed. 59, 2–38.
205. Grisoni, F., Neuhaus, C.S., Gabernet, G., Mu €ller, A.T., Hiss, J.A., and
Schneider, G. (2018). Designing anticancer peptides by constructive ma- 228. Coley, C.W., Eyke, N.S., and Jensen, K.F. (2019). Autonomous discovery
chine learning. ChemMedChem 13, 1300–1302. in the chemical sciences part II: outlook. Angew. Chem. Int. Ed. 59, 2–25.

206. Yu, F., and Koltun, V. (2015). Multi-scale context aggregation by dilated 229. Coley, C.W., Thomas, D.A., Lummiss, J.A., Jaworski, J.N., Breen, C.P.,
convolutions. arXiv 1511, 07122. Schultz, V., Hart, T., Fishman, J.S., Rogers, L., Gao, H., et al. (2019). A

22 PATTER 1, December 11, 2020


ll
Review OPEN ACCESS

robotic platform for flow synthesis of organic compounds informed by AI 242. Sundararajan, M.; Taly, A.; Yan, Q. Axiomatic attribution for deep net-
planning. Science 365, eaax1566. works. Proceedings of the 34th International Conference on Machine
Learning2017, 70, 3319–3328.
230. Barrett, R.; White, A.D. Iterative peptide modeling with active learning
and meta-learning. arXiv preprint 2019, 1911.09103. 243. Adebayo, J., Gilmer, J., Muelly, M., Goodfellow, I., Hardt, M., and Kim, B.
(2018). Sanity checks for saliency maps. Adv. Neural Inf. Process. Syst.
231. You, J., Liu, B., Ying, R., Pande, V., and Leskovec, J. (2018). Graph con- 9505–9515.
volutional policy network for goal-directed molecular graph generation.
Adv. Neural Inf. Process. Syst. 6410–6421. 244. Shrikumar, A., Greenside, P., and Kundaje, A. (1704). Learning important
features through propagating activation differences. arXiv 2017, 02685.
232. Zhou, Z., Kearnes, S., Li, L., Zare, R.N., and Riley, P. (2019). Optimization
of molecules via deep reinforcement learning. Sci. Rep. 9, 1–10.
245. Lundberg, S.M.; Lee, S.-I. A Unified Approach to Interpreting Model Pre-
233. Mirhoseini, A., Goldie, A., Yazgan, M., Jiang, J., Songhori, E., Wang, S., dictions. Proceedings of the 31st International Conference on Neural In-
Lee, Y.-J., Johnson, E., Pathak, O., Bae, S., et al. (2004). Chip placement formation Processing Systems 2017, 4768–4777.
with deep reinforcement learning. arXiv 2020, 10746.
246. Hannon, G.J. (2002). RNA interference. Nature 418, 244–251.
234. Cooper, S., Khatib, F., Treuille, A., Barbero, J., Lee, J., Beenen, M.,
Leaver-Fay, A., Baker, D., Popovic , Z., and Foldit, P. (2010). Predicting 247. Zhang, P., Woen, S., Wang, T., Liau, B., Zhao, S., Chen, C., Yang, Y.,
protein structures with a multiplayer online game. Nature 466, 756–760. Song, Z., Wormald, M.R., Yu, C., et al. (2016). Challenges of glycosylation
analysis and control: an integrated approach to producing optimal and
235. Koepnick, B., Flatten, J., Husain, T., Ford, A., Silva, D.-A., Bick, M.J., Ba- consistent therapeutic drugs. Drug Discov. Today 21, 740–765.
uer, A., Liu, G., Ishida, Y., Boykov, A., et al. (2019). De novo protein design
by citizen scientists. Nature 570, 390–394. 248. Sanchez-Lengeling, B., and Aspuru-Guzik, A. (2018). Inverse molecular
design using machine learning: generative models for matter engineer-
236. Czibula, G., Bocicor, M.-I., and Czibula, I.-G. (2011). A reinforcement ing. Science 361, 360–365.
learning model for solving the folding problem. Int. J. Comput. Technol.
Appl. 2, 171–182. 249. Coley, C.W., Jin, W., Rogers, L., Jamison, T.F., Jaakkola, T.S., Green,
W.H., Barzilay, R., and Jensen, K.F. (2019). A graph-convolutional neural
237. Jafari, R., and Javidi, M.M. (2020). Solving the protein folding problem in network model for the prediction of chemical reactivity. Chem. Sci. 10,
hydrophobic-polar model using deep reinforcement learning. SN Appl. 370–377.
Sci. 2, 259.
250. Yang, K., Swanson, K., Jin, W., Coley, C., Eiden, P., Gao, H., Guzman-
238. Gao, W. (2020). Development of a Protein Folding Environment for Rein-
Perez, A., Hopper, T., Kelley, B., Mathea, M., et al. (2019). Analyzing
forcement Learning, M.Sc. thesis (Johns Hopkins University).
learned molecular representations for property prediction. J. Chem. Inf.
239. Angermueller, C., Dohan, D., Belanger, D., Deshpande, R., Murphy, K., Model. 59, 3370–3388.
and Colwell, L. (2020). Model-Based Reinforcement Learning for Biolog-
ical Sequence Design (ICLR 2020 Conference). https://openreview.net/ 251. Gao, W., and Coley, C.W. (2020). The synthesizability of molecules pro-
forum?id=HklxbgBKvr. posed by generative models. J. Chem. Inf. Model. https://doi.org/10.
1021/acs.jcim.0c00174.
240. Zeiler, M.D., and Fergus, R. (2014). Visualizing and understanding convo-
lutional networks. Eur. Conf. Comput. Vis. 818–833. 252. Langan, R.A., Boyken, S.E., Ng, A.H., Samson, J.A., Dods, G., West-
brook, A.M., Nguyen, T.H., Lajoie, M.J., Chen, Z., Berger, S., et al.
241. Smilkov, D., Thorat, N., Kim, B., Viégas, F., and Wattenberg, M. (2017). (2019). De novo design of bioactive protein switches. Nature 572,
SmoothGrad: removing noise by adding noise. arXiv 1706, 03825. 205–210.

PATTER 1, December 11, 2020 23

You might also like