I-TASSER-MTD: a deep-learning-based platform for multi-domain protein structure and function prediction zhang server
I-TASSER-MTD: a deep-learning-based platform for multi-domain protein structure and function prediction zhang server
https://doi.org/10.1038/s41596-022-00728-0
function annotation in a fully automated pipeline. Advanced deep-learning models have been incorporated into each of the
steps to enhance both the domain modeling and inter-domain assembly accuracy. The protocol allows for the
incorporation of experimental cross-linking data and cryo-electron microscopy density maps to guide the multi-domain
structure assembly simulations. I-TASSER-MTD is built on I-TASSER but substantially extends its ability and accuracy in
modeling large multi-domain protein structures and provides meaningful functional insights for the targets at both the
domain- and full-chain levels from the amino acid sequence alone.
Introduction
Much progress has been made in protein structure prediction as a result of decades of effort1–4. The
progress has been particularly notable in recent years owing to the introduction of coevolution-based
contact prediction5–7 and deep neural-network learning techniques8–10. In particular, the end-to-end
sequence-to-structure training approaches, such as AlphaFold2 (ref. 11), built on the attention and
equivariant transformer networks, have achieved unprecedented modeling accuracy in the protein
structure prediction as witnessed in the recent CASP14 experiment12. However, most of the advanced
methods have mainly focused on the modeling of individual domain structures, which are the
minimum folding units of proteins that fold and function independently. In fact, more than two-
thirds of prokaryotic proteins and four-fifths of eukaryotic proteins contain two or more domains13,
where many proteins perform higher-level cellular functions through cooperative domain interac-
tions14,15. Therefore, determining the full-length structures of multi-domain proteins is a crucial step
towards elucidating their full functions and designing new drugs to regulate these functions.
A common approach for multi-domain protein structure modeling is to split the query sequence
into domains and generate models for each individual domain separately16,17. The individual domain
models are subsequently assembled into full-length models, usually under the guidance of other
homologous multi-domain proteins from the Protein Data Bank (PDB)18. However, many multi-
domain proteins have been solved only as single-domain proteins, and just 35.3% of proteins in the
PDB contain multi-domain structures. The lack of homologous multi-domain structures makes
the template-based domain-assembly approach infeasible for most multi-domain protein targets. On
the other hand, template-free (or ab initio) domain structure assembly is challenging, owing to the
fact that multi-domain proteins have a high degree of freedom in domain-orientation space and we
do not have reliable force fields to accommodate the domain–domain interactions. In a recent study,
1
Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA. 2College of Information Engineering,
Zhejiang University of Technology, Hangzhou, China. 3Department of Biological Chemistry, University of Michigan, Ann Arbor, MI, USA.
✉e-mail: [email protected]
MNELVDTTEMYLRTIYDLEEEGVTPLRARIAERLDQSGPTVSQTVSRMERDGLLRVAGDRLELTEKGRALAIAVMR
KHRLAERLLVDVIGLPWEEVHAEACRWEHVMSEDVERRLVKVLNNPTTSPFGNPIPGLDELGVGPEPGADDANLV
LOMETS2
LOMETS2
threading BioLiP
structure–function
database
Structure database Threading templates Template fragments
DeepPotential
PPI–function
ThreaDom database
Domain assembly
G T N V P L T F
D3
G S N V P L A F
G S D V P M T F
G T N I P L A I
G S N V P L A I D2
G S D V P M A S EC number GO terms
G S V V P M A F
G S V V D M A F D1 CC BP MF
Sequence MSA
database Folding unit detection Clustering
Hard target or Multi-domain and
coverage ≤Cov? coverage <Cov? Analogous templates
Binding site
50 residual
blocks Spatial restraints Individual domain models Multi-domain library Function prediction
Query
sequence DeepPotential FUpred D-I-TASSER DEMO COFACTOR
Fig. 1 | Overview of the I-TASSER-MTD protocol for multi-domain protein structure and function prediction. Cov is the cutoff of the alignment
coverage for assessing if the query needs to be modeled as a single-unit or multi-unit target, where a unit can contain a single domain or multiple
domains if the latter is fully covered by the LOMETS2 threading alignments. CC, BP and MF represent, respectively, the cellular component, biological
process and molecular function in GO. It should be noted that, following their implementation on full-length sequence, LOMETS2 and DeepPotential
will be run again in D-I-TASSER to generate templates/restraints for each individual domain if the query is deemed as a multi-unit target.
in Fig. 1. Starting from the query sequence, multiple threading templates are first collected by
LOMETS2 (ref. 42) a meta-server approach that combines up to 11 sequence/hidden Markov model
profile-based methods and deep-learning threading programs—from the PDB, and domain bound-
aries are predicted by FUpred22 and ThreaDom20,43. Meanwhile, residue–residue spatial restraints are
created by DeepPotential44 through residual convolutional network training. If the query sequence is
deemed a multi-domain protein by FUpred or ThreaDom, and none of the top ten template align-
ments can cover all domains (i.e., one or more domains with an alignment coverage below the cutoff
Cov = 95%), the ‘multi-domain assembly’ mode will be initiated, where models for each domain will
be independently constructed by D-I-TASSER30, a crucially improved version of I-TASSER powered
by the DeepPotential spatial restraints. Subsequently, the domain models will be assembled into the
full-length model by DEMO16 based on the structurally analogous templates. Otherwise, if the query
is deemed a single-domain protein or one or more top template alignments can cover all domains, the
‘multi-domain assembly’ mode will be turned off and the full-length structure will be directly
modeled by D-I-TASSER. Finally, the protein function annotations, including the Enzyme Com-
mission (EC) numbers, GO terms and ligand-binding sites are predicted by COFACTOR45 for all
individual domains and the full-chain protein, based on the modeled structures, sequences and
protein–protein interactions (PPIs). The main individual programs employed in the I-TASSER-MTD
pipeline are listed in Supplementary Table 1. Next, we briefly describe the main components of the
I-TASSER-MTD pipeline.
assembly simulations (Supplementary Fig. 5). For each domain, LOMETS2 is used again to identify
domain-level structural templates from a nonredundant PDB structural library. Meanwhile, distance
maps, hydrogen-bonding networks4 and inter-residue torsion angles are predicted by DeepPotential.
The contact maps are also predicted by four deep-learning-based methods (ResPre10, DeepPLM31,
ResTriplet62 and TripletRes63) and a naïve Bayes-based contact predictor (NeBcon64) (see the
description and benchmark results reported in Supplementary Note 1). Then, the domain-level
structure models are constructed using REMC simulations under the guidance of the deep-learning-
based spatial restraints and the I-TASSER potential4, the latter of which contains generic statistical
potentials and threading template-based restraints. Five REMC simulations are performed in parallel
for each protein, where the structural decoys from 8 (or 3 for hard targets) low-temperature replicas
are clustered by SPICKER65. Finally, the decoy of the center of the largest cluster is selected as the
final model, and the side-chains of the final model are then repacked by FASPR66, which is further
refined by FG-MD67. It should be noted that some programs (e.g., LOMETS2 and DeepPotential) that
have been implemented at full-chain level are performed again in this step to create D-I-TASSER
models for each domain. As the programs and the D-I-TASSER force field have been trained mainly
on the domain level and many template structures solved in the PDB contain only single domains,
the domain-level modeling often generates more reliable results than that starting from the full-chain
sequences; this is also one of the major motivations for the development of the I-TASSER-
MTD pipeline.
The first term in Equation (1) evaluates the degree of convergence of the domain assembly
simulations, where Mtot is the total number of full-length decoys generated in the domain assembly
simulations, M(k) is the number of structure decoys with root-mean-square deviation (RMSD) <1.5 Å
to the kth full-length model and hRMSDik denotes the average RMSD between these decoys and the
kth reported model. The second term assesses the quality of the full-length template, where T-score(i)
is the template score of the ith full-length template, which is calculated as the harmonic mean of the
TM-scores between the domain models and the full-length template that is used for DEMO-based
domain assembly, and T-score0 = 0.85 is the cutoff used to distinguish good from bad templates. The
third term assesses how closely the distances in the reported model match the predicted distances by
DeepPotential, where T is the number of predicted inter-domain distances used to guide the domain
pre
assembly, and dt and dtmodel ðkÞ are the distances of the tth residue pair in the predicted distance map
and the kth reported model, respectively. The fourth term accounts for the domain–domain interface
satisfaction rate of the predicted interface map in the reported model, where N(Ipre) is the number of
predicted domain–domain interfaces and OðI pre ; I model Þk is the number of overlapped interfaces
between the predicted interface map and the kth reported model. As restraints in the third and fourth
terms are predicted using MSAs, wneff is a weight associated with the quality of the MSA and
calculated on the basis of the number of effective sequences (neff; Supplementary Eq. (S12)). Finally,
the fifth term accounts for the quality of individual domain models from D-I-TASSER, where Ndom is
the total number of domains and eTM-scoredom ðDÞ is the estimated TM-score of the Dth domain
model from D-I-TASSER (Supplementary Note 3). w1 = 0.065, w2 = 0.063, w3 = −0.08, w4 = 0.01,
w5 = 0.96 and w6 = 0.1 are the weighting factors, which are optimized using an improved differential
evolution algorithm72–74 to minimize the error between the eTM-score and the actual TM-score of
the decoys to the native structure on the DEMO training set of 425 nonredundant multi-domain
proteins. In addition, the estimated RMSD (eRMSD) of I-TASSER-MTD models is also calculated
on the basis of the same terms in Equation 1 but with an additional term to count for the protein
length (L) (Eq. (S14) in Supplementary Note 4), where the weighting factors for eRMSD are w1 = −1.40,
w2 = −2.74, w3 = 4.78, w4 = −1.19, w5 = −16.43, w6 = 0.0 and w7 = 2.66. Meanwhile, we define a new
score to quantitatively assess the relative populations of the assembled conformations for the kth
reported model by
pðkÞ
P-scoreðkÞ ¼ P ð2Þ
k pðkÞ
where pðkÞ ¼ MMðtotkÞ is the normalized number of structural decoys of kth model in the I-TASSER-MTD
assembly simulations.
The accuracy of eTM-score and eRMSD were examined on the DEMO benchmark set, which
includes 356 multi-domain proteins with different domain types that are nonhomologous to the
DEMO training dataset. As shown in Supplementary Fig. 9, the eTM-score has a high Pearson
correlation coefficient (PCC = 0.85) with the actual TM-score, where the average error of the
eTM-score is 0.07. Compared with the eTM-score, the eRMSD and RMSD have a slightly lower
correlation (PCC = 0.82), where the average error between eRMSD and RMSD (2.2 Å) is relatively
high (see the distribution in Supplementary Fig. 9d). It should be noted that RMSD is not the best
measurement for the accuracy of predicted models when the modeling accuracy is low; as all residue
pairs have the same weight in the RMSD calculation, this renders the RMSD value sensitive to local
variations, such as tails or loops, rather than the global fold. In this regard, it is recommended to use
TM-score as a more reliable measurement for model accuracy assessment; because the residue pairs
with smaller distance errors are weighted more heavily than the residues with larger errors in the TM-
score calculation, the TM-score value is generally more sensitive to the accuracy of the global fold of
the predicted models68.
As shown in Supplementary Fig. 9a, if we use an eTM-score cutoff of 0.5 to select models with
correct global topologies, both the false-negative and false-positive rate are <0.15, indicating that the
fold-level prediction by the eTM-score is correct in >85% of the cases. As illustrative examples, six
models from two targets are included in Supplementary Fig. 10, showing that the eTM-score is highly
correlated with the actual TM-score of the models in different ranges of model quality. In Supple-
mentary Fig. 11, we also show the eTM-score and thus the model quality will be reduced when more
and more random mutations are introduced to the target sequence of PDBID 1we3F.
In addition to the global quality assessment, to show the local accuracy of each individual domain
model, I-TASSER-MTD estimates the residue-level distance error of the predicted model relative to
the native structure using ResQ75, a method for estimating residue-level quality in protein structure
prediction on the basis of local variations of modeling simulations and the uncertainty of homologous
template alignments.
1 51;
where ‘1–51’ indicates that one of the domains should include residues from 1 to 51. The total sequence length
of the protein is 152, and the complete domain definition should be ‘1–51;52–105;106–152;’ as defined in the
CATH database. The server will keep the domain provided by users when predicting the domain boundary.
When providing the defined domain, please ensure that the length of every undefined domain region is be
larger than the minimum length (30) of a domain.
the top five templates with the highest scores will be selected for the initial full-length model
generation. If only one template is specified, the top four analogous templates detected from the
library and the uploaded template will be used to create the initial full-length model. For both cases,
the domain-template alignments are determined by a sliding-window procedure based on TM-score
as used by DEMO16. Furthermore, all templates provided by the user will be utilized to extract the
inter-domain distance profiles, which will be used to mainly guide the domain assembly simulations
along with restraints deduced from the templates identified from the library. If the provided template
comes from computationally modeled structures (such as those from AlphaFold2 predictions) and
includes the confidence score (pLDDT)11 in the ‘Temperature factor’ column, only the distances of
the residue pairs with pLDDT >70 for both residues are extracted from the template to guide the
assembly simulations.
15 165 28 0:85
63 220 21 0:96
94 168 17 0:79
where the first and the second columns are the residue indices, and the third column is the maximum Cα distance
for the residue pair. The fourth column is the confidence score that the distance of the residue pair is less than
the given distance listed in the third column. The confidence score is optional, and it will be set to 1 if not
provided. Values in the same row should be separated by tabs or spaces. For example, the first row ‘15 165 28
0.85’ indicates that the Cα distance between residues 15 and 165 has a confidence of 0.85 being less than 28 Å.
Note that the residue index is for the full-chain sequence rather than each individual domain, and the residue
index starts from 1 rather than 0.
have relatively low accuracy in the contact/distance prediction, users are recommended to provide
restraints with confidence score >0.5 to reduce noise.
that mostly predict the function of the individual domains, I-TASSER-MTD performs the function
annotation of the query protein at both the domain level and full-length level.
As a blind test, I-TASSER-MTD (as ‘Zhang-Server’) participated in the most recent community-
wide CASP experiment for fully automated protein structure prediction. In Fig. 2a, we present a
summary of the five best performing servers in CASP14, in which we sorted the servers according to
the average global distance test (GDT) score of the full-length models for all multi-domain proteins
with one or more template-free modeling (FM) or template-free modeling/template-based modeling
(FM/TBM) domain. The average GDT score of I-TASSER-MTD for the multi-domain proteins was
the highest among all participating servers. The accuracy of the individual domain models for multi-
domain proteins was also higher than that of other servers. For example, I-TASSER-MTD achieved
an average GDT score of 61.4 for all individual domain models of the multi-domain proteins, which
was 19.4% higher than that of the second-best server, ROSETTA (51.4). This is mainly due to the
incorporation of the highly accurate deep-learning-based restraints from DeepPotential in the
I-TASSER-MTD simulations. The average GDT score of I-TASSER-MTD was also ranked the highest
for the structure modeling of single-domain proteins in CASP14.
Owing to the employment of FUpred and ThreaDom, I-TASSER-MTD can accurately distinguish
multi-domain proteins from single-domain proteins and predict the domain boundaries with rea-
sonable accuracy. Here, since we cannot obtain the domain definitions used by other servers in CASP,
I-TASSER-MTD is compared with two state-of-the-art methods ConDO21 and DoBo80 on all
CASP14 targets. As shown in Fig. 2b, the accuracy of I-TASSER-MTD domain boundary prediction
is significantly higher than the that of the two control methods in terms of normalized domain
overlap (NDO) score81 for the protein domain boundary prediction, as well as accuracy (ACC) and
Matthew’s correlation coefficient (MCC) for protein classification. For example, the NDO score of
I-TASSER-MTD for multi-domain protein was 0.86, which was 65.4% and 79.2% higher than that of
ConDO (0.52) and DoBO (0.48), respectively.
We also compared I-TASSER-MTD with AlphaFold2 (ref. 11) and RoseTTAFold82 on all CASP14
targets. As the AlphaFold2 results reported in CASP14 are based on the human-expert group, while
the results of I-TASSER-MTD (as ‘Zhang-Server’) are based on the automated server group, we
regenerated all models by running the standalone AlphaFold2 package for a fair comparison. The
average TM-score of AlphaFold2 is 0.84, which is considerably higher than that of I-TASSER-MTD
(0.65). For RoseTTAFold, which has two options, I-TASSER-MTD’s TM-score is slightly higher than
the RoseTTAFold end-to-end version (0.63) but slightly lower than the RoseTTAFold pyRosetta
version (0.69).
To highlight the effectiveness of I-TASSER-MTD on some protein targets, we list in Fig. 2c–e three
examples of multi-domain models built by I-TASSER-MTD that had a significantly higher TM-score
than models built with the state-of-the-art programs. First, Fig. 2c shows the comparison between the
native and predicted structures of human complement component C6 (Uniprot ID: P13671). Although
AlphaFold2 almost correctly predicted all domain models (TM-score 0.78, 0.93, 0.93, 0.88 and 0.87),
the domain orientations were not correctly generated, resulting in a full-length model with a poorer
TM-score/RMSD of 0.63/31.1 Å. The I-TASSER-MTD model obtained a TM-score/RMSD of 0.95/
3.2 Å since it correctly generated both domain models and inter-domain orientations after the
assembly. Figure 2d shows the second example from Sarcoplasmic/endoplasmic reticulum calcium
ATPase 2 (Uniprot ID: P16615), where I-TASSER-MTD generated a better-quality model (TM-score/
1
60
Score (average)
0.8
50
0.6
40
30 0.4
20 0.2
Multi NDO Single NDO Overall ACC Overall MCC
Ya SE er
-S TA
ap r
rX
d
Ya SE er
-S TA
ap r
rX
d
Ya SE er
-S TA
ap r
rX
d
R rve
R rve
R rve
ol
ol
ol
v
v
to
to
to
ng T
ng T
ng T
tF
tF
tF
RO er
RO er
RO er
e
e
S
S
g-
g-
g-
an
an
an
Zh
Zh
Zh
c d e
P13671 P16615 T1092
I-TASSER-MTD model
TM score = 0.95
RMSD = 3.2 Å
I-TASSER-MTD model I-TASSER-MTD model
TM score = 0.97 TM score = 0.82
AlphaFold2 model RMSD = 1.7 Å RMSD = 5.3 Å
TM score = 0.62
RMSD = 31.1 Å
Fig. 2 | Comparison between I-TASSER-MTD and other methods. a, Comparison between I-TASSER-MTD (Zhang-Server) with the other top four
servers of CASP14 on modeling the full-length multi-domain targets assessed by the GDT score. b, Comparison of I-TASSER-MTD with ConDo and
DoBo for the protein domain boundary prediction on the CASP14 targets, where the y-axis is the NDO score of the multi-domain protein, the NDO
score of the single-domain protein, and the ACC and MCC for the four subparts from left to right. c–e, Representative examples in which I-TASSER-
MTD generated better quality full-length models than AlphaFold2 and RoseTTAFold, where the gray and color cartoon represent native structure and
predicted model, respectively, and different colors indicate different domains: human complement component C6 (P13671) (c); human sarcoplasmic/
endoplasmic reticulum calcium ATPase 2 (P16615) (d); DNA-directed RNA polymerase beta′ subunit (T1092 of CASP14) (e).
RMSD 0.97/1.7 Å) than AlphaFold2 (0.65/11.5 Å), while the latter misfolded the larger-size domain
resulting in an incorrect overall domain orientation. Finally, Fig. 2e presents an example of
the CASP14 target (T1092) from DNA-directed RNA polymerase beta′ subunit. Although the
RoseTTAFold pyRosetta version generated a correct fold for the domain models (TM-score 0.77 and
0.87), the domain orientations were not correctly modeled, resulting a full-length TM-score/RMSD
of 0.53/14.2 Å. Again, I-TASSER-MTD correctly constructed both the domain models and domain
orientations, thus obtaining a full-length model with TM-score/RMSD of 0.82/5.3 Å. These data
demonstrate that I-TASSER-MTD is complementary to these state-of-the-art programs, especially for
long protein sequences with multiple domains, although the overall performance on many other
targets of I-TASSER-MTD, the average TM-score of which is still lower than AlphaFold2, needs
further improvement.
In addition, compared with the deep-learning-based end-to-end models (i.e., AlphaFold2 (ref. 11)
and RoseTTAFold82), which are largely a block box to both developer and users83, I-TASSER-MTD
has several advantages due to the fact that its simulation process is accessible and interpretable. First,
I-TASSER-MTD reports the templates used to model each of the regions (domains) and full-length
protein, which can help users better understand where the predictions come from and therefore
provide functional insights for further studies on the protein. In fact, I-TASSER-MTD offers a
separate section for protein function annotation built on the structural modeling results.
Limitations
The domain boundary prediction method employed in the I-TASSER-MTD server for Hard targets
with template alignment coverage ≤95% is based on the deep-learning predicted contact maps that
require MSA collection. For extremely large proteins (>2,000 residues), contact map prediction and
MSA collection will require a high amount of random access memory (RAM) that the current
computing server cannot provide for some cases, which will result in failed domain partitioning by
FUpred due to memory limit. However, this limitation can be overcome by using ThreaDom for all
targets with sequence lengths >2,000 residues. Users can download the FUpred standalone package
and run it locally to predict the domain boundaries for a query sequence if they want to use the
FUpred-predicted domain definition for these cases. Furthermore, users also can provide the domain
definition predicted by other external programs such as NCBI Conserved Domain Database or
PFAM. This will also speed up the I-TASSER-MTD structure modeling process as the server will not
take time to predict the domain boundaries. See the ‘Experimental design’ section for detailed
instructions on how to provide the domain definition.
One highlight of the I-TASSER-MTD server is that it independently creates the model of each
individual domain and assembles all domain models into the full-length model. However, the quality
of the final full-length model is dependent on the accuracy of the individual domain models.
Although the domain assembly process and scoring function can accommodate some degree of
structural uncertainty, an incorrect domain model (e.g., TM-score <0.5) may affect the full-length
template identification based on structural alignment and misguide the domain assembly. This will
probably result in a poor final full-length model with a low eTM-score because each model is
considered as a rigid body during the domain assembly. For cases with low eTM-scores (e.g., <0.5),
users are advised to provide other sources of structural information, such as cross-linking experi-
mental data, restraints from mutagenesis, high-confidence full-length templates or inter-domain
contacts/distances determined by alternative programs, to guide the domain assembly.
Materials
Equipment
● A personal computer with Internet connection and a web browser with JavaScript enabled
(the I-TASSER-MTD server is compatible with popular web browsers, including Google Chrome,
Firefox, Microsoft Edge and Safari)
CRITICAL The amino acid sequence should be in
c
FASTA format, in which only characters from the single-letter code of the 20 standard amino acids are
allowed. Spaces, line breaks and header lines starting with ‘>’ will be ignored and will not affect the
prediction.
Software
● A web browser such as Google Chrome, Firefox, Microsoft Edge or Safari
● (Optional) A molecular visualizing software, such as Jmol, RasMol or PYMOL, for viewing the 3D
structure of the modeled protein and the predicted functional sites locally
Procedure
Query sequence submission ● Timing 5 min
1 Navigate to the I-TASSER-MTD website at https://zhanggroup.org/I-TASSER-MTD/.
2 Provide the amino acid sequence by copying and pasting the sequence into the provided form or
directly uploading a plain text file containing the sequence.
CRITICAL STEP The sequence should contain only one chain. If the provided sequence includes
c
multiple chains, only the first chain will be used. At present, the I-TASSER-MTD server accepts
protein sequences with a length between 30 and 3,000 amino acids.
3 Input an email address in the text box to receive the results when the job is completed.
CRITICAL STEP It is crucial to provide a correct email address. Otherwise, the user will not be
c
density map will significantly improve the quality of the final full-length model16,77. In addition,
cross-linking data can also be replaced by contact or distance restraints determined by any contact
or distance prediction program.
? TROUBLESHOOTING
10 Click the ‘Run I-TASSER-MTD’ button to submit the job.
? TROUBLESHOOTING
other jobs before it are processed on the computer cluster. Users may choose to close the job status
page. When the prediction is done, an email notification containing the link to the results page will
be sent to the user. The results can be accessed through this link or the bookmarked page.
12 Click the link in the email notification or open the link bookmarked in Step 11 to visit the page
containing the results. The page starts with a title and a link to download the tarball file including
all results listed on the page (Fig. 3a). An example results page is available at https://zhanggroup.
org/I-TASSER-MTD/example/.
! CAUTION The results will be stored on the server for 1 month. Users are recommended to
download the results to their computers.
a
1 I-TASSER-MTD Results for job ITM118
2 [Click on ITM118_results.tar.bz2 to download the tarball file including all results listed on this page]
3
4
Fig. 3 | Example of the I-TASSER-MTD results page (Sections 2 and 3). a, Title of the results page and the link to download all results shown in the
page. b, Query sequence in FASTA format submitted by the user, where the predicted domains are marked by different colors, and the range of each
domain is listed below the sequence. c, Top final full-length models predicted by the server and their estimated accuracy (right). Model 1 is shown
(left), where different domains are represented by different colors. d, Predicted individual domain models that are used to assemble the full-length
model and the estimated distance error of each domain model.
c
model was directly generated by D-I-TASSER as the threading templates could cover all domains,
and the top templates identified by LOMETS2 had consistent topologies. In these cases, the final
model usually has a relative high eTM-score, indicating a high-quality final model.
15 View the second column of the table to analyze the eTM-score for each full-length model. As
defined in Equation. 1, the eTM-score is calculated on the basis of the confidence of individual
domain models and the confidence of the inter-domain assembly simulations. The eTM-score
usually ranges from 0 to 1, where a higher score indicates a model of better quality. In general,
models with eTM-score >0.5 have a correct global fold.
? TROUBLESHOOTING
16 View the eRMSD to the native structure shown in the third column of the table. As defined in
Eq. (S14), eRMSD is estimated in a similar way as the eTM-score but with the sequence length
incorporated.
! CAUTION Since the top five models are ranked by energy or by cluster size, in rare cases it is
possible that the lower-rank models have a higher eTM-score or lower eRMSD. Accordingly,
although the first model has a better quality on average, it is also possible that the lower-rank
models have a better quality than the higher-rank models as seen in our benchmark tests. Users can
also estimate the total local distance difference test (lDDT) of the model for reference by using
deep-learning based quality assessment methods, such as DeepAccNet98 and DeepUMQA99.
17 View the P-score reported in the fourth column of the table. Here, P-score is used to assess the
relative populations of complex conformations under the assumption that the relative populations
of the complex conformations are approximately proportional to their entropy variations in the
domain structural assembly simulations. The P-score value ranges between [0, 1], and a higher
value indicates the structure occurs more often in the simulation trajectory.
18 Download the predicted models in PDB format by clicking on the ‘Download model’ link shown in
the corresponding row of the column marked as ‘PDB file’. Users can interactively view the
predicted structure on their computer using the programs mentioned in the ‘Materials’ section.
19 View the predicted probability of the inter-domain interaction for every two domains, which is
listed under the table (see label 1 in Fig. 3c). Here, an inter-domain interaction is defined as one or
more residue pairs with distance <8 Å apart from the linker region. The domain interaction
probability is estimated by the average probability score of the top 1% of the inter-domain residue
pairs predicted by the deep-learning models. It ranges from 0 to 1, with a higher value indicating
that the two domains have a larger probability of interaction.
20 Click the link labeled ‘More about eTM-score’ (see label 2 in Fig. 3c) to open a new page containing
more information about the eTM-score and eRMSD.
1 2 3
Fig. 4 | Example of the I-TASSER-MTD results page (Sections 4–6). a, Predicted secondary structure of the full-length sequence, where different
types of secondary structure are marked by different colors. b, Predicted solvent accessibility of the full-length sequence, where a larger confidence
score indicates a higher probability of exposure. c, Results of the domain boundary prediction, which includes the contact map used to guide the
domain boundary prediction (1), the curve of the FU-score for continuous domain detection (2), the heatmap of FU-score for discontinuous domain
detection (3) and the predicted domain definitions (4).
of each domain can be viewed by clicking the link labeled ‘Click to view the predicted function’
below the corresponding predicted domain distance error chart (see label 4 in Fig. 3c). The function
results page will be interpreted in Steps 38–52.
Fig. 5 | Example of the I-TASSER-MTD results page (Sections 7 and 8). a, Top ten full-length templates identified by the global structural alignment,
which are used to guide the domain model assembly. b, Predicted residue–residue distance maps and domain–domain interface maps for domain
model assembly.
with Cα and Cβ distances <20 Å, respectively. The subsequent two columns show the interface maps
with Cα and Cβ distances <18 Å predicted by deep learning, where all residue pairs are included in
the map. In these maps, only the inter-domain distances and inter-domain interface probabilities
are employed to guide the domain model assembly.
? TROUBLESHOOTING
37 Click the link below the figure to download the corresponding predicted distance map for further
analysis. For example, click the link labeled ‘Download CA distance map’ to download the predicted
distance map with Cα distances <20 Å.
c
2
d 3
Fig. 6 | Example of the I-TASSER-MTD results page (Sections 9–12). a, Top ten analogous structures that are structurally close to the top model of
the query protein. b, Results of the predicted GO terms including MF (1), BP (2) and CC (3). c, Results of the predicted EC numbers from the top five
homologous enzyme templates. d, Results of the predicted ligand-binding site from the top five homologous templates.
visit the Amigo website (http://amigo.geneontology.org/amigo) to analyze the definition and lineage
of the term.
44 Click on the link ‘full result’ below the table to download the results presented in the table.
users are advised to consult both the CscoreEC and the TM-score. For example, if most
of the identified functional analogs with similar folds (i.e., TM-score >0.5) have the same EC
number digits and the CscoreEC is relatively high, the likelihood of the prediction being correct is
very high.
48 Click on the EC numbers to visit the ExPASy enzyme database (https://enzyme.expasy.org/) to
further learn about the enzyme families, such as reactions catalyzed by the enzyme, the cofactors
required and the metabolic pathway in which they function.
5 Error! Some residues are missed in Some residue indices are not included in the Check the domain definition to make sure
the domain definition domain definition that the domain definition includes all residue
indices
Error! Residue overlaps in the Some residue indices are included in multiple Check the domain definition to make sure
domain definition domain ranges each residue index is included in only
one domain
Error! Wrong format of the The domain definition cannot be recognized Read the instructions on how to correct the
domain definition by the server format of the domain definition or follow the
formatting shown in the example input
Error! Too many domains in the Maximum number of domains accepted by the Merge short domains in the domain
domain definition server is 20 definition or model the first 20 domains and
the rest of the domains independently using
the server, and assemble them using DEMO
Error! Too large domain in the Maximum length of the domain accepted by Further split large domains into multiple parts
domain definition the server is 1,000
Error! Too short domain in the Minimum length of the domain accepted by Merge short domains with other domains
domain definition the server is 30
6 Error! Incorrect template name The template name cannot be recognized by Rename the template according to the
the server instructions
Error! Wrong tarball format (XXX) The tarball format is not supported by Repackage the templates as *.tar.bz2,
of the templates the server *.tar.gz, *.tar or *.zip
9 Error! Wrong cross-linking data The cross-linking data cannot be read by Check the instructions or the example input
the server to ensure the cross-linking data is in the
correct format. Ensure that the residue index
is within the sequence length range
Error! Wrong density map file The cryo-EM density map data cannot be read Make sure that the density map is in the 8-
by the server bit, 16-bit, or 32-bit MRC and CCP4 format,
and fix the error according to the instructions
10 The protein sequence is too short The range of sequence length is not within Check the sequence to make sure that the
or too long [30, 2000] length is within [30, 2000]. Users who want
to model larger proteins with sequence
lengths >2,000 residues should specify their
own domain boundary
15 Low eTM-score full-length model Low-quality individual domain models, As described in the ‘Limitations’ section,
full-length templates or distance map users are advised to seek other sources of
structural information, such as experimental
data, full-length templates and contact/
distance restraints predicted by other tools
35 No predicted distance maps The sequence length is large, which causes a Users can download the standalone package
larger distance map than can be predicted by and predict the distance locally
the server owing to RAM limitations
Timing
Steps 1–10, query sequence submission: 5 min
Steps 11–12, job monitoring: 6–12 h
Step 13, query sequence and predicted domain definition: 3 min
Steps 14–20, predicted full-length structures: 10 min
Steps 21–24, individual domain modeling: 5 min
Step 25, secondary structure prediction: 2 min
Step 26, solvent accessibility prediction: 2 min
Steps 27–30, domain boundary prediction: 5 min
Anticipated results
Here we used the iron-dependent regulator from Mycobacterium tuberculosis (PDB ID: 1fx7A) as an
example to predict the structure and function of the protein using the I-TASSER-MTD server. In
Step 8, we chose the ‘Option IV’ as ‘YES’ to predict the function of all individual domains and the
full-length protein, while other options were kept as default. Once the job is finished, the user will
receive an email containing a link to the results page. Clicking on the link will open the results page
containing the following results, which are shown in Figs. 3–6:
● The title of the results page: I-TASSER-MTD results for job jobid (see label 1 in Fig. 3a).
● The link to download the results files. Click on the link to download a compressed file containing all
(Fig. 3b). Users can confirm the sequence length shown in the parentheses after the query name.
● The top five full-length models predicted by I-TASSER-MTD (Fig. 3c). The left panel shows the
predicted model using the JSmol applet, while the right panel summarizes the information of the
predicted models.
● The predicted individual domain models (Fig. 3d). Each domain model is shown in an independent
JSmol applet, with a link to download the model, and the predicted functions are given below the model.
The predicted functions for each individual domain are shown in Supplementary Figs. 17–19.
● The predicted secondary structure of the full-length sequence (Fig. 4a). The results contain three rows.
The first row shows the query sequence, while the second and the third row report the predicted
secondary structure and the confidence score, respectively.
● The predicted solvent accessibility of the full-length sequence, which contains two rows (Fig. 4b). The first
row and the second row display the query sequence and the confidence score for the predicted solvent
accessibility, respectively.
● The predicted domain boundary (Fig. 4c). This section contains the deep-learning-predicted contact map
for the domain boundary prediction, the FU-score curve for the continuous domain detection, the
FU-score heatmap for the discontinuous domain detection and the predicted domain definition.
● The top ten full-length templates identified by TM-align structural alignment (Fig. 5a). This section
reports the sequence identity, template score and the alignment between the query and the template.
● The distance/interface map predicted by deep learning (Fig. 5b). The C /C distance maps with distances
α β
<20 Å, the Cα/Cβ interface maps with distances <18 Å and the corresponding link to download the
distance/interface maps are shown in this section.
● The top ten proteins structurally close to the query protein (Fig. 6a). The predicted full-length models
with the aligned analogous models are depicted in the left panel using the JSmol applet, while the
information for the structural analogs is shown in the right panel.
● The predicted GO terms (Fig. 6b). This section contains the predicted MF, BP and CC. For each section,
the predicted GO terms are plotted using a directed acyclic graph displayed in the left panel, while a table
with the predicted GO terms is shown on the right panel.
● The predicted EC numbers (Fig. 6c). The JSmol panel on the left shows the full-length model and the
predicted active site residues. The table on the right summarizes the predicted active sites.
● The predicted ligand-binding sites (Fig. 6d). The full-length model, predicted binding sites and
corresponding ligands are displayed on the left panel using the JSmol applet, and a summary of the
predicted ligand-binding sites is reported in a table on the left panel.
Data availability
The raw data and example files are available at https://zhanggroup.org/I-TASSER-MTD/ or from the
corresponding author upon reasonable request.
Code availability
The I-TASSER-MTD standalone package is freely available for academic use at https://zhanggroup.
org/I-TASSER-MTD/.
References
1. Sali, A. & Blundell, T. L. Comparative protein modelling by satisfaction of spatial restraints. J. Mol. Biol.
234, 779–815 (1993).
2. Simons, K. T., Kooperberg, C., Huang, E. & Baker, D. Assembly of protein tertiary structures from
fragments with similar local sequences using simulated annealing and Bayesian scoring functions. J. Mol.
Biol. 268, 209–225 (1997).
3. Xu, D. & Zhang, Y. Ab initio protein structure assembly using continuous structure fragments and opti-
mized knowledge-based force field. Proteins 80, 1715–1735 (2012).
4. Yang, J. et al. The I-TASSER Suite: protein structure and function prediction. Nat. Methods 12, 7–8 (2015).
5. Weigt, M., White, R. A., Szurmant, H., Hoch, J. A. & Hwa, T. Identification of direct residue contacts in
protein–protein interaction by message passing. Proc. Natl Acad. Sci. USA 106, 67–72 (2009).
6. Marks, D. S. et al. Protein 3D structure computed from evolutionary sequence variation. PLoS ONE 6,
e28766 (2011).
7. Mortuza, S. et al. Improving fragment-based ab initio protein structure assembly using low-accuracy
contact-map predictions. Nat. Commun. 12, 5011 (2021).
8. Wang, S., Sun, S., Li, Z., Zhang, R. & Xu, J. Accurate de novo prediction of protein contact map by ultra-
deep learning model. PLoS Comput. Biol. 13, e1005324 (2017).
9. Senior, A. W. et al. Improved protein structure prediction using potentials from deep learning. Nature 577,
706–710 (2020).
10. Li, Y., Hu, J., Zhang, C., Yu, D.-J. & Zhang, Y. ResPRE: high-accuracy protein contact prediction by
coupling precision matrix with deep residual neural networks. Bioinformatics 35, 4647–4655 (2019).
11. Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
12. Kryshtafovych, A., Schwede, T., Topf, M., Fidelis, K. & Moult, J. Critical assessment of methods of protein
structure prediction (CASP)—Round XIV. Proteins 89, 1607–1617 (2021).
13. Chothia, C., Gough, J., Vogel, C. & Teichmann, S. A. Evolution of the protein repertoire. Science 300,
1701–1703 (2003).
14. Apic, G., Huber, W. & Teichmann, S. A. Multi-domain protein families and domain pairs: comparison with
known structures and a random model of domain recombination. J. Struct. Funct. Genomics 4,
67–78 (2003).
15. Han, J.-H., Batey, S., Nickson, A. A., Teichmann, S. A. & Clarke, J. J. N. R. M. C. B. The folding and
evolution of multidomain proteins. Nat. Rev. Mol. Cell Biol. 8, 319 (2007).
16. Zhou, X. G., Hu, J., Zhang, C. X., Zhang, G. J. & Zhang, Y. Assembling multidomain protein structures
through analogous global structural alignments. Proc. Natl Acad. Sci. USA 116, 15930–15938 (2019).
17. Xu, D., Jaroszewski, L., Li, Z. & Godzik, A. AIDA: ab initio domain assembly for automated
multi-domain protein structure prediction and domain–domain interaction prediction. Bioinformatics 31,
2098–2105 (2015).
18. Berman, H. M. et al. The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000).
19. Zhang, Y. & Skolnick, J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic
Acids Res. 33, 2302–2309 (2005).
20. Xue, Z., Xu, D., Wang, Y. & Zhang, Y. ThreaDom: extracting protein domain boundary information from
multiple threading alignments. Bioinformatics 29, i247–i256 (2013).
21. Hong, S. H., Joo, K. & Lee, J. ConDo: protein domain boundary prediction using coevolutionary infor-
mation. Bioinformatics 35, 2411–2417 (2019).
22. Zheng, W. et al. FUpred: detecting protein domains through deep-learning based contact map prediction.
Bioinformatics 36, 3749–3757 (2020).
23. Wollacott, A. M., Zanghellini, A., Murphy, P. & Baker, D. Prediction of structures of multidomain proteins
from structures of the individual domains. Protein Sci. 16, 165–175 (2007).
24. Zhang, C., Zheng, W., Freddolino, P. L. & Zhang, Y. MetaGO: predicting Gene Ontology of non-
homologous proteins through low-resolution protein structure prediction and protein–protein network
mapping. J. Mol. Biol. 430, 2256–2265 (2018).
25. Yao, S. et al. NetGO 2.0: improving large-scale protein function prediction with massive sequence, text,
domain, family and network information. Nucleic Acids Res. 49, W469–W475 (2021).
26. Piovesan, D. & Tosatto, S. C. INGA 2.0: improving protein function prediction for the dark proteome.
Nucleic Acids Res. 47, W373–W378 (2019).
27. Koo, D. C. E. & Bonneau, R. Towards region-specific propagation of protein functions. Bioinformatics 35,
1737–1744 (2019).
28. Gligorijević, V. et al. Structure-based protein function prediction using graph convolutional networks.
Nat. Commun. 12, 1–14 (2021).
29. Pearce, R. & Zhang, Y. Toward the solution of the protein structure prediction problem. J. Biol. Chem. 297,
100870 (2021).
30. Zheng, W. et al. Protein structure prediction using deep learning distance and hydrogen‐bonding restraints
in CASP14. Proteins 89, 1734–1751 (2021).
31. Zheng, W. et al. Deep-learning contact-map guided protein structure prediction in CASP13. Proteins 87,
1149–1164 (2019).
32. Battey, J. N. et al. Automated server predictions in CASP7. Proteins 69 (Suppl.), 68–82 (2007).
33. Croll, T. I., Sammito, M. D., Kryshtafovych, A. & Read, R. J. Evaluation of template-based modeling in
CASP13. Proteins 87, 1113–1127 (2019).
34. Zhang, Y. Template-based modeling and free modeling by I-TASSER in CASP7. Proteins 69
(Suppl.), 108–117 (2007).
35. Zhang, Y. I-TASSER: fully automated protein structure prediction in CASP8. Proteins 77 (Suppl.), 100–113
(2009).
36. Xu, D., Zhang, J., Roy, A. & Zhang, Y. Automated protein structure modeling in CASP9 by I-TASSER
pipeline combined with QUARK-based ab initio folding and FG-MD-based structure refinement. Proteins
79 (Suppl.), 147–160 (2011).
37. Zhang, Y. Interplay of I-TASSER and QUARK for template-based and ab initio protein structure prediction
in CASP10. Proteins 82 (Suppl.), 175–187 (2014).
38. Zhang, W. et al. Integration of QUARK and I-TASSER for ab initio protein structure prediction in CASP11.
Proteins 84 (Suppl.), 76–86 (2016).
39. Zhang, C., Mortuza, S. M., He, B., Wang, Y. & Zhang, Y. Template-based and free modeling of I-TASSER
and QUARK pipelines using predicted contact maps in CASP12. Proteins 86 (Suppl.), 136–151 (2018).
40. Roy, A., Kucukural, A. & Zhang, Y. I-TASSER: a unified platform for automated protein structure and
function prediction. Nat. Protoc. 5, 725–738 (2010).
41. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. in Proceedings of the IEEE
conference on computer vision and pattern recognition 770–778 (2016).
42. Zheng, W. et al. LOMETS2: improved meta-threading server for fold-recognition and structure-based
function annotation for distant-homology proteins. Nucleic Acids Res. 47, W429–W436 (2019).
43. Wang, Y. et al. ThreaDomEx: a unified platform for predicting continuous and discontinuous protein
domains by multiple-threading and segment assembly. Nucleic Acids Res. 45, W400–W407 (2017).
44. Li, Y. et al. Protein inter‐residue contact and distance prediction by coupling complementary coevolution
features with deep residual networks in CASP14. Proteins 89, 1911–1921 (2021).
45. Zhang, C., Freddolino, P. L. & Zhang, Y. COFACTOR: improved protein function prediction by
combining structure, sequence and protein–protein interaction information. Nucleic Acids Res. 45,
W291–W299 (2017).
46. Sillitoe, I. et al. CATH: increased structural coverage of functional space. Nucleic Acids Res. 49, D266–D273
(2021).
47. Xu, Y., Xu, D. & Gabow, H. N. Protein domain decomposition using a graph-theoretic approach.
Bioinformatics 16, 1091–1104 (2000).
48. Steinegger, M. & Söding, J. Clustering huge protein sequence sets in linear time. Nat. Commun. 9, 2542 (2018).
49. Steinegger, M., Mirdita, M. & Söding, J. Protein-level assembly increases protein sequence recovery from
metagenomic samples manyfold. Nat. Methods 16, 603–606 (2019).
50. Mitchell, A. L. et al. MGnify: the microbiome analysis resource in 2020. Nucleic Acids Res. 48, D570–D578
(2020).
51. Chen, I.-M. A. et al. The IMG/M data management and analysis system v. 6.0: new tools and advanced
capabilities. Nucleic Acids Res. 49, D751–D763 (2021).
52. Mirdita, M. et al. Uniclust databases of clustered and deeply annotated protein sequences and alignments.
Nucleic Acids Res. 45, D170–D176 (2017).
53. Suzek, B. E. et al. UniRef clusters: a comprehensive and scalable alternative for improving sequence
similarity searches. Bioinformatics 31, 926–932 (2015).
54. Zhang, C., Zheng, W., Mortuza, S., Li, Y. & Zhang, Y. DeepMSA: constructing deep multiple sequence
alignment to improve contact prediction and fold-recognition for distant-homology proteins. Bioinformatics
36, 2105–2112 (2020).
55. Yan, R., Xu, D., Yang, J., Walker, S. & Zhang, Y. A comparative assessment and analysis of 20 representative
sequence alignment methods for protein structure prediction. Sci. Rep. 3, 1–9 (2013).
92. Zhang, C. et al. Functions of essential genes and a scale-free protein interaction network revealed by
structure-based function and interaction prediction for a minimal genome. J. Proteome Res. 20, 1178–1189
(2021).
93. Zhang, C., Wei, X., Omenn, G. S. & Zhang, Y. Structure and protein interaction-based gene ontology
annotations reveal likely functions of uncharacterized proteins on human chromosome 17. J. Proteome Res.
17, 4186–4196 (2018).
94. Zhang, C., Lane, L., Omenn, G. S. & Zhang, Y. Blinded testing of function annotation for uPE1 proteins by
I-TASSER/COFACTOR pipeline using the 2018–2019 additions to neXtProt and the CAFA3 challenge.
J. Proteome Res. 18, 4154–4166 (2019).
95. Iyer, S., Subramanian, V. & Acharya, K. R. C9orf72, a protein associated with amyotrophic lateral sclerosis
(ALS) is a guanine nucleotide exchange factor. PeerJ 6, e5815 (2018).
96. Skotnicová, P. et al. The cyanobacterial protoporphyrinogen oxidase HemJ is a new b-type heme protein
functionally coupled with coproporphyrinogen III oxidase. J. Biol. Chem. 293, 12394–12404 (2018).
97. Hanson, R. M., Prilusky, J., Renjian, Z., Nakane, T. & Sussman, J. L. JSmol and the next‐generation web‐
based representation of 3D molecular structure as applied to proteopedia. Isr. J. Chem. 53, 207–216 (2013).
98. Hiranuma, N. et al. Improved protein structure refinement guided by deep learning based accuracy esti-
mation. Nat. Commun. 12, 1–11 (2021).
99. Guo, S.-S., Liu, J., Zhou, X. & Zhang, G. DeepUMQA: ultrafast shape recognition-based protein model
quality assessment using deep learning. Bioinformatics 38, 1895–1903 (2022).
100. Xu, J. & Zhang, Y. How significant is a protein structure similarity with TM-score = 0.5? Bioinformatics 26,
889–895 (2010).
101. Ellson, J., Gansner, E.R., Koutsofios, E., North, S.C. & Woodhull, G. in Graph Drawing Software 127–148
(Springer, 2004).
102. Towns, J. et al. XSEDE: acceleratingscientific discovery. Comput. Sci. Eng. 16, 62–74 (2014).
103. Xu, J. Distance-based protein folding powered by deep learning. Proc. Natl Acad. Sci. USA 116,
16856–16865 (2019).
104. Källberg, M. et al. Template-based protein structure modeling using the RaptorX web server. Nat. Protoc. 7,
1511–1522 (2012).
105. Du, Z. et al. The trRosetta server for fast and accurate protein structure prediction. Nat. Protoc. 16,
5634–5651 (2021).
106. Lobley, A., Sadowski, M. I. & Jones, D. T. pGenTHREADER and pDomTHREADER: new methods for
improved protein fold recognition and superfamily discrimination. Bioinformatics 25, 1761–1767 (2009).
107. Zimmermann, L. et al. A completely reimplemented MPI bioinformatics toolkit with a new HHpred server
at its core. J. Mol. Biol. 430, 2237–2243 (2018).
108. Kelley, L. A., Mezulis, S., Yates, C. M., Wass, M. N. & Sternberg, M. J. The Phyre2 web portal for protein
modeling, prediction and analysis. Nat. Protoc. 10, 845–858 (2015).
Acknowledgements
This work is supported in part by the National Institute of General Medical Sciences (GM136422 and S10OD026825 to Y.Z.), the
National Institute of Allergy and Infectious Diseases (AI134678 to Y.Z.), the National Science Foundation (IIS1901191 and DBI2030790
to Y.Z.), the National Nature Science Foundation of China (62173304 and 61773346 to G.Z.), the ‘New Generation Artificial Intelligence’
major project of Science and Technology Innovation 2030 of the Ministry of Science and Technology of China (2021ZD0150100 to G.Z.)
and the Key Project of Zhejiang Provincial Natural Science Foundation of China (LZ20F030002 to G.Z.). This work used the Extreme
Science and Engineering Discovery Environment (XSEDE)102, which is supported by the National Science Foundation (ACI1548562).
Author contributions
Y.Z. conceived and designed the project. X.Z. developed the pipeline and performed the test. W.Z. developed the method for domain
boundaries prediction. Y.L. developed the method for contacts and distances prediction. C.Z. developed the method for protein function
prediction. Y.Z., W.Z., Y.L., C.Z. and R.P. developed the method for individual domain modeling. X.Z. developed the method for multi-
domain protein structure assembly. X.Z. and E.B. tested the server. G.Z. helped supervise the research. X.Z. and Y.Z. wrote the
manuscript, and all authors read and approved the final manuscript.
Competing interests
The authors declare no competing interests.
Additional information
Supplementary information The online version contains supplementary material available at https://doi.org/10.1038/s41596-022-00728-0.
Correspondence and requests for materials should be addressed to Yang Zhang.
Peer review information Nature Protocols thanks Ruben Sánchez-García, Beat R. Vogeli and the other, anonymous, reviewer(s) for their
contribution to the peer review of this work.
Reprints and permissions information is available at www.nature.com/reprints.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other
rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing
agreement and applicable law.
Related links
Key references using this protocol
Zhou, X. et al. Proc. Natl Acad. Sci. USA 116, 15930–15938 (2019): https://doi.org/10.1073/pnas.1905068116
Zhang, C. et al. Nucleic Acids Res. 45, W291–299 (2017): https://doi.org/10.1093/nar/gkx366
Zheng, W. et al. Cell Rep. Methods 1, 100014 (2021): https://doi.org/10.1016/j.crmeth.2021.100014
Hermes, C. et al. Nat. Commun. 12, 144 (2021): https://doi.org/10.1038/s41467-020-20418-3