0% found this document useful (0 votes)
2 views

TargetDBP_Accurate_DNA-Binding_Protein_Prediction_Via_Sequence-Based_Multi-View_Feature_Learning

The document presents TargetDBP, a novel computational method for accurately predicting DNA-binding proteins (DBPs) from protein sequences using a combination of four single-view features and a differential evolution algorithm for weight learning. The method employs support vector machine (SVM) for model training and includes a new benchmark dataset for evaluation, demonstrating superior performance compared to existing predictors. TargetDBP is accessible through a web server for academic use, providing researchers with a tool for efficient DBP identification.

Uploaded by

navakanthcse
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

TargetDBP_Accurate_DNA-Binding_Protein_Prediction_Via_Sequence-Based_Multi-View_Feature_Learning

The document presents TargetDBP, a novel computational method for accurately predicting DNA-binding proteins (DBPs) from protein sequences using a combination of four single-view features and a differential evolution algorithm for weight learning. The method employs support vector machine (SVM) for model training and includes a new benchmark dataset for evaluation, demonstrating superior performance compared to existing predictors. TargetDBP is accessible through a web server for academic use, providing researchers with a tool for efficient DBP identification.

Uploaded by

navakanthcse
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 17, NO.

4, JULY/AUGUST 2020 1419

TargetDBP: Accurate DNA-Binding Protein


Prediction Via Sequence-Based
Multi-View Feature Learning
Jun Hu , Xiao-Gen Zhou , Yi-Heng Zhu, Dong-Jun Yu , and Gui-Jun Zhang

Abstract—Accurately identifying DNA-binding proteins (DBPs) from protein sequence information is an important but challenging task
for protein function annotations. In this paper, we establish a novel computational method, named TargetDBP, for accurately targeting
DBPs from primary sequences. In TargetDBP, four single-view features, i.e., AAC (Amino Acid Composition), PsePSSM (Pseudo
Position-Specific Scoring Matrix), PsePRSA (Pseudo Predicted Relative Solvent Accessibility), and PsePPDBS (Pseudo Predicted
Probabilities of DNA-Binding Sites), are first extracted to represent different base features, respectively. Second, differential evolution
algorithm is employed to learn the weights of four base features. Using the learned weights, we weightedly combine these base
features to form the original super feature. An excellent subset of the super feature is then selected by using a suitable feature selection
algorithm SVM-REFþCBR (Support Vector Machine Recursive Feature Elimination with Correlation Bias Reduction). Finally, the
prediction model is learned via using support vector machine on the selected feature subset. We also construct a new gold-standard
and non-redundant benchmark dataset from PDB database to evaluate and compare the proposed TargetDBP with other existing
predictors. On this new dataset, TargetDBP can achieve higher performance than other state-of-the-art predictors. The TargetDBP
web server and datasets are freely available at http://csbio.njust.edu.cn/bioinf/targetdbp/ for academic use.

Index Terms—DNA-binding protein prediction, sequence-based, differential evolution, feature selection, support vector machine

1 INTRODUCTION

I NTERACTIONS between proteins and DNA are indispens-


able for biological activities and play vital roles in a wide
variety of biological processes, such as gene regulation,
generally make use of both the structural and sequential
information of target proteins, while the sequence-based
methods solely employ the protein sequence information.
DNA replication and repair [1]. Hence, accurately targeting For the structure-based methods, although they can show
DNA-binding proteins (DBPs) is of significant importance promising predictive performance, their application is lim-
for the annotation of protein functions. Tremendous wet-lab ited, since the structural information of proteins is not
efforts have been made to uncover the intrinsic mechanism always available. In contrast, the sequence-based methods
of protein-DNA interactions. However, identification of can overcome this shortcoming by only using the sequence
DBPs via wet-lab experimental technologies is often cost- information as input for the DBP prediction. Since the gap
intensive and time-consuming. Facing the difficulty in exper- between protein sequences and structures fast continues
imentally identifying DBPs and the avalanche of new protein to widen in the post-genomic age, the development of
sequences generated in the post-genomic age [2], it is highly sequence-based DBP predictors has become a hot topic in
desired to develop an automatic computational method for bioinformatics.
rapidly and accurately targeting DBPs. In the recent, many researchers have proposed a series of
During the past decades, many computational methods sequence-based methods to predict DBPs. To name a few:
have been emerged for targeting DBPs [3], [4], [5]. These iDNA-Prot [7], PseDNA-Pro [8], iDNAPro-PseAAC [9],
methods can be roughly grouped into two categories iDNA-Prot j dis [10], Local-DPP [11], PSFM-DBT [12],
according to the features they used: structure-based meth- HMMBinder [13], IKP-DBPPred [3], iDNAProt-ES [14], DPP-
ods and sequence-based methods. Note that the structure- PseAAC [15], and the methods proposed in references [16],
based methods, e.g., DBD-Hunter [5] and iDBPs [6], [17], [18]. These methods often use only sequence informa-
tion and recognize DBPs with one or more machine-learning
algorithms, such as support vector machine (SVM) [14] or
 J. Hu, X. G. Zhou, and G. J. Zhang are with the College of Information random forest (RF) [7], [11]. For example, in iDNA-Prot [7],
Engineering, Zhejiang University of Technology, Hangzhou, 310023,
China. E-mail: {hujunum, zxg, zgj}@zjut.edu.cn.
the authors first employed a grey system theory [19] to
 Y. H. Zhu and D. J. Yu are with the School of Computer Science and extract the novel pseudo amino acid composition for repre-
Engineering, Nanjing University of Science and Technology, Nanjing senting the feature of each protein data and then adopted
210094, P. R. China. E-mail: [email protected], [email protected]. the algorithm of RF to train the final prediction model. In
Manuscript received 26 Aug. 2018; revised 17 Dec. 2018; accepted 13 Jan. PseDNA-Pro [8], three kinds of sequence-based information,
2019. Date of publication 18 Jan. 2019; date of current version 6 Aug. 2020.
(Corresponding authors: Dong-Jun Yu and Gui-Jun Zhang.)
i.e., amino acid composition, pseudo amino acid compo-
Digital Object Identifier no. 10.1109/TCBB.2019.2893634 sition (PseAAC) [20], [21], and physicochemical distance
1545-5963 ß 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.
thorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY TIRUCHIRAPALLI. Downloaded on January 17,2024 at 11:55:14 UTC from IEEE Xplore. Restrictions app
1420 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 17, NO. 4, JULY/AUGUST 2020

transformation, are used to represent the feature vector of Finally, we have established an online web server of
each protein data and the SVM algorithm is then employed TargetDBP, which can be freely accessible for academic use
to generate the prediction model. In iDNAPro-PseAAC [9], at http://csbio.njust.edu.cn/bioinf/targetdbp/.
the information of pseudo amino acid composition and pro-
file-based protein representation [22] were integrated to rep-
2 MATERIALS AND METHODS
resent the numerical feature of each protein for predicting
DBPs based on the SVM algorithm. In HMMBinder [13], the 2.1 Benchmark Datasets
monogram and bigram features were first extracted from the The first important step of the statistical predictor develop-
HMM profiles, which were generated by HHblits [23] based ment is to establish a comprehensive, reliable, and stringent
on the sequence information, and then were fed into the SVM benchmark dataset [10]. In this paper, the benchmark dataset
algorithm for predicting DBPs. In iDNAProt-ES [14], the S for DBP prediction can be formally denoted as following:
PSSM-based evolutionary information and sequence-driven
structural information, generated by PSI-BLAST and SPI- S ¼ Sposi [ Snega ; (1)
DER2, respectively, were extracted to represent each protein;
then, the recursive feature elimination was utilized to select where Sposi means the positive subset that only contains
the optimal subset of features; finally, the SVM algorithm DBPs, Snega means the negative subset that only includes
were used to learn the final model. In DPP-PseAAC [15], the non-DBPs, and the symbol [ represents the “union” in the
authors first used Chou’s general PseAAC [20], [21] to repre- set theory.
sent protein data; then, the RF and SVM-RFE (support vector In order to construct Sposi , we first extract all DBP chains
machine recursive feature elimination) algorithms were from PDB [27] (as of May 12, 2018). Here, each DBP chain is
employed to rank features; finally, the SVM algorithm was classified into the positive class (i.e., DNA-binding protein)
utilized to generate the final prediction model. Despite the in PDB [27] or contains one DNA-binding residue at least,
promising results of these methods, there remains room for which includes at least one of its non-hydrogen atoms whose
further improvements in accurately predicting DBPs from distance to at least one non-hydrogen atom of the DNA mol-
sequence information. ecule is less than the sum of the Van Der Waals radii of the

To further improve the accuracy of DBP prediction, in this two corresponding atoms plus a tolerance of 0.5 A. Then, the
study, we propose a new sequence-based predictor, named CD-hit software [29] is used to remove the redundant protein
TargetDBP. To make the proposed TargetDBP to be a useful chains such that no two chains have more than 25 percent
sequence-based DBP predictor, we should observe the sequence identity. To ensure no fragment in the final dataset,
Chou’s 5-step rule [20] used in a series of recent publications any chain with less than 50 residues in length is removed.
[24], [26], i.e., making the following five steps clear: (1) how Besides, we also remove the proteins containing the residue
to construct or select a valid benchmark dataset to train ‘X’ since they include unknown residue. Finally, a total of
and test the predictor; (2) how to formulate the biological 1,200 non-redundant protein chains are obtained to form
sequence samples with an effective mathematical expression Sposi . The Sposi is divided into a positive training subset
that can truly reflect their intrinsic correlation with the target (Str
posi ) and a positive independent validation subset (Sposi ).
tst

to be predicted; (3) how to introduce or develop a powerful Sposi consists of 1,052 protein chains, which were all depos-
tr

algorithm (or engine) to operate the prediction; (4) how to ited into the PDB before May 12, 2014, while Stst posi includes
properly perform cross-validation tests to objectively evalu- 148 protein chains, which were all deposited into the PDB
ate the anticipated accuracy of the predictor; (5) how to after May 12, 2014.
establish a user-friendly web-server for the predictor that is According to the same steps as described above, we can
accessible to the public. obtain 16,058 non-DBP chains from PDB, where the sequence
Following the Chou’s 5-step rule, a new gold-standard identity of each two chains is less than 25 percent. Here, each
and non-redundant benchmark data set consisting of 1,200 non-DBP chain is not annotated to the class of DBP in PDB
DBPs and 1,200 non-DBPs is first collected from Protein Data [27] and contains no one DNA-binding residue. To construct
Bank (PDB) [27] and used to train and test the prediction Snega , we first select 1,052 chains, which were all deposited
model of TargetDBP. Second, we generate the feature repre- into the PDB before May 12, 2014, from these non-DBP chains
sentation of each protein by using the following steps: to form a negative training subset Str nega . We then choose 148
(1) four sequence-based single-view features, i.e., AAC, chains, which were all deposited into the PDB after May 12,
PsePSSM, PsePRSA, and PsePPDBS, are extracted to repre- 2014, to form a negative independent validation subset Stst nega .
sent different base features, respectively; (2) differential evo- Finally, Snega ¼ Strnega [ Stst
nega .
lution algorithm [28] is employed to learn the weights of four In this paper, the benchmark dataset S can be also pre-
base features; (3) using these learned weights, we can sented as following:
weightedly combine the four base features to form the origi-
nal super feature; (4) a suitable feature selection algorithm, S ¼ Str [ Stst ; (2)
SVM-REFþCBR, is employed to select an excellent subset of
the super feature for further quantifying the difference where Str ¼ Str posi [ Snega means the training dataset, which is
tr

between DBPs and non-DBPs. Third, the prediction model is used to train the DBP prediction model and test the TargetDBP
learned via using the algorithm of support vector machine performance by the leave-one-out cross-validation test (i.e.,
on the selected feature subset. Then, the leave-one-out cross- jackknife test), and Stst ¼ Ststposi [ Snega represents the inde-
tst

validation and independent dataset tests are used to system- pendent validation dataset, which is employed to test the
atically examine the strengths and weaknesses of TargetDBP. prediction performance by independent test. The protein
thorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY TIRUCHIRAPALLI. Downloaded on January 17,2024 at 11:55:14 UTC from IEEE Xplore. Restrictions app
HU ET AL.: TARGETDBP: ACCURATE DNA-BINDING PROTEIN PREDICTION VIA SEQUENCE-BASED MULTI-VIEW FEATURE... 1421

name list of the benchmark dataset S can be found in Text S1 different protein sequences are generally different and these
in the Supporting Information (SI). Moreover, the correspond- lengths vary widely; but on the other hand, all existing
ing protein sequence information can be easily downloaded machine-learning algorithms, such as SVM [30], RF [31], and
from the online web server of TargetDBP, whose address is K-nearest neighbor (KNN) [43], cannot handle the length-
http://csbio.njust.edu.cn/bioinf/targetdbp/. different information, such as PSSM.
To deal with the above dilemma, Chou and Shen [37]
2.2 Feature Extraction proposed the PsePSSM approach, which has been demon-
With the explosive growth of biological sequences in the strated to be very useful for the most areas of computational
post-genomic era, one of the most important but also most proteomics, such as protein crystallization prediction [33]
difficult problems in computational biology is how to and protein structural class prediction [44]. The details of
express a biological sequence with a discrete model or a vec- generating PsePSSM feature are described as follows.
tor, yet still keep considerable sequence-order information For one given protein sequence consisting of L amino
or key pattern characteristic. This is because all the existing acids, its PSSM profile can be generated by using the PSI-
machine-learning algorithms (such as SVM [30], RF [31]) can BLAST software [45] to search against the non-redundant
only handle vectors as elaborated in a comprehensive review NCBI database through three iterations, with 0.001 as the
[32]. However, a simple vector (such as AAC [33], [34]) e-value cutoff for multiple sequence alignment. The logistic
defined in a discrete model may completely lose all the function, i.e., fðxÞ ¼ 1=ð1 þ ex Þ, is then used to normalize
sequence-pattern information. To avoid completely losing the score of each element, denoted as x, in a PSSM profile.
the sequence-pattern information for proteins, lots of meth- Let fpi;j gL20 denote the normalization PSSM (PSSMnorm ) of
ods, such as DPC (dipeptide composition) [34], TPC (tripep- the given protein sequence. The PsePSSM can be constructed
tide composition) [35], PseAAC [20], [36], and PsePSSM [37], with the following two steps [44]:
have been proposed to represent the feature of each protein Step 1: Calculating the PSSM Composition
sequence. Recently, two powerful web-servers, i.e., Pse- The PSSM composition (uPSSM ) is the vector with 20 ele-
in-One [38] and BioSeq-Analysis [39], have been established ments as defined:
to make us to generate the feature of protein sequence easily.
 T
In this study, we employed four effective single-view fea- uPSSM ¼ uPSSM ; uPSSM ; . . . ; uPSSM ; (4)
1 2 20
tures, i.e., AAC, PsePSSM, PsePRSA, and PsePPDBS, to repre-
PL
sent four different base features of each protein, respectively. where uPSSM ¼ pk;j =L, 1  j  20.
j k¼1
The details of extracting these single-view features are
described as follows: Step 2: Calculating Correlation Factors
Although the evolutionary information of the given pro-
2.2.1 Amino Acid Composition tein sequence can be partially reflected by uPSSM , all the
Amino acid composition (AAC) is a classical sequence- sequence-order information during the evolution process of
based feature view that has been widely employed in many the protein are fully lost. In order to remedy this defect and
protein attribute prediction tasks including DBP prediction extract the sequence-order information, we calculate the
[4]. Let AA1 ; AA2 ; . . . ; AA19 , and AA20 be the 20 ordered g-tier correlation factor of each column of the PSSMnorm
native amino acid types (i.e., A; C; . . . ; W; and Y ), ni be the matrix as follows:
number of occurrence of AAi in the given protein sequence,
and L be the protein length. The AAC feature (fAAC ) of the
g T
g ¼ ð1g ; 2g ; . . . ; 20 Þ ; (5)
protein is then formulated to be a vector with 20 elements, PLg
where jg ¼ k¼1 ðpk;j  pkþg;j Þ2 =ðL  gÞ, 1  j  20, 0  g 
as follows: G, G  L. The scalar quantity jg is the correlation factor by
n n2 n20 T
coupling the g-most contiguous PSSM scores along the pro-
1
fAAC ¼ ;...;
; ; (3) tein sequence for the amino acid type j.
L L L
Finally, the PsePSSM of the given protein sequence,
denoted as fPsePSSM , is the combination of its PSSM compo-
where T means the transpose of the vector. The dimension-
sition (uPSSM ) and the G correlation factor vectors (fg gGg¼1 )
ality of the AAC-view feature is 20.
as follows:
0 PSSM 1
2.2.2 Pseudo Position-Specific Scoring Matrix u
B 1 C
Position-specific scoring matrix (PSSM) is widely used to B C
B 2 C
extract the evolutionary information of a given protein fPsePSSM ¼ B  C: (6)
B .. C
sequence, which has been proved to be highly effective in a @ . A
variety of prediction tasks in the field of bioinformatics, G
including the DBP prediction [11], protein-ligand binding
site prediction [40], [41], protein contact map prediction [42], Considering the fact that there is no theoretical justification
and so on. However, when developing a machine-learning- on determining the optimal value G, we empirically tested
based DBP predictor, such as TargetDBP in this study, different values of G from 1 to 9 with a step of 1 on the training
PSSM, whose length is equal to that of the given protein dataset (i.e., Str ) and found that the optimal value of G is 6 (see
sequence, cannot directly be employed to represent the fea- details in Text S2 in SI). Accordingly, the dimensionality of
ture vector. This is because, on the one hand, the lengths of the final PsePSSM (fPsePSSM ) is 20 þ 6  20 ¼ 140.
thorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY TIRUCHIRAPALLI. Downloaded on January 17,2024 at 11:55:14 UTC from IEEE Xplore. Restrictions app
1422 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 17, NO. 4, JULY/AUGUST 2020

2.2.3 Pseudo Predicted Relative Solvent Accessibility class of DBS. It is importantly noted that we obtain the
The solvent accessibility [46] has close relevance to the spa- predicted probabilities of DBSs (PPDBS) for the training
tial arrangement of a protein, the characteristics of residues proteins via using the leave-one-out cross-validation test.
packing, and the protein-DNA interactions. Many researches For one given protein sequence with L amino acids, we
[4], [47] have experimentally demonstrated that the solvent can obtain the corresponding PPDBS, denoted as fsj gLj¼1 ,
accessibility has a positive impact to the DBP prediction. using the above procedures, where sj means the probability
In this study, the solvent accessibility of each residue is of the j-th residue belonging to DBS. We then extract the fea-
evaluated by the predicted relative solvent accessibility ture of pseudo PPDBS (PsePPDBS) by the following steps:
(PRSA) via feeding the corresponding sequence to the stand- Step 1: Calculating the PPDBS Composition
alone SANN program [48], downloaded from http://lee.kias. The PPDBS composition (uPPDBS ) is the vector with 20
re.kr/newton/sann/. For each given sequence, the SANN elements as defined:
program accurately predicts its PRSA profile (L rows and 3
columns, where L is the length of the given sequence), which  T
uPPDBS ¼ uPPDBS ; uPPDBS ; . . . ; uPPDBS ; (8)
includes the probabilities of three solvent accessibility classes 1 2 20

(i.e., buried (B), intermediate (I), and exposed (E)) of each resi- P
where uPPDBS
i ¼ Lk¼1 sk  Tk ðAAi Þ=L, AAi means i-th order
due. Similar to the generation procedure of PsePSSM, we also amino acid type (see details in Section 2.2.1), Tk ðAAi Þ ¼ 1
extract the pseudo predicted relative solvent accessibility when the amino acid type of the k-th residue of the given pro-
(PsePRSA) from the PRSA profile as follows: tein is equal to AAi , otherwise Tk ðAAi Þ ¼ 0, and 1  i  20.
Let fai;j gL3 be the PRSA profile. The PsePRSA feature
vector, denoted as fPsePRSA , can be formulated as follows: Step 2: Calculating Correlation Factors
0 1 To save some parts of the sequence-order information of
uPRSA the given protein, we calculate the m-tier correlation factor
B o1 C of the PPDBS as follows:
B C
B 2 C
fPsePRSA ¼ B o C; (7) XLm
B .. C
@ . A hm ¼ ðsk  skþm Þ2 =ðL  mÞ (9)
k¼1
oH
where 0  m  M and M  L.
P Finally, the PsePPDBS of the given sequence, denoted as
where u PRSA
¼ ðuPRSA ; uPRSA ; uPRSA ÞT , uPRSA ¼ Lk¼1 ak;j =L,
1 2
PLh 3 j
fPsePPDBS , is the combination of uPPDBS and the M correla-
oh ¼ ðoh1 ; oh2 ; oh3 ÞT , ohj ¼ 2
k¼1 ðak;j  ak þ h;j Þ =ðL  hÞ, tion factors (fhm gMm¼1 ) as follows:
1  j  3, 1  h  H, H  L. Then, we tested different
0 1
values of H from 1 to 9 with a step of 1 on the training uPPDBS
dataset (i.e., Str ) and found that the optimal value of H is 4 B h1 C
B C
B C
¼B h
2
(see details in Text S3 in SI). Finally, the dimensionality of fPsePPDBS C: (10)
the PsePRSA is 3 þ 4  3 ¼ 15. B . C
@ .. A
hM
2.2.4 Pseudo Predicted Probabilities of DNA-Binding
Sites To ensure the optimal value of the parameter M, we
Theoretically, we can correctly target all DBPs, when the empirically tested different values of M from 1 to 9 with a
DNA-binding sites (DBSs) of these proteins can be predicted step of 1 on the training dataset (i.e., Str ) and found that the
with the accuracy of 100 percent. However, most of the state- value optimal M is 4 (see details in Text S4 in SI). Accordingly,
of-the-art DBS predictors can only achieve around 80 percent the dimensionality of the final PsePPDBS is 20 þ 4  1 ¼ 24.
accuracy. Directly using the prediction results of DBS predic-
tors to target the DBPs is not the best way to help us to 2.3 Learning Weights for Combining Multi-View
improve the accuracy of DBP prediction. In this study, we Features
employ the prediction probability results of DBS predictor to It is believed that the above four single-view base features
act as a new feature view and extract the useful feature vector potentially contain the complementary discriminative infor-
from it to enhance the accuracy of DBP prediction. mation for predicting DBPs. How to combine the four base
To avoid the obviously high accuracy of DBS prediction features is one of the most crucial steps in generating a
caused by the potential high sequence identity between the machine-learning-based DBP prediction model. The most
testing proteins and the training proteins of the existing straightforward and convenient method is to serially and
DBS predictors, we use our training dataset to train a new directly combine these base features to gain a super feature
DBS prediction model rather than chose one of the existing (i.e., AAC þ PsePSSM þ PsePRSA þ PsePPDBS) to be
DBS predictors. Inspired by our previous work, i.e., employed for training a DBP prediction model. Here, 0 þ 0
TargetDNA [49], we first extract the discriminative feature of means simply serial combination. However, the simple com-
each residue of each training protein from the views of PSSM bination method cannot guarantee to obtain an optimal
and PRSA with a sliding window of size 9. Then, we employ discriminative capability, since it neglects the relative impor-
the SVM algorithm [30], [50] to train the DBS prediction tance of these base features. Hence, learning the relative
model. Finally, we use this model to gain the probability of importance of these base features would be especially useful
each residue of each testing protein being classified into the for improving the accuracy of DBP prediction.
thorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY TIRUCHIRAPALLI. Downloaded on January 17,2024 at 11:55:14 UTC from IEEE Xplore. Restrictions app
HU ET AL.: TARGETDBP: ACCURATE DNA-BINDING PROTEIN PREDICTION VIA SEQUENCE-BASED MULTI-VIEW FEATURE... 1423

In this study, we employ the differential evolution (DE) 0:164; 1:489ÞT . That is, the weights of AAC, PsePSSM,
algorithm [28], [51], which is one of the most competitive PsePRSA, and PsePPDBS are 0.275, 0.168, 0.164, and 1.489,
variants of evolution algorithms, to learn the optimal/sub- respectively. Based on wG max
best , we can generated a new super
optimal weights of these base features, since it is easy to be feature, denoted as wAAC þ wPsePSSM þ wPsePRSA þ
implemented and achieve high performance [52], [53]. Fur- wPsePPDBS, by weightedly and serially combining the fea-
thermore, DE has been used to help numerous research fields, tures of AAC, PsePSSM, PsePRSA, and PsePPDBS.
such as protein structure prediction [54] and sensor network
localization [55], to achieve a positive impact. The procedure 2.4 Feature Selection Using SVM-RFEþCBR
of using DE to learn the suitable weights of these single-view Feature selection is a widely-used technique in the fields of
features in this study is briefly described as follows: machine learning and pattern recognition, since it can enhance
Let fðwÞ be the problem that is to search the maximum the performance of the prediction model by removing irrele-
MCC value of the leave-one-out cross-validation test on Str , vant, noisy, and redundant information from the original
w ¼ ðw1 ; w2 ; w3 ; w4 ÞT 2 ½2; 2 4 is one candidate solution, feature space. In order to further improve the performance of
where w1 , w2 , w3 , and w4 mean the weights of AAC, PsePSSM, DBP prediction, we employ the feature selection algorithm
PsePRSA, and PsePPDBS feature views, respectively. Then, named support vector machine recursive feature elimination
the steps of DE can be described as follows: with correlation bias reduction (SVM-RFEþCBR) [56], which
Step 1: Initialization has been successfully applied in DBP prediction [57], to select
Randomly generate the initial population P g ¼ fwg1 ; one excellent feature subset from the weighted feature vector,
w2 ; . . . wgN g, where wgi ¼ ðwgi;1 ; wgi;2 ; wgi;3 ; wgi;4 ÞT means the i-th
g i.e., wAAC þ wPsePSSM þ wPsePRSA þ wPsePPDBS. As
described in [57], SVM-REFþCBR is the enhanced version of
solution in the g-th generation population, N is the popula-
SVM-RFE algorithm [58]. SVM-RFEþCBR has both linear and
tion size. Set the values of scaling factor ðF Þ, crossover rate
nonlinear versions. The nonlinear SVM-RFEþCBR employs a
ðCRÞ, and maximum generation ðGmax Þ to be 0.5, 0.5, and
special kernel function to transform the nonlinear learning
1000, respectively. Set the generation count, i.e., g, to be 1.
problem in the original feature space to the linear one in the
Step 2: Mutation feature space of higher dimensions. In this study, we can gain
For each solution wgi in the population P g , a mutant vec- the output of the nonlinear SVM-RFEþCBR algorithm with a
tor (mgþ1
i ) is generated according to
ranked feature list on the training dataset. We then select an
optimal feature subset based on the ranked features (see
mgþ1
i ¼ wgr1 þ F  ðwgr2  wgr3 Þ; (11) Section 3.3).

where wgr1 , wgr2 , and wgr3 are three different solutions ran- 2.5 Implementation of Support Vector Machine
domly selected from the set of P g  fwgi g. Support vector machine (SVM) algorithm [30], [50], a
Step 3: Crossover machine learning approach based on the structural risk mini-
To increase the diversity of the solutions in the next mization principle of statistics learning theory, has been used
generation P gþ1 , the step of crossover is introduced in DE in a wide variety of bioinformatics fields, including DBP pre-
algorithm. For each solution wgi in P g , a trial vector diction [4]. In this study, we also utilize SVM to construct the
prediction model. We use the LIBSVM software (version
tgþ1
i ¼ ðtgþ1 gþ1 gþ1 gþ1
i;1 ; ti;2 ; ti;3 ; ti;4 Þ is formed as follows: libsvm-3.18) [59], which can be freely downloaded at
( http://www.csie.ntu.edu.tw/cjlin/libsvm/, to implement
mgþ1
i;k ; if Rj < CR or k ¼ krand
tgþ1 ¼ ; (12) the SVM algorithm and train the model of DBP prediction.
i;k
wgi;k ; otherwise Here, a radial basis function is chosen as the kernel function.
The kernel width parameter s and the regularization param-
where Rj is a uniform random number generator with out- eter g, which are two most important parameters, are opti-
come 2 ½0; 1 for the j-th element. krand 2 f1; 2; 3; 4g means a mized over a five-fold cross-validation using a grid search
randomly selected index, which ensures that tgþ1 i gets at tool in the LIBSVM software.
least one element from mgþ1
i .

Step 4: Selection 2.6 Architecture of TargetDBP


For each solution wgi in P g , to decide whether or not it Fig. 1 demonstrates the architecture of the proposed Tar-
should be remained to become a member, i.e., wigþ1 , of the getDBP. For a given protein, TargetDBP can extract the four
next generation population, i.e., P gþ1 , the corresponding different single-view features, i.e., AAC, PsePSSM, PsePRSA,
trial vector tgþ1
i is compared to the solution wgi based on the and PsePPDBS, by calling the corresponding programs.
output of fðwÞ. If fðwgi Þ is larger than fðtgþ1 g
i Þ, wi is retained Based on the weights wG max
best , which is learned by using DE
to be wi ; otherwise, the trial vector ti is set to wgþ1
gþ1 gþ1
i .
algorithm on the training dataset, the procedure of weighted
feature combination can be performed to generate a discrimi-
Step 5: Loop or Termination native feature vector. Then, the SVM-RFEþCBR algorithm is
If g is larger than Gmax , the procedure of DE algorithm is used to choose an optimal feature subset to be the final fea-
terminated and the best solution wgbest in P g should be out- ture representation of each protein. In training phase, after
putted; otherwise, g ¼ g þ 1 and repeat Steps 2-4. generating the final features of all proteins in the training
After the DE procedure is terminated, we can obtain the dataset Str , we can obtain the training sample set. We then
final solution wG best . In this study, the wbest is ð0:275; 0:168;
max Gmax
employ the SVM algorithm to train the final prediction
thorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY TIRUCHIRAPALLI. Downloaded on January 17,2024 at 11:55:14 UTC from IEEE Xplore. Restrictions app
1424 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 17, NO. 4, JULY/AUGUST 2020

TABLE 1
Performance Comparison of Four Single-View Features over
Leave-One-Out Cross-Validation Test on the Training Dataset

Feature Rec Spe Acc Pre MCC F1


ACC 67.21 68.16 67.68 67.85 0.354 0.675
PsePSSM 72.34 72.24 72.29 72.27 0.446 0.723
PsePRSA 64.26 65.21 64.73 64.88 0.295 0.646
PsePPDBS 77.00 77.09 77.04 77.07 0.541 0.770

where N þ is the total number of DBPs, Nþ is the number of


DBPs incorrectly predicted as non-DBP, N  is the total num-
ber of non-DBPs, and Nþ is the number of non-DBPs incor-
rectly predicted as DBP. The MCC measure ranges from -1 to
1, and the other five evaluation measures range between 0 to
1. The higher values of these evaluation indexes mean the
better performance of DBP prediction. The reported thresh-
old ðTHÞ, which can maximize the MCC value over the
leave-one-out cross-validation test on the training dataset
Str , is then chosen to calculate the values of Rec; Spe; Acc;
Pre; MCC, and F1 in this study. The value of TH is 0.48 in
this study. Moreover, the area under the receiver operating
characteristic (ROC) curve (denoted as AUC), which is a
Fig. 1. Architecture of TargetDBP.
threshold-independent evaluation index, is also used to
evaluate the overall ability of DBP prediction.
model for targeting DBPs. In prediction phase, for each pro-
tein to be predicted, after generating the corresponding final
feature vector, the final prediction model can be used to give
3 RESULTS AND DISCUSSIONS
the probability of classifying it to be a DBP. Finally, the deci- 3.1 Performance Comparison of Four Single-View
sion is performed based on the probability and the pre- Features
scribed threshold TH: a protein with probability above TH is In this section, we will investigate the discriminative per-
marked as DBP. How to choose the threshold TH will be formances of the four single-view features, i.e., AAC,
described in Section 2.7. PsePSSM, PsePRSA, and PsePPDBS. Each feature is evalu-
ated by performing leave-one-out cross-validation test on
2.7 Evaluation Indexes the training dataset Str with the SVM algorithm. Table 1 sum-
In this study, the performance of DBP prediction is assessed marizes the discriminative performance comparison of these
by using the following six classical evaluation indexes of single-view features.
binary classification, i.e., Recall (Rec), Specificity (Spe), Accu- From Table 1, we can easily find that PsePPDBS consis-
racy (Acc), Precision (Pre), Mathew’s Correlation Coefficient tently outperforms other three single-view features concern-
(MCC) [60], and F1 -score ðF1 Þ: ing the six evaluation indexes. Concretely, the MCC and F1
of PsePPDBS are 0.541 and 0.770, which are 52.82 and 14.07
 
Nþ percent higher than AAC, 21.30 and 6.50 percent higher
Rec ¼ 1  þ  100 (13) than PsePSSM, and 83.39 and 19.20 percent higher than
N
PsePRSA, respectively.
 
Nþ Fig. 2 also shows the ROC curves of AAC, PsePSSM,
Spe ¼ 1  100 (14) PsePRSA, and PsePPDBS. In Fig. 2, it is easy to find that the
N
AUC value of PsePPDBS is 0.831, which is higher than that
  of the other three single-view features. The underlying rea-
Nþ þ Nþ
Acc ¼ 1  100 (15) son for PsePPDBS to outperform other three single-view
Nþ þ N
features is that PsePPDBS contains the direct related infor-
  mation for DBP prediction.
Nþ
Pre ¼ 1  100 (16)
N  Nþ þ Nþ
þ
3.2 Weighted Feature Combination Can Enhance
  Prediction Performance
Nþ N
1  Nþ þ Nþ In this section, to verify the ability of weighted feature com-
MCC ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
   (17) bination, we will compare the discriminative performances
N  N þ N þ N 
1 þ þN þ  1 þ N  þ of AAC þ PsePSSM þ PsePRSA þ PsePPDBS and wAAC þ
wPsePSSM þ wPsePRSA þ wPsePPDBS. Each feature is
tested by performing leave-one-out cross-validation test on
Pre  Rec
F1 ¼ 2  ; (18) Str with SVM algorithm. Table 2 summarizes the compared
Pre þ Rec results.
thorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY TIRUCHIRAPALLI. Downloaded on January 17,2024 at 11:55:14 UTC from IEEE Xplore. Restrictions app
HU ET AL.: TARGETDBP: ACCURATE DNA-BINDING PROTEIN PREDICTION VIA SEQUENCE-BASED MULTI-VIEW FEATURE... 1425

Fig. 2. ROC curves of AAC, PsePSSM, PsePRSA, and PsePPDBS. Fig. 3. ROC curves of AAC þ PsePSSM þ PsePRSA þ PsePPDBS and
wAAC þ wPsePSSM þ wPsePRSA þ wPsePPDBS.

TABLE 2
Performance Comparison between
AAC þ PsePSSM þ PsePRSA þ PsePPDBS and
wAAC þ wPsePSSM þ wPsePRSA þ wPsePPDBS over
Leave-One-Out Cross-Validation Test on the Training Dataset

Feature Rec Spe Acc Pre MCC F1


D-Feature 76.81 77.19 77.00 77.10 0.540 0.770
W-Feature# 78.52 79.18 78.85 79.04 0.577 0.788

D-Feature means the AAC þ PsePSSM þ PsePRSA þ PsePPDBS feature.


#
W-Feature means the wAAC þ wPsePSSM þ wPsePRSA þ wPsePPDBS
feature.

From Table 2, we can observe that wAAC þ wPsePSSM þ


wPsePRSA þ wPsePPDBS is consistently superior to
AAC þ PsePSSM þ PsePRSA þ PsePPDBS concerning the Fig. 4. The variation curves of MCC and F1 values versus the number of
six evaluation indexes. The Rec, Spe, Acc, Pre, MCC, and F1 selected features based on the SVM-RFE þ CBR-ranked features.
of wAAC þ wPsePSSM þ wPsePRSA þ wPsePPDBS are
78.52, 79.18, 78.85, 79.04, 0.577, and 0.788, which are 2.23, features in this section on the training dataset Str . Concretely,
2.58, 2.40, 2.52, 6.85, and 2.34 percent higher than after obtaining the ranked features by using SVM-RFEþCBR
AAC þ PsePSSM þ PsePRSA þ PsePPDBS, respectively. Fig. 3 algorithm, we evaluate the performance variations of MCC
also demonstrates the ROC curves of the two combination and F1 on the training dataset (i.e., Str ) over leave-one-out
features. From Fig. 3, we can find that the AUC of cross-validation tests by gradually varying the selected num-
wAAC þ wPsePSSM þ wPsePRSA þ wPsePPDBS is 0.855, ber of the top-ranked features from 1 to 199 with a step size of
which is slightly higher than that of AAC þ PsePSSM þ 10. Fig. 4 shows the variation curves of MCC and F1 values
PsePRSA þ PsePPDBS. along with the increasing number of selected features over
By revisiting Table 1, we can find an interesting phenom- leave-one-out cross-validation test.
enon that is the MCC value of PsePPDBS is slightly higher From Fig. 4, it can be observed that when the number of
than that of the directly serial combination feature, i.e., selected features ðNSF Þ is 131, the corresponding MCC and
AAC þ PsePSSM þ PsePRSA þ PsePPDBS. This phenome- F1 values based on leave-one-out cross-validation test both
non can be understood that directly combining these four achieve the highest values; when 1  NSF < 131, MCC and
single-view features, i.e., AAC, PsePSSM, PsePRSA, and F1 values tend to increase with a little fluctuation; when
PsePPDBS, cannot further improve the performance of NSF > 131, the MCC and F1 values slightly fluctuate and
DBP prediction. While, comparing to PsePPDBS, wAAC þ enter a slowly descending state. Thus, the final optimal fea-
wPsePSSM þ wPsePRSA þ wPsePPDBS can obtain a higher ture subset determined by feature selection is composed of
prediction performance. The above comparison results can the 131 top-ranked features, i.e., ON ¼ 131. In the 131 top-
demonstrate that the impact of weighted feature combina- ranked features, there are 18 features selected from the AAC
tion should be positive. feature view, 84 features selected from PsePSSM feature view,
14 features selected from PsePRSA feature view, and 15
3.3 Improving Performance by Feature Selection features selected from PsePPDBS feature view. That is, the
Since SVM-RFEþCBR cannot automatically ensure the opti- 90.00, 60.00, 93.33, and 62.50 percent features in AAC,
mal number ðONÞ of the selected features, we will empirically PsePSSM, PsePRSA, and PsePPDBS feature views are selected
select the value of ON based on the SVM-RFEþCBR ranked to compose the final optimal feature subset, respectively.
thorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY TIRUCHIRAPALLI. Downloaded on January 17,2024 at 11:55:14 UTC from IEEE Xplore. Restrictions app
1426 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 17, NO. 4, JULY/AUGUST 2020

TABLE 3 TABLE 4
Performance Comparison Between the Features of with Performance Comparisons between the Proposed
and without Feature Selection over Leave-One-Out TargetDBP and Other Predictors of DNA-Binding
Cross-Validation Test on the Training Dataset Proteins on the Independent Validation Dataset Stst

With/without Predictor Rec Spe Acc Pre MCC F1


Rec Spe Acc Pre MCC F1
feature selection a
iDNA-Prot 63.51 60.81 62.16 61.84 0.243 0.627
Without 78.52 79.18 78.85 79.04 0.577 0.788 PseDNA-Prob 78.38 56.08 67.23 64.09 0.354 0.705
With 79.56 79.85 79.71 79.79 0.594 0.797 iDNAPro-PseAACc 78.38 54.05 66.22 63.04 0.334 0.699
iDNA-Prot j disd 72.30 64.19 68.24 66.88 0.366 0.695
Local-DPPe 3.38 93.92 48.65 35.71 -0.06 0.062
The main reason for this phenomenon should be that the four PSFM-DBTf 71.62 65.54 68.58 67.52 0.372 0.695
feature views are all important for DBP prediction. HMMBinderg 99.32 1.35 50.34 50.17 0.034 0.667
IKP-DBPPredh 52.70 63.51 58.11 59.09 0.163 0.557
To further demonstrate the efficacy of feature selection,
iDNAProt-ES (on PDB1075)i 91.89 51.35 71.62 65.38 0.473 0.764
Table 3 and Fig. 5 list the performance comparisons between iDNAProt-ES (on Str)j 95.95 41.22 68.58 62.01 0.444 0.753
with and without feature selection on Str over leave-one-out DPP-PseAACk 55.41 66.89 61.15 62.60 0.225 0.588
cross-validation test. As shown in Table 3, values obtained TargetDBP 76.35 77.03 76.69 76.87 0.534 0.766
with feature selection are consistently better than those
a
obtained without feature selection in terms of all six evaluation Results computed using the iDNA-Prot server at http://www.jci-bioinfo.cn/
iDNA-Prot/.
indexes. Moreover, as demonstrated in Fig. 5, the AUC value b
Results computed using the PseDNA-Pro server at http://bioinformatics.
of with feature selection is 0.865, which is also 1.17 percent hitsz.edu.cn/PseDNA-Pro/.
c
higher than that of without feature selection. The results Results computed using the iDNAPro-PseAAC server at http://bioinformatics.
shown in Table 3 and Fig. 5 indicate that the performance is hitsz.edu.cn/ iDNAPro-PseAAC/.
d
Results computed using the iDNA-Prot j dis server at http://bioinformatics.
indeed enhanced after applying feature selection. hitsz.edu.cn/iDNA-Prot_dis/.
e
Results computed using the Local-DPP server at http://server.malab.cn/Local-
DPP/Index.html.
3.4 Comparing TargetDBP with Existing f
Results computed using the PSFM-DBT server at http://bioinformatics.hitsz.
DBP Predictors edu.cn/PSFM-DBT/.
g
In this section, we will experimentally demonstrate the effi- Results computed using the standalone version of HMMBinder.
h
Results computed using the IKP-DBPPred server at http://server.malab.cn/
cacy of the proposed TargetDBP by comparing it with other IKP-DBPPred/index.jsp.
existing DBP predictors, including iDNA-Prot [7], PseDNA- i
Results computed using the re-implementation of iDNAProt-ES on PDB1075.
Pro [8], iDNAPro-PseAAC [9], iDNA-Prot j dis [10], Local- j
Results computed using the re-implementation of iDNAProt-ES on Str .
k
DPP [11], PSFM-DBT [12], HMMBinder [13], IKP-DBPPred Results computed using the DPP-PseAAC server at http://77.68.43.135:8080/
DPP-PseAAC/.
[3], iDNAProt-ES [14], and DPP-PseAAC [15], on the inde-
pendent validation dataset, i.e., Stst .
The prediction results of iDNA-Prot [7], PseDNA-Pro [8],
iDNAPro-PseAAC [9], iDNA-Prot j dis [10], Local-DPP [11], on PDB1075, which is constructed in [10], after removing 12
PSFM-DBT [12], IKP-DBPPred [3], and DPP-PseAAC [15] are false data, i.e., 3THWD, 4FCYC, 4GNXK, 4GNXL, 1AOII,
obtained by feeding the 296 proteins in Stst to their corre- 4JJNJ, 4JJNI, 4ESVA, 2G8FA, 4EUWA, 3SWMA, and 4FF1A,
sponding web servers. Since the web servers of HMMBinder and some redundant proteins, e.g., 2AY0B and 2AY0C (see
and iDNAProt-ES are not working, the results of HMMBinder details in Text S5 in SI); the other is learned on Str . Table 4
and iDNAProt-ES are computed using the standalone version demonstrates the performance comparisons between Tar-
of HMMBinder and the re-implementations of iDNAProt-ES, getDBP with other predictors on Stst .
respectively. Here, for an objective and fair comparison, we According to the MCC and F1 , which are two overall
re-implemented two versions of iDNAProt-ES: one is learned measurements of the quality of the binary predictions, listed
in Table 4, we can find that the TargetDBP acts as the best
performer followed by iDNA-Prot [7], PseDNA-Pro [8],
iDNAPro-PseAAC [9], iDNA-Prot j dis [10], Local-DPP [11],
PSFM-DBT [12], HMMBinder [13], IKP-DBPPred [3],
iDNAProt-ES [14], and DPP-PseAAC [15]. Compared with
the second-best predictor, i.e., iDNAProt-ES (trained on
PDB1075), TargetDBP achieves the improvements of 50.01,
7.08, 17.57, and 12.90 percent on Spe, Acc, Pre, and MCC,
respectively, while still possesses almost the equal perfor-
mance on F1 as iDNAProt-ES at the same time.
It has not escaped from our notice that HMMBinder [13]
and iDNAProt-ES (trained on Str ) [14] achieves the highest
and second-highest Rec values, i.e., 99.32 and 95.95, respec-
tively. However, the corresponding Spe values are much
lower, i.e., 1.35 and 41.22, denoting too many false positives
are incurred during prediction. On the other hand, Local-
Fig. 5. ROC curves of the features of with and without feature selection DPP achieves the best performance on Spe ð93:92Þ while with
over leave-one-out cross-validation test on the training dataset. much lower Rec value implying too many false negatives are
thorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY TIRUCHIRAPALLI. Downloaded on January 17,2024 at 11:55:14 UTC from IEEE Xplore. Restrictions app
HU ET AL.: TARGETDBP: ACCURATE DNA-BINDING PROTEIN PREDICTION VIA SEQUENCE-BASED MULTI-VIEW FEATURE... 1427

produced during prediction. Those are why the Pre values of Central Universities (No. 30918011104). The authors would
HMMBinder [13], iDNAProt-ES (trained on Str ) [14], and like to thank Dr. Swakkhar Shatabda and Dr. Rianon Zaman
Local-DPP [11] are both lower than that of TargetDBP. for providing the standalone program of HMMBinder.

3.5 Online Implementation of TargetDBP


We have implemented an online TargetDBP server, which is REFERENCES
freely available for academic use at http://csbio.njust.edu. [1] K. A. Jones, J. T. Kadonaga, P. J. Rosenfeld, T. J. Kelly, and R. Tjian,
cn/bioinf/targetdbp/. To use the server, one or more pro- “A cellular DNA-binding protein that activates eukaryotic transcrip-
tion and DNA replication,” Cell, vol. 48, no. 1, pp. 79–89, 1987.
tein sequences and an available email address should be [2] D. Eisenberg, E. M. Marcotte, I. Xenarios, and T. O. Yeates, “Protein
inputted. Then, the server will evaluate the running time function in the post-genomic era,” Nature, vol. 405, no. 6788,
and send it to user by email. After the server finished the sub- pp. 823–826, 2000.
mitted prediction task, the result email will automatically be [3] K. Qu, K. Han, S. Wu, G. Wang, and L. Wei, “Identification of DNA-
binding proteins using mixed feature representation methods,”
sent to user with instruction to access the result page, which Molecules, vol. 22, no. 10, 2017, Art. no. 1602.
will be kept on the TargetDBP web server for 3 months. [4] W. Lou, X. Wang, F. Chen, Y. Chen, B. Jiang, and H. Zhang,
If only one given protein sequence is inputted, the “Sequence based prediction of DNA-binding proteins based on
hybrid feature selection using random forest and Gaussian naive
TargetDBP prediction generally takes about 3-70 minutes Bayes,” PloS One, vol. 9, no. 1, 2014, Art. no. e86703.
depending on the length of the given sequence. The relatively [5] M. Gao and J. Skolnick, “DBD-Hunter: A knowledge-based
long computational time stems from the fact that TargetDBP method for the prediction of DNA–protein interactions,” Nucleic
must perform PSI-BLAST, SANN, and the procedure of DBS Acids Res., vol. 36, no. 12, pp. 3978–3992, 2008.
[6] G. Nimrod, M. Schushan, A. Szilagyi, C. Leslie, and N. Ben-Tal,
prediction to gain discriminative features, and run LIBSVM “iDBPs: A web server for the identification of DNA binding
to predict whether the given protein is DBP or not. proteins,” Bioinf., vol. 26, no. 5, pp. 692–693, 2010.
Although there is no limitation in the number of inputted [7] W.-Z. Lin, J.-A. Fang, X. Xiao, and K.-C. Chou, “iDNA-Prot: Iden-
protein sequences in the TargetDBP server, we strongly sug- tification of DNA binding proteins using random forest with grey
model,” PloS One, vol. 6, no. 9, 2011, Art. no. e24756.
gested that you should input less than 10 protein sequences [8] B. Liu, J. H. Xu, S. X. Fan, R. F. Xu, J. Y. Zhou, and X. L. Wang,
once, since our computation resource is limited. “PseDNA-Pro: DNA-binding protein identification by combining
Chou’s PseAAC and physicochemical distance transformation,”
Mol. Informat., vol. 34, no. 1, pp. 8–17, Jan, 2015.
4 CONCLUSION [9] B. Liu, S. Wang, and X. Wang, “DNA binding protein identifica-
tion by combining pseudo amino acid composition and profile-
Accurate identification of DBPs is one of the most important based protein representation,” Sci. Rep., vol. 5, 2015, Art. no.
tasks in the annotation of protein functions. In order to 15479.
enhance the performance of DBP prediction, in this study, [10] B. Liu, J. Xu, X. Lan, R. Xu, J. Zhou, X. Wang, and K.-C. Chou,
we have designed and implemented a new DBP predictor, “iDNA-Prot j dis: Identifying DNA-binding proteins by incorporat-
ing amino acid distance-pairs and reduced alphabet profile into the
named TargetDBP. Experimental results have demonstrated general pseudo amino acid composition,” PloS One, vol. 9, no. 9,
that the proposed TargetDBP significantly outperforms other 2014, Art. no. e106691.
existing DBP predictors. The superior performances of Tar- [11] L. Wei, J. Tang, and Q. Zou, “Local-DPP: An improved DNA-binding
protein prediction method by exploring local evolutionary
getDBP are due to several reasons, including an appropriate information,” Inf. Sci., vol. 384, pp. 135–144, 2017.
benchmark dataset, more discriminative feature design, and [12] J. Zhang, and B. Liu, “PSFM-DBT: Identifying DNA-binding pro-
careful construction of the prediction model. For easy to use, teins by combing position specific frequency matrix and distance-
the TargetDBP has already been implemented as a web bigram transformation,” Int. J. Mol. Sci., vol. 18, no. 9, 2017, Art. no.
1856.
server that is now available at http://csbio.njust.edu.cn/ [13] R. Zaman, S. Y. Chowdhury, M. A. Rashid, A. Sharma, A. Dehzangi,
bioinf/targetdbp/. and S. Shatabda, “HMMBinder: DNA-Binding protein prediction
To further improve the DBP prediction performance, using HMM profile based features,” Bio. Med. Res. Int., vol. 2017,
2017, 4590609.
our future work includes the following three directions: [14] S. Y. Chowdhury, S. Shatabda, and A. Dehzangi, “iDNAProt-ES:
(1) extracting more discriminative features to represent the Identification of DNA-binding proteins using evolutionary and
information buried in protein sequence; (2) employing structural features,” Sci. Rep., vol. 7, no. 1, 2017, pp. 14938.
the deep learning algorithm [61], [62] to obtain the available [15] M. S. Rahman, S. Shatabda, S. Saha, M. Kaykobad, and M. S. Rahman,
“DPP-PseAAC: A DNA-binding protein prediction model using
information extracted from the original feature repre- Chou’s general PseAAC,” J. Theoretical Biol., vol. 452, pp. 22–34,
sentation; (3) using the semi-supervised or unsupervised Sep 7, 2018.
machine-learning algorithms [63], [64] to learn the DBP pre- [16] R. F. Xu, J. Y. Zhou, B. Liu, Y. L. He, Q. Zou, X. L. Wang, and
diction model on these proteins with unknown structures; K. C. Chou, “Identification of DNA-binding proteins by incorporat-
ing evolutionary information into pseudo amino acid composition
(4) developing a more accurate method by combining Tar- via the top-n-gram approach,” J. Biomolecular Struct. Dyn., vol. 33,
getDBP and other state-of-the-art DBP prediction methods. no. 8, pp. 1720–1730, Aug. 3, 2015.
Although the TargetDBP still has room for optimization (for [17] X. W. Zhao, X. T. Li, Z. Q. Ma, and M. H. Yin, “Identify DNA-
binding proteins with optimal Chou’s amino acid composition,”
instance by integrating more programs when available), we Protein Peptide Lett., vol. 19, no. 4, pp. 398–405, Apr. 2012.
believe that it would be one of the most accurate tools for [18] Y. Fang, Y. Guo, Y. Feng, and M. Li, “Predicting DNA-binding
DBP prediction. proteins: approached from Chou’s pseudo amino acid composition
and other specific sequence features,” Amino Acids, vol. 34, no. 1,
pp. 103–109, Jan. 2008.
ACKNOWLEDGMENTS [19] D. Julong, “Introduction to grey system theory,” J. Grey Syst., vol. 1,
no. 1, pp. 1–24, 1989.
This work was supported in part by the National Natural Sci- [20] K. C. Chou, “Some remarks on protein attribute prediction and
ence Foundation of China (No. 61772273, 61373062, 61773346, pseudo amino acid composition,” J. Theoretical Biol., vol. 273, no. 1,
and 61573317), and the Fundamental Research Funds for the pp. 236–247, Mar. 21, 2011.
thorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY TIRUCHIRAPALLI. Downloaded on January 17,2024 at 11:55:14 UTC from IEEE Xplore. Restrictions app
1428 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 17, NO. 4, JULY/AUGUST 2020

[21] K. C. Chou, “Prediction of protein cellular attributes using pseudo- [43] L. E. Peterson, “K-nearest neighbor,” Scholarpedia, vol. 4, no. 2, 2009,
amino acid composition,” Proteins-Struct. Function Genetics, vol. 43, Art. no. 1883.
no. 3, pp. 246–255, May 15, 2001. [44] D.-J. Yu, J. Hu, X.-W. Wu, H.-B. Shen, J. Chen, Z.-M. Tang, J. Yang,
[22] B. Liu, D. Zhang, R. Xu, J. Xu, X. Wang, Q. Chen, Q. Dong, and and J.-Y. Yang, “Learning protein multi-view features in complex
K.-C. Chou, “Combining evolutionary information extracted from space,” Amino Acids, vol. 44, no. 5, pp. 1365–1379, 2013.
frequency profiles with sequence-based kernels for protein remote [45] A. A. Schaffer, L. Aravind, T. L. Madden, S. Shavirin, J. L. Spouge,
homology detection,” Bioinf., vol. 30, no. 4, pp. 472–479, 2013. Y. I. Wolf, E. V. Koonin, and S. F. Altschul, “Improving the accuracy
[23] M. Remmert, A. Biegert, A. Hauser, and J. S€ oding, “HHblits: of PSI-BLAST protein database searches with composition-based
Lightning-fast iterative protein sequence searching by HMM- statistics and other refinements,” Nucleic Acids Res., vol. 29, no. 14,
HMM alignment,” Nature Methods, vol. 9, no. 2, pp. 173–175, 2012. pp. 2994–3005, Jul. 15, 2001.
[24] B. Liu, F. Yang, D. S. Huang, and K. C. Chou, “iPromoter-2L: A [46] B. Lee and F. M. Richards, “The interpretation of protein struc-
two-layer predictor for identifying promoters and their types by tures: Estimation of static accessibility,” J. Mol. Biol., vol. 55, no. 3,
multi-window-based PseKNC,” Bioinf., vol. 34, no. 1, pp. 33–40, pp. 379–400, IN3–IN4, 1971.
Jan. 1, 2018. [47] N. Bhardwaj and H. Lu, “Residue-level prediction of DNA-binding
[25] X. Cheng, S. G. Zhao, X. Xiao, and K. C. Chou, “iATC-mISF: A sites and its application on DNA-binding protein predictions,”
multi-label classifier for predicting the classes of anatomical thera- FEBS Lett., vol. 581, no. 5, pp. 1058–1066, 2007.
peutic chemicals,” Bioinf., vol. 33, no. 3, pp. 341–346, Feb. 1, 2017. [48] K. Joo, S. J. Lee, and J. Lee, “Sann: Solvent accessibility prediction
[26] X. Cheng, X. Xiao, and K. C. Chou, “pLoc_bal-mGneg: Predict sub- of proteins by nearest neighbor method,” Proteins-Struct. Function
cellular localization of Gram-negative bacterial proteins by quasi- Bioinf., vol. 80, no. 7, pp. 1791–1797, 2012.
balancing training dataset and general PseAAC,” J. Theoretical Biol., [49] J. Hu, Y. Li, M. Zhang, X. Yang, H.-B. Shen, and D.-J. Yu, “Predicting
vol. 458, pp. 92–102, Dec. 7, 2018. protein-DNA binding residues by weightedly combining sequence-
[27] P. W. Rose, A. Prlic, C. Bi, W. F. Bluhm, C. H. Christie, S. Dutta, based features and boosting multiple SVMs,” IEEE/ACM Trans.
R. K. Green, D. S. Goodsell, J. D. Westbrook, and J. Woo, “The RCSB Comput. Biol. Bioinf., vol. 14, no. 6, pp. 1389–1398, Nov./Dec. 2017.
protein data bank: Views of structural biology for basic and applied [50] M. A. Hearst, S. T. Dumais, E. Osuna, J. Platt, and B. Scholkopf,
research and education,” Nucleic Acids Res., vol. 43, no. D1, “Support vector machines,” IEEE Intell. Syst. Appl., vol. 13, no. 4,
pp. D345–D356, 2015. pp. 18–28, Jul./Aug. 1998.
[28] R. Storn, and K. Price, “Differential evolution–a simple and efficient [51] X.-G. Zhou and G.-J. Zhang, “Abstract convex underestimation
heuristic for global optimization over continuous spaces,” J. global assisted multistage differential evolution,” IEEE Trans. Cybern.,
optimization, vol. 11, no. 4, pp. 341–359, 1997. vol. 47, no. 9, pp. 2730–2741, Sep. 2017.
[29] W. Li, and A. Godzik, “Cd-hit: a fast program for clustering and [52] F. Neri and V. Tirronen, “Recent advances in differential evolution:
comparing large sets of protein or nucleotide sequences,” Bioinf., A survey and experimental analysis,” Artif. Intell. Rev., vol. 33,
vol. 22, no. 13, pp. 1658–1659, 2006. no. 1-2, pp. 61–106, 2010.
[30] V. N. Vapnik, Statistical Learning Theory. New York, NY, USA: [53] X. G. Zhou and G. J. Zhang, “Differential evolution with under-
Wiley, 1998. estimation-based multimutation strategy,” IEEE Trans. Cybern.,
[31] A. Liaw and M. Wiener, “Classification and regression by random vol. 49, no. 4, pp. 1353–1364, Apr. 2019.
forest,” R News, vol. 2, no. 3, pp. 18–22, 2002. [54] G.-J. Zhang, X.-G. Zhou, X.-F. Yu, X.-H. Hao, and L. Yu, “Enhancing
[32] K. C. Chou, “Impacts of bioinformatics to medicinal chemistry,” protein conformational space sampling using distance profile-
Med. Chem., vol. 11, no. 3, pp. 218–234, 2015. guided differential evolution,” IEEE/ACM Trans. Comput. Biol.
[33] J. Hu, K. Han, Y. Li, J.-Y. Yang, H.-B. Shen, and D.-J. Yu, Bioinf., vol. 14, no. 6, pp. 1288–1301, Nov./Dec. 2017.
“TargetCrys: Protein crystallization prediction by fusing multi- [55] D. Qiao and G. K. Pang, “A modified differential evolution with
view features with two-layered SVM,” Amino Acids, vol. 48, no. 11, heuristic algorithm for nonconvex optimization on sensor network
pp. 2533–2547, 2016. localization,” IEEE Trans. Veh. Technol., vol. 65, no. 3, pp. 1676–1689,
[34] K. Chen, L. Kurgan, and M. Rahbari, “Prediction of protein crys- Mar. 2016.
tallization using collocation of amino acid pairs,” Biochemical Bio- [56] K. Yan and D. Zhang, “Feature selection and analysis on correlated
physical Res. Commun., vol. 355, no. 3, pp. 764–769, Apr. 13, 2007. gas sensor data with recursive feature elimination,” Sens. Actuators
[35] L. Kurgan, A. A. Razib, S. Aghakhani, S. Dick, M. Mizianty, and B: Chemical, vol. 212, pp. 353–363, 2015.
S. Jahandideh, “CRYSTALP2: Sequence-based protein crystalliza- [57] Y. Wang, Y. Ding, F. Guo, L. Wei, and J. Tang, “Improved detection
tion propensity prediction,” BMC Struct. Biol., vol. 9, Jul. 31, 2009, of DNA-binding proteins via compression technology on PSSM
Art. no. 50. information,” PloS One, vol. 12, no. 9, 2017, Art. no. e0185587.
[36] K. C. Chou, “Pseudo amino acid composition and its applications in [58] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene selection
bioinformatics, proteomics and system biology,” Current Proteomics, for cancer classification using support vector machines,” Mach.
vol. 6, no. 4, pp. 262–274, Dec. 2009. Learn., vol. 46, no. 1-3, pp. 389–422, 2002.
[37] K.-C. Chou and H.-B. Shen, “MemType-2L: A web server for pre- [59] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector
dicting membrane proteins and their types by incorporating evo- machines,” ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, 2011,
lution information through Pse-PSSM,” Biochemical Biophysical Art. no. 27.
Res. Communic., vol. 360, no. 2, pp. 339–345, 2007. [60] W. Chen, P. M. Feng, H. Lin, and K. C. Chou, “iRSpot-PseDNC: Iden-
[38] B. Liu, F. L. Liu, X. L. Wang, J. J. Chen, L. Y. Fang, and K. C. Chou, tify recombination spots with pseudo dinucleotide composition,”
“Pse-in-One: A web server for generating various modes of Nucleic Acids Res., vol. 41, no. 6, Apr. 2013, Art. no. e68.
pseudo components of DNA, RNA, and protein sequences,” [61] M. Spencer, J. Eickholt, and J. Cheng, “A deep learning network
Nucleic Acids Res., vol. 43, no. W1, pp. W65–W71, Jul. 1, 2015. approach to ab initio protein secondary structure prediction,”
[39] B. Liu, “BioSeq-Analysis: A platform for DNA, RNA and protein IEEE/ACM Trans. Comput. Biol. Bioinf., vol. 12, no. 1, pp. 103–112,
sequence analysis based on machine learning approaches,” Briefings Jan./Feb. 2015.
in Bioinf., vol. 20, no. 4, pp. 1280–1294, July 2019. [62] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature,
[40] D. J. Yu, J. Hu, J. Yang, H. B. Shen, J. H. Tang, and J. Y. Yang, vol. 521, no. 7553, pp. 436–444, 2015.
“Designing template-free predictor for targeting protein-ligand [63] B. B. Jiang, H. H. Chen, B. Yuan, and X. Yao, “Scalable graph-based
binding sites with classifier ensemble and spatial clustering,” semi-supervised learning through sparse bayesian model,” IEEE
IEEE/ACM Trans. Comput. Biol. Bioinf., vol. 10, no. 4, pp. 994–1008, Trans. Knowl. Data Eng., vol. 29, no. 12, pp. 2758–2771, Dec. 2017.
Jul.-Aug. 2013. [64] C. C. Chen, H. H. Juan, M. Y. Tsai, and H. H. S. Lu, “Unsupervised
[41] J. Yang, A. Roy, and Y. Zhang, “Protein–ligand binding site recog- learning and pattern recognition of biological data structures with
nition using complementary binding-specific substructure com- density functional theory and machine learning,” Sci. Rep., vol. 8,
parison and sequence profile alignment,” Bioinf., vol. 29, no. 20, Jan. 11, 2018, Art. no. 557.
pp. 2588–2595, 2013.
[42] B. He, S. Mortuza, Y. Wang, H.-B. Shen, and Y. Zhang, “NeBcon:
Protein contact map prediction using neural network training
coupled with na€ıve Bayes classifiers,” Bioinf., vol. 33, no. 15,
pp. 2296–2306, 2017.

thorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY TIRUCHIRAPALLI. Downloaded on January 17,2024 at 11:55:14 UTC from IEEE Xplore. Restrictions app
HU ET AL.: TARGETDBP: ACCURATE DNA-BINDING PROTEIN PREDICTION VIA SEQUENCE-BASED MULTI-VIEW FEATURE... 1429

Jun Hu received the BS degree in computer sci- Dong-Jun Yu received the PhD degree from the
ence from Anhui Normal University, in 2011. He Nanjing University of Science and Technology
received the PhD degree from the School of Com- (NUST), in the subject of pattern recognition and
puter Science and Engineering in Nanjing Univer- intelligence systems in 2003. In 2008, he acted as
sity of Science and Technology, in 2018, and a an academic visitor at the Department of Com-
member of the Pattern Recognition and Bioinfor- puter of the University of York in the UK. He also
matics Group, led by Professor Dong-Jun Yu. visited the Department of Computational Medicine
From 2016 to 2017, he acted as a visiting student of the University of Michigan (Ann Arbor), in 2016.
at the University of Michigan (Ann Arbor) in USA. He is currently a full professor in the School of
He is currently a teacher in the College of Informa- Computer Science and Engineering, Nanjing Uni-
tion Engineering at the Zhejiang University of versity of Science and Technology. His research
Technology. His current interests include pattern interests include pattern recognition, machine learning and bioinformatics.
recognition, data mining and bioinformatics. He is the author of more than 50 scientific papers in pattern recognition
and bioinformatics. He is a senior member of the China Computer Federa-
tion (CCF) and a senior member of the China Association of Artificial Intel-
Xiao-Gen Zhou received the PhD degree in con- ligence (CAAI).
trol science and engineering from the College of
Information Engineering, the Zhejiang University
of Technology, Hangzhou, China, in 2018. His Gui-Jun Zhang received the PhD degree in con-
research interests include intelligent information trol theory and control engineering from Shanghai
processing, optimization theory and algorithm Jiaotong University, Shanghai, China, in 2004. He
design, and bioinformatics. is currently a professor with the College of Infor-
mation Engineering, Zhejiang University of Tech-
nology, Hangzhou, China. His current research
interests include intelligent information process-
ing, optimization theory and algorithm design, and
bioinformatics.
Yi-Heng Zhu received the BS degree in com-
puter science from the Nanjing Institute of Tech-
nology, China, in 2015. Currently, he is working
toward the PhD degree in the School of Com- " For more information on this or any other computing topic,
puter Science and Engineering, the NanJing please visit our Digital Library at www.computer.org/csdl.
University of Science and Technology, and a
member of the Pattern Recognition and Bioinfor-
matics Group, led by professor Dong-Jun Yu. His
research interests include bioinformatics, data
mining, and pattern recognition.

thorized licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY TIRUCHIRAPALLI. Downloaded on January 17,2024 at 11:55:14 UTC from IEEE Xplore. Restrictions app

You might also like