0% found this document useful (0 votes)
16 views

DeepFinder An Integration of Feature Based and Deep Learning Approach For DNA Motif Discovery

This article proposes an improved DNA motif discovery method called DeepFinder that integrates feature-based and deep learning approaches. DeepFinder uses neural networks trained on sequence features from potential binding sites to construct motif models, and employs multiple prediction tools to obtain more initial motif hits. It is evaluated on ten chromatin immunoprecipitation datasets and shows improved performance over existing methods.

Uploaded by

Mohammed Siyad B
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

DeepFinder An Integration of Feature Based and Deep Learning Approach For DNA Motif Discovery

This article proposes an improved DNA motif discovery method called DeepFinder that integrates feature-based and deep learning approaches. DeepFinder uses neural networks trained on sequence features from potential binding sites to construct motif models, and employs multiple prediction tools to obtain more initial motif hits. It is evaluated on ten chromatin immunoprecipitation datasets and shows improved performance over existing methods.

Uploaded by

Mohammed Siyad B
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Biotechnology & Biotechnological Equipment

ISSN: 1310-2818 (Print) 1314-3530 (Online) Journal homepage: https://www.tandfonline.com/loi/tbeq20

DeepFinder: An integration of feature-based and


deep learning approach for DNA motif discovery

Nung Kion Lee, Farah Liyana Azizan, Yu Shiong Wong & Norshafarina Omar

To cite this article: Nung Kion Lee, Farah Liyana Azizan, Yu Shiong Wong & Norshafarina
Omar (2018) DeepFinder: An integration of feature-based and deep learning approach for
DNA motif discovery, Biotechnology & Biotechnological Equipment, 32:3, 759-768, DOI:
10.1080/13102818.2018.1438209

To link to this article: https://doi.org/10.1080/13102818.2018.1438209

© 2018 The Author(s). Published by Informa


UK Limited, trading as Taylor & Francis
Group.

Published online: 10 Feb 2018.

Submit your article to this journal

Article views: 3765

View related articles

View Crossmark data

Full Terms & Conditions of access and use can be found at


https://www.tandfonline.com/action/journalInformation?journalCode=tbeq20
BIOTECHNOLOGY & BIOTECHNOLOGICAL EQUIPMENT, 2018
VOL. 32, NO. 3, 759–768
https://doi.org/10.1080/13102818.2018.1438209

ARTICLE; BIOINFORMATICS

DeepFinder: An integration of feature-based and deep learning approach for DNA


motif discovery
Nung Kion Leea, Farah Liyana Azizanb, Yu Shiong Wonga and Norshafarina Omara
a
Department of Cognitive Sciences, Faculty of Cognitive Sciences and Human Development, Universiti Malaysia Sarawak, Kota Samarahan,
Sarawak, Malaysia; bCentre For Pre-University Studies, Universiti Malaysia Sarawak, Kota Samarahan, Sarawak, Malaysia

ABSTRACT ARTICLE HISTORY


We propose an improved solution to the three-stage DNA motif prediction approach. The three- Received 17 February 2017
stage approach uses only a subset of input sequences for initial motif prediction, and the initial Accepted 3 February 2018
motifs obtained are employed for site detection in the remaining input subset of non-overlaps. The KEYWORDS
currently available solution is not robust because motifs obtained from the initial subset are Deep learning neural
represented as a position weight matrices, which results in high false positives. Our approach, network; motif discovery;
called DeepFinder, employs deep learning neural networks with features associated with binding DNA sequence feature;
sites to construct a motif model. Furthermore, multiple prediction tools are used in the initial motif chromatin
prediction process to obtain a higher number of positive hits. Our features are engineered from the immunoprecipiation-
context of binding sites, which are assumed to be enriched with specificity information of sites sequencing analysis
recognized by transcription factor proteins. DeepFinder is evaluated using several performance
metrics on ten chromatin immunoprecipitation (ChIP) datasets. The results show marked
improvement of our solution in comparison with the existing solution. This indicates the
effectiveness and potential of our proposed DeepFinder for large-scale motif analysis.

Introduction common ancestors are conserved, compared to their


surrounding non-functional bases. Therefore, such con-
The ability to identify transcription factor binding sites or
served functional elements can be identified by perform-
motifs in the genome is one of the keys to decipher
ing conservation analysis between sequences of
gene regulation mechanisms. Motifs are recurring
orthologous or paralogous species using pair-wise and
sequence patterns in a genome and are the binding sites
multiple sequence alignment techniques. GenomeVISTA
of transcription factors crucial for the regulation of pro-
[3], LAGAN/MLAGAN [4], MUMmer [5], AVID [6] and
tein production in cells. Analysis of motifs is important
MULAN [7] are examples of such tools. They are mostly
for advancements of medical treatment and understand-
based on the dynamic programming algorithm such as
ing of cell processes [1]. Both wet-lab and computational
Smith–Waterman [8] for the local alignment and Needle-
techniques have been widely employed for location
man–Wunsch [9] for the global alignment. To speed up
identification and analysis of motifs.
the alignment of genomes, heuristic techniques such as
Motif analyses with chromatin immunoprecipitation
anchoring [6], threaded blockset [10] or greedy search
(ChIP) combined with massive parallel DNA sequencing
[11] have been employed. Although comparative geno-
(ChIP-seq) followed by computational prediction have
mic methods enabled identification of conserved motifs,
enabled rapid genome-wide location prediction of thou-
these methods missed many functional motifs that are
sands of high-confidence candidate motif locations.
not conserved [12]. The second group of methods uses a
Genome-wide datasets have posed several challenges to
database of annotated motif profiles to detect associ-
the computational algorithm design because of increas-
ated sites in input datasets [13–16]. Motifs are typically
ing complexities of the sequence search space and the
represented as a position weight matrix (PWM) [17] or its
requirement of a large amount of memory space. Early
variants [18]. MATCH [13] combines the matrix and core
methods for genome-wide motif discovery are based on
similarity score for scoring a sequence; MISCORE [14]
comparative genomic [2] and motif profile search. The
computes the average mismatch score between a
comparative genomic method is based on the principle
sequence and motif instances for scoring; the MAST [19]
that functional elements (e.g. motifs) evolved from the
score of a sequence is simply the sum of the PWM’s

CONTACT Nung Kion Lee [email protected]


© 2018 The Author(s). Published by Informa UK Limited, trading as Taylor & Francis Group.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use,
distribution, and reproduction in any medium, provided the original work is properly cited.
760 N. K. LEE ET AL.

entries that matched the nucleotides in different posi- necessary to use the whole set as input for computa-
tions of the sequence; FIMO computes the log-likelihood tional tools.
ratio score for each sequence and converts it into p- The finding by Zia and Moses [33] was evidenced in
value for scoring purposes. The disadvantages of the an earlier study by Hu et al. [21], who reported that using
motif profile search method are: first, it cannot effec- more input sequences does not improve the prediction
tively represent the specificities of DNA segments recog- accuracy. They suggested that ‘one can input only partial
nized by transcription factors (TF); second, a single input sequences to a motif discovery algorithm to obtain a
representation cannot model the different ‘codes’ recog- motif model and then use this model to find motifs in the
nized by different TFs. For instance, some motifs have remaining sequences. In this manner, a significant reduc-
dependencies while others do not [20]. As a result, most tion in the running time can be achieved without sacrific-
PWM models give poor sensitivity and specificity in motif ing the prediction accuracy.’ The accuracy of that method
detection. holds when: (a) The primary motif’s appearances in the
The advantages of the approach involving computa- dataset are evenly distributed in the dataset of a suffi-
tional prediction of motif patterns are its cost effective- ciently large size. (b) The primary motif model obtained
ness and its ability to hypothesize candidate binding from the partial input sequences can effectively detect
sites before wet-lab verification is performed. The ulti- associated binding sites in the remaining sequences. To
mate aim of motif prediction tools is to return a set of the best of our knowledge, this finding has not been
most potential putative binding site locations. Pre-geno- incorporated into the design of any tool. In fact, many
mic era tools were targeted mainly on small datasets newly proposed computational motif prediction tools
from prokaryotic species, which cannot be scaled in [34] are designed to tackle large numbers of DNA
terms of accuracy and speed [21]. Popular tools in that sequences. Examples of such tools are AMD [31], RSAT
era can be categorized into multiple local alignment peakmotifs [34] and DREME [30], which are based on
(AlignACE [22], MEME [23], BioProspector [24], MotifSam- consensus pattern enumeration; genetic algorithm
pler [25]), pattern enumeration (MDSCAN [26], Weeder based GADEM [35], and CompleteMotifs [36], which
[27]) and heuristic search (GAME [28,29]). With the inven- employs multiple motif discovery tools. Some authors
tion of the ChIP technology, genome-wide motif analysis have employed an ad-hoc motif discovery pipeline that
has become feasible with many computational tools works per the recommendation by Hu et al. [21]. We
being proposed. Most of these tools employed heuristics termed this strategy the three-stage approach.
and pattern enumerative approach to search for possible In this study, DeepFinder, a motif discovery pipeline,
motif patterns for their efficiency. That is, instead of enu- is proposed to improve the current implementation of
merating exhaustively all motif patterns of specified the three-stage method. The two novel features of this
lengths, a heuristic algorithm initially selects statistical approach are: first, we employ an ensemble of motif dis-
significance seed consensus motif patterns which for covery tools for initial prediction of candidate binding
examples are short (3–8 bp in DREME [30]) or short pat- sites from a subset of input sequences; second, features
terns spaced by gaps [31]. Seeds are used to form longer associated with the most potential candidate binding
patterns or initialize a search algorithm. Computational sites are extracted for deep neural network learning.
time, thus, is significantly reduced by starting the search Using ten ChIP datasets for evaluation, our results have
using the resulting sub-optimal motif patterns. Although demonstrated that DeepFinder is able to improve the
these tools are useful, most can only predict short motif overall sensitivity and specificity rates in comparison to
patterns. A genetic algorithm-based tool has also been the three-stage approach.
proposed [32], but search-based tools are not scalable.
A recently published article made an intriguing find-
Related works
ing regarding the theoretical limit of the number of DNA
sequences that should be used for computational DNA The three-stage approach tackles the motif search in a
motif discovery [33]. The authors reported that increas- large ChIP dataset by dividing the task into three conse-
ing the number of input sequences does improve the cutive steps [37,38]: (1) select a small subset of input
motif prediction accuracy; however, after it reaches a dataset; (2) perform motif discovery in the subset using
certain quantity, the improvement is no longer signifi- a computational tool and select the most potential can-
cant. This finding contradicted many studies that didate motifs; (3) use the candidate motif models to
assumed that better results can be expected when more detect binding sites in the input subset not used in stage
input sequences are used for computational tools. This 1. The three-stage approach reduced the computational
implies that a sufficient number of input sequences will time significantly by avoiding the motif search in the
be adequate to predict motifs in a dataset, and it is not whole sequence space. It conjectures that prominent
BIOTECHNOLOGY & BIOTECHNOLOGICAL EQUIPMENT 761

motifs can be obtained using any subset of the input technique for motif prediction. Figure 1 illustrates the
sequences of a reasonable size. The existing approach DeepFinder computational framework. It has three con-
employed a single motif tool in stage 2 to predict candi- secutive steps: (a) The dataset is partitioned into five
date motifs. The obtained motifs are typically modeled non-overlapped subsets. (b) Four de novo motif discov-
using the PWM which are subsequently used for site ery tools are applied on one of the partitioned subsets
detection in stage 3. Nevertheless, the existing solution to predict putative motifs and the respective binding
is not robust. First, motif detection in the third stage sites. The top three motifs returned from each tool are
relies on a good binding model that can represent a pro- merged and divided using a clustering algorithm. Sev-
tein’s specificities. Although PWM usually fits the binding enty-six features associated with candidate binding sites
affinity and specificity of a TF well, it is incapable of cap- in merged clusters are extracted and used for stacked-
turing motifs with positional dependencies. Second, in autoencoder neural network learning. (c) Learned neural
motif detection, setting the threshold value of a match is network is used to predict associated binding sites in
often difficult to ensure the balance of high sensitivity input sequences not used in the initial motif prediction.
and specificity [39].
Supervised learning based on deep learning neural
networks for enhancer motif prediction has become pop-
Candidate motif prediction and selection
ular recently. DeepBind [40] employed convolutional neu-
ral networks (CNN) to identify the DNA- or RNA-binding A subset of input sequences is randomly selected from
regions. DeepBind’s binding model showed excellent per- the input dataset motif prediction by four de-novo motif
formance with an average area under curve (AUC) of discovery tools: MEME [23], BioProspector [24], MDscan
0.85, when it was trained on in vitro and tested on in vivo [26] and MotifSampler [25]. We employed toolbox of
motif datasets, which outperformed the state-of-the-art motif discovery (Tmod) [42], which implemented the four
methods using several performance metrics. DanQ [41] is selected tools for candidate motif prediction. The top
a hybrid of convolutional and recurrent deep neural net- three motifs ranked by each tool’s scoring function are
works for learning enhancer-associated histone marks. It selected for further processing. Putative site locations in
was claimed to outperform DeepSEA, another CNN-based the DNA sequences are located. Regions in DNA sequen-
model, using ChIP-Seq datasets for evaluation. However, ces where many overlapping putative sites are located
DeepSEA’s performance is still considered unsatisfactory are most likely to be legitimate binding regions. We called
with its precision-recall AUC being under 70%. these regions binding segments (i.e. covered with at
least one binding site or several overlapping binding sites
in the vicinity). After identifying all binding segments,
Materials and methods pairwise similarity between every possible pair is com-
puted to generate a symmetry distance matrix (see the
Framework
next subsection). Two clusters are generated by using the
DeepFinder involves a three-stage approach that utilizes k-medoid clustering algorithm, implemented in Pycluster
an ensemble of motif finders and machine-learning [43]. The cluster with a higher number of binding

Figure 1. DeepFinder framework.


762 N. K. LEE ET AL.

segments is fed as input to the deep learning neural net- There is a total of max(l(ATGCA), l(CGGA)) – min(l
work for building a binding model. We decided to gener- (ATGCA), l(CGGA)) + 1 = 5¡4 + 1 = 2 possible alignment
ate two clusters because it is crucial to pursue more positions as shown below.
sequences in a cluster to avoid inadequate training data
ATGCA Score
since there is no practical guideline on how many CGGA- 1/8
sequences are needed for model learning. Therefore, it is -CGGA 2/7
Sum = 0.411
utterly important to keep the cluster number small in
order to avoid missing any significant binding segments. For alignment position 1, the score is 1/8 since there is
a position (i.e. 3) where the nucleotide matched from the
total nucleotides of 8. The matched nucleotide is
Motif similarity
counted as one for calculating |ATGCA [ CGGA¡|. The
We have modified the similarity function described by [44], alignment scores are summed to obtain 0.411 using
to compute the similarity score of two binding segments x Equation (1). The second pair is computed similarly
and y. Let A(x) = {x1, x2, …, xn} be the set of binding sites clus- which obtains sim(CGGA, GCCG) = 0. Finally, the similar-
tered at binding segment x. The similarity score between ity score between the two binding segments x and y is
two binding segments is computed from the average align- sim(A(x), A(y)) = 1¡ (0.411 + 0)/3 = 0.863.
ment of every pair of binding sites in the two segments.
Suppose binding sites xi 2 A(x) and yj 2 A(y) have max(l(xi),l
Motif features
(xj)) – min(l(xi), l(xj)) + 1 possible alignment positions; l(xi) is
the length of xi. Alignments of xi and yj are performed by Several DNA sequence features are highly associated
starting at the left end position and then right shifting one with binding sites. Osada et al. [20] reported that adja-
base at a time, the shorter one on the longer ones. At a par- cent bases of motifs have high occurrence dependen-
ticular alignment position
 k, the alignment score is com- cies, which, when modeled, can significantly improve
puted as sk xi ; yj ¼ jxi \ yj j 6 jxi [ yj j, where |xi \ yj| is the motif prediction sensitivity and specificity rates. Fur-
the number of matched nucleotides of the aligned two thermore, Yan~ez-Cuna et al. [45] observed that enhancer
sites, |xi [ yj| is the total nucleotides in the alignment. Note regions have high occurrences of repeated dinucleotides
that, 0  sk(xi, yj)  1. Therefore, the similarity score CA, GA, CG or GC. For classifier learning, a feature vector
between two binding sites xi and yj is defined as comprising three distinct feature sets is generated from
X  each DNA binding segment: (a) k-mer feature as a simple
  
sim xi ; yj ¼ sk x i ; y j : (1) count of co-occurrences of bases that have strong
1kmij dependencies, and k is set to 3, which gives 64 feature
values; (b) the frequency counts of A, C, G, T; and (c)
The similarity score between two binding segments x and selected 2-mers count: CA, CG, GA and GC and the dinu-
y is defined as cleotide dependencies of CA, CG, GA and GC, where the
P   dependency value of an arbitrary dinucleotide XY is
ij sim xi ; yj computed as c(XY)/(c(XA) + c(XC) + c(XG) + c(XT)); c() is
sim ðAðx Þ; Aðy ÞÞ ¼ 1  P : (2)
ij mij the frequency count in a binding segment. The feature
value of a k-mer g is computed as f(g) = c(g)/c(*), where c
We use the scores obtained to populate our distance (*) is the sum of counts from all possible k-mers. The fre-
matrix, which is used by the k-medoid algorithm. As an quency values are normalized using the min–max
illustrative example, suppose A(x) = {ATGCA, GCCG} and A method.
(y) = {CGGA} are binding sites in each binding segment x
and y, respectively. There are two pair-wise alignments
Classifier learning
between the two sets. The alignment pairs are (ATGCA,
CGGA) and (GCCG, CGGA). The table below shows the DeepFinder employs a stacked autoencoder [46] to con-
alignment scores in different positions for the two pairs. struct binding models using the 76 engineered
sequence features. The stacked autoencoder is well
Position known for its feature discovery, especially in unlabeled
x y 1 2
data, and its capacity in part-whole decomposition. In
ATGCA CGGA 1/8 2/7
GCCG CGGA 0 addition, a single stacked autoencoder acquires greater
expressive power compared with any deep learning neu-
The following gives an example of how one of the ral network. An equal number of negative controls are
alignment scores is obtained for the pair (ATGCA, CGGA). added to the sequence segments of positive data to
BIOTECHNOLOGY & BIOTECHNOLOGICAL EQUIPMENT 763

purpose, only five thousand locations were randomly


selected and each was extended in both directions by
adding symmetric margins of 1000 bp along the
genome. Twenty percent from the total binding sites is
treated as input to the four de novo motif discovery tools
for all the experiments.
The accuracy of the classifier is assessed by the f-mea-
sure and false discovery rate (FDR). The f-measure f is
given by the following formula:

2pf
f  measure ¼
pþr

where p and r are the precision and recall rates, respec-


Figure 2. Stacked autoencoder neural network architecture. It
tively [50]. FDR is defined by 1 – precision [51]. Precision
consists of 76 input neurons in the input layer, 25 and 15 neu-
rons in the first and second hidden layer, respectively. The out- and recall rates are computed using the following formu-
put layer has two output neurons which represent motif and las:
non-motif class; b is bias neuron.
TP TP
p ¼ ; r ¼
become the input dataset. The negative controls used TP þ FP TP þ FN
are complementary intervals of the positive datasets
retrieved from Galaxy Online [47]. The architecture of where TP, FP and FN are counts of true positives (TP),
the stacked autoencoder used in this study is shown in false positives (FP) and false negatives (FN), respectively,
Figure 2. It consists of two hidden layers, with 25 and 15 from the cross-validation experiment.
hidden neurons, respectively. The neural network was Figure 3 shows that TP is where the region (100 bp
trained with a learning rate value between 0.5 to 1, mini- symmetric margins on both sides of TFBS) is correctly
batch size of 100 and dropout value 0.5. We found that classified as a TF-binding region; FP is where the region
five fixed epochs were sufficient for the classifier to learn is incorrectly classified as a TF-binding region; true nega-
the features, since no significant improvement on the tive (TN) is where the region is correctly identified as a
learning can be obtained when more epochs were used. non-TF binding region; and FN is where the region is
Therefore, it is used for all the cross-validation experi- incorrectly identified as a non-TF-binding region. Sensi-
ments. The stacked autoencoder implemented by the tivity, or true positive rate (TPR), and specificity (SPC), or
DeepLearnToolbox [48] was employed in this study. true negative rate, are defined as follows:

TPR ¼ TP 6 ðTP þ FNÞ


Datasets and performance metrics SPC ¼ TN 6 ðTN þ FPÞ
Ten DNA datasets were downloaded from UCSC
Lastly, the Matthew correlation coefficient (MCC) [52] is
Genome database (hg19, February 2009 (GRCh37/) [49].
computed using the following formula:
These datasets are enhancer DNA sequences bound by
various TFs, which allowed us to evaluate the robustness TP  TN  FP  FN
of DeepFinder. pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ðTP þ FPðFP þ FNÞðTN þ FPÞðTN þ FNÞ
Table 1 shows the number of binding sites annotated
in the database for the selected TFs. For our evaluation

Table 1. Statistics of ten datasets used in this study.


Transcription factor Total sites
CREB 6822
GATA1 9248
P53 18 282
P300 13 189
SRF 24 889
STAT1 14 030
NFE2 19 635 Figure 3. Visual description of true positives (TP), false positives
MEF2 41 426 (FP), false negatives (FN) and true negatives (TN) on binding and
ELK1 10 781
HNF4 14 306 non-binding regions in prediction. The black region indicates
binding sites and the blue stretch indicates DNA region.
764 N. K. LEE ET AL.

MCC produces a value in the range of [¡1, 1], in which 1 Table 3. Comparison of false discovery rate (FDR), accuracy, and
indicates a perfect prediction, 0 means random predic- Matthews correlation coefficient (MCC) of MAST, FIMO, and
DeepFinder (DF).
tion, and ¡1 represents a negative correlation.
FDR Accuracy MCC
TF MAST FIMO DF MAST FIMO DF MAST FIMO DF
CREB 0.40 0.01 0.00 0.03 0.45 0.99 -0.62 -0.06 0.97
Hardware specifications ELK1 0.31 0.01 0.09 0.04 0.54 0.89 -0.54 -0.05 0.81
GATA1 0.32 0.01 0.05 0.04 0.53 0.94 -0.55 -0.05 0.89
For the simulation, all the individual motif discovery HNF4 0.32 0.00 0.00 0.05 0.51 1.00 -0.55 -0.04 0.99
tools in DeepFinder ran on Intel Core i5, 1.7 GHz CPU MEF2 0.20 0.00 0.00 0.01 0.99 0.99 -0.44 -0.04 0.98
NFE2 0.27 0.00 0.00 0.03 0.36 1.00 -0.51 -0.03 0.99
with 16 GB of memory. P53 0.29 0.01 0.00 0.05 0.49 0.99 -0.53 -0.06 0.99
P300 0.35 0.01 0.02 0.06 0.67 0.97 -0.57 -0.05 0.94
SRF 0.25 0.00 0.02 0.03 0.37 0.98 -0.49 -0.03 0.96
Results and discussion STAT1 0.33 0.00 0.00 0.03 0.44 0.99 -0.56 -0.04 0.99

We investigated the performance of DeepFinder by com-


paring it with the original implementation of the three- which indicates a positive correlation between the pre-
stage approach. All tools used the datasets listed in dicted and the actual classes. In contrast, the MCC values
Table 1 for the evaluation. For the three-stage approach, of MAST and FIMO are negative for all the datasets,
MAST [15] and FIMO [15] were used as site detection which indicates a poor correlation between the pre-
tools in stage 3, while MEME was used for initial motif dicted and actual classes. In particular, FIMO MCC values
prediction. The top three motifs obtained from 20% of are mostly near to 0, implying very poor prediction.
the randomly selected input sequences were used to Figure 4 depicts the average f-measure, precision,
detect binding sites in the remaining input subset. The recall and accuracy rates for all datasets produced by
top motifs were chosen based on the ranking function DeepFinder. It is clearly shown that ELK1 and GATA1
used in MEME. datasets have lower performance than that of the other
Tables 2 and 3 show the comparison results between datasets. Figure 5 shows the overall performance of the
DeepFinder, MAST and FIMO. The values in the tables three tools for the three performance metrics, precision
are averaged scores from five-fold cross-validation. The rate, recall rate and FDR. It can be observed that Deep-
comparison of these three models revealed that MAST Finder outperformed MAST and FIMO in overall results
consistently performed the worst in all of the evaluation for the ten datasets. For example, the average rate
metrics. FIMO is slightly better than DeepFinder in terms obtained by DeepFinder for the ten datasets was 0.97,
of precision and false discovery rate for ELK1, GATA1, whereas those obtained by MAST and FIMO were 0.04
P300 and SRF datasets. It is worth noticing that Deep- and 0.46, respectively. Likewise, DeepFinder outper-
Finder outperformed others for recall, f-measure and formed considerably in terms of average accuracy rate
accuracy rates for all the datasets. For example, the recall of 0.97 in comparison with only 0.04 and 0.54 by MAST
rates for DeepFinder are >0.90 for all datasets, whereas and FIMO, respectively. FIMO performed marginally bet-
those for FIMO averaged at 0.68 at best. However, MAST ter than DeepFinder in terms of precision rate and FDR.
and FIMO, both have no true negatives because of the MAST consistently performed the worst in all the data-
scanning upon positive controls; therefore, no specificity sets and performance metrics.
rates are shown. The precision and recall rates obtained by DeepFinder
Table 3 illustrates the MCC values of the three tools. It by using five different subsets of each dataset are pre-
is observed that the MCC values of DeepFinder are >0.8, sented in Figures 6 and 7. SRF dataset has rather incon-
sistent precision and recall rates when different subsets
Table 2. Comparison of average precision and the recall and f- are used. For example, it is observed that there is a sharp
measure rates of MAST, FIMO and DeepFinder (DF) using five- decrease in the precision and recall rates when subset 2
fold cross validation. and 5 were used. In general, the performance was quite
Precision Recall f-measure robust when different input subsets were used for initial
TF MAST FIMO DF MAST FIMO DF MAST FIMO DF motif prediction. This result supported our assumption
CREB 0.60 0.99 1.00 0.03 0.45 0.97 0.06 0.62 0.99
ELK1 0.69 0.99 0.91 0.04 0.54 0.91 0.08 0.70 0.90
that the same set of motif features occur quite consis-
GATA1 0.68 0.99 0.95 0.04 0.53 0.94 0.08 0.69 0.95 tently across different subsets of input sequences.
HNF4 0.68 1.00 1.00 0.05 0.51 0.99 0.09 0.67 1.00 DeepFinder showed promising results in our evalua-
MEF2 0.80 1.00 1.00 0.01 0.20 0.98 0.03 0.33 0.99
NFE2 0.73 1.00 1.00 0.03 0.36 0.99 0.06 0.52 1.00 tion using the ten datasets listed in Table 1. It demon-
P53 0.71 0.99 1.00 0.05 0.49 0.99 0.09 0.66 0.99 strates that using supervised learning with sequence
P300 0.65 0.99 0.98 0.06 0.68 0.96 0.11 0.80 0.97
SRF 0.75 1.00 0.98 0.03 0.37 0.99 0.06 0.51 0.98 content feature improves the performance of the three-
STAT1 0.67 1.00 1.00 0.04 0.44 0.99 0.07 0.61 0.99 stage approach. DeepFinder has better predictive power,
BIOTECHNOLOGY & BIOTECHNOLOGICAL EQUIPMENT 765

Figure 4. Average f-measure, precision, recall and accuracy rates obtained by DeepFinder on the ten datasets using five-fold cross-
validation.

Figure 5. Average precision, recall, f-measure, false discovery rate (FDR) and accuracy for MAST, FIMO and DeepFinder for the ten
datasets.

Figure 6. Comparisons of precision rates for five different data subsets used in prediction.
766 N. K. LEE ET AL.

Figure 7. Comparisons of recall rates for five different data subsets used in prediction.

compared to those of MAST and FIMO, particularly for Conclusions


TFs that bind within the enhancer regions. The improved
We have proposed the DeepFinder framework, a simple
performance is due to better representation of features
yet effective ensemble method that utilizes positive
associated with binding sites and the ability of super-
information in the motifs returned by multiple motif dis-
vised neural network to effectively learn the features.
covery tools for motif identification purposes. Overall,
The 76 features were selected based on scientific evi-
our solution produced highly improved results in com-
dence in the literature to represent the specificity of
parison with the original implementation of the three-
transcription factor proteins. Better representation
stage approach. Our solution, which incorporated
improves the robustness of the motif model to noises in
sequence content feature and deep learning, demon-
the final site detection stage. In addition, the multi-layer
strates promising potential for modeling TF specificities.
structure of the neural network coupled with supervised
The various sequence features used are effective in rep-
learning can effectively learn the input-output mapping.
resenting the salient characteristics of binding sites in
The strength of the neural network is in its ability to learn
comparison with PWM. In addition, our results also sug-
different abstractions of features through different con-
gest an effective implementation of the three-stage
figurations of the number of hidden layers and hidden
approach for tackling large input datasets. Using Deep-
neurons in each layer [53]. Our results suggest that a
Finder has several advantages. First, stacked autoen-
motif model obtained from supervised learning has bet-
coder employed in DeepFinder can accurately predict
ter discriminative power compared with the PWM. The
regulatory sequences without any prior knowledge
use of multiple tools for initial prediction in the first
about transcription factor binding sites (e.g. lengths of
stage increases the chances of obtaining true binding
possible motifs and pre-fixed motif model) by using only
sites. The poor performance of MAST and FIMO is inher-
sequence content information for model construction.
ited from the performance level of MEME as well as the
Second, DeepFinder is flexible to capitalize novel feature
binding model used. It is known that the PWM binding
information related to binding sites in the future. The
model is poor in capturing sequence specificities of vari-
new features can be easily used as inputs to the Deep-
ous TFs [14,54]. It is seen in Table 2 that FIMO achieved a
Finder for more effective motif modeling. Third, deep
slightly higher precision rate in three of the ten datasets
learning neural networks is known to learn better when
(i.e. ELK1, GATA1, SRF) compared with DeepFinder. Upon
more data are available. While large data are not a
checking the consensus motif patterns of those datasets,
requirement for DeepFinder, we can expect it to perform
it is observed that they are highly conserved, while for
better when larger subsets are used. However, PWM is
the CREB and the p53, which have less conserved motif
restricted by the number of free parameters that are pre-
patterns, DeepFinder performed better than FIMO.
defined in the model. This implies that more data would
Although this explanation is far from conclusive with the
not necessarily increase the amount of information it
limited datasets evaluated, it is safe to infer that one of
can represent. For the future studies, one of the perti-
DeepFinder’s strengths lies in better modelling the spec-
nent issues is the representation of DNA sequences for
ificity of less conserved motifs.
BIOTECHNOLOGY & BIOTECHNOLOGICAL EQUIPMENT 767

other types of deep learning neural networks such as [9] Needleman SB, Wunsch CD. A general method applicable
CNN. CNN is powerful because it can learn the different to the search for similarities in the amino acid sequence
abstraction of features in DNA sequences without the of two proteins. J Mol Biol. 1970;48:443–453.
[10] Blanchette M, Kent WJ, Riemer C, et al. Aligning multiple
need of handcrafted features. However, it requires DNA
genomic sequences with the threaded blockset aligner.
sequences to be represented as vectors or in the matrix Genome Res. 2004;14:708–715.
form (i.e. as an image) for effective learning. The cur- [11] Al Ait L, Yamak Z, Morgenstern B. DIALIGN at GOBICS–
rently available solution mainly focuses on one-hot multiple sequence alignment using various sources of
encoding, which we feel is not a ‘natural’ representation external information. Nucleic Acids Res. 2013;41:W3–7.
[12] King DC, Taylor J, Zhang Y, et al. Finding cis-regulatory
of DNA sequences. We are currently conducting explor-
elements using comparative genomics: some lessons
atory research for more meaningful representation of from ENCODE data. Genome Res. 2007;17:775–786.
DNA sequences as ‘images.’ In addition, a more effective [13] Kel AE, Go€ssling E, Reuter I, et al. MATCH: A tool for search-
method is needed to merge and filter large number of ing transcription factor binding sites in DNA sequences.
candidate motifs from multiple motif prediction tools. Nucleic Acids Res. 2003;31:3576–3579.
[14] Wang D, Lee NK. MISCORE: Mismatch-based matrix simi-
larity scores for DNA motif detection. In: Ko €ppen M, Kasa-
bov N, Coghill G, editors. Adv. Neuro-Information Process.
Acknowledgements Berlin, Heidelberg: Springer; 2009. p. 478–485.
YS is supported by the Malaysian MyBrain15 MyPhD Scholar- [15] Grant CE, Bailey TL, Noble WS. FIMO: scanning for occur-
ship. ON is supported by the Fundamental Research Grant rences of a given motif. Bioinformatics. 2011;27:1017–
Scheme FRGS/1/2014/SG03/UNIMAS/02/2. 1018.
[16] Bailey T, Boden M, Whitington T, et al. The value of posi-
tion-specific priors in motif discovery using MEME. BMC
Bioinformatics. 2010 [cited 2017 Mar 12];11:179.
Disclosure statement DOI:10.1186/1471-2105-11-179.
No potential conflict of interest was reported by the authors. [17] Stormo GD. DNA binding sites: representation and discov-
ery. Bioinformatics. 2000;16:16–23.
[18] Bi Y, Kim H, Gupta R, et al. Tree-based position weight
matrix approach to model transcription factor binding
Funding site profiles. PLoS One. 2011 [cited 2017 Mar 12];6:e24210.
This study was supported by the Malaysia Research Accultura- DOI:10.1371/journal.pone.0024210.
tion Grant Scheme of the Ministry of Higher Education Malaysia [19] Bailey TL, Gribskov M. Combining evidence using p-val-
[grant number RAGS/b(5)/927/2012(28)]. ues: application to sequence homology searches. Bioinfor-
matics. 1998;14:48–54.
[20] Osada R, Zaslavsky E, Singh M. Comparative analysis of
methods for representing and searching for transcription
References factor binding sites. Bioinformatics. 2004;20:3516–3525.
[21] Hu J, Li B, Kihara D. Limitations and potentials of current
[1] Maston GA, Evans SK, Green MR. Transcriptional regula-
motif discovery algorithms. Nucleic Acids Res.
tory elements in the human genome. Annu Rev Genomics
2005;33:4899–4913.
Hum Genet. 2006;7:29–59.
[22] Hughes JD, Estep PW, Tavazoie S, et al. Computational
[2] Zambelli F, Pesole G, Pavesi G. Motif discovery and tran-
identification of Cis-regulatory elements associated with
scription factor binding sites before and after the next-
groups of functionally related genes in Saccharomyces
generation sequencing era. Brief Bioinform. 2013;14:225–
cerevisiae. J Mol Biol. 2000;296:1205–1214.
237.
[23] Bailey TL, Elkan C. Fitting a mixture model by expectation
[3] Poliakov A, Foong J, Brudno M, et al. GenomeVISTA—an
maximization to discover motifs in biopolymers. Proc Int
integrated software package for whole-genome align-
Conf Intell Syst Mol Biol. 1994;2:28–36.
ment and visualization. Bioinformatics. 2014;30:2654–
[24] Liu XS, Brutlag DL, Liu JS. BioProspector: discovering con-
2655.
served DNA motifs in upstream regulatory regions of co-
[4] Brudno M, Do CB, Cooper GM, et al. LAGAN and Multi-
expressed genes. Pacific Symp Biocomput. 2001;6:127–
LAGAN: efficient tools for large-scale multiple alignment
138.
of genomic DNA. Genome Res. 2003;13:721–731.
[25] Thijs G, Marchal K, Lescot M, et al. Gibbs sampling method
[5] Kurtz S, Phillippy A, Delcher AL, et al. Versatile and open
to detect overrepresented motifs in the upstream regions
software for comparing large genomes. Genome Biol. 2004
of coexpressed genes. J Comput Biol. 2002;9:447–464.
[cited 2017 Mar 12];5:R12. DOI:10.1186/gb-2004-5-2-r12
[26] Liu XS, Brutlag DL, Liu JS. An algorithm for finding protein-
[6] Bray N, Dubchak I, Pachter L. AVID: A global alignment
DNA binding sites with applications to chromatin-immu-
program. Genome Res. 2003;13:97–102.
noprecipitation microarray experiments. Nat Biotechnol.
[7] Ovcharenko I, Loots GG, Giardine BM, et al. Mulan: multiple-
2002;20:835–839.
sequence local alignment and visualization for studying
[27] Pavesi G, Mauri G, Pesole G. An algorithm for finding sig-
function and evolution. Genome Res. 2005;15:184–194.
nals of unknown length in DNA sequences. Bioinformat-
[8] Smith TF, Waterman MS. Identification of common molec-
ics. 2001;17:S207–214.
ular subsequences. J Mol Biol. 1981;147:195–197.
768 N. K. LEE ET AL.

[28] Wei Z, Jensen ST. GAME: detecting cis-regulatory ele- [41] Quang D, Xie X. DanQ: a hybrid convolutional and recur-
ments using a genetic algorithm. Bioinformatics. rent deep neural network for quantifying the function of
2006;22:1577–1584. DNA sequences. Nucleic Acids Res. 2016 [cited 2017 Mar
[29] Fogel GB, Weekes DG, Varga G, et al. Discovery of 12];44:e107. DOI:10.1093/nar/gkw226.
sequence motifs related to coexpression of genes using [42] Sun H, Yuan Y, Wu Y, et al. Tmod: toolbox of motif discov-
evolutionary computation. Nucleic Acids Res. ery. Bioinformatics. 2010;26:405–407.
2004;32:3826–3835. [43] de Hoon MJL, Imoto S, Nolan J, et al. Open source cluster-
[30] Bailey TL. DREME: motif discovery in transcription factor ing software. Bioinformatics. 2004;20:1453–1454.
ChIP-seq data. Bioinformatics. 2011;27:1653–1659. [44] Wijaya E, Yiu S-M, Son NT, et al. MotifVoter: a novel
[31] Shi J, Yang W, Chen M, et al. AMD, an automated motif ensemble method for fine-grained integration of generic
discovery tool using stepwise refinement of gapped con- motif finders. Bioinformatics. 2008;24:2288–2295.
sensuses. Aiyar A, editor. PLoS One. 2011 [cited 2017 Mar [45] Yan
~ez-Cuna JO, Arnold CD, Stampfel G, et al. Dissection of
12];6:e24576. DOI:10.1371/journal.pone.0024576. thousands of cell type-specific enhancers identifies dinu-
[32] Lee NK, Fong PK, Abdullah MT. Modelling complex fea- cleotide repeat motifs as general enhancer features.
tures from histone modification signatures using genetic Genome Res. 2014;24:1147–1156.
algorithm for the prediction of enhancer region. Bio-Medi- [46] Vincent P, Larochelle H, Bengio Y, et al. Extracting and
cal Mater Eng. 2014;24:3807–3814. composing robust features with denoising autoencoders.
[33] Zia A, Moses AM. Towards a theoretical understanding of In: Proceeding of 25th International Conference on
false positives in DNA motif finding. BMC Bioinformatics. Machine Learning; 2008. p. 1096–1103. New York, NY:
2012 [cited 2017 Mar 12];13:151. DOI:10.1186/1471-2105- ACM.
13-151. [47] Giardine B, Riemer C, Hardison RC, et al. Galaxy: a platform
[34] Thomas-Chollier M, Herrmann C, Defrance M, et al. RSAT for interactive large-scale genome analysis. Genome Res.
peak-motifs: motif analysis in full-size ChIP-seq datasets. 2005;15:1451–1455.
Nucleic Acids Res. 2012 [cited 2017 Mar 12];40:e31. [48] Palm RB. Deep learning toolbox. [2015-09]. http://www.
DOI:10.1093/nar/gkr1104. mathworks.com/matlabcentral/fileex-change/38310-
[35] Li L. GADEM: A genetic algorithm guided formation of deep-learning-toolbox. 2012.
spaced dyads coupled with an EM algorithm for motif dis- [49] Rosenbloom KR, Armstrong J, Barber GP, et al. The UCSC
covery. J Comput Biol. 2009;16:317–329. Genome Browser database: 2015 update. Nucleic Acids
[36] Kuttippurathu L, Hsing M, Liu Y, et al. CompleteMOTIFs: Res. 2015;43:D670–D681.
DNA motif discovery platform for transcription factor [50] Manning CD, Raghavan P, Sch€ utze H, et al. Introduction to
binding experiments. Bioinformatics. 2011;27:715–717. information retrieval. Cambridge: Cambridge University
[37] Carroll JS, Meyer CA, Song J, et al. Genome-wide analysis Press; 2008.
of estrogen receptor binding sites. Nat Genet. [51] Benjamini Y, Hochberg Y. Controlling the false discovery
2006;38:1289–1297. rate: a practical and powerful approach to multiple test-
[38] Wei C-L, Wu Q, Vega VB, et al. A global map of p53 tran- ing. J R Stat Soc Ser B. 1995; 57:289–300.
scription-factor binding sites in the human genome. Cell. [52] Matthews BW. Comparison of the predicted and observed
2006;124:207–219. secondary structure of T4 phage lysozyme. Biochim Bio-
[39] Lee NK, Wang D. Optimization of MISCORE-based motif phys Acta (BBA)-Protein Struct. 1975;405:442–451.
identification systems. 3rd International Conference on [53] Haykin S. Neural networks: A comprehensive foundation.
Bioinformatics and Biomedical Engineering (ICBBE 2009); 2nd ed. Upper Saddle River (NJ, USA): Prentice Hall PTR;
2009; Beijing, China. 1998.
[40] Alipanahi B, Delong A, Weirauch MT, et al. Predicting the [54] Wasserman WW, Sandelin A. Applied bioinformatics for
sequence specificities of DNA-and RNA-binding proteins the identification of regulatory elements. Nat Rev Genet.
by deep learning. Nat Biotechnol. 2015;33:831–838. 2004;5:276–287.

You might also like