DeepFinder An Integration of Feature Based and Deep Learning Approach For DNA Motif Discovery
DeepFinder An Integration of Feature Based and Deep Learning Approach For DNA Motif Discovery
Nung Kion Lee, Farah Liyana Azizan, Yu Shiong Wong & Norshafarina Omar
To cite this article: Nung Kion Lee, Farah Liyana Azizan, Yu Shiong Wong & Norshafarina
Omar (2018) DeepFinder: An integration of feature-based and deep learning approach for
DNA motif discovery, Biotechnology & Biotechnological Equipment, 32:3, 759-768, DOI:
10.1080/13102818.2018.1438209
ARTICLE; BIOINFORMATICS
entries that matched the nucleotides in different posi- necessary to use the whole set as input for computa-
tions of the sequence; FIMO computes the log-likelihood tional tools.
ratio score for each sequence and converts it into p- The finding by Zia and Moses [33] was evidenced in
value for scoring purposes. The disadvantages of the an earlier study by Hu et al. [21], who reported that using
motif profile search method are: first, it cannot effec- more input sequences does not improve the prediction
tively represent the specificities of DNA segments recog- accuracy. They suggested that ‘one can input only partial
nized by transcription factors (TF); second, a single input sequences to a motif discovery algorithm to obtain a
representation cannot model the different ‘codes’ recog- motif model and then use this model to find motifs in the
nized by different TFs. For instance, some motifs have remaining sequences. In this manner, a significant reduc-
dependencies while others do not [20]. As a result, most tion in the running time can be achieved without sacrific-
PWM models give poor sensitivity and specificity in motif ing the prediction accuracy.’ The accuracy of that method
detection. holds when: (a) The primary motif’s appearances in the
The advantages of the approach involving computa- dataset are evenly distributed in the dataset of a suffi-
tional prediction of motif patterns are its cost effective- ciently large size. (b) The primary motif model obtained
ness and its ability to hypothesize candidate binding from the partial input sequences can effectively detect
sites before wet-lab verification is performed. The ulti- associated binding sites in the remaining sequences. To
mate aim of motif prediction tools is to return a set of the best of our knowledge, this finding has not been
most potential putative binding site locations. Pre-geno- incorporated into the design of any tool. In fact, many
mic era tools were targeted mainly on small datasets newly proposed computational motif prediction tools
from prokaryotic species, which cannot be scaled in [34] are designed to tackle large numbers of DNA
terms of accuracy and speed [21]. Popular tools in that sequences. Examples of such tools are AMD [31], RSAT
era can be categorized into multiple local alignment peakmotifs [34] and DREME [30], which are based on
(AlignACE [22], MEME [23], BioProspector [24], MotifSam- consensus pattern enumeration; genetic algorithm
pler [25]), pattern enumeration (MDSCAN [26], Weeder based GADEM [35], and CompleteMotifs [36], which
[27]) and heuristic search (GAME [28,29]). With the inven- employs multiple motif discovery tools. Some authors
tion of the ChIP technology, genome-wide motif analysis have employed an ad-hoc motif discovery pipeline that
has become feasible with many computational tools works per the recommendation by Hu et al. [21]. We
being proposed. Most of these tools employed heuristics termed this strategy the three-stage approach.
and pattern enumerative approach to search for possible In this study, DeepFinder, a motif discovery pipeline,
motif patterns for their efficiency. That is, instead of enu- is proposed to improve the current implementation of
merating exhaustively all motif patterns of specified the three-stage method. The two novel features of this
lengths, a heuristic algorithm initially selects statistical approach are: first, we employ an ensemble of motif dis-
significance seed consensus motif patterns which for covery tools for initial prediction of candidate binding
examples are short (3–8 bp in DREME [30]) or short pat- sites from a subset of input sequences; second, features
terns spaced by gaps [31]. Seeds are used to form longer associated with the most potential candidate binding
patterns or initialize a search algorithm. Computational sites are extracted for deep neural network learning.
time, thus, is significantly reduced by starting the search Using ten ChIP datasets for evaluation, our results have
using the resulting sub-optimal motif patterns. Although demonstrated that DeepFinder is able to improve the
these tools are useful, most can only predict short motif overall sensitivity and specificity rates in comparison to
patterns. A genetic algorithm-based tool has also been the three-stage approach.
proposed [32], but search-based tools are not scalable.
A recently published article made an intriguing find-
Related works
ing regarding the theoretical limit of the number of DNA
sequences that should be used for computational DNA The three-stage approach tackles the motif search in a
motif discovery [33]. The authors reported that increas- large ChIP dataset by dividing the task into three conse-
ing the number of input sequences does improve the cutive steps [37,38]: (1) select a small subset of input
motif prediction accuracy; however, after it reaches a dataset; (2) perform motif discovery in the subset using
certain quantity, the improvement is no longer signifi- a computational tool and select the most potential can-
cant. This finding contradicted many studies that didate motifs; (3) use the candidate motif models to
assumed that better results can be expected when more detect binding sites in the input subset not used in stage
input sequences are used for computational tools. This 1. The three-stage approach reduced the computational
implies that a sufficient number of input sequences will time significantly by avoiding the motif search in the
be adequate to predict motifs in a dataset, and it is not whole sequence space. It conjectures that prominent
BIOTECHNOLOGY & BIOTECHNOLOGICAL EQUIPMENT 761
motifs can be obtained using any subset of the input technique for motif prediction. Figure 1 illustrates the
sequences of a reasonable size. The existing approach DeepFinder computational framework. It has three con-
employed a single motif tool in stage 2 to predict candi- secutive steps: (a) The dataset is partitioned into five
date motifs. The obtained motifs are typically modeled non-overlapped subsets. (b) Four de novo motif discov-
using the PWM which are subsequently used for site ery tools are applied on one of the partitioned subsets
detection in stage 3. Nevertheless, the existing solution to predict putative motifs and the respective binding
is not robust. First, motif detection in the third stage sites. The top three motifs returned from each tool are
relies on a good binding model that can represent a pro- merged and divided using a clustering algorithm. Sev-
tein’s specificities. Although PWM usually fits the binding enty-six features associated with candidate binding sites
affinity and specificity of a TF well, it is incapable of cap- in merged clusters are extracted and used for stacked-
turing motifs with positional dependencies. Second, in autoencoder neural network learning. (c) Learned neural
motif detection, setting the threshold value of a match is network is used to predict associated binding sites in
often difficult to ensure the balance of high sensitivity input sequences not used in the initial motif prediction.
and specificity [39].
Supervised learning based on deep learning neural
networks for enhancer motif prediction has become pop-
Candidate motif prediction and selection
ular recently. DeepBind [40] employed convolutional neu-
ral networks (CNN) to identify the DNA- or RNA-binding A subset of input sequences is randomly selected from
regions. DeepBind’s binding model showed excellent per- the input dataset motif prediction by four de-novo motif
formance with an average area under curve (AUC) of discovery tools: MEME [23], BioProspector [24], MDscan
0.85, when it was trained on in vitro and tested on in vivo [26] and MotifSampler [25]. We employed toolbox of
motif datasets, which outperformed the state-of-the-art motif discovery (Tmod) [42], which implemented the four
methods using several performance metrics. DanQ [41] is selected tools for candidate motif prediction. The top
a hybrid of convolutional and recurrent deep neural net- three motifs ranked by each tool’s scoring function are
works for learning enhancer-associated histone marks. It selected for further processing. Putative site locations in
was claimed to outperform DeepSEA, another CNN-based the DNA sequences are located. Regions in DNA sequen-
model, using ChIP-Seq datasets for evaluation. However, ces where many overlapping putative sites are located
DeepSEA’s performance is still considered unsatisfactory are most likely to be legitimate binding regions. We called
with its precision-recall AUC being under 70%. these regions binding segments (i.e. covered with at
least one binding site or several overlapping binding sites
in the vicinity). After identifying all binding segments,
Materials and methods pairwise similarity between every possible pair is com-
puted to generate a symmetry distance matrix (see the
Framework
next subsection). Two clusters are generated by using the
DeepFinder involves a three-stage approach that utilizes k-medoid clustering algorithm, implemented in Pycluster
an ensemble of motif finders and machine-learning [43]. The cluster with a higher number of binding
segments is fed as input to the deep learning neural net- There is a total of max(l(ATGCA), l(CGGA)) – min(l
work for building a binding model. We decided to gener- (ATGCA), l(CGGA)) + 1 = 5¡4 + 1 = 2 possible alignment
ate two clusters because it is crucial to pursue more positions as shown below.
sequences in a cluster to avoid inadequate training data
ATGCA Score
since there is no practical guideline on how many CGGA- 1/8
sequences are needed for model learning. Therefore, it is -CGGA 2/7
Sum = 0.411
utterly important to keep the cluster number small in
order to avoid missing any significant binding segments. For alignment position 1, the score is 1/8 since there is
a position (i.e. 3) where the nucleotide matched from the
total nucleotides of 8. The matched nucleotide is
Motif similarity
counted as one for calculating |ATGCA [ CGGA¡|. The
We have modified the similarity function described by [44], alignment scores are summed to obtain 0.411 using
to compute the similarity score of two binding segments x Equation (1). The second pair is computed similarly
and y. Let A(x) = {x1, x2, …, xn} be the set of binding sites clus- which obtains sim(CGGA, GCCG) = 0. Finally, the similar-
tered at binding segment x. The similarity score between ity score between the two binding segments x and y is
two binding segments is computed from the average align- sim(A(x), A(y)) = 1¡ (0.411 + 0)/3 = 0.863.
ment of every pair of binding sites in the two segments.
Suppose binding sites xi 2 A(x) and yj 2 A(y) have max(l(xi),l
Motif features
(xj)) – min(l(xi), l(xj)) + 1 possible alignment positions; l(xi) is
the length of xi. Alignments of xi and yj are performed by Several DNA sequence features are highly associated
starting at the left end position and then right shifting one with binding sites. Osada et al. [20] reported that adja-
base at a time, the shorter one on the longer ones. At a par- cent bases of motifs have high occurrence dependen-
ticular alignment position
k, the alignment score is com- cies, which, when modeled, can significantly improve
puted as sk xi ; yj ¼ jxi \ yj j 6 jxi [ yj j, where |xi \ yj| is the motif prediction sensitivity and specificity rates. Fur-
the number of matched nucleotides of the aligned two thermore, Yan~ez-Cuna et al. [45] observed that enhancer
sites, |xi [ yj| is the total nucleotides in the alignment. Note regions have high occurrences of repeated dinucleotides
that, 0 sk(xi, yj) 1. Therefore, the similarity score CA, GA, CG or GC. For classifier learning, a feature vector
between two binding sites xi and yj is defined as comprising three distinct feature sets is generated from
X each DNA binding segment: (a) k-mer feature as a simple
sim xi ; yj ¼ sk x i ; y j : (1) count of co-occurrences of bases that have strong
1kmij dependencies, and k is set to 3, which gives 64 feature
values; (b) the frequency counts of A, C, G, T; and (c)
The similarity score between two binding segments x and selected 2-mers count: CA, CG, GA and GC and the dinu-
y is defined as cleotide dependencies of CA, CG, GA and GC, where the
P dependency value of an arbitrary dinucleotide XY is
ij sim xi ; yj computed as c(XY)/(c(XA) + c(XC) + c(XG) + c(XT)); c() is
sim ðAðx Þ; Aðy ÞÞ ¼ 1 P : (2)
ij mij the frequency count in a binding segment. The feature
value of a k-mer g is computed as f(g) = c(g)/c(*), where c
We use the scores obtained to populate our distance (*) is the sum of counts from all possible k-mers. The fre-
matrix, which is used by the k-medoid algorithm. As an quency values are normalized using the min–max
illustrative example, suppose A(x) = {ATGCA, GCCG} and A method.
(y) = {CGGA} are binding sites in each binding segment x
and y, respectively. There are two pair-wise alignments
Classifier learning
between the two sets. The alignment pairs are (ATGCA,
CGGA) and (GCCG, CGGA). The table below shows the DeepFinder employs a stacked autoencoder [46] to con-
alignment scores in different positions for the two pairs. struct binding models using the 76 engineered
sequence features. The stacked autoencoder is well
Position known for its feature discovery, especially in unlabeled
x y 1 2
data, and its capacity in part-whole decomposition. In
ATGCA CGGA 1/8 2/7
GCCG CGGA 0 addition, a single stacked autoencoder acquires greater
expressive power compared with any deep learning neu-
The following gives an example of how one of the ral network. An equal number of negative controls are
alignment scores is obtained for the pair (ATGCA, CGGA). added to the sequence segments of positive data to
BIOTECHNOLOGY & BIOTECHNOLOGICAL EQUIPMENT 763
2pf
f measure ¼
pþr
MCC produces a value in the range of [¡1, 1], in which 1 Table 3. Comparison of false discovery rate (FDR), accuracy, and
indicates a perfect prediction, 0 means random predic- Matthews correlation coefficient (MCC) of MAST, FIMO, and
DeepFinder (DF).
tion, and ¡1 represents a negative correlation.
FDR Accuracy MCC
TF MAST FIMO DF MAST FIMO DF MAST FIMO DF
CREB 0.40 0.01 0.00 0.03 0.45 0.99 -0.62 -0.06 0.97
Hardware specifications ELK1 0.31 0.01 0.09 0.04 0.54 0.89 -0.54 -0.05 0.81
GATA1 0.32 0.01 0.05 0.04 0.53 0.94 -0.55 -0.05 0.89
For the simulation, all the individual motif discovery HNF4 0.32 0.00 0.00 0.05 0.51 1.00 -0.55 -0.04 0.99
tools in DeepFinder ran on Intel Core i5, 1.7 GHz CPU MEF2 0.20 0.00 0.00 0.01 0.99 0.99 -0.44 -0.04 0.98
NFE2 0.27 0.00 0.00 0.03 0.36 1.00 -0.51 -0.03 0.99
with 16 GB of memory. P53 0.29 0.01 0.00 0.05 0.49 0.99 -0.53 -0.06 0.99
P300 0.35 0.01 0.02 0.06 0.67 0.97 -0.57 -0.05 0.94
SRF 0.25 0.00 0.02 0.03 0.37 0.98 -0.49 -0.03 0.96
Results and discussion STAT1 0.33 0.00 0.00 0.03 0.44 0.99 -0.56 -0.04 0.99
Figure 4. Average f-measure, precision, recall and accuracy rates obtained by DeepFinder on the ten datasets using five-fold cross-
validation.
Figure 5. Average precision, recall, f-measure, false discovery rate (FDR) and accuracy for MAST, FIMO and DeepFinder for the ten
datasets.
Figure 6. Comparisons of precision rates for five different data subsets used in prediction.
766 N. K. LEE ET AL.
Figure 7. Comparisons of recall rates for five different data subsets used in prediction.
other types of deep learning neural networks such as [9] Needleman SB, Wunsch CD. A general method applicable
CNN. CNN is powerful because it can learn the different to the search for similarities in the amino acid sequence
abstraction of features in DNA sequences without the of two proteins. J Mol Biol. 1970;48:443–453.
[10] Blanchette M, Kent WJ, Riemer C, et al. Aligning multiple
need of handcrafted features. However, it requires DNA
genomic sequences with the threaded blockset aligner.
sequences to be represented as vectors or in the matrix Genome Res. 2004;14:708–715.
form (i.e. as an image) for effective learning. The cur- [11] Al Ait L, Yamak Z, Morgenstern B. DIALIGN at GOBICS–
rently available solution mainly focuses on one-hot multiple sequence alignment using various sources of
encoding, which we feel is not a ‘natural’ representation external information. Nucleic Acids Res. 2013;41:W3–7.
[12] King DC, Taylor J, Zhang Y, et al. Finding cis-regulatory
of DNA sequences. We are currently conducting explor-
elements using comparative genomics: some lessons
atory research for more meaningful representation of from ENCODE data. Genome Res. 2007;17:775–786.
DNA sequences as ‘images.’ In addition, a more effective [13] Kel AE, Go€ssling E, Reuter I, et al. MATCH: A tool for search-
method is needed to merge and filter large number of ing transcription factor binding sites in DNA sequences.
candidate motifs from multiple motif prediction tools. Nucleic Acids Res. 2003;31:3576–3579.
[14] Wang D, Lee NK. MISCORE: Mismatch-based matrix simi-
larity scores for DNA motif detection. In: Ko €ppen M, Kasa-
bov N, Coghill G, editors. Adv. Neuro-Information Process.
Acknowledgements Berlin, Heidelberg: Springer; 2009. p. 478–485.
YS is supported by the Malaysian MyBrain15 MyPhD Scholar- [15] Grant CE, Bailey TL, Noble WS. FIMO: scanning for occur-
ship. ON is supported by the Fundamental Research Grant rences of a given motif. Bioinformatics. 2011;27:1017–
Scheme FRGS/1/2014/SG03/UNIMAS/02/2. 1018.
[16] Bailey T, Boden M, Whitington T, et al. The value of posi-
tion-specific priors in motif discovery using MEME. BMC
Bioinformatics. 2010 [cited 2017 Mar 12];11:179.
Disclosure statement DOI:10.1186/1471-2105-11-179.
No potential conflict of interest was reported by the authors. [17] Stormo GD. DNA binding sites: representation and discov-
ery. Bioinformatics. 2000;16:16–23.
[18] Bi Y, Kim H, Gupta R, et al. Tree-based position weight
matrix approach to model transcription factor binding
Funding site profiles. PLoS One. 2011 [cited 2017 Mar 12];6:e24210.
This study was supported by the Malaysia Research Accultura- DOI:10.1371/journal.pone.0024210.
tion Grant Scheme of the Ministry of Higher Education Malaysia [19] Bailey TL, Gribskov M. Combining evidence using p-val-
[grant number RAGS/b(5)/927/2012(28)]. ues: application to sequence homology searches. Bioinfor-
matics. 1998;14:48–54.
[20] Osada R, Zaslavsky E, Singh M. Comparative analysis of
methods for representing and searching for transcription
References factor binding sites. Bioinformatics. 2004;20:3516–3525.
[21] Hu J, Li B, Kihara D. Limitations and potentials of current
[1] Maston GA, Evans SK, Green MR. Transcriptional regula-
motif discovery algorithms. Nucleic Acids Res.
tory elements in the human genome. Annu Rev Genomics
2005;33:4899–4913.
Hum Genet. 2006;7:29–59.
[22] Hughes JD, Estep PW, Tavazoie S, et al. Computational
[2] Zambelli F, Pesole G, Pavesi G. Motif discovery and tran-
identification of Cis-regulatory elements associated with
scription factor binding sites before and after the next-
groups of functionally related genes in Saccharomyces
generation sequencing era. Brief Bioinform. 2013;14:225–
cerevisiae. J Mol Biol. 2000;296:1205–1214.
237.
[23] Bailey TL, Elkan C. Fitting a mixture model by expectation
[3] Poliakov A, Foong J, Brudno M, et al. GenomeVISTA—an
maximization to discover motifs in biopolymers. Proc Int
integrated software package for whole-genome align-
Conf Intell Syst Mol Biol. 1994;2:28–36.
ment and visualization. Bioinformatics. 2014;30:2654–
[24] Liu XS, Brutlag DL, Liu JS. BioProspector: discovering con-
2655.
served DNA motifs in upstream regulatory regions of co-
[4] Brudno M, Do CB, Cooper GM, et al. LAGAN and Multi-
expressed genes. Pacific Symp Biocomput. 2001;6:127–
LAGAN: efficient tools for large-scale multiple alignment
138.
of genomic DNA. Genome Res. 2003;13:721–731.
[25] Thijs G, Marchal K, Lescot M, et al. Gibbs sampling method
[5] Kurtz S, Phillippy A, Delcher AL, et al. Versatile and open
to detect overrepresented motifs in the upstream regions
software for comparing large genomes. Genome Biol. 2004
of coexpressed genes. J Comput Biol. 2002;9:447–464.
[cited 2017 Mar 12];5:R12. DOI:10.1186/gb-2004-5-2-r12
[26] Liu XS, Brutlag DL, Liu JS. An algorithm for finding protein-
[6] Bray N, Dubchak I, Pachter L. AVID: A global alignment
DNA binding sites with applications to chromatin-immu-
program. Genome Res. 2003;13:97–102.
noprecipitation microarray experiments. Nat Biotechnol.
[7] Ovcharenko I, Loots GG, Giardine BM, et al. Mulan: multiple-
2002;20:835–839.
sequence local alignment and visualization for studying
[27] Pavesi G, Mauri G, Pesole G. An algorithm for finding sig-
function and evolution. Genome Res. 2005;15:184–194.
nals of unknown length in DNA sequences. Bioinformat-
[8] Smith TF, Waterman MS. Identification of common molec-
ics. 2001;17:S207–214.
ular subsequences. J Mol Biol. 1981;147:195–197.
768 N. K. LEE ET AL.
[28] Wei Z, Jensen ST. GAME: detecting cis-regulatory ele- [41] Quang D, Xie X. DanQ: a hybrid convolutional and recur-
ments using a genetic algorithm. Bioinformatics. rent deep neural network for quantifying the function of
2006;22:1577–1584. DNA sequences. Nucleic Acids Res. 2016 [cited 2017 Mar
[29] Fogel GB, Weekes DG, Varga G, et al. Discovery of 12];44:e107. DOI:10.1093/nar/gkw226.
sequence motifs related to coexpression of genes using [42] Sun H, Yuan Y, Wu Y, et al. Tmod: toolbox of motif discov-
evolutionary computation. Nucleic Acids Res. ery. Bioinformatics. 2010;26:405–407.
2004;32:3826–3835. [43] de Hoon MJL, Imoto S, Nolan J, et al. Open source cluster-
[30] Bailey TL. DREME: motif discovery in transcription factor ing software. Bioinformatics. 2004;20:1453–1454.
ChIP-seq data. Bioinformatics. 2011;27:1653–1659. [44] Wijaya E, Yiu S-M, Son NT, et al. MotifVoter: a novel
[31] Shi J, Yang W, Chen M, et al. AMD, an automated motif ensemble method for fine-grained integration of generic
discovery tool using stepwise refinement of gapped con- motif finders. Bioinformatics. 2008;24:2288–2295.
sensuses. Aiyar A, editor. PLoS One. 2011 [cited 2017 Mar [45] Yan
~ez-Cuna JO, Arnold CD, Stampfel G, et al. Dissection of
12];6:e24576. DOI:10.1371/journal.pone.0024576. thousands of cell type-specific enhancers identifies dinu-
[32] Lee NK, Fong PK, Abdullah MT. Modelling complex fea- cleotide repeat motifs as general enhancer features.
tures from histone modification signatures using genetic Genome Res. 2014;24:1147–1156.
algorithm for the prediction of enhancer region. Bio-Medi- [46] Vincent P, Larochelle H, Bengio Y, et al. Extracting and
cal Mater Eng. 2014;24:3807–3814. composing robust features with denoising autoencoders.
[33] Zia A, Moses AM. Towards a theoretical understanding of In: Proceeding of 25th International Conference on
false positives in DNA motif finding. BMC Bioinformatics. Machine Learning; 2008. p. 1096–1103. New York, NY:
2012 [cited 2017 Mar 12];13:151. DOI:10.1186/1471-2105- ACM.
13-151. [47] Giardine B, Riemer C, Hardison RC, et al. Galaxy: a platform
[34] Thomas-Chollier M, Herrmann C, Defrance M, et al. RSAT for interactive large-scale genome analysis. Genome Res.
peak-motifs: motif analysis in full-size ChIP-seq datasets. 2005;15:1451–1455.
Nucleic Acids Res. 2012 [cited 2017 Mar 12];40:e31. [48] Palm RB. Deep learning toolbox. [2015-09]. http://www.
DOI:10.1093/nar/gkr1104. mathworks.com/matlabcentral/fileex-change/38310-
[35] Li L. GADEM: A genetic algorithm guided formation of deep-learning-toolbox. 2012.
spaced dyads coupled with an EM algorithm for motif dis- [49] Rosenbloom KR, Armstrong J, Barber GP, et al. The UCSC
covery. J Comput Biol. 2009;16:317–329. Genome Browser database: 2015 update. Nucleic Acids
[36] Kuttippurathu L, Hsing M, Liu Y, et al. CompleteMOTIFs: Res. 2015;43:D670–D681.
DNA motif discovery platform for transcription factor [50] Manning CD, Raghavan P, Sch€ utze H, et al. Introduction to
binding experiments. Bioinformatics. 2011;27:715–717. information retrieval. Cambridge: Cambridge University
[37] Carroll JS, Meyer CA, Song J, et al. Genome-wide analysis Press; 2008.
of estrogen receptor binding sites. Nat Genet. [51] Benjamini Y, Hochberg Y. Controlling the false discovery
2006;38:1289–1297. rate: a practical and powerful approach to multiple test-
[38] Wei C-L, Wu Q, Vega VB, et al. A global map of p53 tran- ing. J R Stat Soc Ser B. 1995; 57:289–300.
scription-factor binding sites in the human genome. Cell. [52] Matthews BW. Comparison of the predicted and observed
2006;124:207–219. secondary structure of T4 phage lysozyme. Biochim Bio-
[39] Lee NK, Wang D. Optimization of MISCORE-based motif phys Acta (BBA)-Protein Struct. 1975;405:442–451.
identification systems. 3rd International Conference on [53] Haykin S. Neural networks: A comprehensive foundation.
Bioinformatics and Biomedical Engineering (ICBBE 2009); 2nd ed. Upper Saddle River (NJ, USA): Prentice Hall PTR;
2009; Beijing, China. 1998.
[40] Alipanahi B, Delong A, Weirauch MT, et al. Predicting the [54] Wasserman WW, Sandelin A. Applied bioinformatics for
sequence specificities of DNA-and RNA-binding proteins the identification of regulatory elements. Nat Rev Genet.
by deep learning. Nat Biotechnol. 2015;33:831–838. 2004;5:276–287.