0% found this document useful (0 votes)
88 views

A Review of Ensemble Methods in Bioinformatics: Pengyi Yang, Yee Hwa Yang, Bing B. Zhou and Albert Y. Zomaya

tree ensemble
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
88 views

A Review of Ensemble Methods in Bioinformatics: Pengyi Yang, Yee Hwa Yang, Bing B. Zhou and Albert Y. Zomaya

tree ensemble
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

296 Current Bioinformatics, 2010, 5, 296-308

A Review of Ensemble Methods in Bioinformatics


Pengyi Yang1,2,3,*, Yee Hwa Yang2, Bing B. Zhou1,4 and Albert Y. Zomaya1,4

1
School of Information Technologies, University of Sydney, NSW 2006, Australia
2
School of Mathematics and Statistics, University of Sydney, NSW 2006 Australia
3
NICTA, Australian Technology Park, Eveleigh, NSW 2015, Australia
4
Centre for Distributed and High Performance Computing, University of Sydney, NSW 2006, Australia

Abstract: Ensemble learning is an intensively studied technique in machine learning and pattern recognition. Recent
work in computational biology has seen an increasing use of ensemble learning methods due to their unique advantages in
dealing with small sample size, high-dimensionality, and complex data structures. The aim of this article is two-fold.
Firstly, it is to provide a review of the most widely used ensemble learning methods and their application in various bioin-
formatics problems, including the main topics of gene expression, mass spectrometry-based proteomics, gene-gene inter-
action identification from genome-wide association studies, and prediction of regulatory elements from DNA and protein
sequences. Secondly, we try to identify and summarize future trends of ensemble methods in bioinformatics. Promising
directions such as ensemble of support vector machines, meta-ensembles, and ensemble based feature selection are dis-
cussed.
Keywords: Ensemble learning, bioinformatics, microarray, mass spectrometry-based proteomics, gene-gene interaction, regu-
latory elements prediction, ensemble of support vector machines, meta ensemble, ensemble feature selection.

INTRODUCTION properties, as we will review later, have a major impact on


many different bioinformatics applications.
Modern biology has seen an increasing use of computa-
tional techniques for large scale and complex biological data A large number of ensemble methods have been applied
analysis. Various computational techniques, especially ma- to biological data analysis. This article aims at providing a
chine learning algorithms [1], are applied, for example, to review of the most widely used methods and their variants
select genes or proteins associated with the trait of interest used in bioinformatics applications, and to identify the future
and to classify different types of samples in gene expression development directions of ensemble methods in bioinformat-
of microarrays data [2] or mass spectrometry (MS)-based ics. In the next section, we briefly discuss the rationale of
proteomics data [3], to identify disease associated genes, ensemble approaches and introduce the three most popular
gene-gene interactions, and gene-environmental interactions ensemble methods – bagging [10], boosting [11], and ran-
from genome wide association (GWA) studies [4], to recog- dom forests [12]. This is followed by a section discussing the
nize the regulatory elements in DNA or protein sequences application of ensemble methods to three different bioinfor-
[5], to identify protein-protein interactions [6], or to predict matics problems. These are: (1) gene expression of microar-
protein structure [7]. ray and MS-based proteomics data classification, (2) identi-
Ensemble learning is an effective technique that has in- fication of gene-gene interaction in GWA studies, and (3)
creasingly been adopted to combine multiple learning algo- regulatory elements prediction from DNA or protein se-
rithms to improve overall prediction accuracy [8]. These quences. Several other applications are also reviewed. The
ensemble techniques have the advantage to alleviate the fourth section describes several extensions of ensemble
small sample size problem by averaging and incorporating methods and the adaptation of ensemble learning theory for
multiple classification models to reduce the potential for feature selection problems. The last section concludes the
overfitting the training data [9]. In this way the training data paper.
set may be used in a more efficient way, which is critical to
many biological applications with small sample size. Some POPULAR ENSEMBLE METHODS
ensemble methods such as random forests are particularly Improvements in classification tasks are often obtained
useful for high-dimensional datasets because increased clas- by aggregating a group of classifiers (referred to as base
sification accuracy can be achieved by generating multiple classifiers) as an ensemble committee and making the pre-
prediction models each with a different feature subset. These diction for unseen data in a consensus way. The aim of de-
signing/using ensemble methods is to achieve more accurate
*Address correspondence to this author at the School of Information Tech- classification (on training data) as well as better generaliza-
nologies (J12), University of Sydney, NSW 2006, Australia; tion (on unseen data). However, this is often achieved at the
Tel: (61 2) 9036-9112; Fax: (61 2) 9351-3838; expense of increased model complexity (decreased model
E-mail: [email protected] interpretability) [13]. A better generalization property of

1574-8936/10 $55.00+.00 © 2010 Bentham Science Publishers Ltd.


A Review of Ensemble Methods in Bioinformatics Current Bioinformatics, 2010, Vol. 5, No. 4 297

ensemble approach is often explained using the classic bias- tion rules of multiple classifiers using integration methods
variance decomposition analysis [14]. Specifically, previous that take advantage of the overlapped region (such as averag-
studies found that methods like bagging (Fig. 1(a)) improve ing and majority voting), we are approximating the best
generalization by decreasing variance [15] while methods classification rule by using multiple rules. As a result, the
similar to boosting (Fig. 1(b)) achieve this by decreasing ensemble composed in such a manner often appears to be
bias [16]. Here we provide a more intuitive interpretation of more accurate.
the advantage of ensemble approach.
From the above analysis, it is clear that in order to obtain
Let the best classification rule (called hypothesis) hbest of an improvement the base classifiers need to be accurate (bet-
a given induction algorithm for certain kind of data be the ter than chance) and diverse from each other [17]. The need
circle in Fig. (2). Suppose the training data is free from for diversity originates from the assumption that if a classi-
noise, without any missing value, and sufficiently large to fier makes a misclassification, there may be another classi-
represent the underneath pattern. Then, we expect the classi- fier that complements it by correctly classifying the misclas-
fier trained on the dataset to capture the best classification sified sample. Ideally, each classifier makes incorrect classi-
hypothesis represented as the circle. In practice, however, the fication independently.
training datasets are often compounded by small sample size,
Popular ensemble methods like bagging (Fig. 1(a)) and
high dimensionality, and high noise-to-signal ratio etc.
random forests (Fig. 1(c)) (note that random forests can be
Therefore, obtaining the best classification hypothesis is of-
ten nontrivial because there are a large number of suboptimal considered as a special form of bagging algorithm) harness
the diversity by using different perturbed data sets and dif-
hypotheses in the hypothesis space (denoted as H in Fig. 2)
ferent feature sets for training base classifiers, respectively.
which can fit the training data but do not generalize well on
That is, each base classifier is trained on a subset of sam-
unseen data.
ples/features to obtain a slightly different classification hy-
Creating multiple classifiers by manipulating the training pothesis, and then combined to form the ensemble. As for
data in an intelligent way allows one to obtain a different boosting (Fig. 1(b)), diversity is obtained by increasing the
hypothesis space with each classifier (H1, H2, ..., HL; where L weights of misclassified samples in an iterative manner.
is the number of classifiers), which may lead to a narrowed Each base classifier is trained and combined from the sam-
overlap hypothesis space (Ho). By combining the classifica- ples with different classification weights, and therefore, dif-

Fig. (1). Schematic illustration of the three most popular ensemble methods.

Fig. (2). A schematic illustration of hypothesis space partitioning with ensemble of classifiers. By combining diverse and moderate accurate
base classifiers, we can approximate the best classification rule hbest with the increase of model complexity. This can be achieved by combin-
ing base classifiers with averaging or majority voting which take advantage of the overlapped region.
298 Current Bioinformatics, 2010, Vol. 5, No. 4 Yang et al.

ferent hypotheses. By default, these three methods use deci- dataset with equal number of positive samples and negative
sion trees as base classifiers because decision trees are sensi- samples, the ensemble error rate quickly gets smaller than
tive to small changes on the training set [8], and thus suited the error rate of base classifiers. If we add more base classi-
for the perturbation procedure applied to the training data. fiers, the improvement becomes more significant. In this
example, we used odd numbers of base classifiers where the
It is worth noting that there are many other well- consensus is made by (M+1)/2 classifiers. When using even
established methods for creating ensemble classifiers. For
number of base classifiers, the consensus is made by M/2+1
example, stacked generalization [18] combines the base
classifiers.
classifiers through a meta-classifier to maximize the gener-
alization. Methods like base classifier selection and cascade- Besides majority voting one can also apply other meth-
based classifiers are also widely used [19, 20]. ods to combine base classifiers, such as weighted majority
voting, bayesian combination [22], and probabilistic ap-
To aggregate the base classifiers in a consensus manner,
proximation [23]. Yet, majority voting remains to be one of
strategies such as majority voting or simple averaging are
the most popular choices because of its simplicity and effec-
commonly used. Assuming the prediction outputs of the base
tiveness compared to more complex decision fusion methods
classifiers are independent of each other (which, in practice,
[24].
is partially achieved by promoting diversity among the base
classifiers), the majority voting error rate mv can be ex- APPLICATION
pressed as follows [21]:
In this section, we describe the application of ensemble
M M
i methods in bioinformatics in three broad topics. They are as
 mv =    (1  )
M i
follow:
M i
i= +1
2  Classification of gene expression microarray data and
MS-based proteomics data;
where M is the number of base classifiers in ensemble.
Given the condition that  < random for random being the error  Gene-gene interaction identification using single nu-
rate of a random guess and all base classifiers have identical cleotide polymorphism (SNPs) data from GWA stud-
error rate , the majority voting error rate mv monotonically ies;
decreases and approaches 0 when M  .  Prediction of regulatory elements from DNA and pro-
Fig. (3) shows an ideal scenario in which the dataset has tein sequences.
two classes each with the same number of samples, the pre- The Application of Ensemble Methods to Microarray
diction of base classifiers are independent of each other, and and MS-Based Proteomics
all base classifiers have identical error rate. It can be seen
from the figure that, when the error rate of the base classifi- Many biological studies are designed to distinguish pa-
ers is smaller than 0.5, which is a random guess for a binary tients from normal people, or to distinguish different disease

Fig. (3). The relationship of error rates of base classifiers and error rates of the ensemble classifier in majority voting. The diagonal line rep-
resents the case in which the base classifiers are identical to each other, while the three carve lines represent combining different numbers of
base classifiers which are independent of each other.
A Review of Ensemble Methods in Bioinformatics Current Bioinformatics, 2010, Vol. 5, No. 4 299

types, progression etc. based on the gene expression profiles samples [29] are the key issues in microarray or MS-based
or protein abundance. Typical high-throughput techniques proteomics classification problem.
include using microarrays to measure gene expression [2]
The unique advantages offered by ensemble methods are
(Fig. 4) and using mass spectrometer to measure protein
their ability in dealing with small sample size and high di-
abundance [3] (Fig. 5). These techniques can provide a ge-
nome-wide transcription or translation monitoring. However, mensionality. For this reason, they have been widely applied
to both microarray and MS-based proteomics data analysis.
when such high-throughput techniques are applied, the ex-
periments often result in the evaluation of a huge number of The initial work of applying bagging and boosting meth-
features (gene probsets in microarray studies or mass/charge ods to classify tumors using gene expression profiles was
(m/z) ratio in MS studies, etc.) with a limited number of pioneered by Ben-Dor et al. [30] and Dudoit et al. [31]. Both
samples [25]. This is commonly known as the “curse-of- studies compared the ensemble methods with other individ-
dimensionality” [26], and selecting the most relevant fea- ual classifiers such as k-nearest neighbors (kNN), clustering
tures [27, 28] and making the most use of the limited data based classifiers, support vector machines (SVM), linear

Fig. (4). Gene expression of microarray data matrix. The microarray data from the computational viewpoint is a N  M matrix. Each row
represents a sample while each column represents a gene except the last column which represents the class label of each sample. i,j is a nu-
meric value representing the gene expression level of the ith sample in the jth gene. ci in the last column is the class label of the ith sample.

Fig. (5). Mass spectrometry-based proteomics. The proteomics data generated by mass spectrometer are very similar to microarray data from
the computational viewpoint. The difference is that, instead of measuring gene expressions, each column represents the abundance of a pro-
tein or peptide in the tissue derived from a sample.
300 Current Bioinformatics, 2010, Vol. 5, No. 4 Yang et al.

discriminant analysis (LDA), and classification trees. The (SC), and kNN with feature selection. Other advantages of
conclusion was that the ensemble methods bagging and random forests such as robustness to noise, lack of depend-
boosting performed similarly to other single classification ence upon tuning parameters, and the speed of computation
algorithms included in the comparison. have been demonstrated by Izmirlian [41] in classifying
SELDI-TOF proteomic data.
In contrast to the results obtained by Dudoit et al. and
Ben-Dor et al., the follow up studies revealed that much bet- Due to the good performance of random forests in high-
ter results can be achieved through minor tuning and modifi- dimensional data classification, the development of random
cation. For instance, Dettling and Buhlmann [32] proposed forest variants is a very active research topic. For instance,
an algorithm called LogitBoost which replaces the exponen- Zhang et al. [42] proposed a deterministic procedure to form
tial loss function used in AdaBoost with a log-likelihood loss a forest of classification trees. Their results indicate that the
function. They demonstrated that LogitBoost is more accu- performance of the proposed deterministic forest is similar to
rate in classification of gene expression data compared to the that of random forests, but with better reproducibility and
original AdaBoost algorithm. Long [33] argued that the per- interpretability. Geurts et al. [43] proposed a tree ensemble
formance of AdaBoost can be enhanced by improving the method called “extra-trees” which selects at each node the
base classifiers. He then proposed several customized boost- best among k randomly generated splits. This method is an
ing algorithms for microarray data classification. The ex- improvement on random forests because unlike random
perimental results indicate that the customized boosting al- forests which are grown with multiple subsets, the base trees
gorithms performed favorably compared to SVM-based al- of extra-trees are grown from the complete learning set and
gorithms. In comparison to the single tree classifier, Tan and by explicitly randomizing the cut-points.
Gilbert [34] demonstrated that, overall, the ensemble meth-
Besides the development of more effective ensemble
ods bagging and boosting are more robust and accurate in
methods, current studies also focus on more objective com-
microarray data classification using seven publicly available
parison [44]. For example, a recent study by Ge and Wong
datasets.
[45] compared the single classifier of decision trees with six
In MS-based proteomics, Qu et al. [35] conducted the ensemble methods including random forests, stacked gener-
first study using boosting ensembles for classifying mass alization, bagging, Adaboost, LogitBoost, and Multiboost
spectra serum profiles. 100% classification accuracy was using three different feature selection schemes (Student t-
estimated using the standard AdaBoost algorithm, while a test, Wilcoxon rank sum test, and genetic algorithms). An-
simpler ensemble called boosted decision stump feature se- other comprehensive study by Statnikov et al. [46] compared
lection (BDSFS) showed slightly lower classification accu- random forests with SVM for microarray-based cancer class-
racy (97%) but gives more interpretable classification rules. ification across 22 datasets.
A thorough comparison study was conducted by Wu et al.
Lastly, genes are connected by pathways and functioning
[36], who compared the ensemble methods of bagging,
in groups, and therefore, there is a growing trend to analyse
boosting, and random forests to individual classifiers of
microarray data at the pathway level [47]. Pang et al. [48]
LDA, quadratic discriminant analysis, kNN, and SVM for
proposed to combine microarray data with the pathway in-
MALDI-TOF (matrix assisted laser desorption/ionization
with time-of-flight) data classification. The study found that formation from the KEGG database [49]. The dataset is sub-
sequently divided into categorical phenotype data and clini-
among all methods random forests, on average, gives the
cal outcome data, and then used to train a random forests
lowest error rate with the smallest variance. Another recent
ensemble. The genes selected by random forests for sample
study by Gertheiss and Tutz [37] designed a block-wise
classification are treated as informative genes while the error
boosting algorithm to integrate feature selection and sample
rate of random forests is used to evaluate the association
classification of mass spectrometry data. Based on Logit-
Boost, their method addresses the horizontal variability of between pathways and the disease of interest.
the m/z values by dividing the m/z values into small subsets The Application of Random Forests to Identify Gene-
called blocks. Finally, the boosting ensemble has also been Gene Interaction
adopted as the classification and biomarker discovery com-
ponent in the proteomic data analysis framework proposed Besides measuring gene expressions and protein expres-
by Yasui et al. [38]. sions, screening and comparing the genotypes of different
samples can also give critical information of different dis-
In comparison to bagging and boosting ensemble meth- eases and their pathogeneses because the development of the
ods, random forests holds a unique advantage because its use disease is studied from the very source of the genetic makeup
of multiple feature subsets is well suited for high- – DNA. More importantly, such studies, termed association
dimensional data such as those generated by microarray and study, can help to determine different individuals' suscepti-
MS-based proteomics studies. This is demonstrated by sev- bility to various diseases as well as their response to different
eral studies such as [39] and [40]. In [39], Lee et al. com- drugs based on their genetic variations [50].
pared the ensemble of bagging, boosting and random forests
using the same experimental settings and found random A widely used design for association studies is to screen
forests was the most successful one. In [40], the experimental common single nucleotide polymorphisms (SNPs) and com-
results through ten microarray datasets suggest that random pare the variation between case and control samples for dis-
forests are able to preserve predictive accuracy while yield- ease associated gene identification at the genome-wide scale
ing smaller gene sets compared to diagonal linear discrimi- (termed as genome-wide association (GWA) studies) [4]. It
nant analysis (DLDA), kNN, SVM, shrunken centroides is commonly accepted that many complex diseases such as
A Review of Ensemble Methods in Bioinformatics Current Bioinformatics, 2010, Vol. 5, No. 4 301

diabetes and cancer arise from a combination of multiple sociation study. A similar method was also used by Lunetta
genes which often regulate and interact with each other to et al. [60] for complex interaction identification. However,
produce the traits [51]. Therefore, the goal of these studies is these early studies limited the SNPs under analysis to a rela-
to identify the complex interactions among multiple genes tively small number (30 - 40 SNPs).
which together with environmental factors may substantially
increase the risk of the development of diseases. Using SNPs Recent studies focus on developing customized random
as genetic markers, this problem is commonly formulated as forests algorithms and applied them for gene-gene interac-
the task of SNP-SNP and SNP-environment interaction iden- tion identification to a much higher data dimension, contain-
tification. Fig. (6) illustrates the pairwise interaction rela- ing several hundred thousands of candidate SNPs. Specifi-
tionship among multiple SNPs. cally, Cheng et al. [61] investigated the statistical power of
random forests in SNP interaction pair identification. Their
Among many pattern recognition algorithms, the decision algorithm was then applied to analyse the SNP data from the
tree algorithm has long been recognized as a promising tool complex disease of age-related macular degeneration (AMD)
for SNP-SNP interaction identification [52, 53]. Initial at- [62] by using a haplotype based method for dimension re-
tempts to identify gene-gene interaction using decision tree duction. Meng et al. [63] modified random forests to take
based methods were investigated on relatively small datasets. into account the linkage disequilibrium (LD) information
For instance, Cook et al. [54] applied the decision tree algor- when measuring the importance of SNPs. Jiang et al. [64]
ithm with a multivariate adaptive regression spline model to
developed a sequential forward feature selection procedure
explore the presence of genetic interactions from 92 SNPs.
to improve random forests in epistatic interaction identifica-
With the increasing popularity of tree based ensemble tion. The random forests algorithm was first used to compute
methods, they became the focus of many recent studies the gini index for a total of 116,204 SNPs from the AMD
under the context of SNP-SNP interaction identification for dataset [62] and then used as a classifier to minimize the
complex disease analysis. Although different ensemble classification error by selecting a subset of SNPs in a for-
methods have been proposed for identifying SNP-SNP inter- ward sequential manner with a predefined window size.
action [55, 56], it is random forests that enjoyed the most
popularity [51]. This is largely due to its intrinsic ability to The Application of Ensemble Methods to Regulatory
take multiple SNPs jointly into consideration in a nonlinear Elements Prediction
fashion [57]. In addition, random forests can be used easily Regulatory elements prediction is a general term that en-
as an embedded feature evaluation algorithm [58], which is compasses tasks such as promoter region recognition [5, 65],
very useful for disease associated SNP selection. transcription start sites prediction [66], or glycosylation site
The initial work of Bureau et al. [58] shows the advan- and phosphorylation site prediction [67]. The similarity of
tage of random forests regression method in linkage data these tasks is to computationally identify the functional sites
mapping. Several quantitative trait loci have been success- based on the sequences of DNA or proteins with other bio-
fully identified. The same group [59] then applied the ran- logical and/or genomic information. Fig. (7) illustrates dif-
dom forests algorithm in the context of the case-control as- ferent functional sites on a DNA sequence of a gene.

Fig. (6). Schematic illustration of SNP-SNP interactions. The SNP chip is applied for genotyping and the data matrix obtained is similar to
those from microarray and MS-based proteomics studies except that each feature is a SNP variable which can take the genotype of AA, AB,
or BB. The SNP-SNP interactions are schematically illustrated as the red boxes in the “heat map” with brighter colors indicating stronger
interactions and associations with the disease of interest.
302 Current Bioinformatics, 2010, Vol. 5, No. 4 Yang et al.

Fig. (7). Schematic illustration of functional sites. This is a schematic illustration of different functional sites on a DNA sequence of a gene.
The task of regulatory elements prediction could be the computational identification of the promoter region-promoter region recognition, or
the computational identification of the transcription start sites-transcription start sites recognition, etc.

Ensemble methods have recently been introduced to this prediction accuracy is observed in the prediction of Thr and
domain due to the diverse data types and features each task Asn glycosylation sites. In [74], Caragea et al. attempted to
employs to perform the recognition and the diverse patterns devise an ensemble using bagging with the base classifier of
presented in different promoter sequences. For instance, SVMs. Their comparison to single SVM indicates that the
Hong et al. [68] proposed a modified boosting approach for ensemble of SVM is more accurate according to several ev-
identifying transcription factor binding sites. The modified aluation metrics.
boosting algorithm was applied to ChIP-chip data. It auto-
Moreover, ensemble methods can be used as an embed-
matically chooses the number of base classifiers to be used
so as to avoid overfitting. Xie et al. [69] utilized the Ad- ded component for model tuning. A typical example is the
study [75] in which the Yoo et al. employed the AdaBoost
aBoost algorithm to combine a variety of features for pro-
algorithm for tuning multiple neural networks. The tuned
moter site identification. The features included ranges from
system was then used for phosphorylation site prediction,
local distribution of pentamers, positional CpG island (ge-
and the performance of this system compared favorably to
nomic regions with high CpG sites) features, to digitized
nine existing machine learning algorithms and four widely
DNA sequence. AdaBoot is adopted to select the most in-
formative features while building the ensemble of classifiers. used phosphorylation site predictors.
Zhao et al. [70] adopted a similar method that utilizes Other Emerging Applications of Ensemble Methods in
LogitBoost with stumps for transcription start sites predic- Bioinformatics
tion. They used a diverse collection of features including
core promoter elements, transcription factor binding site, Besides the above three main areas, ensemble methods
mechanical properties, markovian score, and k-mer fre- have also been widely applied to many other bioinformatics
quency. The resulting program called CoreBoost contains problems.
two classifiers which are specific for CpG-related promoter In gene function prediction, Guan et al. [76] introduced a
prediction and for non-CpG-related promoter prediction, meta-ensemble based on SVM. This meta-ensemble contains
respectively. By integrating specific genome-wide histone three “base classifiers”. They are the ensemble of SVMs
modification as a set of extra features, Wang et al. [71] pro- trained using bagging for each gene ontology (GO) term, the
posed an improved CoreBoost algorithm called CoreBoost hierarchical Bayesian combination of SVM classifiers, and
with histone modification features or CoreBoost_HM. They the naive bayes combination of SVM classifiers. The predic-
then demonstrated that CoreBoost_HM can successfully tion of this meta-ensemble is made by selecting the best per-
used to predict of core-promoters of both coding and non- forming one on each GO term.
coding genes. Quite uniquely, Gordon et al. [72] combined a
group of SVMs, each with a different mismatch string ker- Protein folding recognition, structure prediction, and
nel, for transcription start sites prediction. They found sig- function prediction are closely related problems. In [77],
nificantly reduced false positives in the prediction result Shen and Chou designed nine sets of features for ensemble
which, from a practical viewpoint, is extremely useful to recognition of protein folding. The features extracted from
biologists. the protein sequences include predicted secondary structure,
hydrophobicity, van der Waals volume, polarity, polari-
Glycosylation site and phosphorylation site are the func- zability, and different dimensions of pseudo-amino acid
tional sites of post translational modifications (PTMs) in composition. Modified kNN base classifiers are trained using
protein sequences. Accurate localization of these functional different feature sets and combined in a weighted voting
sites can elucidate many important biological processes such manner.
as protein folding, subcellular localization, protein transpor-
tation and functions. In [73], Hamby and Hirst utilized the Melvin et al. [78] proposed to combine kNN classifier
random forests algorithm for glycosylation sites prediction with SVM classifier for protein structure prediction using
and prediction rule extraction. The significant increase of sequence information. The kNN classifier is trained using
global sequence information (called full-coverage) while the
A Review of Ensemble Methods in Bioinformatics Current Bioinformatics, 2010, Vol. 5, No. 4 303

SVM is trained using local sequence information. The classi- (both old and new) from the ensemble perspective in the
fiers are then combined by a punting method using a speci- coming years.
fied threshold. Lee et al. [79] compare random forests to
SVM in identifying protein functions with features derived EXTENSION OF ENSEMBLE METHOD IN BIOIN-
from protein sequence properties. In their study, 484 features FORMATICS
are extracted solely from the protein sequence and a correla- The accumulating evidence suggests that the ensemble
tion-based feature selection (CFS) procedure coupled with method is one of the most promising solutions to many bio-
either SVM (SVM_CFS) or random forests (RF_CFS) is logical problems. Due to the immense success of many en-
applied to identify the final 39 features which are then used semble methods in bioinformatics applications, numerous
in function classification of 11 different protein classes. The extensions have been proposed. In this section, we review
performance of SVM_CFS and RF_CFS are compared with some of the most promising directions. They are divided into
SVM and RF without feature selection using five evaluation two major topics. The first one discusses different extensions
metrics. Overall, SVM and RF are comparable, and when for achieving better prediction, while the second one dis-
coupled with CFS the performance can be significantly im- cusses the adaptation of the ensemble theory for feature se-
proved. Wang et al. [80] employed stacked generalization to lection – ensemble feature selection.
predict membrane protein types. A SVM and a kNN were
used as the base classifiers and a decision tree was adopted The Extension of Ensemble Methods for Classification
to combine the base classifiers. Ensemble of SVMs
The problem of protein-protein interaction prediction has SVM is generally considered as the best “off-the-shelf”
also been approached from the ensemble perspective. In classifier. If it can be successfully used as the base classifier
study [81], Chen and Liu introduced a domain-based random of an ensemble, the further improvement gain could be
forests method to infer protein interactions. The protein- noteworthy. One simple way to use SVM in the ensemble
protein interactions are inferred from the protein domain framework is to apply a bagging procedure with an SVM
level, and the proposed “domain-based random decision for- base classifier. This is the approach taken by Caragea et al.
est framework” predicts possible domain-domain interac- [85] who applied a bagging ensemble with the base classifier
tions by considering all single-domains as well as domain of SVM for glycosylation site prediction. The experimental
combination pairs. More recently, Deng et al. [82] applied a results indicate that by training each base classifier with a re-
SVM-based ensemble algorithm using booststrap resampling sampling of the “balanced” training set, the performance of
and weighted voting strategy for protein-protein interaction the SVM ensemble suppresses both the single SVM and the
sites prediction. One difficulty of this learning task is the balanced SVM. Similarly, in [76], Guan et al. applied the
imbalance of the data classes due to the lack of positive bagging procedure for constructing an ensemble of SVMs
training examples. Deng et al. found that their ensemble of for gene function prediction. In gene ontology (GO) term
SVMs can alleviate the imbalanced problem and signifi- recognition, the ensemble of SVMs consistently outper-
cantly improve the prediction performance. formed the single SVM classifier.

Finally, many recently studies also focus on elucidating In the study of Peng [86], the concept of over-generating
genetic networks using ensemble methods. For instance, Wu and selecting of an appropriate subset of base classifiers
were investigated. The base classifier used is SVM and the
et al. [83] proposed to use a relevance vector machine
bootstrap sampling method is used to generate multiple train-
(RVM)-based ensemble for prediction of human functional
ing sets. Compared to the decision tree, SVM is much more
genetic networks from multiple sources of data. The pro- stable for small perturbation of the training samples. In order
posed ensemble is combined in a boosting manner and the to obtain the diversity among the base classifiers, a cluster-
comparison with a naive bayes classifier indicates that the ing based base classifier selection procedure is employed to
ensemble is more effective even with massive missing explicitly ensure that the base classifiers are accurate while
values. The study of Altay and Emmert-Streib [84] adopted also disagreeing with each other. By comparing it to a single
ensemble approach from a different perspective. In particu- SVM classifier and the ensemble of bagging and boosting,
lar, they use an ensemble of datasets drawn under the same Peng demonstrated that the proposed clustering based SVM
condition to reveal the differences in gene networks inferred ensemble achieved the best result.
by different algorithms. They identified the bias of different
inference algorithms with respect to different network com- The study by Gordon et al. [72] utilized a unique ensem-
ble approach in which multiple SVMs each with a different
ponents, and subsequently, use this information to interpret
kernel is combined for transcription start sites prediction.
more objectively on the inferred networks.
This approach provides a new way to create ensembles of
The applications of ensemble methods in bioinformatics SVMs. It could be useful for problems with heterogeneous
reviewed above are by no means an exhaustive list but data sources and feature types.
merely the major topics which have received much attention. Meta Ensemble
In most reviewed studies, ensemble methods have shown to
be very useful. Given the flexibility and the numerous ways One pursuable idea is to gain more improvement by
to create and tune them, it is likely that much more effort building ensembles of ensembles – meta ensembles. This
will be directed to solve many more biological problems idea was first investigated by Dettling [87] who proposed to
combine the bagging and boosting algorithms (called Bag-
304 Current Bioinformatics, 2010, Vol. 5, No. 4 Yang et al.

Boosting) for microarray data classification. The underlying tein structural class prediction. Hassan et al. [91] combined a
hypothesis is that the boosting ensemble has a lower bias but set of fifteen classifiers ranging from rule-based classifiers
the variance is relatively high, while the bagging ensemble such as kNN and decision trees to function-based classifiers
has a lower variance but approximately non-altered bias. such as SVM and neural networks. This ensemble of classi-
Therefore, combining these two ensemble methods may re- fication algorithms is applied to three microarray datasets to
sult in a prediction tool which could achieve both low bias find a small number of highly differentially expressed (DE)
and low variance. The empirical evaluation indicates that the genes. Yang et al. [92] proposed multi-filter enhanced ge-
proposed BagBoosting can improve the predication com- netic ensemble system for microarray analysis. The system
pared to bagging and boosting alone, and it is competitive combines multiple classifiers and filtering algorithms with a
compared to several other classifiers such as SVM, kNN, multiple objective genetic algorithm. By introducing a com-
DLDA, and PAM. In [76], three different ensembles of binatorial ranking component and optimizing a set of base
SVMs are treated as “base” classifiers and are further com- classifiers, Yang et al. [93] extended the genetic ensemble
bined as a meta-ensemble of SVMs for gene function predic- system for gene-gene interaction identification from GWA
tion. The final predictions of genes are made by selecting the studies.
best performing classifier according to each GO term. An-
The similarity of this class of ensemble methods is that
other study by Liu and Xu [88] explored a different way of
the diversity of the ensemble classifier is imposed by using
forming meta-ensemble of classifiers. Their ensemble sys-
tem is based on a genetic programming approach which op- different classification algorithms. However, this could be
further combined with data-level perturbation to produce a
timizes a group of small-scale ensembles, called sub-
meta-ensemble of classifiers, which could potentially in-
ensembles, each consisting of a group of decision trees
crease the overall diversity while providing higher classifica-
trained using different sets of input features. The experiment
tion accuracy. The schematic illustration of such kind of en-
demonstrates that the system outperforms several other evo-
semble methods is depicted in Fig. (8).
lutionary based algorithms.
Ensemble of Multiple Classification Algorithms Other Approaches
It is possible to create ensembles using many other ap-
Another direction for extending the ensemble idea is to
proaches. Liu et al. [94] introduced a novel ensemble of neu-
gain the disagreement in sample classification by using dif-
ral networks by using three different feature selec-
ferent classification algorithms. That is, instead of manipu-
tion/extraction methods coupled with bootstrapping to gen-
lating the dataset to train different classification models
erate diverse base classifiers. Their study demonstrated that
using a given classification algorithm such as decision trees
or SVM, these methods attempt to find the diversity of the the diversity of base classifiers can also be obtained by in-
corporating different feature generating algorithms which
base classifier by using heterogeneous classification algor-
provide several different gene ranking lists. A similar idea
ithms.
was used by Koziol et al. [95]. They assembled five disjoint
For example, Bhanot et al. [89] combined ANN, SVM, lists of genes and created the base classifiers of decision
Weighted Voting, kNN, decision trees, and logistic regres- trees, each trained on the dataset filtered by a gene list. The
sion for the classification of mass spectrometry data. Ke- prediction is then made by simple voting.
darisetti et al. [90] extracted different sets of features from
To enhance the random forests algorithm for very high-
the protein sequence database to train an ensemble of classi-
dimensional dataset, Amaratunga et al. [96] designed a ran-
fiers using kNN, decision trees, logistic regression, and SVM
dom forests variant called enriched random forests which
classification algorithms. The ensemble is then used for pro-

Fig. (8). Schematic illustration of the ensemble using different classification algorithms. (a) classification algorithms are trained using the
same training set. (b) classification algorithms are trained using different perturbations of the training set.
A Review of Ensemble Methods in Bioinformatics Current Bioinformatics, 2010, Vol. 5, No. 4 305

weigh the importance of features when selecting splitting more stable results. Zhang et al. [108] incorporated multiple
nodes. This modified random forests demonstrated very filtering algorithms and classification algorithms to improve
promising results when the dimension of the microarray data the prediction accuracy and the stability of the gene ranking
is huge while the number of the discriminative genes is results in a genetic algorithm based wrapper procedure.
small. More selection chance is given to these informative Abeel et al. [109] studied the ensemble of filters in a boot-
genes while the diversity of the base classifier is still pre- strap framework. Netzer et al. [110] developed a feature se-
served by also including other different genes for building lection approach using the principle of stacked generaliza-
the base classifier. tion. The feature selection algorithm termed stacked feature
ranking is reported to identify important markers and im-
Yanover et al. [97] introduced a statistical ensemble
prove sample classification accuracy.
method called solution-aggregating motif finder (SAMF).
Their method is based on Markov Random Field with the Yang et al. [111] integrated various statistical methods
BMMF algorithm [98] which gives the M top-scoring solu- such as t-test, penalized t-test, mixture models, and linear
tions. The final result is given by aggregating the clustering models to improve the robustness of the gene ranking results
output of the BMMF solutions. of microarray. Similarly, Chan et al. [112] combined Wil-
coxon test with different feature selection procedures and
There are also many extensions of ensemble methods
different classification algorithms. They divided the feature
under the Bayesian framework. For example, Armananzas et
selection into two levels–statistical feature selection and
al. [99] proposed a hierarchy of Bayesian network classifiers secondary feature selection. The underlying principle behind
for detecting gene interactions from microarray data, and
these methods is that genes and proteins that are selected or
Robles et al. [100] used Bayesian network to combine multi-
highly ranked by different measures are more likely to have
ple classifiers for protein secondary structure prediction.
genuine biological relevance than those by a single measure
Finally, utilizing the general theory of ensemble, Hu et [111].
al. [101] and Wijaya et al. [102] proposed to combine the
outputs of multiple motif finder algorithms so as to improve CONCLUSIONS
the final prediction result. In this regard, the focus is shifted
In classification and prediction, a carefully engineered
to the design of proper integration function for combining
ensemble algorithm generally offer higher accuracy and sta-
multiple results.
bility than a single algorithm can achieve. In addition, en-
The Adaptation of the Ensemble Theory for Feature semble algorithms can often alleviate the problems of small
Selection sample size and high dimensionality which commonly occur
in many bioinformatics applications. It is worth mentioning
The idea of ensemble in biological data analysis origi- that the increased accuracy is often accompanied with in-
nates from combining multiple classifiers for improving
creased model complexity which causes decreased model
sample classification accuracy. However, it has been adapted
interpretability and higher computational intensity. Never-
and increasingly used in feature selection, possibly as a con-
theless, the theoretical studies of ensemble approaches and
sequence of the growing concern of the instability of the
the increase of computational power may counter those diffi-
feature selection results from high-dimensional data [103].
culties.
One direct adaptation of ensemble methods for feature Beside classification, many ensemble methods can also
selection is to modify them as embedded feature selection
be used, with minor modifications, for feature selection or
algorithms by incorporating a feature extraction component.
measuring feature importance. These are the main tasks in
This idea is very similar to the use of random forests for
many biological studies such as disease associated gene se-
SNP-SNP interaction identification. For instance, Jiang et al.
lection from microarray, disease associated protein selection
[104] employed a gene shaving method with random forests
from mass spectrometry data, or high risk SNPs and SNP-
so as to select differentially expressed genes. Levner [105] SNP interaction identification from GWA studies. In feature
designed a feature extraction procedure for a boosting en-
selection, the development of novel methods which are
semble. The feature selection procedure is similar to sequen-
guided by general ensemble learning theory has been proven
tial forward selection procedure in that the algorithm selects
to be fruitful. Therefore, they are likely to be effective meth-
a single best feature during each boosting iteration. Saeys et
ods and tools to address the ever-widening gap between the
al. [106] also applied random forests as an embedded feature
sample size and the data dimension generated by high-
selection algorithm. However, it is further combined with throughput biological experiments.
two other filtering algorithms – Symmetrical Uncertainty
and RELIEF, and an SVM with recursive feature elimination This review mainly focused on the most popular methods
(SVM-RFE). The results of these studies generally support and the main applications. Yet, the idea of ensemble has
the adaptation of the ensemble method for feature selection. been widely applied to many other bioinformatics problems,
A more general approach is to utilize the ensemble theory which is beyond the scope of this review. The utilization of
of combining multiple models. Specifically, Dutkowski and ensemble methods has been one of the recent growing trends
Gambin [107] combined several filtering algorithms in a in the field of bioinformatics. It is our expectation that en-
cross-validation framework for biomarker selection from semble methods will become a flexible and promising tech-
mass spectrometry data. Multiple classification algorithms nique for addressing many more bioinformatics problems in
are used to evaluate the selected biomarkers so as to yield the years to come.
306 Current Bioinformatics, 2010, Vol. 5, No. 4 Yang et al.

SUMMARY [14] Webb GI, Zheng Z. Multistrategy ensemble learning: Reducing


error by combining ensemble learning techniques. IEEE Trans
 Ensemble methods have been increasingly applied to Knowl Data Eng 2004; 16: 980-91.
bioinformatics problems in dealing with small sample [15] Breiman L. Arcing classifiers (with discussion). Ann Appl Probab
1998; 26: 801-49.
size, high-dimensionality, and complexity data struc- [16] Schapire RE, Freund Y, Bartlett P, et al. Boosting the margin: a
ture. new explanation for the effectiveness of voting methods. Ann Appl
Probab 1998; 26: 1651-86.
 The main applications of ensemble methods in bioin- [17] Tsymbal A, Pechenizkiy M, Cunningham P. Diversity in search
formatics are classification of gene expression and strategies for ensemble feature selection. Inf Fusion 2005; 6: 83-98.
mass spectrometry-based proteomics data, gene-gene [18] Wolpert DH. Stacked generalization. IEEE Trans Neural Netw
interaction identification from genome-wide associa- 1992; 5: 241-59.
[19] Kuncheva LI. Switching between selection and fusion in combin-
tion studies, and prediction of regulatory elements ing classifiers: an experiment. IEEE Trans Syst Man Cybern 2002;
from DNA and protein sequences. 32: 146-56.
[20] Gama J, Brazdil P. Cascade generalization. Mach Learn 2000; 41:
 Sampling methods such as bagging and boosting are 315-43.
effective in dealing with data with small sample size, [21] Lam L, Suen Y. Application of majority voting to pattern recogni-
while random forests holds a unique advantage in tion: an analysis of its behavior and performance. IEEE Trans Syst
dealing with data with high-dimensionality. Man Cybern 1997; 27: 553-68.
[22] Bahler D, Navarro L. Methods for combining heterogeneous sets of
 Emerging ensemble methods such as ensemble of classifiers. In: Proceedings of the 17th National Conference on Ar-
support vector machines, meta-ensemble, and ensem- tificial Intelligence (AAAI), Workshop on New Research Problems
for Machine Learning. Austin, Texas 2000.
ble of heterogeneous classification algorithms are [23] Kang H, Kim K, Kim J. A framework for probabilistic combination
promising directions for more accurate classification of multiple classifiers at an abstract level. Eng Appl Artif Intell
in bioinformatics. 1997; 10: 379-85.
[24] Kittler J, Hatef M, Duin RP, et al. On combining classifiers. IEEE
 Ensemble based feature selection is a promising ap- Trans Pattern Anal Mach Intell 1998; 20: 226-39.
proach for feature selection and biomarker identifica- [25] Asyali MH, Colak D, Demirkaya O, et al. Gene expression profile
tion in bioinformatics. classification: a review. Curr Bioinform 2006; 1: 55-73.
[26] Somorjai RL, Dolenko B, Baumgartner R, et al. Class prediction
ACKNOWLEDGEMENTS and discovery using gene microarray and proteomics mass spec-
troscopy data: curses, caveats, cautions. Bioinformatics 2003; 19:
We thank Professor Joachim Gudmundsson for critical 1484-91.
comments and constructive suggestions which have greatly [27] Saeys Y, Lnza I, Larranaga P. A review of feature selection tech-
niques in bioinformatics. Bioinformatics 2007; 23: 2507-17.
improve the early version of this article. We also thank [28] Hilario M, Kalousis A. Approaches to dimensionality reduction in
Georgina Wilcox for editing the article. Pengyi Yang is sup- proteomic biomarker studies. Brief Bioinform 2008; 9: 102-18.
ported by the NICTA International Postgraduate Award [29] Braga-Neto U, Dougherty E. Is cross-validation valid for small-
(NIPA) and the NICTA Research Project Award (NRPA). sample microarray classification? Bioinformatics 2004; 20: 374-80.
[30] Ben-Dor A, Bruhn L, Friedman N, et al. Tissue classification with
REFERENCES gene expression profiles. Int J Comput 2000; 7: 559-83.
[31] Dudoit S, Fridlyand J, Speed T. Comparison of discrimination
[1] Larranaga P, Calvo B, Santana R, et al. Machine learning in bioin- methods for the classification of tumors using gene expression data.
formatics. Briefings Bioinform 2003; 7: 86-112. J Am Stat Assoc 2002; 97: 77-87.
[2] Allison DB, Cui X, Page GP, et al. Microarray data analysis: from [32] Dettling M, Buhlmann P. Boosting for tumor classification with
disarray to consolidation and consensus. Nat Rev Genet 2006; 7: gene expression data. Bioinformatics 2003; 19: 1061-9.
55-66. [33] Long P. Boosting and microarray data. Mach Learn 2003; 53: 31-
[3] Aebersold R, Mann M. Mass spectrometry-based proteomics. Na- 44.
ture 2003; 422: 198-207. [34] Tan A, Gilbert D. Ensemble machine learning on gene expression
[4] Hirschhorn J, Daly M. Genome-wide association studies for com- data for cancer classification. Appl Bioinformatics 2003; 2: S75-
mon diseases and complex traits. Nat Rev Genet 2005; 6: 95-108. S83.
[5] Zeng J, Zhu S, Yan H. Towards accurate human promoter recogni- [35] Qu Y, Adam B, Yasui Y, et al. Boosted decision tree analysis of
tion: a review of currently used sequence features and classification surface-enhanced laser desorption/ionization mass spectral serum
methods. Brief Bioinform 2009; 10: 498-508. profiles discriminates prostate cancer from noncancer patients. Clin
[6] Valencia A, Pazos F. Computational methods for the prediction of Chem 2002; 48: 1835-43.
protein interactions. Curr Opin Struct Biol 2002; 12: 368-73. [36] Wu B, Abbott T, Fishman D, et al. Comparison of statistical meth-
[7] Jones DT. Protein structure prediction in genomics. Brief Bioin- ods for classification of ovarian cancer using mass spectrometry
form 2001; 2: 111-25. data. Bioinformatics 2003; 19: 1636-43.
[8] Dietterich TG. Ensemble methods in machine learning. In: Pro- [37] Gertheiss J, Tutz G. Supervised feature selection in mass spec-
ceedings of Multiple Classifier System, LNCS 1857. Springer: Italy trometry-based proteomic profiling by blockwise boosting. Bioin-
2000; pp. 1-15. formatics 2009; 25: 1076-7.
[9] Dietterich TG. An experimental comparison of three methods for [38] Yasui Y, Pepe M, Thompson M, et al. A data-analytic strategy for
constructing ensembles of decision trees: Bagging, boosting and protein biomarker discovery: profiling of high-dimensional pro-
randomization. Mach Learn 2000; 40: 139-158. teomic data for cancer detection. Biostatistics 2003; 4: 449-63.
[10] Breiman L. Bagging predictors. Mach Learn 1996; 26: 123-140. [39] Lee J, Lee J, Park M, et al. An extensive comparison of recent
[11] Freund Y, Schapire R. Experiments with a new boosting algorithm. classification tools applied to microarray data. Comput Stat Data
In: Proceedings of the Thirteenth National Conference on Mach Anal 2005; 48: 869-85.
Learn: Italy 1996; pp. 148-56. [40] Diaz-Uriarte R, de Andres S. Gene selection and classification of
[12] Breiman L. Random forests. Mach Learn 2001; 45: 5-32. microarray data using random forest. BMC Bioinformatics 2006; 7:
[13] Kuncheva L. Combining pattern classifiers: methods and algo- 3.
rithms. Wiley: New Jersey 2004. [41] Izmirlian G. Application of the random forest classification algo-
rithm to a SELDI-TOF proteomics study in the setting of a cancer
prevention trial. Ann N Y Acad Sci 2004; 1020: 154-74.
A Review of Ensemble Methods in Bioinformatics Current Bioinformatics, 2010, Vol. 5, No. 4 307

[42] Zhang H, Yu C, Singer B. Cell and tumor classification using gene [70] Zhao X, Xuan Z, Zhang M. Boosting with stumps for predicting
expression data: Construction of forests. Proc Indian Natl Sci Acad transcription start sites. Genome Biol 2007; 8: R17.
B Biol Sci 2003; 100: 4168-72. [71] Wang X, Xuan Z, Zhao X, et al. High-resolution human core-
[43] Geurts P, Fillet M, Seny D, et al. Proteomic mass spectra classifi- promoter prediction with CoreBoost_HM. Genome Res 2009; 19:
cation using decision tree based ensemble methods. Bioinformatics 266-75.
2005; 21: 3138-45. [72] Gordon JJ, Towsey MW, Hogan JM, et al. Improved prediction of
[44] Dougherty ER, Sima C, Hanczar B, et al. Performance of error bacterial transcription start sites. Bioinformatics 2006; 22: 142-8.
estimators for classification. Curr Bioinform 2010; 5: 53-67. [73] Hamby SE, Hirst JD. Prediction of glycosylation sites using ran-
[45] Ge G, Wong G. Classification of premalignant pancreatic cancer dom forests. BMC Bioinformatics 2008; 9: 500.
mass-spectrometry data using decision tree ensembles. BMC Bioin- [74] Caragea C, Sinapov J, Silvescu A, et al. Glycosylation site predic-
formatics 2008; 9: 275. tion using ensembles of Support Vector Machine classifiers. BMC
[46] Statnikov A, Wang L, Aliferis C. A comprehensive comparison of Bioinformatics 2007; 8: 438.
random forests and support vector machines for microarray-based [75] Yoo PD, Ho YS, Zhou BB, et al. SiteSeek: Post-translational modi-
cancer classification. BMC Bioinformatics 2008; 9: 319. fication analysis using adaptive locality-effective kernel methods
[47] Curtis RK, Oresic M, Vidal-Puig A. Pathways to the analysis of and new profiles. BMC Bioinformatics 2008; 9: 272.
microarray data. Trends Biotechnol 2005; 23: 429-35. [76] Guan Y, Myers C, Hess D, et al. Predicting gene function in a
[48] Pang H, Lin A, Holford M, et al. Pathway analysis using random hierarchical context with an ensemble of classifiers. Genome Biol
forests classification and regression. Bioinformatics 2006; 22: 2008; 9: S3.
2028-36. [77] Shen HB, Chou KC. Ensemble classifier for protein fold pattern
[49] Kanehisa M, Araki M, Goto S, et al. KEGG for linking genomes to recognition. Bioinformatics 2006; 22: 1717-22.
life and the environment. Nucleic Acids Res 2008; 36: D480-D484. [78] Melvin I, Weston J, Leslie CS, et al. Combining classifiers for
[50] Montana G. Statistical methods in genetics. Brief Bioinform 2006; improved classification of proteins from sequence or structure.
7: 297-308. BMC Bioinformatics 2008; 9: 389.
[51] Cordell JH. Detecting gene-gene interactions that underlie human [79] Lee B, Shin M, Oh Y, et al. Identification of protein functions
diseases. Nat Rev Genet 2009; 10: 392-404. using a machine learning approach based on sequence-derived
[52] Zhang H, Bonney G. Use of classification trees for association properties. Proteome Sci 2009; 7: 27.
studies. Genet Epidemiol 2000; 19: 323-32. [80] Wang SQ, Yang J, Chou KC. Using stacked generalization to pre-
[53] Huang J, Lin A, Narasimhan B, et al. Tree-structured supervised dict membrane protein types based on pseudo-amino acid composi-
learning and the genetics of hypertension. Proc Natl Acad Sci USA tion. J Theor Biol 2006; 242: 941-6.
2004; 101: 10529-34. [81] Chen XW, Liu M. Prediction of protein-protein interactions using
[54] Cook N, Zee R, Ridker P. Tree and spline based association analy- random decision forest framework. Bioinformatics 2005; 21: 4394-
sis of gene-gene interaction models for ischemic stroke. Stat Med 400.
2004; 23: 1439-53. [82] Deng L, Guan J, Dong Q, et al. Prediction of protein-protein inter-
[55] Ye Y, Zhong X, Zhang H. A genome-wide tree-and forest-based action sites using an ensemble method. BMC Bioinformatics 2009;
association analysis of comorbidity of alcoholism and smoking. 10: 426.
BMC Genetics 2005; 6: S135. [83] Wu CC, Asgharzadeh S, Triche TJ, et al. Prediction of human
[56] Zhang Z, Zhang S, Wong MY, et al. An ensemble learning ap- functional genetic networks from heterogeneous data using RVM-
proach jointly modeling main and interaction effects in genetic as- based ensemble learning. Bioinformatics 2010; 26: 807-813.
sociation studies. Genet Epidemiol 2008; 32: 285-300. [84] Altay G, Emmert-Streib F. Revealing differences in gene network
[57] McKinney BA, Reif DM, Ritchie MD, et al. Machine learning for inference algorithms on the network level by ensemble methods.
detecting gene-gene interactions: a review. Appl Bioinformatics Bioinformatics 2010; 26: 1738-44.
2006; 5: 77-88. [85] Caragea C, Sinapov J, Silvescu A, et al. Glycosylation site predic-
[58] Bureau A, Dupuis J, Hayward B, et al. Mapping complex traits tion using ensemble of support vector machine classifiers. BMC
using random forests. BMC Genet 2003; 4: S64. Bioinformatics 2007; 8: 438.
[59] Bureau A, Dupuis J, Falls K, et al. Identifying SNPs predictive of [86] Peng Y. A novel ensemble machine learning for robust microarray
phenotype using random forests. Genet Epidemiol 2005; 28: 171- data classification. Comput Biol Med 2006; 36: 553-73.
82. [87] Dettling M. Bagboosting for tumor classification with gene expres-
[60] Lunetta KL, Hayward LB, Segal J, et al. Screening large-scale sion data. Bioinformatics 2004; 20: 3583-93.
association study data: exploiting interactions using random forests. [88] Liu KH, Xu CG. A genetic programming-based approach to the
BMC Genet 2004; 5: 32 classification of multiclass microarray datasets. Bioinformatics
[61] Chen X, Liu C, Zhang M, et al. A forest-based approach to identi- 2009; 25: 331-7.
fying gene and gene-gene interactions. Proc Natl Acad Sci USA [89] Bhanot G, Alexe G, Venkataraghavan B, et al. A robust meta-
2007; 104: 19199-203. classification strategy for cancer detection from MS data. Pro-
[62] Klein RJ, Zeiss C, Chew EY, et al. Complement factor H polymor- teomics 2006; 6: 592-604.
phism in age-related macular degeneration. Science 2005; 308: [90] Kedarisetti KD, Kurgan L, Dick S. Classifier ensembles for protein
385-9. structural class prediction with varying homology. Biochem Bio-
[63] Meng Y, Yu Y, Cupples L, et al. Performance of random forest phys Res Commun 2006; 348: 981-8.
when SNPs are in linkage disequilibrium. BMC Bioinform 2009; [91] Hassan MR, Hossain MM, Bailey J, et al. A voting approach to
10: 78. identify a small number of highly predictive genes using multiple
[64] Jiang R, Tang W, Wu X, et al. A random forest approach to the classifiers. BMC Bioinformatics 2009; 10: S19.
detection of epistatic interactions in case-control studies. BMC [92] Yang P, Zhou BB, Zhang Z, et al. A multi-filter enhanced genetic
Bioinform 2009; 10: S65. ensemble system for gene selection and sample classification of
[65] Zhang M. Computational analyses of eukaryotic promoters. BMC microarray data. BMC Bioinformatics 2010; 11: S5.
Bioinform 2007; 8: S3. [93] Yang P, Ho JWK, Zomaya AY, et al. A genetic ensemble approach
[66] Down TA, Hubbard TJP. Computational detection and location of for gene-gene interaction identification. BMC Bioinformatics 2010;
transcription start sites in mammalian genomic DNA. Genome Res 11: 524.
2002; 12: 458-61. [94] Liu B, Cui Q, Jiang T, et al. A combinational feature selection and
[67] Blon N, Ponten T, Gupta R, et al. Prediction of post-translational ensemble neural network method for classification of gene expres-
glycosilation and phosphorylation of proteins from amino acid se- sion data. BMC Bioinformatics 2004; 5: 136.
quence. Proteomics Clin Appl 2004; 4: 1633-49. [95] Koziol JA, Feng AC, Jia Z, et al. The wisdom of the commons:
[68] Hong P, Liu XS, Zhou Q, et al. A boosting approach for motif ensemble tree classifiers for prostate cancer prognosis. Bioinfor-
modeling using ChIP-chip data. Bioinformatics 2005; 21: 2636-43. matics 2009; 25: 54-60.
[69] Xie X, Wu S, Lam KM, et al. PromoterExplorer: an effective pro- [96] Amaratunga D, Cabrera J, Lee Y. Enriched random forests. Bioin-
moter identification method based on the AdaBoost algorithm. Bio- formatics 2008; 24: 2010-4.
informatics 2006; 22: 2722-8.
308 Current Bioinformatics, 2010, Vol. 5, No. 4 Yang et al.

[97] Yanover C, Singh M, Zaslavsky E. M are better than one: an en- [106] Saeys Y, Abeel T, Van de Peer Y. Robust feature selection using
semble-based motif finder and its application to regulatory element ensemble feature selection techniques. In: Proceedings of the euro-
prediction. Bioinformatics 2009; 25: 868-74. pean conference on machine learning and knowledge discovery in
[98] Yanover C, Weiss Y. Finding the AI most probable configurations databases. Part II, LNCS 5212. Belgium: Springer 2008; pp. 313-
using loopy belief propagation. In: Proceedings of advances in neu- 25.
ral information processing systems. Vancover: The MIT Press [107] Dutkowski J, Gambin A. On consensus biomarker selection. BMC
2004; p. 289. Bioinformatics 2007; 8: S5.
[99] Armananzas R, Inza I, Larranaga P. Detecting reliable gene interac- [108] Zhang Z, Yang P, Wu X, et al. An agent-based hybrid system for
tions by a hierarchy of Bayesian network classifiers. Comput microarray data analysis. IEEE Intel Syst 2009; 24: 53-63.
Methods Programs Biomed 2008; 91: 110-21. [109] Abeel T, Helleputte T, Van de Peer Y, et al. Robust biomarker
[100] Robles V, Larranaga P, Pena JM, et al. Bayesian network multi- identification for cancer diagnosis with ensemble feature selection
classifiers for protein secondary structure prediction. Artif Intell methods. Bioinformatics 2010; 26: 392-8.
Med 2004; 31: 117-36. [110] Netzer M, Millonig G, Osl M, et al. A new ensemble-based algo-
[101] Hu J, Yang YD, Kihara D. EMD: an ensemble algorithm for dis- rithm for identifying breath gas marker candidates in liver disease
covering regulatory motifs in DNA sequences. BMC Bioinformat- using ion molecule reaction mass spectrometry. Bioinformatics
ics 2006; 7: 342. 2009; 25: 941-7.
[102] Wijaya E, Yiu SM, Son NT, et al. MotifVoter: a novel ensemble [111] Yang YH, Xiao Y, Segal MR. Identifying differentially expressed
method for fine-grained integration of generic motif finders. Bioin- genes from microarray experiments via statistic synthesis. Bioin-
formatics 2008; 24: 2288-95. formatics 2005; 21: 1084-93.
[103] Boulesteix AL, Slawski M. Stability and aggregation of ranked [112] Chan D, Bridges SM, Burgess SC. An ensemble method for identi-
gene lists. Brief Bioinform 2009; 10: 556-68. fying robust features for biomarker identification. In: Liu H, Ed.
[104] Jiang H, Deng Y, Chen HS, et al. Joint analysis of two microarray Computational Methods of Feature Selection. Arizona State Uni-
gene-expression data sets to select lung adenocarcinoma marker versity, Tempe, AZ, Hiroshi Motoda, AFOSR/AOARD, Tokyo,
genes. BMC Bioinformatics 2004; 5: 81. Japan. Series: Chapman & Hall/CRC Data Mining and Knowledge
[105] Levner I. Feature selection and nearest centroid classification for Discovery Series 2007; pp. 377-92.
protein mass spectrometry. BMC Bioinformatics 2005; 6: 68.

Received: September 28, 2010 Revised: November 04, 2010 Accepted: November 15, 2010

You might also like