0% found this document useful (0 votes)
5 views

Deep centroid a general deep cascade classifier for biomedical omics data classification

The paper introduces Deep Centroid, a novel deep cascade classifier designed for biomedical omics data classification, addressing challenges like high dimensionality and limited sample sizes. It demonstrates superior performance in cancer early diagnosis, prognosis, and drug sensitivity prediction compared to traditional machine learning models, while also providing biologically significant feature interpretations. The Deep Centroid classifier is available for public use, enhancing the applicability of machine learning in precision medicine.

Uploaded by

zhangqshit
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Deep centroid a general deep cascade classifier for biomedical omics data classification

The paper introduces Deep Centroid, a novel deep cascade classifier designed for biomedical omics data classification, addressing challenges like high dimensionality and limited sample sizes. It demonstrates superior performance in cancer early diagnosis, prognosis, and drug sensitivity prediction compared to traditional machine learning models, while also providing biologically significant feature interpretations. The Deep Centroid classifier is available for public use, enhancing the applicability of machine learning in precision medicine.

Uploaded by

zhangqshit
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Bioinformatics, 2024, 40(2), btae039

https://doi.org/10.1093/bioinformatics/btae039
Advance Access Publication Date: 1 February 2024
Original Paper

Gene expression

Downloaded from https://academic.oup.com/bioinformatics/article/40/2/btae039/7596621 by National Science & Technology Library user on 26 March 2024
Deep centroid: a general deep cascade classifier for
biomedical omics data classification
Kuan Xie1, Yuying Hou 1
, Xionghui Zhou 1,2,
*
1
Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan 430070, People’s
Republic of China
2
Key Laboratory of Smart Farming for Agricultural Animals, Huazhong Agricultural University, Wuhan 430070, People’s Republic of China
*Corresponding author. Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan 430070, People’s
Republic of China. E-mail: [email protected] (X.Z.)
Associate Editor: Anthony Mathelier

Abstract
Motivation: Classification of samples using biomedical omics data is a widely used method in biomedical research. However, these datasets
often possess challenging characteristics, including high dimensionality, limited sample sizes, and inherent biases across diverse sources. These
factors limit the performance of traditional machine learning models, particularly when applied to independent datasets.
Results: To address these challenges, we propose a novel classifier, Deep Centroid, which combines the stability of the nearest centroid classifier
and the strong fitting ability of the deep cascade strategy. Deep Centroid is an ensemble learning method with a multi-layer cascade structure, consist-
ing of feature scanning and cascade learning stages that can dynamically adjust the training scale. We apply Deep Centroid to three precision medicine
applications—cancer early diagnosis, cancer prognosis, and drug sensitivity prediction—using cell-free DNA fragmentations, gene expression profiles,
and DNA methylation data. Experimental results demonstrate that Deep Centroid outperforms six traditional machine learning models in all three appli-
cations, showcasing its potential in biological omics data classification. Furthermore, functional annotations reveal that the features scanned by the
model exhibit biological significance, indicating its interpretability from a biological perspective. Our findings underscore the promising application of
Deep Centroid in the classification of biomedical omics data, particularly in the field of precision medicine.
Availability and implementation: Deep Centroid is available at both github (github.com/xiexiexiekuan/DeepCentroid) and Figshare (https://fig
share.com/articles/software/Deep_Centroid_A_General_Deep_Cascade_Classifier_for_Biomedical_Omics_Data_Classification/24993516).

1 Introduction characterized by small sample sizes and high data dimensions,


there is currently no widely accepted universal deep learning
The rapid advancement of next-generation sequencing technolo-
gies has led to the generation of large-scale omics data, providing model suitable for omics classification problems. It is worth not-
ample opportunities for harnessing machine learning models to ing that many deep learning models suffer from limited inter-
investigate associations between genetic features and specific pretability, a concern that aligns with the biological functional
phenotypes, such as cancer, as well as predicting sample catego- relevance of features often emphasized in biomedical omics re-
ries based on omics features. However, omics data often possess search, further constraining the applicability of deep learning
high-dimensional features and are constrained by limited sample models in this domain.
sizes (Sun et al. 2019, Basavegowda and Dagnew 2020). To address the issues of large training sample requirements
Furthermore, data from different sources often exhibit inherent and lack of interpretability in deep learning models, Zhou and
inconsistencies that are difficult to eliminate (Papiez et al. 2019). colleagues (Zhou and Feng 2017, Zhou and Feng 2019) devel-
Consequently, these factors pose significant hurdles in construct- oped a cascaded classifier based on random forest called Deep
ing classifiers with exceptional classification performance and Forest (DF). This model has achieved performance comparable
robust discriminative ability across datasets derived from vari- to deep learning models on multiple datasets. Some researchers
ous sources (Demsar and Zupan 2021, Greener et al. 2022). have modified the DF architecture to make it suitable for bio-
In recent years, machine learning, especially deep learning, medical data (Su et al. 2019, Chu et al. 2021, Wu et al. 2023).
has made significant progress in various fields, and deep learning However, these studies still lack validation on independent data-
models have achieved remarkable breakthroughs in some bio- sets and are only applicable to specific problems. Therefore, an
medical domains (Jumper et al. 2021, Townshend et al. 2021, easy-to-use, general, and robust classification model is crucial
Baek and Baker 2022, Cheng et al. 2023). Due to the typically for the application of biomedical omics data.
large-scale training data required for deep learning models, deep The centroid classifier (Simon 2003), also known as
learning has achieved remarkable success primarily in tasks such Nearest Centroid Classifier, is renowned for its simplicity and
as molecular structure prediction, where abundant training data consistent performance on independent datasets (Lu and
are available. However, for most biomedical omics data, Zhou 2019). Simultaneously, the deep cascade strategy

Received: 30 October 2023; Revised: 13 January 2024; Editorial Decision: 15 January 2024
C The Author(s) 2024. Published by Oxford University Press.
V
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which
permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
2 Xie et al.

significantly enhances the model’s fitting capability. Building Table 1. Details of all the datasets used in this work.
upon these strengths, we present Deep Centroid, a novel clas- Dataset name Number of Sample ratio Application
sifier that combines the stability of the centroid classifier with samples (positive versus
the robust fitting ability of the deep cascade strategy. Deep negative)
Centroid uses an ensemble learning approach with a multi-
layer cascade structure, comprising feature scanning and cas- LUCAS 287 1:1.22 Early detection
LUCAS validation 431 1:8.37
cade learning stages that allow for dynamic adjustment of the

Downloaded from https://academic.oup.com/bioinformatics/article/40/2/btae039/7596621 by National Science & Technology Library user on 26 March 2024
cohort
training scale. To validate the effectiveness of our model, we GSE2034 276 1:1.97 Cancer prognosis
applied Deep Centroid to three important problems in preci- GSE7390 190 1:4.28
sion medicine research: early cancer diagnosis, cancer progno- GSE11121 182 1:5.50
sis, and drug sensitivity prediction, utilizing diverse GSE12093 136 1:10.33
biomedical omics data as features. The results demonstrate GSE68379 949 cell lines / Drug sensitivity
Expression data 949 cell lines / prediction
that our model outperforms six alternative models in both S5C 48 drugs on 1:7.65
cross-validation and independent validation. Notably, the fea- 949 cell lines
tures identified by Deep Centroid hold biological significance,
highlighting the potential applicability of our approach. In ad-
dition, we provide a comprehensive Python package for Deep 2.2 Deep centroid classifier
Centroid, equipped with user-friendly functions and interfaces Deep Centroid Classifier is a deep cascaded ensemble classi-
for parameter tuning. This package ensures effective and ro- fier that utilizes centroid classifier (Supplementary Note S1)
bust classification using biomedical omics data. as the base classifier. It takes heterogeneous omics data from
one or multiple layers of samples as input and outputs binary
2 Materials and methods class labels or scores for the samples. The model consists of
three stages: feature scanning, deep cascading, and majority
2.1 Datasets voting.
Our method has been applied to three typical precision medi-
cine domains: early cancer diagnosis, cancer prognosis, and 2.2.1 Feature scanning
drug sensitivity prediction. For early cancer diagnosis, we In the field of image processing, local information in images is
obtained whole-genome cell-free DNA (cfDNA) sequencing beneficial for subsequent recognition tasks. Therefore, convo-
data from plasma samples of lung cancer patients, referred to lutional neural networks utilize convolutional kernels to ex-
as the LUCAS dataset and the LUCAS validation dataset, tract meaningful local information for further analysis.
from Dimitrios et al.’s paper (Mathios et al. 2021). Following However, in omics data, gene features are usually arranged in
the approach outlined by Ulz et al. (2016), we used the num- the order of names or IDs, as a result, adjacent features in the
ber of fragments near the transcription start site (TSS) within data often lack biological correlation. In this paper, we use
[ 50, 150] as features to construct the early cancer diagnosis many random scans to obtain feature combinations involved
model. Consistent with Dimitrios et al.’s approach, we used in certain biological functions (such as biological processes).
the LUCAS dataset for cross-validation and model construc- The number of random scans and the size of each random fea-
tion and the LUCAS validation dataset for independent vali- ture sets can be adjusted based on the number of candidate
dation. In the case of cancer prognosis, we utilized four breast features (see Supplementary Fig. S1 for the scan methods).
cancer transcriptome datasets along with clinical information Our toolkit also provides parameter structures for user selec-
for each sample. Building upon our previous work (Lu and tion. In addition, we offer known functional gene sets (Go
Zhou 2019), we used the GSE2034 dataset (Wang et al. Term biological process and KEGG pathways from MSIGDB)
2005) for cross-validation and model construction, while the as candidate feature sets, which can be selected by the users
union of the other three datasets (Desmedt et al. 2007, through parameters. In the meanwhile, the model adds a
Schmidt et al. 2008, Zhang et al. 2009) was used for indepen- known feature set interface, which can be imported by the
dent validation. All transcriptome data were based on the users for known feature sets to provide some prior knowledge
GPL96 chip, and chip probe information was mapped to for feature scanning.
genes. For each gene, the average value of all probes was used
as the gene’s value. For each patient, if the metastasis/recur- 2.2.2 Cascading learning
rence event occurred within five years, they were considered Deep Centroid utilizes a multi-layer cascade structure to con-
as samples with poor prognosis. If a patient’s event-free sur- struct the model, where multiple centroid classifiers are
vival exceeded five years, they were considered to have a good trained separately in each layer. The output results, along
prognosis, and other samples were discarded. All datasets for with the scanned initial data, are then used as input for the
drug sensitivity prediction were obtained from GDSC subsequent layer until the model achieves convergence. In ad-
(Genomics of Drug Sensitivity in Cancer) (Iorio et al. 2016), dition, the model has a pruning function, which will delete
and we used 949 cell line datasets that included DNA methyl- classifiers with low scores to improve prediction accuracy.
ation data (GSE68379) and gene transcriptome data (RMA The scanned initial features are sets obtained using the
normalized expression data for cell lines). The drug response method described in the feature scanning stage. Each set is
information for 48 FDA-approved drugs in these cell lines trained with a machine learning model and scored to generate
was used as class labels. Since there was no independent vali- a new feature. Within the package, users have the flexibility
dation dataset, cross-validation was used to evaluate our to choose different base classifiers, such as centroid classifier
models for each drug. For all the three tasks, z-score was used (default), Support Vector Machine (SVM), Random Forest
to normalize the datasets. The details of the datasets were (RF), and more. By incorporating spliced data, the training
shown in Table 1. dataset can be supplemented to avoid insufficient training
Deep centroid: a general deep cascade classifier 3

features or highly concentrated data features that may result Blagus and Lusa 2013), and Deep Neural Network (DNN)
in rapid model convergence. This approach enhances the reus- (Liu et al. 2021). For early cancer diagnosis and cancer prog-
ability of the original data and improves the model’s fitting nosis analysis, we performed 10 times 5-fold cross-validation
ability. The details of the cascading learning strategy are on one dataset and conducted independent validation on an-
shown as follow: other dataset (for cancer prognosis, the merged set of three
different datasets were used), as detailed in Table 1. For drug
1) In the training set, a re-sampling strategy (sampling with sensitivity prediction, due to the lack of an independent data-

Downloaded from https://academic.oup.com/bioinformatics/article/40/2/btae039/7596621 by National Science & Technology Library user on 26 March 2024
replacement) is used to ensure that the samples involved set, we presented the average results of 10 times 5-fold cross-
in training the classifiers have relatively balanced catego- validation for 48 clinically approved drugs. As many models,
ries. The sampling ratio of the two categories is defined including our Deep Centroid classifier, used random seeds
by sampling coefficient (adjustable, default: 0.65). The during training, the independent validation results presented
samples that are not included in the sampling process are are the averages of multiple independent validations. For
used as the validation set to evaluate the performance of more comprehensive information regarding the specific exper-
the model. imental parameters used for each model, please refer to
2) The samples used for training the model are divided into Supplementary Note S2.
two classes by the centroid classifier based on their We evaluated the performance of each classifier using met-
labels. For each class, the centroid vectors are computed rics such as Matthews Correlation Coefficient (MCC), Area
for each feature, resulting in n sets of centroid vectors. under the curve (AUC), Accuracy, and F1-score. Since omics
Each set contains both positive and negative centroid data often face the issue of class imbalance, this study primar-
vectors. ily utilized MCC, a metric that effectively evaluates classifica-
3) The validation set is used to evaluate the classification tion performance on imbalanced datasets, as the main
performance, and classifiers with MCC  Threshold (ad- evaluation criterion. For the details of all the indices, please
justable, default: 0) are removed. The sample score is refer to Supplementary Note S3.
obtained by calculating the distance between the sample
and the centroid vector. 2.4 Ablation experiments
4) If the classification performance of the current layer Our model innovates in feature scanning and base classifier
improves compared to the previous layer, training con- selection compared to Deep Forest. To confirm these enhance-
tinues to the next layer. The n-centroid distances output ments’ impact on performance, we performed two ablation
by the n-centroid classifiers are used as new data fea- experiments.
tures, which are combined with the scanned features
from the original data (or the other types of omics data) 2.4.1 Feature scanning ablation experiment
and serve as input for the next layer. In the feature scanning ablation experiment, we compared
5) Training is stopped if the classification performance of two strategies for feature selection: random scanning and slid-
the current layer no longer improves. ing window scanning. For random scanning, we randomly se-
lected feature sets of varying sizes (from 10 to 200 features)
2.2.3 Majority voting from the original feature set. The features within each set
When the classification performance of the model no longer were chosen randomly. On the other hand, sliding window
improves, the model ceases to cascade further and obtains the scanning involved selecting contiguous sets of neighboring
final prediction for each sample by majority voting among all features from the original features. In our ablation implemen-
classifiers in the last layer. Depending on user-defined param- tation, the size of the sliding window feature sets was fixed at
eters, in addition to the predicted class labels for the samples, 100, corresponding to the median size of the random sets. In
the model can also provide the probability values indicating addition, the step size for sliding the window was set to X
the likelihood of the samples belonging to the positive class. (defined as the total number of features divided by the
number of feature sets). Both random scanning and sliding
2.2.4 Data fusion strategy window scanning resulted in a total of 500 feature sets.
Our model is designed to handle both single-level omics data
and multi-omics data. For single omics data, this data serves 2.4.2 Base classifier ablation experiment
as the input to the data input layer, and after undergoing fea- In Deep Centroid, the nearest centroid classifiers were used as
ture random scanning, it is used as the input to the Nearest base classifiers. In the ablation experiments, we compared this
Centroid Classifier (NCC). Simultaneously, to prevent rapid strategy with two others: random forest (denoted as
convergence during cascading, the omics data, after feature Integrated RF) and multiple models (denoted as Integrated
scanning, is used as new information for each cascaded layer. MM). For Integrated RF, all NCC were replaced with RF. For
For multi-omics data, multiple omics data are concatenated Integrated MM, we kept the centroid classifiers as the first
and used together as the input for the first layer and subse- layer of the model. In each layer after the first layer, we used
quent deep cascaded layers. five types of classifiers—NCC, RF, support vector machines
(SVM), XGBoost, and deep neural networks (DNN)—as the
2.3 Evaluations base classifiers. The parameters of all classifiers were consis-
To evaluate the performance of Deep Centroid (DC), we com- tent with the previous description.
pared it with six classical classifier models: Rand Forest (RF)
(Ho 1995), Support Vector Machine (SVM) (Cortes and
Vapnik 1995), Deep Forest (DF) (Zhou and Feng 2017), 3 Results
eXtreme Gradient Boosting (XGBoost) (Chen and Guestrin In the results section, we first provide a brief overview of
2016), Nearest Centroid Classifier (NCC) (Hart et al. 2000, the method’s framework. Subsequently, we conducted a
4 Xie et al.

comprehensive assessment of Deep Centroid in three critical data has proven to be an effective, minimally invasive bio-
domains: early cancer diagnosis, cancer prognosis, and drug marker for early cancer diagnosis (Mathios et al. 2021). We
sensitivity prediction. This included a comparison of its classi- obtained whole-genome cfDNA sequencing data from lung
fication performance with six typical, general-purpose classi- cancer patients and controls from Dimitrios et al.’s paper
fiers. We also performed functional interpretation of the (LUCAS dataset and LUCAS validation dataset, Table 1) and
features identified by the DC model to demonstrate the bio- calculated the number of fragments near transcription start
logical interpretability of our model. Finally, we carried out sites as features (Ulz et al. 2016). Following the same strategy

Downloaded from https://academic.oup.com/bioinformatics/article/40/2/btae039/7596621 by National Science & Technology Library user on 26 March 2024
ablation experiments on the two key innovations of Deep in the original paper (Mathios et al. 2021), we used the
Centroid, namely using centroid classifiers as base classifiers LUCAS dataset for cross-validation and model construction,
and implementing random feature scanning, to clarify the sig- with the LUCAS validation dataset serving as an independent
nificance of these modifications in Deep Centroid. test set. In a comparative analysis with six other methods
(Fig. 2a and b, Supplementary Tables S1 and S2), our method
3.1 The mainframe of deep centroid outperformed the others in both cross-validation and inde-
Considering the robustness of the centroid classifier and the pendent validation. Apart from our method, the Nearest
strong fitting ability of the deep cascading strategy, we intro- Centroid classifier and SVM also demonstrated good perfor-
duce a novel deep cascading classifier, the Deep Centroid clas- mance, indicating that in such high-dimensional, low-sample
sifier (Fig. 1). Deep Centroid consists of three stages: feature datasets, simpler models may yield more stable classification
scanning, deep cascading, and result prediction. In the feature abilities. In addition, our deep cascading strategy indeed im-
scanning stage, considering that the feature arrangement in proved the fitting ability of the Nearest Centroid classifier.
biomedical data has no actual biological significance, local To evaluate whether our feature scanning strategy identi-
scanning cannot extract meaningful local features like image fied important genes for cancer diagnosis, we selected the im-
portant features scanned by our method (genes included in
data. Therefore, we use a random scanning method (by de-
base classifiers with high classification performance, detailed
fault, the strategy used in this manuscript) and an optional
methods in Supplementary Note S4) and performed an enrich-
feature scanning strategy that utilizes biological prior knowl-
ment analysis on these features (Supplementary Note S5). All
edge (biological functional gene sets) to extract features from
enrichment results are provided in Supplementary Table S3,
high-dimensional and heterogeneous input data. In the deep
with key results presented in Fig. 2c. The results indicate that
cascade stage, each layer uses several centroid classifiers as
the features scanned by our strategy were primarily enriched
base classifiers. The first-layer centroid classifiers receive fea-
in cancer-related pathways (such as Pathways in Cancer),
tures extracted by the feature scanning stage as input, after functional gene sets related to cell differentiation, apoptosis,
centroid classifier calculations, and the results are used as in- and cell adhesion. In addition, some pathways were enriched
put for the second-layer centroid classifiers. Subsequent layers in immune cell or other leukocyte-related pathways. As we
receive inputs from the centroid classifiers of the previous know, cfDNA in plasma mainly originates from white blood
layer and provide input to the next layer. Meanwhile, to pre- cells (Bryzgunova et al. 2021), and these annotation results
vent the deep cascade process from converging too quickly, validate that our feature scanning strategy can indeed identify
each cascade layer also receives features scanned during the features of significant biological relevance, further affirming
feature scanning stage to provide new information to the cas- the reliability of our model.
cade layers. In the prediction result stage, we use majority vot-
ing to calculate the score and corresponding label for each 3.3 Application of deep centroid to cancer prognosis
sample. Please refer to the Method section for detailed steps Cancer prognosis plays a guiding role in the treatment of can-
of the Deep Centroid classifier. cer patients (Wang et al. 2005). We downloaded the tran-
scriptome data and corresponding prognosis information of
3.2 Application of deep centroid to cancer early four breast cancer patients from NCBI GEO, and selected the
detection dataset with the largest sample size, GSE2034 (Wang et al.
Early cancer diagnosis can significantly reduce the mortality 2005), to build the model and perform cross-validation, while
rate among cancer patients, and cell-free DNA fragmentomics using the other three datasets as independent datasets.

Figure 1. Schematic diagram of the model structure of Deep Centroid. In the feature scanning stage, using random scanning strategy, the model divides
heterogeneous data into multiple feature sets, with each feature set corresponding to one nearest centroid classifier. After each layer of the model in the
cascade learning stage is trained, the output results are integrated with the optimized features as new features and continue to be used for the next layer
of training. When the model converges, the model stops training and uses the majority votes to obtain the predicted result.
Deep centroid: a general deep cascade classifier 5

Downloaded from https://academic.oup.com/bioinformatics/article/40/2/btae039/7596621 by National Science & Technology Library user on 26 March 2024
Figure 2. Performance of Deep Centroid in cancer early detection. (a) Classification performance of Deep Centroid in cross-validation. (b) Classification
performance of Deep Centroid in independent validation. (c) Functional annotation results of important features.

As shown in Fig. 3a and b (see Supplementary Tables S4 and famous anticancer drug Tamoxifen and found that its results
S5 for detailed results), our model performs better in both were similar to the overall results, with our method still per-
cross-validation and independent validation, followed by forming best (Fig. 4b).
Centroid classifier, SVM, and DNN. Finally, we conducted functional annotation on the features
We also performed functional annotation on the important identified by the Tamoxifen model, which included both gene
features identified from the breast cancer prognosis dataset, expression and DNA methylation data. We analyzed the im-
as shown in Fig. 3c and Supplementary Table S6. portant features identified in gene expression data and DNA
Theenriched functional gene sets include Pathways in Cancer, methylation data separately (Fig. 4c and Supplementary
cell differentiation, cell adhesion, immune response, and other Table S8 for gene expression data, Fig. 4d and Supplementary
cancer-related functions. Notably, the annotation results also Table S9 for DNA methylation data). The results show that
include hormone-mediated signaling pathway and mammary these features are enriched not only in pathways in cancer,
gland morphogenesis, indicating that our feature scanning cell adhesion, DNA replication, cell differentiation, cell cycle,
strategy can identify genes highly relevant to breast cancer de- DNA damage repair, and PD1 (PDL1) related biological pro-
velopment and prognosis. cess but also in breast cancer-specific functions such as regula-
tion of hormone level and ‘breast cancer’. In the meanwhile,
3.4 Application of deep centroid to drug sensitivity Tamoxifen is a drug used for breast cancer treatment
prediction (Osborne 1998). In addition, these features are also enriched
We used 949 cells from the CCLE dataset that have gene ex- in biological functions related to drug transport and metabo-
pression data, DNA methylation data, and information on lism, such as regulation of transferase activity. All these
whether they are sensitive to 48 FDA-approved drugs to test results suggest that our model not only has stable classifica-
the performance of the Deep Centroid classifier in predicting tion performance but also has biologically interpretable
drug sensitivity. We built a model for each cell line to predict results.
whether it is sensitive or resistant to each drug, using cross-
validation to evaluate our models. In each model, the cell 3.5 Ablation experiments
line’s gene expression data was used as input features for the We conducted ablation studies to assess the impact of two
Deep Centroid classifier, while the cell line’s DNA methyla- key components of our model: the feature scanning strategy
tion scan results were used as additional features for each sub- and the selection of the NCC as the base classifier. In terms of
sequent cascade layer to avoid premature convergence. The the feature scanning strategy, we compared our approach
overall results of the 48 models are shown in Fig. 4a (random scanning) with method used by Deep Forest (sliding
(Supplementary Table S7). The results show that the Deep window scanning). Regarding the base classifiers, we com-
Centroid Classifier performed best, followed by NCC, SVM, pared our approach with two different methods: Random
and DCC. In addition, we conducted a case study of the Forest as the base classifier (denoted as Integrated RF) and
6 Xie et al.

Downloaded from https://academic.oup.com/bioinformatics/article/40/2/btae039/7596621 by National Science & Technology Library user on 26 March 2024
Figure 3. Performance of Deep Centroid in cancer prognosis. (a) Classification performance of Deep Centroid in cross-validation. (b) Classification
performance of Deep Centroid in independent validation. (c) Functional annotation results of important features.

several models as base classifiers (denoted as Integrated MM, stage, we use a random scanning strategy (while also provid-
Method). ing a biological prior knowledge-based scanning strategy for
The results of the ablation experiments in terms of cross- user selection) to extract biologically meaningful feature sets.
validation and independent validation for cancer early detec- In the deep cascading stage, we use nearest centroid classifiers
tion and cancer prognosis, as well as cross-validation for drug as base classifiers. In the final prediction stage, we apply a ma-
sensitivity prediction, are presented in Fig. 5. These results jority voting strategy. This model has achieved better results
clearly demonstrate that both contributions made by Deep than mainstream classification models in three typical applica-
Centroid significantly enhance the model’s performance (see tions of precision medicine: early cancer diagnosis (cfDNA
Supplementary Tables S10–S15 for detailed information). fragmentomics data), cancer prognosis (gene transcriptome
data), and drug sensitivity prediction (gene transcriptome
data and DNA methylation data). In addition, the model’s
4 Conclusion and discussion feature scanning stage can scan biologically meaningful essen-
With the advancement of sequencing technologies, an increas- tial features.
ing number of omics resources (Kaushik et al. 2020a,b) and We found that Deep Centroid had the best performance in
methods (Kaushik et al. 2020a,b; Zhao et al. 2023) have been the MCC evaluation metric for predicting class labels, and
developed to address biomedical challenges. Sample classifica- performed well in AUC, but not always the best. This may be
tion based on biological omics data is a common task in bio- due to the primary advantage of nearest centroid classifiers,
medical research. However, due to the high feature which are simple, resistant to overfitting, and exhibit stable
dimension, low sample size, and the insufficient reproducibil- predictive performance. This characteristic allows our model
ity of data from different sources, although classifier models to predict sample class labels quite accurately across different
have achieved great success in other fields, there is still a lack datasets, consistently resulting in a high MCC metric.
of stable and reliable general classifier models for sample clas- However, the drawback of NCC is its limited fitting ability.
sification of biological omics data. Leveraging the stability of Although our deep cascaded strategy enhances the model’s fit-
nearest centroid classifier and the strong fitting ability of the ting ability (our model consistently outperforms a simple
deep cascading strategy, we propose a novel deep cascading NCC), it is constrained by NCC’s fitting ability, resulting in
ensemble model. In this model, during the feature scanning our model showing the best performance in MCC but not
Deep centroid: a general deep cascade classifier 7

Downloaded from https://academic.oup.com/bioinformatics/article/40/2/btae039/7596621 by National Science & Technology Library user on 26 March 2024

Figure 4. Performance of Deep Centroid in drug sensitivity prediction. (a) Classification performance of Deep Centroid in all the drugs. (b) Classification
performance of Deep Centroid in Tamoxifen. (c) Functional annotation results of important features scanned in gene expression data. (d) Functional
annotation results of important features scanned in DNA methylation data.

always in AUC. As far as we know, providing a precise class overly stable and unchanging, resulting in faster convergence
category for a sample (rather than its score for belonging to a without further enhancement of the model’s fitting ability.
certain class) is more useful for practical clinical applications. In addition, all three applications of our model have high-
Therefore, we believe our model holds significant value. Of dimensional and low-sample-size data, so the results only
course, exploring how to further improve the model’s fitting show that our method can be well applied to sample classifi-
ability based on NCC will be a focus of our future work, spe- cation of typical biological omics data. In other biomedical re-
cifically in the deep cascading stage to provide more diverse search applications such as predicting protein structure,
information to prevent nearest centroid classifiers from being predicting genomic variant sites, and chromatin open regions,
8 Xie et al.

Downloaded from https://academic.oup.com/bioinformatics/article/40/2/btae039/7596621 by National Science & Technology Library user on 26 March 2024
Figure 5. Ablation experiment results. Performance comparison between models using random scanning strategy and sliding window scanning strategy
in early cancer diagnosis (a), cancer prognosis (c), and drug sensitivity prediction (e). Performance comparison between models using nearest centroid
classifier as the base classifier and models applying random forest as the base classifier, as well as the model applying multiple classifier models as the
base classifier in early cancer diagnosis (b), cancer prognosis (d), and drug sensitivity prediction (f).

the application value of our model still needs further valida- Funding
tion. The issues will be the focus of our future work. This work was supported by the Fundamental Research Funds
Anyway, we have proposed a general ensemble classifica- for the Central Universities to X.Z. [2662023XXPY003].
tion model with stable classification capabilities for sample
classification of biological omics data in biomedical research.
This model not only has good classification performance but References
also has biological interpretability. In addition, we have cre- Baek M, Baker D. Deep learning and protein structure modeling. Nat
ated a user-friendly Python toolkit for this model, providing Methods 2022;19:13–4.
valuable support for biomedical research focused on biologi- Basavegowda HS, Dagnew G. Deep learning approach for microarray
cal omics data analysis. cancer data classification. CAAI Trans Intell Technol 2020;5:22–33.
Blagus R, Lusa L. Improved shrunken centroid classifiers for high-
dimensional class-imbalanced data. BMC Bioinformatics 2013;14:
Acknowledgements 64.
Bryzgunova O, Konoshenko MY, Laktionov P. Concentration of cell-
We thank Drs Robert B. Scharpf, Victor E. Velculescu and
free DNA in different tumor types. Expert Rev Mol Diagn 2021;21:
their research group in the Johns Hopkins University School 63–75.
of Medicine for their cell-free DNA data. Chen T, Guestrin C. Xgboost: A scalable tree boosting system. In:
Proceedings of the 22nd ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, 13–17 August, San
Supplementary data Francisco, CA, USA: Association for Computing Machinery, 2016,
Supplementary data are available at Bioinformatics online. 785–94.
Cheng J, Novati G, Pan J et al. Accurate proteome-wide missense variant
effect prediction with AlphaMissense. Science 2023;381:eadg7492.
Chu Y, Kaushik AC, Wang X et al. DTI-CDF: a Cascade deep Forest
Conflict of interest model towards the prediction of drug–target interactions based on
None declared. hybrid features. Brief Bioinform 2021;22:451–62.
Deep centroid: a general deep cascade classifier 9

Cortes C, Vapnik V. Support-vector networks. Mach Learn 1995;20: Papiez A, Marczyk M, Polanska J et al. BatchI: batch effect identification
273–97. in high-throughput screening data using a dynamic programming al-
Demsar J, Zupan B. Hands-on training about overfitting. PLoS Comput gorithm. Bioinformatics 2019;35:1885–92.
Biol 2021;17:e1008671. Schmidt M, Böhm D, von Törne C et al. The humoral immune system
Desmedt C, Piette F, Loi S et al.; TRANSBIG Consortium. Strong time has a key prognostic impact in node-negative breast cancer. Cancer
dependence of the 76-gene prognostic signature for node-negative Res 2008;68:5405–13. 2008
breast cancer patients in the TRANSBIG multicenter independent Simon R. Diagnostic and prognostic prediction using gene expression

Downloaded from https://academic.oup.com/bioinformatics/article/40/2/btae039/7596621 by National Science & Technology Library user on 26 March 2024
validation series. Clin. Cancer Res 2007;13:3207–14. profiles in high-dimensional microarray data. Br J Cancer 2003;89:
Greener JG, Kandathil SM, Moffat L et al. A guide to machine learning 1599–604.
for biologists. Nat Rev Mol Cell Biol 2022;23:40–55. Su R, Liu X, Wei L et al. Deep-Resp-Forest: a deep Forest model to pre-
Hart PE, Stork DG, Duda RO. Pattern Classification. Hoboken: Wiley, dict anti-cancer drug response. Methods 2019;166:91–102.
2000. Sun L, Zhang X, Qian Y et al. Feature selection using neighborhood
Ho TK. Random decision forests. In: Proceedings of 3rd International entropy-based uncertainty measures for gene expression data classifi-
Conference on Document Analysis and Recognition, 14-16 August.
cation. Inf. Sci 2019;502:18–41.
Montreal, QC, Canada: IEEE, 1995, 278–82.
Townshend RJL, Eismann S, Watkins AM et al. Geometric deep learn-
Iorio F, Knijnenburg TA, Vis DJ et al. A landscape of pharmacogenomic
ing of RNA structure. Science 2021;373:1047–51.
interactions in cancer. Cell 2016;166:740–54.
Ulz P, Thallinger GG, Auer M et al. Inferring expressed genes by
Jumper J, Evans R, Pritzel A et al. Highly accurate protein structure pre-
whole-genome sequencing of plasma DNA. Nat Genet 2016;48:
diction with AlphaFold. Nature 2021;596:583–9.
1273–8.
Kaushik AC, Mehmood A, Dai X et al. WeiBI (web-based platform):
Wang Y, Klijn JGM, Zhang Y et al. Gene-expression profiles to predict
enriching integrated interaction network with increased coverage
distant metastasis of lymph-node-negative primary breast cancer.
and functional proteins from genome-wide experimental OMICS
data. Sci Rep 2020a;10:5618. Lancet 2005;365:671–9.
Kaushik AC, Mehmood A, Upadhyay AK et al. CytoMegaloVirus infec- Wu L, Gao J, Zhang Y et al. A hybrid deep Forest-based method for pre-
tion database: a public omics database for systematic and compara- dicting synergistic drug combinations. Cell Rep Methods 2023;3:
ble information of CMV. Interdiscip Sci 2020b;12:169–77. 100411.
Lu B, Zhou X-H. Ensemble Classifier based on gene synergistic network Zhang Y, Sieuwerts AM, McGreevy M et al. The 76-gene signature
improves breast cancer outcome prediction. In: 2019 IEEE defines high-risk patients that benefit from adjuvant tamoxifen ther-
International Conference on Bioinformatics and Biomedicine apy. Breast Cancer Res Treat 2009;116:303–9.
(BIBM), 18-21 November. San Diego, CA, USA: IEEE. 2019, Zhao J et al. Subtype-DCC: decoupled contrastive clustering method for
207–10. cancer subtype identification based on multi-omics data. Brief.
Mathios D, Johansen JS, Cristiano S et al. Detection and characteriza- Bioinform 2023;24:bbad025.
tion of lung cancer using cell-free DNA fragmentomes. Nat Zhou Z-H, Feng J. Deep forest: towards an alternative to deep neural
Commun 2021;12:5060. networks. In: International Joint Conferences on Artificial
Osborne CK. Tamoxifen in the treatment of breast cancer. N Engl J Intelligence, Melbourne, Australia, 19-25 August 2017. 3553–9.
Med 1998;339:1609–18. Zhou Z-H, Feng J. Deep forest. Natl Sci Rev 2019;6:74–86.

You might also like