Machine Learning Based Study For The Classification of Type 2 Diabetes Mellitus Subtypes
Machine Learning Based Study For The Classification of Type 2 Diabetes Mellitus Subtypes
*Correspondence:
[email protected] Abstract
1
Cinvestav Tamaulipas, Carretera Purpose: Data-driven diabetes research has increased its interest in exploring the het‑
Victoria‑Soto la Marina km 5.5, erogeneity of the disease, aiming to support in the development of more specific
Victoria 87130, Tamaulipas,
Mexico
prognoses and treatments within the so-called precision medicine. Recently, one
2
CONAHCYT-Centro de of these studies found five diabetes subgroups with varying risks of complications
Investigación y de Estudios and treatment responses. Here, we tackle the development and assessment of differ‑
Avanzados del IPN, Unidad
Tamaulipas, Carretera
ent models for classifying Type 2 Diabetes (T2DM) subtypes through machine learning
Victoria‑Soto la Marina km 5.5, approaches, with the aim of providing a performance comparison and new insights
Victoria, Tamaulipas 87130, on the matter.
Mexico
Methods: We developed a three-stage methodology starting with the preprocess‑
ing of public databases NHANES (USA) and ENSANUT (Mexico) to construct a dataset
with N = 10,077 adult diabetes patient records. We used N = 2,768 records for training/
validation of models and left the remaining (N = 7,309) for testing. In the second stage,
groups of observations –each one representing a T2DM subtype– were identified. We
tested different clustering techniques and strategies and validated them by using inter‑
nal and external clustering indices; obtaining two annotated datasets Dset A and Dset B.
In the third stage, we developed different classification models assaying four algo‑
rithms, seven input-data schemes, and two validation settings on each annotated data‑
set. We also tested the obtained models using a majority-vote approach for classifying
unseen patient records in the hold-out dataset.
Results: From the independently obtained bootstrap validation for Dset A and Dset B,
mean accuracies across all seven data schemes were 85.3% (±9.2%) and 97.1% (±3.4%),
respectively. Best accuracies were 98.8% and 98.9%. Both validation setting results
were consistent. For the hold-out dataset, results were consonant with most of those
obtained in the literature in terms of class proportions.
Conclusion: The development of machine learning systems for the classification
of diabetes subtypes constitutes an important task to support physicians for fast
and timely decision-making. We expect to deploy this methodology in a data analy‑
sis platform to conduct studies for identifying T2DM subtypes in patient records
from hospitals.
Keywords: Diabetes, Diabetes subtypes, Data-driven, Classification
© The Author(s) 2023. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits
use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original
author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third
party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the mate‑
rial. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or
exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://
creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publi
cdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Ordoñez‑Guillen et al. BioData Mining (2023) 16:24 Page 2 of 37
Introduction
Background
Usually, diabetes has broadly been categorized into Gestational (GDM), Type 1 (T1DM),
and Type 2 (T2DM). GDM occurs during pregnancy and increases the chances of devel-
oping T2DM later in life. T1DM usually appear at early ages when the pancreas stops
producing insulin due to an autoimmune response. The reasons why this occurs are
still not very clear. It is very important to monitor the glucose levels of patients, as sud-
den changes might be life threatening. Patients with this type often need a daily dose of
insulin to lower their blood glucose levels. T2DM is the most common type of diabe-
tes encompassing 95% of diabetic patients, which are commonly adults with a sedentary
lifestyle and poor quality diet. Despite it can be easily controlled in early stages, comor-
bidities might appear years later. Stages of T2DM are related to parameters such as
glucose concentration, insulin sensitivity, insulin secretion, overweight, and aging. How-
ever, recent studies have found that not all patients present the same manifestations.
According to the International Diabetes Federation (IDF) [1], diagnostic guidelines
for diabetes include two measures obtained from blood tests: glycated hemoglobin test
(HbA1C) and plasma glucose (PG) test. The latter can be obtained in three different man-
ners: in fasting state called Fasting Plasma Glucose (FPG), from an oral glucose tolerance
test (OGTT), which consists of administering an oral dose of glucose and measuring PG
after two hours; and from a sample taken at random time (normally carried out when
symptoms are present), called Random Plasma Glucose (RPG). A positive diagnosis is
reached when either one of the following conditions holds (IDF recommends two condi-
tions in absence of symptoms): (1) FPG ≥ 7.0 mmol/L (126 mg/dL), (2) PG after OGTT
≥ 11.1 mmol/L (200 mg/dL), (3) HbA1C ≥ 6.5%, or (4) RPG ≥ 11.1 mmol/L (200 mg/dL).
These parameters allow to readily identify diabetic patients and, when combined with
risk factors such as demographic, family history, dietary, etc. may help to predict the
tendency of developing the disease or its related complications. Understanding the rela-
tion of distinct parameters to the pathology of the disease also helps scientists to develop
new ways to treat it. In this regard, data-driven analysis provide powerful means to dis-
cover such relations.
With the relatively recent advent of big data supporting precision medicine [2], the
understanding of diabetes changed from the classical division of T1DM,T2DM, and
other minority subtypes, to the notion of a highly heterogeneous disease [3]. The field of
research has directed the efforts towards the exploitation of available big data analysis –
particularly from electronic health records – searching for refined classification schemes
of diabetes [4]. Indeed, recent diabetes research has stressed the importance of underly-
ing etiological processes associated to development of important adverse outcomes of
the disease along with response to treatment [5–7]. Exploring the disease heterogeneity,
a recent data-driven unsupervised analysis [8] found that T2DM might have different
manifestations including five subtypes that were related to varying risks of develop-
ing typical diabetes complications such as kidney disease, retinopathy, and neuropathy.
Based on Ahlqvist et al. data-driven analysis [8], in this paper we tackle the development
of methods for classifying T2DM subtypes through machine learning approaches with
the aim of providing a comparison and new insights on the matter. In short, the goals of
the present study were:
Ordoñez‑Guillen et al. BioData Mining (2023) 16:24 Page 3 of 37
Compared to previous work, our study introduces the following contributions and main
results:
• The development of classification models for T2DM subgroups. To our best knowl-
edge, there is only one preceding study that tackled this issue [9].
• Validation of T2DM subtypes in a relatively large dataset predominantly composed
of mexican and other hispanic population.
• An evaluation of clustering algorithms and strategies including indices to measure
clustering quality.
• An assessment of performance of classification models for T2DM subtypes. This
assessment included four algorithms, seven data schemes, two datasets, and two val-
idation methods.
• Our models reached accuracies of up to 98.8% and 98.9% on both datasets. Simpler
and faster algorithms such as SVM and MLP performed better. Models adjusted
notably better to Dset B data and performance was more consistent within the
schemes on this dataset. Both validation settings, bootstrap and 10-fold cross valida-
tion, yielded similar results.
• Finally, the simple majority vote implemented in the testing stage showed a great
amount of consensus, providing class proportions akin to previously reported for
other populations.
We will briefly review the subject of artificial intelligence works related to general dia-
betes and diabetes subgroup classification in the remaining of this Introduction section.
Related work
Artificial intelligence – and particularly, machine learning – methods have been exten-
sively applied within the biomedical field mainly for development of computational tools
to aid in diagnosis of diabetes or its complications [10]. Data analysis has been applied
in several diabetes studies, covering five different main fields: risk factors, diagnosis,
pathology, progression, and management [11]. A number of studies deal with identifi-
cation of diabetes biomarkers, generally by means of feature selection methods, such
as evaluating filter/wrapper strategies [12], combining feature ranking with regression
models to predict short-term subcutaneous glucose [13], and proposing new methods
for feature extraction [14, 15] and generation [16]. Another subfield of research regard-
ing machine learning applied to diabetes mellitus is devoted to detection/prediction
of complications. With the rise of deep learning within the last decade, much of this
Ordoñez‑Guillen et al. BioData Mining (2023) 16:24 Page 4 of 37
1. Severe Autoimmune Diabetes (SAID): It is probably the same as T1DM, but it is clas-
sified as a T2DM subtype, where the pancreas stops producing natural insulin by an
autoimmune response. This is identified by the presence of GAD antibodies.
2. Severe Insulin-Deficient Diabetes (SIDD): It is similar to SAID, but the antibodies
responsible for the autoimmune response are missing.
3. Severe Insulin-Resistant Diabetes (SIRD): the patients seem to produce a normal
amount of insulin, but their body does not respond as expected, maintaining high
blood sugar levels.
4. Mild Obesity Related Diabetes (MORD): It is related to a high body mass index, can
be treated with a better diet and exercise when moderated.
5. Mild Age Related Diabetes (MARD): It is mostly present in elder patients, and corre-
sponds to the natural body ageing.
For such subgroup identification, they used a cohort comprising 8,980 patients for initial
clustering and then, found centroids were used to further cluster three more cohorts and
replicate results. Importantly, these groups were associated with different disease pro-
gression and risk of developing particular complications.
Soon after this pioneer study, a number of works based on the proposed cluster
analysis method emerged to replicate diabetes subgroup assessment within differ-
ent cohorts (see Table 1). The subject was systematically reviewed in [31]. ADOPT
and RECORD trial databases with international and multicenter clinical data com-
prising 4,351 and 4,447 observations, respectively, were analyzed in [32] to investi-
gate glycaemic and renal progression. They found similar cluster results compared
to those reported by Ahlqvist et al., but also that simpler models based on single
clinical features were more descriptive to their same purposes. In a 5-year follow-up
study of a german cohort with 1,105 patients [33], the authors evaluated prevalence
Ordoñez‑Guillen et al. BioData Mining (2023) 16:24 Page 5 of 37
Table 1 Datasets and found proportions of diabetes subgroups reported in the literature
Reference Database/study Origin N SAID (%) MARD (%) MORD (%) SIDD (%) SIRD (%)
prevalence of SIDD class. The latter recruited 1,152 inpatients of a tertiary care hospi-
tal. After performing clustering on the data, the proportions were similar for SIDD and
SIRD, but in this case MORD assembled the majority of records, instead of MARD.
A team of researchers [9] verified the reproducibility of diabetes subgroups by intro-
ducing classification models with trained Self-Normalized Neural Networks (SNNN).
They clustered NHANES data to obtained a labeled dataset on which four input data
models were fitted. These models were used later to classify data from four different
mexican cohorts to assess risk for complications, risk factors of incidence, and treat-
ment response within subgroups. In a subsequent work [39], with the purpose of
assessing prevalence of diabetes subtypes in different ethnic groups in US population,
the research team applied their SNNN models to classify an extended NHANES data-
set comprising cycles up to 2018.
A replication and cross-validation study was performed in [38], the authors used
an alternative input data scheme replacing HOMA2 values – originally used for clus-
tering – with C-peptide along with high density lipoprotein cholesterol. Five clus-
ters were produced with the proposed scheme, three of them showing good matching
with MORD, SIDD, and SIRD; whereas the combination of the remaining two showed
good correspondence to MARD. Cross-validation among three different cohorts
exhibited fair to good cluster correspondence. Pigeyre et al. [40] also replicated clus-
tering results of the original Swedish cohort using data from an international trial
named ORIGIN. In this cohort, they investigated differences in cardiovascular and
renal outcomes within the subgroups, as well as the varied effect of glargine insulin
therapy compared to standard care in hyperglycemia. Finally, the risk of developing
sarcopenia was evaluated in a Japanese cohort previously characterized using clus-
ter analysis [41]. Among diabetes subtypes, SAID and SIDD patients exhibited higher
risk for the onset of this ailment.
Methods
Our interest was to explore different ways to obtain classification model variations for
assigning T2DM subtypes to patients according to a set of attributes. This required us to
characterize T2DM subtypes from existing databases, train these models and apply them
to unseen patient records. The study followed a procedure with three main sequential
stages shown in Fig. 1:
1 3
dataset classification
construction model training
class
subsets members final
classification
clusters stronger model
classes
2 4
data classification
characterization model testing
Fig. 1 Overview of the general procedure applied in the study
Ordoñez‑Guillen et al. BioData Mining (2023) 16:24 Page 7 of 37
1. Dataset construction, where the tasks for acquiring, cleansing, merging and preproc-
essing the data are performed to get a tidy subset from databases. This subset is used
for training, validating, and testing the clustering and classification models in the
subsequent stage.
2. Data characterization, where diabetes patients (instances of the dataset) are seg-
mented, yielding diabetes groups that are labeled according to the feature distribu-
tion patterns.
3. Classification model training, where different classification models are trained and
validated using datasets from previous characterization; the obtained classification
models are then used and evaluated by assigning T2DM subtypes to unseen patient
records.
The best classification models were obtained according to different strategies vary-
ing co-related attributes. Next, in the following subsection, we describe these stages and
steps more in detail.
Dataset construction
The study was performed over real data (NHANES and ENSANUT databases). These
data come from health surveys but was curated in several ways to obtain the better fit-
ting of classification models.
• The National Health and Nutrition Examination Survey (NHANES) database [42],
as its name suggests, is a U.S. national survey performed by the National Center for
Health Statistics (NCHS), which in turn is part of the Centers for Disease Control
and Prevention (CDC). It gathers information from interviews where people answer
questionnaires covering demographic, nutritional, socioeconomic, and health related
aspects. For some of the participants, physical examination and laboratory informa-
tion are included. The database is divided in cycles, which after the NHANES III
(1988 to 1998) are biennial. From NHANES can be obtained several datasets (views)
for a vast number of works, depending on the interests of research. NHANES dataset
that we assembled in the present work consists of the merging of cycles III (1988-
1998) with all continuous NHANES cycles from 1999-2000 to 2017-2020. This latter
cycle was the 2017-2018 cycle joined to the incomplete “pre-pandemic” cycle from
2019 to march 2020.
• The Encuesta Nacional de Salud y Nutrición1 (ENSANUT) [43], is the Mexican anal-
ogous of NHANES database. ENSANUT survey methodology, data gathering, and
curation is carried out by the Center for Research on Evaluation and Surveys, which
is part of the National Institute of Public Health (Mexican Ministry of Health). The
database is the product of a systematic effort aiming to provide a trustworthy database
to assess the status and tendencies of the population health condition, along with uti-
lization and perception of health services. Starting in 1988 as the National Nutrition
Survey, it was until 2000 that became a six-year survey (with some special issues) that
included health information such as anthropometric measures, dietary habits, clini-
1
National Health and Nutrition Survey, for english.
Ordoñez‑Guillen et al. BioData Mining (2023) 16:24 Page 8 of 37
cal history, vaccination, common diseases, and laboratory analysis (in some issues).
Similarly to NHANES, several views can be obtained focusing on specific attributes.
ENSANUT dataset that we have used here included the cycles 2006, 2016, and 2018.
1. Data cleansing consisted in replacing some invalid values with zeroes to represent
absent values.
2. An imputation process to assign values to missing and needed variable inputs to
records that otherwise would be dismissed. When handling data, it is very likely
that some values are missing for many circumstances, such as the participants of
the survey did not answer the questions, then their answers could not be included
in the dataset, or the laboratory samples could not be analysed. We imputed miss-
ing values by using the Multivariate Feature Imputation procedure, which infer
absent values based on values available in other attributes. The considered variables
were Weight, Height, Waist, HbA1c, Glucose1, Glucose2, Insulin, and Age at Diabe-
tes Onset by taking the median value returned by four regression techniques (see
Appendix B for details).
3. A selection step to maintain only those records that met the inclusion criteria: a)
being a diagnosed patient, or b) having OGTT glucose ≥ 200 (mg/dL), or c) having
HbA1C ≥ 6.5 (%). Extreme values, i.e. those values that were apart for more than five
standard deviations from their mean, were removed on each attribute.
4. Scaling. Due to variations in the ranges of values of selected attributes, the compu-
tations are generally biased. Thus, a scaling on those values is required. We trans-
formed the selected attributes by means of min-max normalization and z-score
standardization.
As a result of the whole dataset construction process, a curated dataset was obtained com-
bining NHANES and ENSANUT records. The process is illustrated on the left Panel in
Fig. 2. The dataset was fully preprocessed according to the requirements of the study and, at
this point, is ready for its utilization in data analysis algorithms. The final dataset comprised
a total of 10,077 patient records that were split into a training/validation dataset termed D1
(N = 2,768) and a hold-out dataset termed Test Dset (N = 7,309). D1 consisted of the records
including values for C-peptide variable, whereas Test Dset did not included these values.
Ordoñez‑Guillen et al. BioData Mining (2023) 16:24 Page 9 of 37
Data characterization
The objective of this stage was to characterize the selected instances in the curated
dataset. The overall flow is depicted on Fig. 2 (central panel). Since this dataset was not
labeled with any group or T2DM subtype, we applied clustering algorithms over selected
attributes with the purpose of finding groups of instances in the dataset according to
the similarities on values of the attributes. In a preliminary analysis, we explored three
algorithms with different clustering approaches: partitional (K-means [44]), hierarchi-
cal (agglomerative clustering [45]), and density (DBSCAN [46]). Since these preliminary
results (not included in this paper) demonstrated meaningful dissimilarities among
DBSCAN and Agglomerative clusters with respect to those obtained with K-means, we
determined to focus on the utilization of the latter.
Thus, we applied K-means to group T2DM patients into clusters, relying on the princi-
ple that similar patients in a cluster denote a T2DM subtype. We used a fixed number of
groups (K = 4) corresponding to previously found diabetes subtypes [8], with the excep-
tion of SAID. We did not take this class into account considering all patients as being
GADA negative. The five clinical features previously reported in the literature [8, 30] were
taken into account. These features are: Age at Diabetes Onset (ADO), Body Mass Index
(BMI), Glycated Haemoglobin (HbA1C ), and Homeostasis Model Assessment 2 [47] esti-
mates of beta cell function and insulin resistance (HOMA2-%β and HOMA2-IR, respec-
tively). HOMA2 values are defined by computationally solving a system of empirical
differential equations with software provided by the authors [48]. There are two types of
HOMA2 values, one derived from FPG plus C-peptide, and the other derived from FPG
plus Insulin. We used both types HOMA2 values, as will be explained later. Hereafter, we
will refer to them as CP-HOMA2 and IN-HOMA2, respectively.
As mentioned earlier, dataset D1 included only those records with C-peptide values (N
= 2768) and thus, CP-HOMA2 measures can be computed for those records. Dataset D2
(N = 680), in turn, consists of the subset of D1 that only includes patients with less than
five years of diabetes onset (i.e. AGE − ADO < 5). We carried out a two-stage clustering,
first on D2 and then, used the obtained centroids to cluster the remaining instances of
D1: those with five or more years of diabetes onset (i.e. the difference set D1 − D2) in the
second stage. In total, we tested four clustering strategies in the first stage (numbered 1.1
to 1.4) and six in the second stage (numbered 2.1 to 2.6). In both stages we aimed to con-
trast two overall clustering alternatives: (1) with centroid initialization or de novo cluster-
ing; and (2) taking each gender separately or both genders at once. In the first stage, we
also tested the alternative of only assigning instances to initial centroids (i.e. no iteration)
versus assigning and iterating until centroid convergence. Strategies 1.1 to 1.4 are thus
defined as follows:
For 1.1 and 1.2 we took centroids reported by Ahlqvist et al. [8]. These centroids are
defined by gender, therefore, centroid assignment is performed this manner in 1.1 and
1.2. De novo strategies 1.3 and 1.4 used a repeated K-means procedure, which con-
sisted in several executions (51) of K-means. This procedure yielded a string with 51
positions, where each position holds one of {0, 1, 2, or 3} (the four groups). Hence,
each string corresponds a group assignment pattern for each instance. Then, similarity
among strings was compared to constitute the final four groups. This way, two iden-
tical strings mean that those instances were assigned the same groups across the 51
executions. Those strings not identical were grouped with their most similar instances.
In all these executions, we used the K-means scikit-learn function with K = 4, 100
randomized centroid initializations (with k-means++ function), and 300 maximum
iterations.
After analyzing results from the first stage strategies, we selected strategies 1.2 and
1.4, according to intrinsic and extrinsic clustering validation indices (Appendix A). We
then moved on to second stage clustering computing centroids from strategies 1.2 and
1.4. For both genders denoted by C1.2 and C1.4 ; and separated by gender (W)omen and
(M)en denoted by C1.2(W ), C1.2(M), C1.4(W ), and C1.4(M). In second stage, we also carried
out de novo clusterings with the repeated K-means procedure. Here, we included two
forms of de novo clustering: in addition of using CP-HOMA2 parameters, we also tested
a clustering using IN-HOMA2 parameters and scaling the data with Min-Max normali-
zation, instead of z-score. Importantly, this latter strategy was the only one that imple-
mented these changes. In this manner, the six strategies in second stage were:
Ordoñez‑Guillen et al. BioData Mining (2023) 16:24 Page 11 of 37
Strategies 2.1 to 2.4 used centroids found in first stage for dataset D2 and thus, only clus-
ter the remaining instances in D1. Strategies 2.5 and 2.6 cluster the whole dataset D1
without taking into account previous results from first stage. Again, we evaluated the
results by means of intrinsic and extrinsic validation indices selecting strategies 2.5 and
2.6 as the best performing ones. At the end of second stage clustering, we obtained two
labeled datasets from D1, named as Dset A and Dset B, from groups obtained from clus-
tering 2.5 and 2.6, respectively. The matching of groups with labels of T2DM subtypes
was performed by comparing the obtained attribute distribution patterns against those
reported in the literature [8, 9, 30], as will be further explained in Results section.
Classification schemes
We explored how classification algorithms behave fed with different input data. The
seven classification schemes denoted by S1 to S7 are the following:
Note that all schemes include ADO and BMI and, with exception of scheme S3, all
include also HbA1C. Attributes that were interchanged within schemes are those related
with pancreatic beta cell function and insulin resistance (i.e. HOMA measures and their
related input variables: Glucose and C-peptide/insulin). Notice that schemes S1 and S2
consist of the same attributes on which Dset A and Dset B were respectively clustered.
Scheme S3 is the same as S2 with HbA1C replaced by FPG. Schemes S4 and S5 substitute
HOMA2 measures in schemes S1 and S2 with their respective input attributes. Scheme
S6 makes use of a previous HOMA model [49] that uses simple formulas for approxi-
mating beta cells function and insulin resistance. Finally, scheme S7 applies Metabolic
Scores for Insulin Resistance (METS-IR) [50] and Visceral Fat (METS-VF) [51], which are
respectively proposed measures of insulin resistance and intra-abdominal fat content.
Schemes S1, S2, S3, and S7 were implemented elsewhere [9] and here we added schemes
S4, S5, and S6.
Final evaluation
After the validation stage, we saved the trained models of best performing algorithms in
terms of accuracy, for each of the seven classification schemes. These were obtained from
Ordoñez‑Guillen et al. BioData Mining (2023) 16:24 Page 13 of 37
the bootstrapping procedure and thus, achieved the best accuracy among 1000 runs in
each case. Since the hold out dataset (N = 7,309) did not contain C-peptide values, we
classified it with five trained models from schemes S2, S3, S5, S6, and S7, which did not
use this attribute. To obtain a final classification we applied the majority vote approach,
breaking ties (i.e. two pair of schemes voting for two different classes each pair) by select-
ing the option of the model that achieved the highest accuracy during validation.
Results
This section describes the results corresponding to the data characterization by fol-
lowing the different clustering strategies previously defined, the classification models
obtained from the validation Dset A and Dset B using bootstrapping and cross-valida-
tion, and the final classification on the test dataset.
Data characterization
For the first stage clustering, Table 2 shows the number of patients clustered on each
group and the intrinsic validation values of the four clustering strategies applied on the
dataset D2. Overall, strategies 1.1, 1.2, and 1.4 obtained comparable scores and fairly
similar distribution of patients among the groups, while clustering 1.3 produced con-
siderably lower values on validation indices. As may be intuitively expected, allowing
K-means to iterate until convergence after assigning initial centroids performed slightly
better than the only-assign counterpart. In terms of these validation values obtained,
performing a repeated K-means clustering without initial centroids and without gender
separation outperformed the rest of strategies.
The comparison among the first stage clustering strategies is provided in Table 3.
It can be observed how the similarities among clusterings provide further means to
Table 2 Results for first stage clustering. Dataset D2 (N = 680). SIL silhouette, DB Davies-Bouldin, CH
Calinski-Harabasz. Best metric value achieved appears in bold
Obs. per group Intrinsic index
Strat. 0 1 2 3 SIL DB CH
Table 3 Comparison metrics for first stage clustering. Dataset D2 (N = 680). ARI adjusted rand index,
AMI adjusted mutual index, FM Fowlkes-Mallows index. Best metric value achieved appears in bold
Strat. i Strat. j ARI AMI FM
evaluate them. The best validated clustering 1.4 attained good similarities with clus-
tering strategies 1.1 and 1.2. On the contrary, strategy 1.3 yielded a rather dissimi-
lar grouping with respect to its counterparts, even on this relatively small dataset. In
addition to these results, Fig. 3 contains box plots showing the distribution of attrib-
utes per group, for each of the four implemented clustering strategies. Groups of the
four strategies were identified and changed to match by observing the correspond-
ing pattern in the plots. The order of attributes per group is the same: ADO, BMI,
HbA1C , HOMA2-B, and HOMA2-IR. As it is apparent from these plots, strategies 1.1,
1.2, and 1.4 also yielded similar clusters. It is also noticeable that the distribution of
attributes of clustering 1.3 did not suit the rest of them, particularly in groups 1 and 3.
Based on these first stage clustering results, we chose strategies 1.2 and 1.4 and com-
puted centroids either for the whole clustering (strategies 2.1 and 2.3, respectively) and
for clusterings separated by gender (strategies 2.2 and 2.4, respectively). Additionally,
we performed repeated K-means procedures for CP-HOMA2 and IN-HOMA2 attrib-
utes, the latter using Min-Max normalization instead of z-score (strategies 2.5 and 2.6,
respectively). Table 4 summarizes results from second stage clustering. Overall, group
proportions were similar across all the strategies, being Group 0 the majority group with
proportions ranging from 39.4 to 43.2%. Groups 1, 2, and 3 showed almost identical pro-
portions in strategies 2.1 to 2.5. These percentages ranged from 18.9 to 21.1%, 18.5 to
19.9%, and 19.2 to 20.7%, respectively, for Groups 1, 2, and 3. On the other hand, cluster-
ing 2.6 generated slightly different populated clusters with proportions of 17.6, 16.3, and
22.8%, respectively, in Groups 1, 2, and 3. In terms of clustering validation indices, both
strategies implemented with repeated K-means procedure outperformed those with cen-
troid initialization. Moreover, clustering 2.6, achieved notably better metric scores than
its nearest competitor strategy 2.5. Also, comparing strategies with initial centroids, it
Fig. 3 Box plots of the four implemented clustering strategies in first stage clustering. (A) to (D) correspond
to strategies 1.1 to 1.4, in that order
Ordoñez‑Guillen et al. BioData Mining (2023) 16:24 Page 15 of 37
Table 4 Results for second stage clustering. Dataset D1 (N = 2,768). SIL silhouette, DB Davies-Bouldin,
CH Calinski-Harabasz. Best metric value achieved appears in bold
Obs. per group Intrinsic index
Strat. 0 1 2 3 SIL DB CH
is observable that those without gender separation (2.1 and 2.3) obtained better scores
than their gender-separated counterparts.
Table 5 shows comparison metrics obtained for all-pairs of the six clustering strate-
gies implemented in the second stage clustering. Interestingly, the pair of strategies
(2.1, 2.3) attained the highest similarity scores, despite that they originated from dif-
ferent first stage centroids. These scores were substantially higher, even compared
with that of the pairs (2.1, 2.2) and (2.3, 2.4), that originated from the same first stage
clusterings 1.2 and 1.4, respectively. Moreover, the second similar pair was (2.2, 2.4),
which also come from different centroid initialization. Pairs of strategies that come
from the same first stage clusterings (i.e. (2.1, 2.2) and (2.3, 2.4)) obtained the third
and fourth places in terms of these clustering validity metrics. The remaining of
clustering pairs that used z-score normalization and CP-HOMA2 values (2.1 to 2.5)
reached scores ranging from: 0.6538 to 0.7512 (ARI), 0.6122 to 0.6853 (AMI), and
0.7517 to 0.8221 (FM). Finally, all comparison pairs involving the clustering 2.6 that
used IN-HOMA2 values with Min-Max normalization, obtained lower score ranges:
0.3289-0.3784 (ARI), 0.3380-0.3828 (AMI), and 0.5250-0.5533 (FM).
Table 5 Comparison metrics for second stage clustering. Dataset D1 (N = 2,768). ARI adjusted rand
index, AMI adjusted mutual index, FM Fowlkes-Mallows index. Best metric value achieved appears in
bold
Strat. i Strat. j ARI AMI FM
Figure 4 shows the distribution patterns of involved attributes for the six clustering
strategies applied on dataset D1. The order of attributes per Group is the same: ADO,
BMI, HBA1C, CP-HOMA2-%β, and CP-HOMA2-IR. Importantly, these distribution
plots allowed us to assign T2DM subtype to each cluster, by means of visual inspection
and direct comparison of the patterns against previous results in T2DM sub-classifica-
tions [8, 9, 30]. Indeed, the patterns of attributes obtained within the different clusters
matched the distributions previously reported for MARD, MORD, SIDD, and SIRD.
In general, patterns from all six clustering strategies were sufficiently matching to that
of previously reported in the literature to distinguish and assign a T2DM subtype to
each group. Nevertheless, as it is observable on the plots, there are some slight differ-
ences in ranges, interquartile ranges, and outliers comparing distribution of attributes
in the T2DM subtypes. Among these minor discrepancies, the most appreciable were
(see Fig. 4): both HOMA2 values in MARD (Panels A-E compared to F); BMI in MORD
(Panels A-D compared to E and F); HBA1C in SIDD (Panels A-E compared to F); ADO
and HBA1C in SIRD (Panels A-D compared to E and F).
From the second stage clustering on dataset D1, and considering validation and
comparison metrics, we selected the groups produced by two clustering strategies to
Fig. 4 Box plots of the six implemented clustering strategies on dataset D1 (N = 2,768). Panels (A) to (F)
corresponds to strategies 2.1 to 2.6, in that order
Ordoñez‑Guillen et al. BioData Mining (2023) 16:24 Page 17 of 37
constitute two labeled datasets: Dset A and Dset B, from strategies 2.5 and 2.6, respec-
tively. On these datasets, T2DM subtype labels were assigned to patients by means of
the matching of patterns in the identified groups by clusterings 2.5 and 2.6 (Panels (E)
and (F) in Fig. 4). Both datasets were used in the next stage for developing classification
models.
Table 6 Bootstrap validation results. Global classification metrics obtained for models A and B.
Median accuracies (ACC) and F1-scores (F1) are presented with respective 95% CI. Best performing
model on each scheme appears in bold
Models A Models B
Scheme Algorithm ACC (95% CI) F1 (95% CI) ACC (95% CI) F1 (95% CI)
Naturally, as it consists of the same attributes on which Dset A was clustered, the best
performance among models A was attained by scheme S1 with same ACC and F1 values
of 98.8% (98.1–99.4% CI) . Nonetheless, the next best performing scheme (S4) was not
far from these metrics reaching up to 97.8 (96.8–98.6% CI) ACC and F1. Remaining (best
performing) models A produced ACCs ranging from 75.8 to 82.7% and F1s ranging from
73.9 to 82.3%. Algorithms that yielded the highest ACC were SVM (schemes S2, S3, S5,
and S7) and MLP (schemes S1, S4, and S6). Moreover, these algorithms obtained the best
and second-best performance in all schemes excepting S3 and S7, where KNN and SNNN
attained the second-best performance, respectively. SVM kernels that performed best
were linear (schemes S1, S3, S4, S6, and S7) and rbf (schemes S2 and S5). There were mar-
ginal differences among K values tested in KNN, with values K=54 and K=55 achieving
the best in most of the schemes. Evaluating mean performance of best models across all
seven schemes, mean ACC and F1 were 85.3% (± 9.2%) and 84.8% (± 9.7%), respectively.
Among models B, the best performing were also the ones from which the input dataset
was labeled (in this case, scheme S2), with 98.9% (97.9–99.5% CI) of both best ACC and
F1. However, in this case, the rest of models offered considerable closer performances
with respect to S2, in all schemes excepting S3. Indeed, second to sixth performing mod-
els (schemes S5, S4, S1, S6, and S7) achieved ACCs and F1s ranging from 97.9 to 98.5%
(i.e. only 1.0 to 0.4% lower than S2), while S3 attained lower ACC = 89.3% and F1 =
89.2%. In this case and within all schemes, MLP outperformed the rest of algorithms
closely followed by SVM, particularly in schemes S6 and S7. Interestingly enough, SVM
kernel that produced best results was polynomial within these models. Again, tested K
values did not yield substantial difference in performance for models B. The mean per-
formance of best models in all schemes is given by ACC and F1 values of 97.1% (± 3.4%)
and 97.0% (± 3.5%), respectively.
Supplementary Tables S1 and S2 show corresponding per-class results of models A and B,
respectively, in terms of F1-score, Sensitivity, and Specificity. In these tables, each entry dis-
plays the metrics for the best performing model (i.e. best ACC), out of the 1000 bootstrap
samples. Corresponding confusion matrices from which these metrics were computed are
also included in Supplementary Figs. S1 and S2. By observing Table S1 and corresponding
Fig. S1, it can be noticed that the lower performance of models A within schemes S2, S3, S5,
S6, and S7 is mainly due to a poor Sensitivity for Class 3 (SIRD). This metric was drastically
low in schemes S6 and S7 where some algorithms reached values even lower than 40%. This
effect is evidenced in the confusion matrices by observing that most errors come from Class
3 cases being misclassified as Class 0, and vice versa. Interestingly, that was not the case for
models B (Table S2). In these models, the abnormal low sensitivities occurred only in Class
1 (MORD) and only for SNNN. This result is also explained by watching that many Class 1
records are misclassified as Class 0, 2, or 3 (Fig. S2) in most of schemes.
The amounts of records of each class left in the out-of-bag (validation) set are also
shown in Tables S1 and S2. It can be observed that the proportion of validation records
from the input dataset is ∼ 35–38% in these samples. This means that the models were
trained using a proportion of ∼ 62–65% of different records from the input dataset. In
other words, 35 to 38% of the training records are repeated in the bootstrap process.
For this reason and with the purpose of contrasting results with those reported by [9], we
also aimed at assessing the performance of classification models A and B using a stratified
Ordoñez‑Guillen et al. BioData Mining (2023) 16:24 Page 19 of 37
10-fold cross validation. We selected the best performing algorithm in each scheme from
the bootstrap validation stage; as reviewed above (i.e. those appearing bolded in Table 6).
Table 7 shows these classification results computed as the mean values across the 10 folds
for global Accuracy and per-class Precision, Sensitivity, Specificity, and Area Under the
Curve (AUC). The overall performance of all models was consistent compared with boot-
strap results, with minor increases and decreases in ACC. For models A, it is noticeable the
same behavior observed in bootstrap regarding the low sensitivity in Class 3 for schemes
S2, S3, S5, S6, and S7. With respect to implemented schemes S1, S2, S3, and S7 in [9], our
models A achieved comparable performance in S1, but yielding lower metric values in the
rest of them. Conversely, models B produced remarkable competitive performances in all
compared schemes. Lastly, Fig. 5 compares macro-averaged Receiver Operating Charac-
teristics (ROC) curves and displays corresponding AUCs for both models A and B, and for
each of the seven implemented schemes. In the case of models A (upper panel), these plots
show how schemes S1 and S4 attained the best performance, with considerable higher
AUC than the rest of schemes. For models B (lower panel), it can be observed that except-
ing for S3, all schemes obtained closely similar curves and AUC values.
As a final step in the classification stage of our data analysis flow, we tested our trained
models on unseen data. The hold-out dataset comprised N = 7,309 patient records that
did not include C-peptide values and thus, was a disjoint set with respect to the training/
validation dataset. As previously explained, we applied a majority vote approach using
the best performing models A, considering the five schemes which did not make use of
C-peptide parameter (i.e. S2, S3, S5, S6, and S7). Table 8 shows the number of records
that were classified in each class by the five predictors. Despite of the fact that there were
disparities in these amounts (i.e. predictor S5), in general, there was consensus among the
five predictors. On 77.3% of the observations, all five or four of the predictors agreed on
the resulting class. Moreover, the cases when three or more predictors agreed amounted
to 97% of observations. The total of ties (cases where two pairs of predictors voted for two
different classes) were 175 (2.4%) and were solved by simply assigning the class predicted
by the predictor that achieved the best performance during the bootstrap validation stage.
Figure 6 depicts our final classification results (Panel A) on the test set in terms of
the proportions of each class separated by gender or including both. For comparison
purposes, we also include proportions obtained by landmark studies [8, 9]. The for-
mer (Panel B) were acquired by classifying our test set using the authors’ web tool
with attributes corresponding to our scheme S2. The latter (Panel C) consist of the
authors’ reported results obtained with a dataset of their own (ANDIS, Swedish popu-
lation, N = 8,980). On the latter results, we recalculated the number of observations
accordingly, after eliminating those belonging to the SAID class, which we did not
consider. Proportions of classes from our majority vote approach were similar to that
of [8], in spite of the fact that both were obtained from different populations. On the
other hand, although the charts in Fig. 6 display different proportions with respect to
[9], there was still an overall matching of 57.2% with 1152, 938, 1510, and 578 equally
classified observations for MARD, MORD, SIDD, and SIRD, respectively. 90.3% of dis-
crepancies came from observations that were respectively classified in our method/
web tool as: MARD/MORD (1228), MARD/SIDD (790), MORD/SIDD (411), and
SIRD/MORD (398).
Table 7 Stratified 10-fold cross-validation results. Global accuracy (ACC) with per-class precision (PRE), sensitivity (SEN), specificity (SPE), and area under the curve (AUC) are
shown for models A and B; and contrasted with those reported by [9]. Each entry of our results corresponds to the mean value obtained across the 10 folds. Only best performing
algorithms from bootstrap validation were included (i.e. those appearing in bold from Table 6)
Ordoñez‑Guillen et al. BioData Mining
S1 MARD 0.990 0.991 0.993 0.994 1.000 0.982 0.989 0.989 0.992 0.999 0.981 1.000 0.987 1.000 1.000
MORD 0.983 0.985 0.996 1.000 0.969 0.986 0.991 0.998 0.992 1.000 0.973 1.000
(2023) 16:24
SIDD 0.991 0.987 0.998 1.000 0.982 0.969 0.996 0.999 0.998 0.994 0.991 0.990
SIRD 0.994 0.992 0.999 1.000 0.985 0.975 0.997 1.000 0.998 1.000 0.991 1.000
S2 MARD 0.832 0.824 0.894 0.863 0.945 0.988 0.990 0.996 0.992 1.000 0.903 0.945 0.876 0.914 0.880
MORD 0.863 0.875 0.967 0.985 0.987 0.978 0.996 0.999 0.988 0.966 0.954 0.930
SIDD 0.901 0.904 0.975 0.994 0.986 0.990 0.997 0.999 0.984 0.988 0.931 0.970
SIRD 0.733 0.582 0.949 0.907 0.989 0.982 0.998 1.000 0.901 0.959 0.656 0.860
S3 MARD 0.760 0.734 0.902 0.764 0.898 0.899 0.937 0.964 0.951 0.990 0.859 0.912 0.856 0.891 0.840
MORD 0.804 0.809 0.953 0.954 0.898 0.903 0.970 0.989 0.979 0.950 0.921 0.920
SIDD 0.814 0.792 0.955 0.958 0.891 0.940 0.975 0.994 0.935 0.954 0.774 0.850
SIRD 0.716 0.372 0.965 0.891 0.789 0.677 0.965 0.942 0.903 0.949 0.627 0.840
S4 MARD 0.979 0.985 0.991 0.989 0.999 0.985 0.990 0.993 0.992 0.999 – – – – –
MORD 0.970 0.964 0.993 0.999 0.971 0.982 0.992 0.998 – – – –
SIDD 0.981 0.978 0.995 0.999 0.994 0.974 0.999 0.999 – – – –
SIRD 0.974 0.968 0.994 0.998 0.982 0.982 0.997 1.000 – – – –
S5 MARD 0.833 0.824 0.877 0.865 0.943 0.987 0.992 0.994 0.994 0.999 – – – – –
MORD 0.874 0.904 0.969 0.985 0.973 0.991 0.992 0.999 – – – –
SIDD 0.910 0.922 0.977 0.994 0.994 0.978 0.999 0.999 – – – –
SIRD 0.714 0.578 0.945 0.906 0.991 0.976 0.998 1.000 – – – –
Page 20 of 37
Ordoñez‑Guillen et al. BioData Mining
Table 7 (continued)
(2023) 16:24
S6 MARD 0.825 0.829 0.861 0.873 0.948 0.983 0.987 0.992 0.990 0.999 – – – – –
MORD 0.857 0.878 0.965 0.984 0.977 0.974 0.993 0.998 – – – –
SIDD 0.914 0.912 0.979 0.994 0.984 0.981 0.996 0.999 – – – –
SIRD 0.678 0.605 0.930 0.911 0.979 0.972 0.996 1.000 – – – –
S7 MARD 0.738 0.715 0.904 0.738 0.881 0.982 0.991 0.990 0.993 0.999 0.820 0.923 0.878 0.884 0.880
MORD 0.797 0.819 0.949 0.957 0.975 0.970 0.992 0.998 0.981 0.971 0.934 0.940
SIDD 0.880 0.881 0.971 0.980 0.971 0.976 0.994 0.999 0.984 0.979 0.928 0.950
SIRD 0.404 0.153 0.943 0.753 0.980 0.981 0.996 0.999 0.897 0.941 0.601 0.820
Page 21 of 37
Ordoñez‑Guillen et al. BioData Mining (2023) 16:24 Page 22 of 37
Fig. 5 Macro-averaged Receiver Operating Characteristics curves for each scheme. (A) Models A. (B) Models
B
Lastly, Fig. S3 shows a comparison of per-class distribution patterns for ADO, BMI,
HBA1C, IN-HOMA2-%β, and IN-HOMA2-IR; for results obtained in the test set from
our study (Panel A) and the aforementioned web classifier (Panel B). Overall, resemblance
of patterns is appreciable for all variables, although, there was some variation derived
from the disparities in amounts of observations per class. Due to the MARD/MORD and
MARD/SIDD mismatching classifications, it is observable that the web classifier yielded a
narrower distribution and higher median for ADO in MARD class; as this class has fewer
instances. However, as a consequence of having more instances classified within, classes
MORD and SIDD present less defined distributions of BMI and HBA1C, respectively.
Ordoñez‑Guillen et al. BioData Mining (2023) 16:24 Page 23 of 37
Table 8 Number of records classified per class in the hold-out dataset for each of the five predictors
considered
Schemes
Class S2 S3 S5 S6 S7
Fig. 6 Proportion of observations of T2DM classes. (A) Our mayority vote scheme with models trained
with LD1 dataset. (B) Classification of test dataset using the insulin-based HOMA2 model developed by
Bello-Chavolla et al. [9]. (C) Clustering results reported by Ahlqvist et al. [8] with their dataset ANDIS
Discussion
In the present study, we have focused on developing and testing classification models for
T2DM subtypes. Our methodology consisted in three main stages: dataset construction,
data characterization, and classification model development. In view of our results, we
consider the following as our findings.
First, producing an enriched large dataset by fusing information from two repre-
sentative health databases, NHANES and ENSANUT. Although NHANES includes
Ordoñez‑Guillen et al. BioData Mining (2023) 16:24 Page 24 of 37
Finally, our majority vote approach demonstrated a great deal of consensus amongst
used classifiers, in the hold-out dataset. Class proportions were similar to those found
in the pioneer study of Ahlqvist et al. [8]. On the other hand, we believe that the dispar-
ity in our results compared with those of the web classifier of Bello-Chavolla et al. [9]
are mainly attributable to the standardization step. Indeed, during experimentation we
encountered that this step, which depends on the distribution of variables in the dataset,
greatly impacts classification results.
Conclusion
We have introduced a new pipeline for analysis of datasets with the goal of obtaining
classifiers for T2DM subtypes. With this purpose, we described a detailed data curation
and characterization processes to obtained labeled datasets. Unlike previous work, our
analysis included a clustering validation step through well-known indices, that allowed
us to evaluate quality of clusters. We have obtained results consistent to most of previ-
ous work in terms of subgroup proportions (see Table 1). From the classifiers we have
trained, it is remarkable the fact that simpler and faster algorithms such as SVM and
MLP fitted better to the clustered data than the more involved convolutional architec-
tures. Also, the results showed that classifiers learned better from normalized (Min-
Max) compared to that of standardized (z-score) data. The obtained performances using
this scaling approach were consistent across the seven data schemes, since normalized
data produced better defined clusters according to validation indices.
The present work was based on cross-sectional data and thus, we have limited the
scope of our analysis to the development of classification tools for T2DM subtypes,
without further association with risks of complications, incidence, prevalence, and treat-
ment response. We left such analyses as future work, with the hope of establishing data
sharing collaborations. However, we believe that the study offers valuable insights on the
process of developing classification models for T2DM subtypes. Further limitations of
the present study are those inherent to the population (i.e. dataset) used for the analy-
sis, preprocessing steps applied, and that we have considered all the patients within the
dataset as GADA negative (i.e. not considering SAID class), since this variable was not
available in most of NHANES and ENSANUT records.
Clustering techniques
Clustering techniques differ on the way how clusters are identified. Basically, this
depends on how the user desires the grouping approach and the distribution of data.
Three different clustering approaches were explored:
Ordoñez‑Guillen et al. BioData Mining (2023) 16:24 Page 26 of 37
Intrinsic methods
When the ground truth labels of instances are not available, intrinsic methods allow to
quantify the quality of clusterings. The general idea consists in minimizing distances of
instances within the same partition (i.e. obtain more compact partitions), and maximiz-
ing distances of observations belonging to different partitions (i.e. obtain more separa-
tion among partitions). The methods we applied were:
• Silhouette (SIL) [53]. The Silhouette index computes for each instance pi a score
SILi , given by SILi = (bi − ai )/max(bi , ai ), where ai is the average distance of pi to
every instance within its cluster and bi is the average distance of pi to all instances of
the nearest cluster. The overall index for a clustering C, SILC is obtained by averaging
the index of all instances.
• Davies-Bouldin (DB) [54]. This index is defined as the average similarity
between each cluster Ci (1 ≤ i ≤ k ) and its most similar Cj . For each cluster Ci let
Rij = ((si + sj )/dij ) be this measure of similarity, where si , sj are respectively the
average distance of each instance in Ci to its centroid, and dij the distance between
centroids i and j. The Davies-Bouldin index is defined as the average of the similarity
between clusters Ci and Cj :
Ordoñez‑Guillen et al. BioData Mining (2023) 16:24 Page 27 of 37
k
1
DB = max(Rij )
k i�=j
i=1
• Calinski-Harabasz (CH) [55]. This index is also known as the Variance Ratio Crite-
rion. For a k-clustering of a dataset with N instances, the between- and within-cluster
dispersion matrices are respectively defined as:
K
BK = nk (ck − cN )(ck − cN )T
k=1
K
WK = (p − ck )(p − ck )T
k=1 p∈Ck
where nk and ck are the number of instances and centroid of the k-th cluster Ck and
cN is the global centroid of the dataset. The Calinski-Harabasz index is defined as the
ratio
trace(Bk ) N −K
CH = ×
trace(Wk ) K −1
Extrinsic methods
On the other hand, the extrinsic methods assist in evaluation of clustering quality
only respective to a ground truth label assignment, and without considering any other
information of distance among the data points. We used the following indices with
the purpose of comparing similarity among our different clustering strategies:
• Adjusted Rand Index (ARI) [56]. For a ground truth label assignment C and a
clustering K, let us define a, b, c, and d, respectively as the number of pairs of
instances that:
Terms a, b, c, and d can be calculated from the contingency matrix [57]. The
unadjusted Rand Index is defined by:
a+b
RI =
a+b+c+d
To guarantee that random label assignments will get a value closer to zero, the
Adjusted Rand Index is defined as:
Ordoñez‑Guillen et al. BioData Mining (2023) 16:24 Page 28 of 37
N
(a + d) − [(a + b)(a + c) + (c + d)(b + d)]
2
ARI = 2
N
− [(a + b)(a + c) + (c + d)(b + d)]
2
• Adjusted Mutual Information (AMI) [58]. Let U and V be two label assignments
for N instances in a clustering. Let us define the probability that a randomly picked
instance falls into:
|V |
H (V ) = − P ′ (j)log(P ′ (j))
j=1
and with the expected value for MI E(MI) the Adjusted Mutual Information score is
defined as:
MI(U, V ) − E[MI]
AMI = .
mean(H (U), H(V )) − E[MI]
• Fowlkes-Mallows (FM) [59]. This index is defined as the geometric mean of the pair-
wise precision and recall metrics. In notation of terms a, b, c, and d, defined previ-
ously for the ARI index the FM score is defined as:
a
FM = √
(a + b)(a + c)
Classification algorithms
Algorithms used for developing classification models were K-Nearest Neighbors (K-NN),
Support Vector Machine (SVM), MultiLayer Perceptron (MLP), and Self-Normalized Neu-
ral Networks (SNNN). These are briefly described in the following.
• K-NN. It is one of the simplest algorithm for classification. It is based on two simple
notions, a measure of distance and the premise that closeness among patients is helpful
Ordoñez‑Guillen et al. BioData Mining (2023) 16:24 Page 29 of 37
to infer its class (T2DM subtype) membership [60]. Classification is made in two basic
steps, first find the K nearest neighbors of an input patient and then classify the patient
on a majority vote basis. It is called a lazy learning method in the sense that does not per-
form a training like other methods but rather classifies new patients using the data itself.
• SVM. This algorithm finds the best hyperplane (support vector) that best separates
each class in the input data. In the binary case, the objective is to find a hyperplane
that has the maximum distance between instances of a pair of classes. This maxim
margin distance is useful so that a future instance can be classified. In case that the
distribution of the classes in the training data is not lineally separable, it is necessary
to modify the dimensionality of the data via a kernel function. In some cases, it is
necessary to try different kernel functions to find the most suitable one.
• MLP. This is one of the simplest neural networks, but powerful for classification
because it can learn linear and non-linear relations in data. The input data helps is
combined to adjust a set of initial weights and bias arranged into layers, each linear
combination in a layer is propagated to the next layer. By this, the model learns a set
of patterns that describe the input data of each class. It can use any arbitrary activa-
tion function at the output. By several iterations on the input data, the algorithm
readjusts weights and learning rate until no improvement is noticed in the classifica-
tion. The resulting model classifies unseen patients.
• SNNN. This is a kind of deep learning technique; at first, this was defined as a new
architecture of neural networks, but later it was defined as a variant of MLP. Its main
feature is an implementation of the SELU (Scaled Exponential Linear Units) activa-
tion function. Neuron activations converge towards zero mean and unit variance
even under the presence of perturbations in data. In this way, the data is self-normal-
ized as it passes by each layer of the network making learning highly robust.
√
Values for K in K-NN were selected in the neighborhood of ⌊ N ⌋ (i.e. the interval
√ √
[⌊ N ⌋ − 3, ⌊ N ⌋ + 3], where N is the number of patients. The hyperparameters for SVM,
MLP, and SNNN were selected on preliminary execution of algorithms using grid search.
Classification metrics
Each of the obtained data models was evaluated by means of the following multi-class
metrics. For an M-class classification problem with N instances, let us consider the
M × M confusion matrix CONF = cij , where by convention, we put the actual labels
(ground truth) in columns and predicted labels in rows. For the k-th class 1 ≤ k ≤ M:
With these values the per-class metrics Precision (PRE), Sensitivity or Recall (REC), their
harmonic mean termed F1-score (F1), and Specificity (SPE) are respectively defined by:
Ordoñez‑Guillen et al. BioData Mining (2023) 16:24 Page 30 of 37
TPk
PREk = ,
TPk + FPk
TPk
RECk = ,
TPk + FNk
TNk
SPEk = .
TNk + FPk
Global PRE, REC, and F1 are computed by averaging or adding up per-class values.
These forms of averaging are:
Data merging
After downloading the appropriate files/tables containing the variables needed for our
study, we converted the format of files to .CSV when it was necessary. Particularly, we
used the Python XPORT library to convert NHANES SAS .XPT files. Afterwards, we first
merged tables within each cycle separately, and then concatenated the partial dataset
obtained from each cycle. The set of attributes we initially selected from databases are
shown in Table 9. To deal with variable naming inconsistencies across different cycles,
Ordoñez‑Guillen et al. BioData Mining (2023) 16:24 Page 31 of 37
we defined an ordering and naming. For some categorical variables (e.g. ETHNICITY
or DIABETES) we also defined a consistent scheme of values and changed them accord-
ingly. All values that were identified in the database documentation as codes for missing
or irrelevant data were set to zero. Each observation was identified with a new sequential
variable “NSEQN” and a cycle identifier “CYCLE” was also added. Table 10 resumes this
information about each cycle.
Table 10 Datasets generated per cycle. Cycles 1-11 are from NHANES and cycles 21-23 are from
ENSANUT
CYCLE Years NSEQN Size
Data cleansing
Data cleansing steps consisted in a double-checking variable-per-variable for null/blank
or coded values that were set to zero. In this step we also corrected data inconsistencies
(e.g. some observations with AGE < ADO). Values that were declared in the database
documentation as “below the limit of detection” (C-PEPTIDE and INSULIN in cycles 1
and 4) were also set to zero.
Imputation
Prior to imputation, we maintained only adult patients (AGE ≥ 20, N = 172, 909) to
avoid including young diabetic patients that are often of type I. Imputation was imple-
mented in six incremental stages. For each imputed variable, we used four Sklearn esti-
mator methods:
1. Bayesian Regression. It uses the available data to train a Bayesian ridge regression
model to infer the missing data. By using a ridge approach, the resulting regression is
intentionally offset from the original data to avoid overfitting the model.
2. Decision Tree Regression. It splits the existing data in several ranges per each row,
depending on the ranges the predictor variables are, the outcome variable will be the
mean of the rest of the data in the same range.
3. Extra Trees Regression. It works in a similar way to Decision Tree Regression, but
instead of making rigorous calculus to find the optimal group splitting, a random
splitting is performed.
4. KNN Regression. It uses the K-NN approach where the weighted mean of the k-near-
est neighbours to the existing values in the row to impute are calculated to fill in the
blanks.
We used the median of the four estimated values for each imputed variable. Estimators
require to provide the dataset to impute and the dataset to fit, both with the same vari-
ables. Pandas functionality allow logical formulae with operator symbols “∼, |, &” (NOT,
OR, and AND; respectively) to be provided as queries to retrieve subsets of observa-
tions in a dataset. For instance, if define v1 , v2 , v3 , and v4 as the result of queries where
variables satisfy some conditions (in our case, check if v1 , v2 , v3 , and v4 are present), then
the query v1 | (v2 & ∼ v3 ) retrieves the dataset where either v1 is present or v2 is present
and v3 is absent. Using this notation, in the following we describe the six stages of our
imputation scheme by providing the dataset to impute. We also provide the amount of
observations in each dataset. In stages 1-4 we imputed observations separated by gender,
and thus, the amounts are depicted as Nm (men) and Nw (women). For notation brevity,
we will use the first two letters of each of the involved variables WEIGHT, HEIGHT,
WAIST, HBA1C, GLUCOSE1, INSULIN, ADO, AGE, and BMI.
• Stage 1. Dataset to impute ( Nm = 2380, Nw = 3887): AG & [(¬HE & WE & WA) |
(HE & ¬WE & WA) | (HE & WE & ¬WA)]. Dataset to fit ( Nm = 55319, Nw = 69051):
AG & HE & WE & WA.
Ordoñez‑Guillen et al. BioData Mining (2023) 16:24 Page 33 of 37
• Stage 2. Dataset to impute ( Nm = 130, Nw = 160): AG & [(¬HE & WE) | (HE & ¬
WE)]. Dataset to fit ( Nm = 57699, Nw = 72938): AG & HE & WE.
• Stage 3. Dataset to impute ( Nm = 130, Nw = 160): AG & HE & WE & ∼WA. Dataset
to fit ( Nm = 57699, Nw = 72938): AG & HE & WE & WA.
• Stage 4. Dataset to impute ( Nm = 3974, Nw = 6338): AG & BM & [(∼HB & GL & IN)
| (HB & ∼GL & IN) | (HB & GL & ∼IN)]. Dataset to fit ( Nm = 26191, Nw = 31288):
AG & BM & HB & GL & IN.
• Stage 5. Dataset to impute ( N = 61037): AG & HB & GL & IN & ∼AD. Dataset to fit
( N = 7406): AG & HB & GL & IN & AD.
• Stage 6. Dataset to impute ( N = 24398): AG & HB & ∼AD. Dataset to fit
( N = 72216): AG & HB & AD.
HOMA2 computation
HOMA2 values which are derived from GLUCOSE1 and either C-PEPTIDE or INSU-
LIN were computed using the excel version of the HOMA2 calculator downloaded
from the authors’ webpage [48]. This calculator has some limit restrictions on the
values accepted. We dealt with these restrictions by assigning the limit values when
necessary. After the HOMA2 computation, we performed a second extreme value
removal procedure based only on the HOMA2 (from C-peptide) variables, from
which we obtained a final dataset with N = 2768 observations.
Abbreviations
ADO Age at Diabetes Onset
BMI Body Mass Index
HBA1C Glycated hemoglobin test
FPG Fasting Plasma Glucose
RPG Random Plasma Glucose
HDLC High-Density Lipoprotein Cholesterol
OGTT Oral Glucose Tolerance Test
HOMA Homeostasis Model Assessment
METS-VF Metabolic Score for Visceral Fat
METS-IR Metabolic Score for Insulin Resistance
T2DM Type 2 Diabetes Mellitus
MARD Mild Age-Related Diabetes
MORD Mild Obesity-Related Diabetes
SIDD Severe Insulin-Deficient Diabetes
SIRD Severe Insulin-Resistant Diabetes
Ordoñez‑Guillen et al. BioData Mining (2023) 16:24 Page 34 of 37
Supplementary Information
The online version contains supplementary material available at https://doi.org/10.1186/s13040-023-00340-2.
Additional file 1.
Acknowledgements
Not applicable.
Authors’ contributions
Conceptualization, N.E.O.-G., J.L.G-C and I.L.-A.; methodology, N.E.O.-G., J.L.G-C, I.L.-A. and E.A.-B.; software, N.E.O.-G.,
M.C.-M and E.A.-B.; validation, N.E.O.-G., J.L.G-C, I.L.-A. and E.A.-B.; formal analysis, N.E.O.-G., J.L.G-C, I.L.-A. E.A.-B. and M.C.-M;
investigation, N.E.O.-G., J.L.G-C and I.L.-A.; resources, N.E.O.-G., J.L.G-C and I.L.-A.; data curation, N.E.O.-G., J.L.G-C, I.L.-A.
E.A.-B. and M.C.-M; writing—original draft preparation, N.E.O.-G., J.L.G-C, I.L.-A. E.A.-B. and M.C.-M; writing—review and
editing, N.E.O.-G., J.L.G-C, I.L.-A. E.A.-B. and M.C.-M; visualization, N.E.O.-G. and M.C.-M; supervision, N.E.O.-G., J.L.G-C and
I.L.-A.; project administration, N.E.O.-G. and J.L.G-C; funding acquisition, J.L.G-C.
Funding
This research was funded by the FORDECYT-PRONACES project 41756 “Plataforma tecnológica para la gestión, ase‑
guramiento, intercambio y preservación de grandes volúmenes de datos en salud y construcción de un repositorio
nacional de servicios de análisis de datos de salud” by CONACYT (Mexico) together with the CONACYT postdoctoral
fellowship granted to N.E.O.-G.
Declarations
Ethics approval and consent to participate
Not applicable.
Competing interests
The authors declare no competing interests.
References
1. International Diabetes Federation. IDF Diabetes Atlas, 10th edn, Brussels Belgium. 2021. https://www.diabetesatlas.
org. Accessed 03 Oct 2022.
2. Zhang Y, Zhu Q, Liu H. Next generation informatics for big data in precision medicine era. BioData Min. 2015;8(34).
https://doi.org/10.1186/s13040-015-0064-2.
3. Tuomi T, Santoro N, Caprio S, Cai M, Weng J, Groop L. The many faces of diabetes: a disease with increasing hetero‑
geneity. Lancet. 2014;383(9922):1084–94. https://doi.org/10.1016/S0140-6736(13)62219-9.
4. Capobianco E. Systems and precision medicine approaches to diabetes heterogeneity: a Big Data perspective. Clin
Transl Med. 2017;6(1):23. https://doi.org/10.1186/s40169-017-0155-4.
5. Del Prato S. Heterogeneity of diabetes: heralding the era of precision medicine. Lancet Diabetes Endocrinol.
2019;7(9):659–61. https://doi.org/10.1016/S2213-8587(19)30218-9.
Ordoñez‑Guillen et al. BioData Mining (2023) 16:24 Page 35 of 37
6. Nair ATN, Wesolowska-Andersen A, Brorsson C, Rajendrakumar AL, Hapca S, Gan S, et al. Heterogeneity in pheno‑
type, disease progression and drug response in type 2 diabetes. Nat Med. 2022;28(5):982–8. https://doi.org/10.1038/
s41591-022-01790-7.
7. Cefalu WT, Andersen DK, Arreaza-Rubín G, Pin CL, Sato S, Verchere CB, et al. Heterogeneity of Diabetes: β-Cells,
Phenotypes, and Precision Medicine: Proceedings of an International Symposium of the Canadian Institutes of
Health Research’s Institute of Nutrition, Metabolism and Diabetes and the U.S. National Institutes of Health’s National
Institute of Diabetes and Digestive and Kidney Diseases. Diabetes Care. 2021;45(1):3–22. https://doi.org/10.2337/
dci21-0051.
8. Ahlqvist E, Storm P, Käräjämäki A, Martinell M, Dorkhan M, Carlsson A, et al. Novel subgroups of adult-onset diabetes
and their association with outcomes: a data-driven cluster analysis of six variables. Lancet Diabetes Endocrinol.
2018;6(5):361–9. https://doi.org/10.1016/S2213-8587(18)30051-2.
9. Bello-Chavolla OY, Bahena-López JP, Vargas-Vázquez A, Antonio-Villa NE, Márquez-Salinas A, Fermín-Martínez CA,
et al. Clinical characterization of data-driven diabetes subgroups in Mexicans using a reproducible machine learning
approach. BMJ Open Diabetes Res Care. 2020;8(1). https://doi.org/10.1136/bmjdrc-2020-001550.
10. Kavakiotis I, Tsave O, Salifoglou A, Maglaveras N, Vlahavas I, Chouvarda I. Machine Learning and Data Mining Meth‑
ods in Diabetes Research. Comput Struct Biotechnol J. 2017;15:104–16. https://doi.org/10.1016/j.csbj.2016.12.005.
11. Gautier T, Ziegler LB, Gerber MS, Campos-Náñez E, Patek SD. Artificial intelligence and diabetes technology: A
review. Metab Clin Exp. 2021;124:154872. https://doi.org/10.1016/j.metabol.2021.154872.
12. Bagherzadeh-Khiabani F, Ramezankhani A, Azizi F, Hadaegh F, Steyerberg EW, Khalili D. A tutorial on variable
selection for clinical prediction models: feature selection methods in data mining could improve the results. J Clin
Epidemiol. 2016;71:76–85. https://doi.org/10.1016/j.jclinepi.2015.10.002.
13. Georga EI, Protopappas VC, Polyzos D, Fotiadis DI. Evaluation of short-term predictors of glucose concentration in
type 1 diabetes combining feature ranking with regression models. Med Biol Eng Comput. 2015;53(12):1305–18.
https://doi.org/10.1007/s11517-015-1263-1.
14. Wang KJ, Adrian AM, Chen KH, Wang KM. An improved electromagnetism-like mechanism algorithm and its appli‑
cation to the prediction of diabetes mellitus. J Biomed Inform. 2015;54:220–9. https://doi.org/10.1016/j.jbi.2015.02.
001.
15. Sideris C, Pourhomayoun M, Kalantarian H, Sarrafzadeh M. A flexible data-driven comorbidity feature extraction
framework. Comput Biol Med. 2016;73:165–72. https://doi.org/10.1016/j.compbiomed.2016.04.014.
16. Aslam MW, Zhu Z, Nandi AK. Feature generation using genetic programming with comparative partner selection for
diabetes classification. Expert Syst Appl. 2013;40(13):5402–12. https://doi.org/10.1016/j.eswa.2013.04.003.
17. Ling D, Liang W, Huating L, Chun C, Qiang W, Hongyu K, et al. A deep learning system for detecting diabetic retinopa‑
thy across the disease spectrum. Nat Commun. 2021;12(1):3242. https://doi.org/10.1038/s41467-021-23458-5.
18. Kangrok O, Hae Min K, Dawoon L, Hyungyu L, Kyoung Yul S, Sangchul Y. Early detection of diabetic retinopathy
based on deep learning and ultra-wide-field fundus images. Sci Rep. 2021;11(1):1897. https://doi.org/10.1038/
s41598-021-81539-3.
19. Gulshan V, Peng L, Coram M, Stumpe MC, Wu D, Narayanaswamy A, et al. Development and Validation of a Deep
Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs. JAMA. 2016;316(22):2402–
10. https://doi.org/10.1001/jama.2016.17216.
20. Bawankar P, Shanbhag N, Smitha KS, Dhawan B, Palsule A, Kumar D, et al. Sensitivity and specificity of automated
analysis of single-field non-mydriatic fundus photographs by Bosch DR Algorithm-Comparison with mydriatic
fundus photography (ETDRS) for screening in undiagnosed diabetic retinopathy. PLoS ONE. 2017;12(12):e0189854.
https://doi.org/10.1371/journal.pone.0189854.
21. Huang GM, Huang KY, Lee TY, Weng JTY. An interpretable rule-based diagnostic classification of diabetic nephropa‑
thy among type 2 diabetes patients. BMC Bioinformatics. 2015;16(1):S5. https://doi.org/10.1186/1471-2105-16-S1-S5.
22. Leung RK, Wang Y, Ma RC, Luk AO, Lam V, Ng M, et al. Using a multi-staged strategy based on machine learning and
mathematical modeling to predict genotype-phenotype risk patterns in diabetic kidney disease: a prospective
case-control cohort analysis. BMC Nephrol. 2013;14(1):162. https://doi.org/10.1186/1471-2369-14-162.
23. Yudong C, Jitendra J, Siaw-Teng L, Pradeep R, Manish K, Hong-Jie D, et al. Identification and Progression of
Heart Disease Risk Factors in Diabetic Patients from Longitudinal Electronic Health Records. BioMed Res Int.
2015;2015:636371. https://doi.org/10.1155/2015/636371.
24. Baskozos G, Themistocleous AC, Hebert HL, Pascal MMV, John J, Callaghan BC, et al. Classification of painful or
painless diabetic peripheral neuropathy and identification of the most powerful predictors using machine learning
models in large cross-sectional cohorts. BMC Med Inform Decis Making. 2022;22(1):144. https://doi.org/10.1186/
s12911-022-01890-x.
25. Nanda R, Nath A, Patel S, Mohapatra E. Machine learning algorithm to evaluate risk factors of diabetic foot ulcers
and its severity. Med Biol Eng Comput. 2022;60(8):2349–57. https://doi.org/10.1007/s11517-022-02617-w.
26. Mueller L, Berhanu P, Bouchard J, Alas V, Elder K, Thai N, et al. Application of Machine Learning Models to
Evaluate Hypoglycemia Risk in Type 2 Diabetes. Diabetes Ther. 2020;11(3):681–99. https://doi.org/10.1007/
s13300-020-00759-4.
27. Deng Y, Lu L, Aponte L, Angelidi AM, Novak V, Karniadakis GE, et al. Deep transfer learning and data augmentation
improve glucose levels prediction in type 2 diabetes patients. npj Digit Med. 2021;4(1):109. https://doi.org/10.1038/
s41746-021-00480-x.
28. Saxena R, Sharma SK, Gupta M, Sampada GC. A Comprehensive Review of Various Diabetic Prediction Models: A
Literature Survey. J Healthc Eng. 2022;2022:15. https://doi.org/10.1155/2022/8100697.
29. Chaki J, Thillai Ganesh S, Cidham SK, Ananda Theertan S. Machine learning and artificial intelligence based Diabetes
Mellitus detection and self-management: A systematic review. J King Saud Univ Comput Inf Sci. 2022;34(6, Part
B):3204–3225. https://doi.org/10.1016/j.jksuci.2020.06.013.
30. Ahlqvist E, Prasad RB, Groop L. Subtypes of Type 2 Diabetes Determined From Clinical Parameters. Diabetes.
2020;69(10):2086–93. https://doi.org/10.2337/dbi20-0001.
Ordoñez‑Guillen et al. BioData Mining (2023) 16:24 Page 36 of 37
31. Sarría-Santamera A, Orazumbekova B, Maulenkul T, Gaipov A, Atageldiyeva K. The Identification of Diabetes Mellitus
Subtypes Applying Cluster Analysis Techniques: A Systematic Review. Int J Environ Res Public Health. 2020;17(24).
https://doi.org/10.3390/ijerph17249523.
32. Dennis JM, Shields BM, Henley WE, Jones AG, Hattersley AT. Disease progression and treatment response in data-
driven subgroups of type 2 diabetes compared with models based on simple clinical features: an analysis using
clinical trial data. Lancet Diabetes Endocrinol. 2019;7(6):442–51. https://doi.org/10.1016/S2213-8587(19)30087-7.
33. Zaharia OP, Strassburger K, Strom A, Bönhof GJ, Karusheva Y, Antoniou S, et al. Risk of diabetes-associated diseases
in subgroups of patients with recent-onset diabetes: a 5-year follow-up study. Lancet Diabetes Endocrinol.
2019;7(9):684–94. https://doi.org/10.1016/S2213-8587(19)30187-1.
34. Herder C, Maalmi H, Strassburger K, Zaharia OP, Ratter JM, Karusheva Y, et al. Differences in Biomarkers of Inflamma‑
tion Between Novel Subgroups of Recent-Onset Diabetes. Diabetes. 2021;70(5):1198–208. https://doi.org/10.2337/
db20-1054.
35. Maalmi H, Herder C, Bönhof GJ, Strassburger K, Zaharia OP, Rathmann W, et al. Differences in the prevalence of
erectile dysfunction between novel subgroups of recent-onset diabetes. Diabetologia. 2022;65(3):552–62. https://
doi.org/10.1007/s00125-021-05607-z.
36. Li X, Yang S, Cao C, Yan X, Zheng L, Zheng L, et al. Validation of the Swedish Diabetes Re-Grouping Scheme in Adult-
Onset Diabetes in China. J Clin Endocrinol Metab. 2020;105(10):e3519–28. https://doi.org/10.1210/clinem/dgaa524.
37. Wang W, Pei X, Zhang L, Chen Z, Lin D, Duan X, et al. Application of new international classification of adult-onset
diabetes in Chinese inpatients with diabetes mellitus. Diabetes/Metab Res Rev. 2021;37(7):e3427. https://doi.org/10.
1002/dmrr.3427.
38. Slieker RC, Donnelly LA, Fitipaldi H, Bouland GA, Giordano GN, Åkerlund M, et al. Replication and cross-validation
of type 2 diabetes subtypes based on clinical variables: an IMI-RHAPSODY study. Diabetologia. 2021;64(9):1982–9.
https://doi.org/10.1007/s00125-021-05490-8.
39. Antonio-Villa NE, Fernández-Chirino L, Vargas-Vázquez A, Fermín-Martínez CA, Aguilar-Salinas CA, Bello-Chavolla
OY. Prevalence Trends of Diabetes Subgroups in the United States: A Data-driven Analysis Spanning Three Decades
From NHANES (1988-2018). J Clin Endocrinol Metabo. 2021;107(3):735–742. https://doi.org/10.1210/clinem/dgab7
62.
40. Pigeyre M, Hess S, Gomez MF, Asplund O, Groop L, Paré G, et al. Validation of the classification for type 2 diabetes
into five subgroups: a report from the ORIGIN trial. Diabetologia. 2022;65(1):206–15. https://doi.org/10.1007/
s00125-021-05567-4.
41. Tanabe H, Hirai H, Saito H, Tanaka K, Masuzaki H, Kazama JJ, et al. Detecting Sarcopenia Risk by Diabetes Cluster‑
ing: A Japanese Prospective Cohort Study. J Clin Endocrinol Metab. 2022;107(10):2729–36. https://doi.org/10.1210/
clinem/dgac430.
42. Centers for Disease Control and Prevention (CDC). National Center for Health Statistics (NCHS). National Health
and Nutrition Examination Survey Data. Hyattsville, MD: U.S. Department of Health and Human Services. 2022.
https://www.cdc.gov/nchs/nhanes/index.htm. Accessed 01 Mar 2022.
43. Secretaría de Salud. Instituto Nacional de Salud Pública (INSP). Encuesta Nacional de Salud y Nutrición. 2022. https://
ensanut.insp.mx/index.php. Accessed 01 Mar 2022.
44. MacQueen J. Some Methods for Classification and Analysis of Multivariate Observations. In: Proceedings of the 5th
Berkeley Symposium on Mathematical Statistics and Probability. Berkeley: University of California Press; 1967. Vol. 1,
page 281–297.
45. Bridges CC. Hierarchical Cluster Analysis. Psychol Rep. 1966;18:851–4.
46. Ester M, Kriegel HP, Sander J, Xu X. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases
with Noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining.
KDD’96. AAAI Press; 1996. p. 226–231.
47. Levy JC, Matthews DR, Hermans MP. Correct Homeostasis Model Assessment (HOMA) Evaluation Uses the Com‑
puter Program. Diabetes Care. 1998;21(12):2191–2. https://doi.org/10.2337/diacare.21.12.2191.
48. University of Oxford. HOMA2 Calculator. 2022. https://www.dtu.ox.ac.uk/homacalculator/. Accessed 01 May 2022.
49. Matthews DR, Hosker JP, Rudenski AS, Naylor BA, Treacher DF, Turner RC. Homeostasis model assessment: insulin
resistance and β-cell function from fasting plasma glucose and insulin concentrations in man. Diabetologia.
1985;28(7):412–9. https://doi.org/10.1007/BF00280883.
50. Bello-Chavolla OY, Almeda-Valdes P, Gomez-Velasco D, Viveros-Ruiz T, Cruz-Bautista I, Romo-Romo A, et al. METS-IR,
a novel score to evaluate insulin sensitivity, is predictive of visceral adiposity and incident type 2 diabetes. Eur J
Endocrinol. 2018;178(5):533–44. https://doi.org/10.1530/EJE-17-0883.
51. Bello-Chavolla OY, Antonio-Villa NE, Vargas-Vázquez A, Viveros-Ruiz TL, Almeda-Valdes P, Gomez-Velasco D, et al.
Metabolic Score for Visceral Fat (METS-VF), a novel estimator of intra-abdominal fat content and cardio-metabolic
health. Clin Nutr. 2020;39(5):1613–21. https://doi.org/10.1016/j.clnu.2019.07.012.
52. Beleites C, Baumgartner R, Bowman C, Somorjai R, Steiner G, Salzer R, et al. Variance reduction in estimating classifi‑
cation error using sparse datasets. Chemometr Intell Lab Syst. 2005;79(1):91–100. https://doi.org/10.1016/j.chemo
lab.2005.04.008.
53. Rousseeuw PJ. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J Comput Appl
Math. 1987;20:53–65. https://doi.org/10.1016/0377-0427(87)90125-7.
54. Davies DL, Bouldin DW. A Cluster Separation Measure. IEEE Trans Pattern Anal Mach Intell. 1979;PAMI-1(2):224–227.
https://doi.org/10.1109/TPAMI.1979.4766909.
55. Caliński T, Harabasz J. A dendrite method for cluster analysis. Commun Stat. 1974;3(1):1–27. https://doi.org/10.1080/
03610927408827101.
56. Hubert L, Arabie P. Comparing partitions. J Classif. 1985;2(1):193–218. https://doi.org/10.1007/BF01908075.
57. Santos JM, Embrechts M. On the Use of the Adjusted Rand Index as a Metric for Evaluating Supervised Classification.
In: Alippi C, Polycarpou M, Panayiotou C, Ellinas G, editors. Artificial Neural Networks - ICANN 2009. Springer Berlin
Heidelberg; 2009. p. 175–84.
Ordoñez‑Guillen et al. BioData Mining (2023) 16:24 Page 37 of 37
58. Strehl A, Ghosh J. Cluster Ensembles — a Knowledge Reuse Framework for Combining Multiple Partitions. J Mach
Learn Res. 2003;3(null):583–617. https://doi.org/10.1162/153244303321897735.
59. Fowlkes EB, Mallows CL. A Method for Comparing Two Hierarchical Clusterings. J Am Stat Assoc. 1983;78(383):553–
69. https://doi.org/10.1080/01621459.1983.10478008.
60. Altman N. An introduction to kernel and nearest-neighbor nonparametric regression. Am Stat. 1992;46(3):175–85.
https://doi.org/10.1080/00031305.1992.10475879.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.