Calibration of Transformer-Based Models For
Calibration of Transformer-Based Models For
Abstract— In today’s fast-paced world, the rates of stress sive stress can lead to anxiety disorders or even depression.
and depression present a surge. People use social media for Depression entails a great number of symptoms, including loss
expressing their thoughts and feelings through posts. Therefore, of interest, anger, pessimism, changes in weight, feelings of
social media provide assistance for the early detection of mental
health conditions. Existing methods mainly introduce feature worthlessness, thoughts of suicide, and many more. According
extraction approaches and train shallow machine learning (ML) to the WHO [3], around 280 million people in the world have
classifiers. For addressing the need of creating a large feature depression. China, India, the United States, Russia, Indonesia,
set and obtaining better performance, other research studies use and Nigeria are some of the countries presenting the highest
deep neural networks or language models based on transformers. rates of depression [4]. People with stress and depression
Despite the fact that transformer-based models achieve noticeable
improvements, they cannot often capture rich factual knowledge. use social media platforms, including Twitter and Reddit, and
Although there have been proposed a number of studies aim- share their thoughts, emotions, feelings, and so on through
ing to enhance the pretrained transformer-based models with posts or comments with other users. Therefore, social media
extra information or additional modalities, no prior work has constitute a valuable form of information, where linguistic
exploited these modifications for detecting stress and depression patterns of depressive/stressful posts can be investigated.
through social media. In addition, although the reliability of a
machine learning (ML) model’s confidence in its predictions is Existing research initiatives exploit social media data for
critical for high-risk applications, there is no prior work taken identifying depressive and stressful posts. The majority of
into consideration the model calibration. To resolve the above these research works [5], [6] employ feature extraction
issues, we present the first study in the task of depression and approaches and train shallow ML algorithms. Employing
stress detection in social media, which injects extra-linguistic feature extraction approaches constitutes a tedious procedure
information in transformer-based models, namely, bidirectional
encoder representations from transformers (BERT) and Mental- and demands domain expertise, since the authors may not
BERT. Specifically, the proposed approach employs a multimodal find the optimal feature set for the specific problem. At the
adaptation gate for creating the combined embeddings, which are same time, the train of shallow ML algorithms does not
given as input to a BERT (or MentalBERT) model. For taking yield optimal performance and does not generalize well to
into account the model calibration, we apply label smoothing. new data. For addressing these limitations, other approaches
We test our proposed approaches in three publicly available
datasets and demonstrate that the integration of linguistic [7] use deep neural networks, including convolutional neural
features into transformer-based models presents a surge in networks (CNNs), BiLSTMs, and so on, or transformer-based
performance. Also, the usage of label smoothing contributes networks. In addition, there are research studies employing
to both the improvement of the model’s performance and the ensemble strategies [8]. However, these approaches increase
calibration of the model. We finally perform a linguistic analysis
substantially the training time, since multiple models must be
of the posts and show differences in language between stressful
and nonstressful texts, as well as depressive and nondepressive trained separately. In addition, recently there have been studies
posts. [9], [10] showing that transformer-based models struggle or
Index Terms— Calibration, depression, emotion, mental health,
fail to capture rich knowledge. For this reason, there have
stress, transformers. been proposed methods for enhancing these models with
external information or additional modalities [11], [12], [13],
[14]. However, existing research initiatives in the tasks of
I. I NTRODUCTION
stress and depression detection through social media have
ILIAS et al.: CALIBRATION OF TRANSFORMER-BASED MODELS FOR IDENTIFYING STRESS AND DEPRESSION 3
In terms of the tweet-level attributes, they exploited linguistic, as input to AdaBoost, LR, RF, and SVM for identifying the
i.e., positive and negative emotion words, punctuation marks, severity of depression.
emoticons, and so on, visual, i.e., saturation, brightness, five- Recently, deep learning approaches have been introduced,
color theme, and so on, and social features, i.e., number of since they obtain better performance than the traditional ML
comments, retweets, and likes. Regarding user-level attributes, algorithms and do not often require the tedious procedure of
they extracted features pertinent to their posting behavior and feature extraction. For example, Wani et al. [36] represented
social interaction. words as word2vec and tf–idf approach and trained a deep
Emotion-enhanced approaches in conjunction with explain- neural network consisting of CNNs and LSTMs. Kim et al.
ability techniques have been introduced. Turcan et al. [31] [37] collected a dataset consisting of posts written by peo-
proposed three emotion-enhanced models that incorporate ple, who suffer from mental disorders, including depression,
emotional information in various ways to enhance the task anxiety, bipolar, borderline personality disorder, schizophrenia,
of binary stress prediction. In terms of the first approach, and autism. This study developed six binary classification
the authors exploited two single-task models sharing the same models for detecting mental disorders, i.e., depression versus
BERT representation layers. Specifically, the authors trained nondepression, and so on. Specifically, the authors utilized
the stress task with the Dreaddit data and the emotion task with the tf–idf approach and trained an XGBoost classifier. Next,
the GoEmotions or Vent data. Regarding the second model, the the authors used the word2vec and trained a CNN model.
authors proposed a multitask learning framework, where the Naseem et al. [38] reformulated depression identification
two tasks were performed at the same time. With regards to as an ordinal classification problem, where they used four
the third model, the authors fine-tune first a BERT model depression severity levels. The authors introduced a deep
using the emotion task and then apply this fine-tuned BERT neural network consisting of a text graph convolutional net-
model to stress detection. Finally, the authors introduced a new work, bidirectional long short-term memory (BiLSTM), and
framework for model interpretation using local interpretable attention layer. A similar approach was proposed by Ghosh
model-agnostic explanations (LIME). and Anwar [39], where the authors extracted features and
trained LSTMs for estimating the depression intensity levels.
A hybrid deep neural network consisting of CNN and BiLSTM
B. Depression Detection
was introduced by Kour and Gupta [40]. Zogan et al. [41]
Some studies have focused on the extraction of features introduced the first dataset including posts from users with
and then the train of shallow ML classifiers. For instance, and without depression during COVID-19 and presented a new
Tadesse et al. [5] extracted n-grams via the tf–idf approach, hierarchical CNN. An emotion-based attention network model
LIWC features, and LDA topics. Then, they trained LR, SVM, was proposed by Ren et al. [42], where the authors extracted
random forest (RF), AdaBoost, and multilayer perceptron the positive and negative words and passed them through two
(MLP). Results showed that the bigram features trained on separate BiLSTM layers followed by attention layers.
an SVM classifier achieved 80.00% accuracy, while the best Ensemble strategies have also been explored in the litera-
accuracy accounting for 91.00% was achieved by exploiting ture. This means that multiple models are trained separately
the MLP classifier with all the features, i.e., LIWC, LDA, and the final decision is taken usually by a majority voting
and bigrams. Liu and Shi [32] extracted a set of textual approach. For instance, an ensemble strategy was introduced
features, namely, part-of-speech, emotional words, personal by Ansari et al. [8]. First, the authors exploited some senti-
pronouns, polarity, and so on, and a set of features indicating ment lexicons, including AFINN, NRC, SenticNet, and multi-
the posting behavior of the user, i.e., posting habits and perspective question answering (MPQA), extracted features,
time. Next, feature selection techniques were applied, includ- and applied principal component analysis for reducing the
ing recursive elimination, mutual information, and extreme dimensionality of the feature set. An LR classifier was trained
random tree. Finally, Naive Bayes, the k-nearest neighbor, using the respective feature set. Next, the authors trained
regularized LR, and SVM were used as base learners, and an LSTM neural network coupled with an attention mecha-
a simple LR algorithm was used as a combination strategy nism. Finally, the authors combined the predictions of these
to build a stacking model. Nguyen et al. [33] extracted a two approaches via an ensemble method. Also, an ensemble
set of features, including LDA topics, LIWC features, and approach was proposed by Trotzek et al. [43]. First, the
affective features by using the affective norms for english authors trained an LR classifier using input user-level linguistic
words (ANEW) lexicon, and mood labels. The authors trained metadata. Specifically, the authors extracted LIWC features,
a LASSO regression classifier for detecting depressive posts the length of the text, four readability scores, and so on.
and analyzing the importance of each feature. The authors Next, the authors trained a CNN model. Finally, the authors
applied also statistical tests and found significant differences combined the outputs of these approaches via a late fusion
between depressive and nondepressive posts. Tsugawa et al. strategy, i.e., by averaging the predictions of the classifiers.
[34] extracted features and trained an SVM classifier to detect Figuerêdo et al. [7] designed a CNN along with early and late
depression in Twitter. Specifically, the authors extracted the fusion strategies. Specifically, the authors exploited fastText
frequency of words used in tweets, the ratio of tweet topics and GloVe embeddings. In the early fusion approach, multiple-
found by LDA, the ratio of positive and negative words, and word embeddings were concatenated and passed to the CNN
many more. Pirina and Çöltekin [35] collected several corpora model. In the late fusion strategy, a majority-vote approach
and trained an SVM classifier using character and word was performed based on the predictions of multiple CNN
n-grams. Doc2vec and tf–idf features were extracted and given models. The CNN model comprised a simple convolution
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
layer, max-pooling, fully connected layers, and concatenated NaturalLanguageUnderstanding tool and extracted sentiment
rectified linear units as the activation function. and emotion information for all user descriptions along with
Explainable approaches have also been introduced. Souza the possible categories (at most three) that the description
et al. [44] introduced a stacking ensemble neural network, may belong to. Next, the authors designed a neural network
which addresses a multilabel classification task. Specifically, consisting of BiGRU, attention layers, convolution layers,
the proposed architecture consists of two levels. In the first and dense layers. The authors used GloVe embeddings. The
level, binary base classifiers were trained with two distinct proposed architecture can predict whether the user suffers
roles, i.e., expert and differentiating. The expert base classifiers from depression or not as well as predict the sadness, joy,
were used for differentiating between users belonging to the fear, disgust, and anger score. Li et al. [53] exploited text,
control group and those diagnosed with anxiety, depression, pictures, and auxiliary information (post time, dictionary, and
or comorbidity. The differentiating base models aimed at dis- social information) and used attention mechanisms within
tinguishing between two target conditions, e.g., anxiety versus and between the modalities at the same time. The authors
depression. In the second level, a meta-classifier uses the base exploited TextCNN, ResNet-18, and fully connected layers for
models’ outputs to learn a mapping function that manages extracting representation vectors of text, images, and auxiliary
the multilabel problem of assigning control or diagnosed information, respectively. A multimodal approach was pro-
labels. The authors used LSTMs and CNNs. Finally, this study posed by Cheng and Chen [54], where the authors exploited
explored Shapley additive explanations (SHAP) metrics for texts, images, posting time, and the time interval between
identifying the influential classification features. Zogan et al. the posts on Instagram. Shen et al. [55] collected multimodal
[45] proposed also an explainable approach, where textual, datasets and extracted six depression-oriented feature groups,
behavioral, temporal, and semantic aspect features from social namely, social network, user profile, visual, emotional, topic-
media were exploited. A hierarchical attention network was level, and domain-specific features. Gui et al. [56] combined
used in terms of explainable purposes. A hierarchical attention texts and images and proposed a new cooperative multiagent
network was also used by Uban et al. [46], where the authors reinforcement learning method.
extracted a feature set consisting of content, style, LIWC, and Multitask approaches have been introduced. A multitask
emotions/sentiment features. An interpretable approach was approach was introduced by Zhou et al. [57]. Specifically, the
proposed by Song et al. [47], where the authors introduced authors proposed a hierarchical attention network consisting
the feature attention network. The feature attention network of BiGRU layers and integrated LDA topics. The main task
consists of four feature networks, each of which analyzes was the identification of depression, i.e., binary classification
posts based on an established theory related to depression and task, while the auxiliary task was the prediction of the domain
a postlevel attention on top of the networks. However, this category of the post, i.e., multiclass classification task. Both
method did not attain satisfactory results. multitask and multimodal approaches were introduced by
Recently, transformer-based models have been applied to Wang et al. [58]. The authors extracted a total of ten features
the task of depression detection in social media. Specifically, from the text, social behavior, and pictures. XLNet and BiGRU
Boinepelli et al. [48] introduced a method for finding the coupled with attention layers, and dense layers were used.
subset of posts that would be a good representation of all
the posts made by the user. First, they employed BERT and
C. Related Work Review Findings
computed the embeddings for all posts made by the user.
Next, they used a clustering and ranking algorithm. After Existing research initiatives rely on the feature extrac-
finding the representative posts per user, the authors added tion process and the train of shallow machine classifiers
domain-specific elements by exploiting RoBERTa. Finally, the targeting diagnosing mental disorders in social media. This
authors experimented with two ways of diagnosing depres- fact demands domain expertise and does not generalize well
sion, i.e., by either employing a majority-vote approach or to new data. Other existing approaches train CNNs, BiL-
training a hierarchical attention network. Anantharaman et al. STMs, or employ hybrid models and ensemble strategies.
[49] fine-tuned a BERT model for classifying the signs Recently, transformer-based models have been used also. Only
of depression into three labels, namely, “not depressed,” a few works have experimented with injecting linguistic,
“moderately depressed,” and “severely depressed.” Similarly, including emotion, features into deep neural networks. These
Nilsson and Kovács [50] exploited a BERT model and approaches employ multitask learning models, fine-tuning,
used abstractive summarization techniques for data augmen- or multimodal approaches. All these approaches employ-
tation. Zogan et al. [51] presented an abstractive–extractive ing transformer-based models usually fine-tune these models.
automatic text summarization model based on BERT, k- None of these approaches have used modifications of BERT
means clustering, and bidirectional auto-regressive transform- aiming to enhance its performance by injecting into it external
ers (BART). Then, they proposed a deep learning framework, knowledge. Also, no prior work has taken into account model
which combines user behavior and user post history or user calibration creating in this way overconfident models.
activity. Therefore, our work differs from the existing research
Multimodal approaches combining both text and images initiatives, since we: 1) present a new method, which injects
have also been proposed. For instance, a multimodal approach linguistic features into transformer-based models; 2) apply
was introduced by Ghosh et al. [52] for detecting depression label smoothing for calibrating our model and evaluate both
in Twitter. Specifically, the authors utilized the user’s descrip- the performance and calibration of our proposed models; and
tion and profile image. The authors used the IBM Watson 3) present a methodology for gaining linguistic insight into
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
ILIAS et al.: CALIBRATION OF TRANSFORMER-BASED MODELS FOR IDENTIFYING STRESS AND DEPRESSION 5
the tasks of depression and stress investigating their common e(i) , we concatenate e(i) with feature vectors, i.e., h (i)
v
characteristics. (i)
(i) (i)
wv = σ Whv e ; h v + bv
(3)
element is the proportion of tokens belonging to each where β is a hyperparameter. Then, we apply a layer normal-
category. ization [66] and dropout layer [67] to em(i) . Next, the combined
2) LIWC: LIWC is a dictionary-based approach to count embeddings are fed to a BERT/MentalBERT model.
words in linguistic, psychological, and topical categories We get the classification [CLS] token of this model and pass
[60]. We use LIWC 2022 [61] to represent each text as it through a dense layer consisting of 128 units with a ReLU
a 117-d vector. activation function. Finally, we use a dense layer consisting
3) LDA Topics: Before training the LDA model, we remove of either two units (binary classification task) or four units
stop words and punctuation. We exploit LDA (with (multiclass classification task).
25 topics) and extract 25 topic probabilities per text [62]. We denote our proposed models as M-BERT and multi-
These probabilities describe the topics of interest in each modal MentalBERT (M-MentalBERT) followed by the lin-
text. Inspired by Liu et al. [22], we use the following guistic features which are integrated into them. For example,
feature vector. the injection of LIWC features into a BERT model is denoted
a) Global Outlier Standard Score: For evaluating the as M-BERT (LIWC).
ith text’s interest on a certain topic k, compared to
the rest of the texts, we use the GOSS feature B. Model Calibration
Pn To prevent the model from becoming too overconfident,
i=1 x ik
µ(xk ) = (1) we use label smoothing [25], [68]. Specifically, label smooth-
n
xik − µ(xk ) ing calibrates learned models so that the confidences of their
GOSS(xik ) = qP . (2) predictions are more aligned with the accuracies of their
i (x ik − µ(x k ))
2
predictions.
For a network trained with hard targets, the cross-entropy
Therefore, each text is represented as a 25-d vector.
loss is minimized between the truePtargets yk and the network’s
4) Top2Vec: Top2Vec [63] is an algorithm for topic mod-
outputs pk , as in H (y, p) = k=1 −yk log( pk ), where yk
K
eling, which automatically detects topics present in the
is “1” for the correct class and “0” for the other. For a
text and generates jointly embedded topic, document,
network trained with label smoothing, we minimize instead
and word vectors. After training Top2Vec by exploiting
the cross-entropy between the modified targets ykLSu and the
the universal sentence encoder, each text is represented
network’s outputs pk
as a 512-d vector.
α
We experiment with the following pretrained models: BERT ykLSu = yk · (1 − α) + (7)
[64] and MentalBERT [65]. K
K
First, we pass each text through the aforementioned X
H (y, p) = −ykLSu · log( pk ) (8)
transformer-based models. Let C ∈ R N ×d be the output of
k=1
the transformer-based models, where N denotes the sequence
length, while d denotes the dimensionality of the models. where α is the smoothing parameter, and K is the number of
We have omitted the dimension corresponding to the batch classes.
size for the sake of simplicity.
IV. E XPERIMENTS
Then, we project the feature vectors to dimensionality equal
to 128. We repeat the feature vector N times, so as to ensure A. Datasets
that the feature vector and the output of the transformer-based 1) Dreaddit: This dataset includes stressful and nonstressful
models can be concatenated. Given the word representation texts posted by users of Reddit [27]. Specifically, these posts
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
belong to five domains, namely, abuse, anxiety, financial, of our proposed approach. We use these metrics similar to
social, and posttraumatic stress disorder (PTSD). A subset Wani et al. [36].
of data has been annotated using Amazon Mechanical Turk. Regarding the multiclass classification task reported on
This dataset includes also lexical, syntactic, and social media the Depression_Severity dataset, we use weighted precision,
features per post. The dataset has been divided by the authors weighted recall, and weighted F1-score. We use these metrics
into a train and a test set. The train set comprises 1488 stressful similar to Mishra et al. [71].
texts and 1350 nonstressful ones, while the test set includes 2) Calibration: We evaluate the calibration of our model
369 stressful texts and 346 nonstressful ones. using the metrics proposed by the relevant literature [72], [73],
2) Depression_Mixed: This dataset [35] consists of [74]. Specifically, we use the metrics mentioned as follows.
1482 nondepressive posts and 1340 depressive posts. These 1) Expected Calibration Error (ECE): The calibration error
posts have been written by users on Reddit and English is the difference between the fraction of predictions in
depression forums. the bin that are correct (accuracy) and the mean of the
3) Depression_Severity: This dataset includes posts in Red- probabilities in the bin (confidence). First, we divide the
dit [38] and assigns each post to a severity level, i.e., minimal predictions into M equally spaced bins (size 1/M)
(2587 posts), mild (290 posts), moderate (394 posts), and 1 X
severe form of depression (282 posts). acc(Bm ) = 1( ŷ i = yi ) (9)
|Bm | i∈B
m
B. Experimental Setup 1 X
conf(Bm ) = p̂i (10)
We use the Adam optimizer with a learning rate of 0.001. |Bm | i∈B
m
ILIAS et al.: CALIBRATION OF TRANSFORMER-BASED MODELS FOR IDENTIFYING STRESS AND DEPRESSION 7
TABLE I
P ERFORMANCE C OMPARISON A MONG P ROPOSED M ODELS AND BASELINES U SING THE D EPRESSION _M IXED AND D READDIT DATASETS
surpassing the performance of the BERT model in F1-score 0.016 and 0.009, respectively, compared with the respective
by 0.57%. We speculate that the injection of top2vec fea- model without label smoothing.
tures obtains better performance than the injection of features
VI. L INGUISTIC A NALYSIS
derived by LDA topics, i.e., GOSS features since the top2vec
algorithm is capable of identifying the number of topics We finally perform an analysis on the Depression_Mixed
automatically. In terms of MentalBERT, we observe that the and Dreaddit datasets to uncover the peculiarities of stress
injection of top2vec features obtains an F1-score of 92.69% and depression. Specifically, we seek to find the correlations
surpassing MentalBERT by 1.52%. We observe that the inte- of LIWC features with stressful/depressive and nonstress-
gration of NRC and top2vec features improve the performance ful/nondepressive texts. To do this, we adopt the methodol-
obtained by MentalBERT. Regarding the proposed approaches ogy by Ilias and Askounis [75]. First, we normalize LIWC
with label smoothing, we observe that these models attain bet- features, so as to ensure that they sum up to 1 across each
ter performances than the ones obtained by the models with- post. Next, we use the point-biserial correlation between each
out label smoothing. Specifically, we observe that M-BERT LIWC category and the label of the post. The output of the
(top2vec) with label smoothing surpasses the respective model point-biserial correlation is a number ranging from −1 to 1.
without label smoothing in F1-score and Accuracy by 0.61% Positive correlations mean that the specific LIWC category is
and 0.36% respectively. Similarly, M-MentalBERT (top2vec) correlated with the stressful/depressive class (label 1), while
with label smoothing obtains the highest F1-score and accu- negative correlations mean that the specific LIWC category is
racy accounting for 93.06% and 93.45%, respectively. This correlated with the nonstressful/nondepressive class (label 0).
model surpasses the respective model without label smoothing We consider the absolute values of the correlations. Results
in F1-score and accuracy by 0.37% and 0.18%. Except for are reported in Table III. All the correlations are significant
the improvement of the performance metrics, i.e., precision, at p < 0.05 with Benjamini–Hochberg correction [76] for
recall, F1-score, and accuracy, we observe that the models with multiple comparisons.
label smoothing obtain better results in terms of the calibration In terms of the Depression_Mixed dataset, we observe that
metrics, i.e., ECE and ACE, than the ones obtained by the the control group tends to use words with positive tone and
models without label smoothing. For example, we observe that emotion, i.e., good, well, happy, hope, and so on. In addition,
M-BERT (top2vec) with label smoothing improves the ECE the healthy control group discusses topics of everyday life,
and ACE scores obtained by M-BERT (top2vec) without label including lifestyle (work, home, and school), culture (car and
smoothing by 0.008 and 0.013, respectively. Similarly, M- phone), politics (govern and congress), family, and friends
MentalBERT (LDA topics) with label smoothing improves the (boyfriend, girlfriend, and dude). Also, these people make
ECE and ACE scores obtained by M-MentalBERT (LDA top- plans for the future, thus using words indicating a focus on
ics) without label smoothing by 0.042 and 0.043, respectively. the future (correlation equal to 0.0666). However, it must be
With regards to the Depression_Severity dataset, we first noted that this is a very weak correlation. On the other hand,
compare our proposed approaches without label smoothing people with depression focus on the present and do not make
with the BERT and MentalBERT models. We observe that the plans for the future. They discuss negative topics, including
integration of LIWC features and features extracted by LDA death, illnesses, mental health, and substances. This can be
topic modeling, i.e., GOSS features, into the BERT model justified by the fact that people with depression often have
leads to a performance surge in comparison with the BERT tendencies to suicide and believe that they cannot achieve
model. Specifically, M-BERT (LIWC) outperforms BERT in anything. In addition, they use swear words, i.e., shit, fuck,
weighted F1-score by 1.13%. At the same time, the integration and damn, since they think that everything goes wrong in their
of all the features, except NRC, to a MentalBERT model life. Also, their posts are full of sadness, anxiety, and negative
yields a performance improvement compared to the Men- tone.
talBERT model. Specifically, M-MentalBERT (LDA topics) Regarding the Dreaddit dataset, we observe similar patterns.
attains the highest weighted F1-score accounting for 72.58% Specifically, nonstressful posts include words with positive
surpassing MentalBERT by 0.91%. When it comes to proposed tone and emotion. People in nonstressful conditions discuss
models with label smoothing, we observe an improvement topics pertinent to lifestyle, money, leisure, technology, food,
in both the performance metrics and calibration ones. More work, religion, and many more. Also, people use personal
specifically, the integration of NRC features to a BERT pronouns, including third-person singular, second-person, and
model obtains a weighted F1-score of 72.81% outperforming first-person plural. Also, their posts include words indicating
BERT by 1.81%, M-BERT (NRC) without label smoothing politeness. On the contrary, people in stressful conditions
by 2.85%, and M-BERT (LIWC) without label smoothing by use swear words or words indicating interpersonal conflict,
0.68%. In addition, M-MentalBERT (LDA topics) with label including fight, kill, attacking, and so on. The first person
smoothing obtains the highest F1-score accounting for 73.16% singular constitutes the LIWC category with the highest degree
surpassing MentalBERT by 1.49% and M-MentalBERT (LDA of correlation with the stressful class. This result agrees with
topics) without label smoothing by 0.58%. In terms of the previous work [77], where it is mentioned that self-references
calibration metrics, we observe that both ECE and ACE scores by individuals present a surge in emotionally vulnerable con-
are improved when we apply label smoothing. For example, ditions. Therefore, the use of the first-person singular indicates
M-BERT (LIWC) with label smoothing obtains an ECE score increased self-focus. In addition, similar to the depressive
of 0.094 and an ACE score of 0.069, which are improved by posts, we observe that stressful posts include words with
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
ILIAS et al.: CALIBRATION OF TRANSFORMER-BASED MODELS FOR IDENTIFYING STRESS AND DEPRESSION 9
TABLE III
LIWC F EATURES A SSOCIATED W ITH D EPRESSIVE /S TRESSFUL AND N ONDEPRESSIVE /N ONSTRESSFUL P OSTS , S ORTED BY P OINT-B ISERIAL
C ORRELATION . A LL C ORRELATIONS ARE S IGNIFICANT AT p < 0.05 A FTER B ENJAMINI –H OCHBERG C ORRECTION
negative emotions, anger, sadness, and anxiety. Also, topics 1) Prior works having proposed multimodal, multitask, and
discussed by these users are pertinent to illness, health, and ensemble strategies in conjunction with transformer-
death. based models, have just fine-tuned these pretrained
transformer-based models instead of using some mod-
VII. D ISCUSSION ifications of them. Thus, this study is the first attempt to
Our study contributes to the literature by introducing the inject extra knowledge into BERT (and MentalBERT),
first approach of integrating extra-linguistic information into in order to enhance its performance.
pretrained language models based on transformers, namely, 2) All the prior works evaluate only the classification per-
BERT and MentalBERT. Specifically, we adapt M-BERT [23] formance of their approaches neglecting the confidence
by replacing multimodal information with linguistic informa- of the prediction. To tackle this, this is the first study in
tion. To be more precise, we extract NRC, LIWC, features the task of stress and depression detection through social
derived by LDA topics, and top2vec features. We apply media posts utilizing label smoothing and evaluating
a multimodal adaptation gate and exploit also a shifting both the classification performance and the calibration
component for creating new combined embeddings which of the models.
are given as input to BERT (and MentalBERT) models. 3) Finally, this is the first study utilizing features derived
In addition, motivated by the fact that in real-world decision- from LDA topics, namely, the GOSS, which captures
making systems, classification networks must not only be the text’s interest compared to other texts.
accurate but also should indicate when they are likely to From the results of this study, we found the following.
be incorrect, we apply label smoothing and evaluate our 1) Finding 1: The integration of linguistic features into
proposed approaches both in terms of classification and transformer-based models yields an increase in classi-
calibration. fication performance. However, it is worth noting that
Therefore, our study is different from the state-of-the-art in some cases this improvement is limited. For instance,
approaches described in Section II. the integration of LIWC features into the MentalBERT
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
model with label smoothing obtains better performance that stressful and depressive posts present high correlations
than MentalBERT in F1-score by 3.36% and better with common LIWC categories.
performance than M-MentalBERT (LIWC) without label In the future, we plan to exploit transfer learning and domain
smoothing in F1-score by 0.63%. However, we believe adaptation methods. Also, employing explainable multimodal
that even a small improvement can make a difference. models is one of our future plans. In addition, we plan
2) Finding 2: Label smoothing improves both the perfor- to exploit more methods for enhancing transformer-based
mance and the calibration of the proposed approaches. models with external knowledge. Finally, we aim to contribute
The calibration of the proposed approaches is measured further to the uncertainty estimation by exploiting Monte Carlo
via two metrics, namely, ECE and ACE. dropout [79].
3) Finding 3: Findings from a linguistic analysis revealed
that people in stressful and/or depressive conditions R EFERENCES
use words belonging to specific LIWC categories more [1] World Health Organization. (2023). Stress. Accessed: Mar. 30, 2023.
frequently than others. [Online]. Available: https://www.who.int/news-room/questions-and-
answers/item/stress
There are several limitations related to this study. [2] W. J. Friedman. (2023). Types of Stress and Their Symptoms.
1) Hyperparameter Tuning: Due to limited access to GPU Accessed: Mar. 30, 2023. [Online]. Available: https://www.mentalhelp.
resources, we were not able to perform hyperparame- net/blogs/types-of-stress-and-their-symptoms/
ter tuning. On the contrary, we tried some combina- [3] World Health Organization. (2021). Depression. Accessed:
Mar. 30, 2023. [Online]. Available: https://www.who.int/news-room/
tions of parameters. We believe that the adoption of fact-sheets/detail/depression
the hyperparameter tuning procedure through access to [4] World Health Organization. (2017). Depression and Other Com-
GPU resources would increase further the classification mon Mental Disorders. Accessed: Mar. 30, 2023. [Online]. Avail-
able: https://www.who.int/publications/i/item/depression-global-health-
performance. estimates
2) Explainability: The present study is not accompanied by [5] M. M. Tadesse, H. Lin, B. Xu, and L. Yang, “Detection of depression-
explainability techniques, i.e., integrated gradients [78], related posts in Reddit social media forum,” IEEE Access, vol. 7,
pp. 44883–44893, 2019.
and so on. Therefore, we aim to apply explainability [6] S. Chandra Guntuku, A. Buffone, K. Jaidka, J. C. Eichstaedt, and
techniques in the future. L. H. Ungar, “Understanding and measuring psychological stress using
3) Due to limited access to GPU resources and similar to social media,” in Proc. Int. AAAI Conf. Web Social Media, vol. 13, no. 1,
pp. 214–225, Jul. 2019.
prior work [8], [36], [39], we were not able to perform [7] J. S. L. Figuerêdo, A. L. L. M. Maia, and R. T. Calumby, “Early
multiple runs for testing for statistical significance in depression detection in social media based on deep learning and
terms of the Depression_Mixed and Dreaddit datasets. underlying emotions,” Online Social Netw. Media, vol. 31, Sep. 2022,
Art. no. 100225.
[8] L. Ansari, S. Ji, Q. Chen, and E. Cambria, “Ensemble hybrid learning
methods for automated depression detection,” IEEE Trans. Computat.
VIII. C ONCLUSION Social Syst., vol. 10, no. 1, pp. 211–219, Feb. 2023.
In this article, we present a new method for identify- [9] N. Poerner, U. Waltinger, and H. Schütze, “E-BERT: Efficient-Yet-
Effective entity embeddings for BERT,” in Proc. Findings Assoc. Com-
ing stress and depression in social media text by injecting put. Linguistics, (EMNLP), 2020, pp. 803–818.
linguistic information into transformer-based models. Also, [10] N. Kassner and H. Schütze, “Negated and misprimed probes
it is the first study exploiting label smoothing, in order to for pretrained language models: Birds can talk, but cannot fly,”
in Proc. 58th Annu. Meeting Assoc. Comput. Linguistics, 2020,
ensure that our model is calibrated. We evaluate our proposed pp. 7811–7818.
methods on three publicly available datasets, which include [11] N. Peinelt, M. Rei, and M. Liakata, “GiBERT: Enhancing BERT with
a depression detection dataset (binary classification), a stress linguistic information using a lightweight gated injection method,”
in Proc. Findings Assoc. Comput. Linguistics, (EMNLP), 2021,
detection dataset, and a depression detection dataset (mul- pp. 2322–2336.
ticlass classification—severity of depression). Findings sug- [12] R. Wang et al., “K-adapter: Infusing knowledge into pre-trained models
gest that transformer-based networks combined with linguistic with adapters,” in Proc. Findings Assoc. Comput. Linguistics, (ACL-
information lead to performance improvement in comparison IJCNLP), 2021, pp. 1405–1418.
[13] J. Lu, D. Batra, D. Parikh, and S. Lee, ViLBERT: Pretraining Task-
with transformer-based networks. Also, applying label smooth- Agnostic Visiolinguistic Representations for Vision-and-Language Tasks.
ing yields both the performance improvement and better Red Hook, NY, USA: Curran Associates, 2019.
calibration of the proposed models. Specifically, in terms of [14] M. E. Peters et al., “Knowledge enhanced contextual word representa-
tions,” in Proc. Conf. Empirical Methods Natural Lang. Process. 9th Int.
the Depression_Mixed dataset, we found that the injection Joint Conf. Natural Lang. Process. (EMNLP-IJCNLP), 2019, pp. 43–54.
of top2vec features into BERT and MentalBERT models [15] A. P. Dawid, “The well-calibrated Bayesian,” J. Amer. Stat. Assoc.,
along with label smoothing obtained the highest F1-score vol. 77, no. 379, pp. 605–610, Sep. 1982.
[16] A. H. Murphy and E. S. Epstein, “Verification of probabilistic predic-
and accuracy. Regarding the Dreaddit dataset, results showed tions: A brief review,” J. Appl. Meteorol., vol. 6, no. 5, pp. 748–755,
that the integration of LIWC features into language models Oct. 1967.
based on transformers in conjunction with label smoothing [17] C. S. Crowson, E. J. Atkinson, and T. M. Therneau, “Assessing calibra-
tion of prognostic risk scores,” Stat. Methods Med. Res., vol. 25, no. 4,
yielded the highest F1-score and accuracy. With regards to the pp. 1692–1706, Aug. 2016.
Depression_Severity dataset, findings showed that the injection [18] X. Jiang, M. Osl, J. Kim, and L. Ohno-Machado, “Calibrating predictive
of NRC features into the BERT model and the integration model estimates to support personalized medicine,” J. Amer. Med.
of features derived by LDA topics, namely, GOSS features, Inform. Assoc., vol. 19, no. 2, pp. 263–274, Mar. 2012.
[19] M. Raghu et al., “Direct uncertainty prediction for medical second
into the MentalBERT model yielded the highest weighted opinions,” in Proc. Mach. Learn. Res., vol. 97, Long Beach, CA, USA,
F1-scores. We also conduct a linguistic analysis and show Jun. 2019, pp. 5281–5290.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
ILIAS et al.: CALIBRATION OF TRANSFORMER-BASED MODELS FOR IDENTIFYING STRESS AND DEPRESSION 11
[20] R. Sawhney, A. Neerkaje, and M. Gaur, “A risk-averse mechanism for [42] L. Ren, H. Lin, B. Xu, S. Zhang, L. Yang, and S. Sun, “Depression
suicidality assessment on social media,” in Proc. 60th Annu. Meeting detection on Reddit with an emotion-based attention network: Algorithm
Assoc. Comput. Linguistics (Short Papers), vol. 2, 2022, pp. 628–635. development and validation,” JMIR Med. Informat., vol. 9, no. 7,
[21] L. Fiorillo, P. Favaro, and F. D. Faraci, “DeepSleepNet-lite: A simplified Jul. 2021, Art. no. e28754.
automatic sleep stage scoring model with uncertainty estimates,” IEEE [43] M. Trotzek, S. Koitka, and C. M. Friedrich, “Utilizing neural networks
Trans. Neural Syst. Rehabil. Eng., vol. 29, pp. 2076–2085, 2021. and linguistic metadata for early detection of depression indications
[22] L. Liu, Y. Lu, Y. Luo, R. Zhang, L. Itti, and J. Lu, “Detecting in text sequences,” IEEE Trans. Knowl. Data Eng., vol. 32, no. 3,
‘smart’ spammers on social network: A topic model approach,” in Proc. pp. 588–601, Mar. 2020.
NAACL Student Res. Workshop. San Diego, CA, USA: Association for [44] V. B. de Souza, J. C. Nobre, and K. Becker, “DAC stacking: A deep
Computational Linguistics, Jun. 2016, pp. 45–50. learning ensemble to classify anxiety, depression, and their comorbidity
[23] W. Rahman et al., “Integrating multimodal information in large pre- from Reddit texts,” IEEE J. Biomed. Health Informat., vol. 26, no. 7,
trained transformers,” in Proc. 58th Annu. Meeting Assoc. Comput. pp. 3303–3311, Jul. 2022.
Linguistics, 2020, pp. 2359–2369. [45] H. Zogan, I. Razzak, X. Wang, S. Jameel, and G. Xu, “Explainable
[24] M. Jin and N. Aletras, “Complaint identification in social media with depression detection with multi-aspect features using a hybrid deep
transformer networks,” in Proc. 28th Int. Conf. Comput. Linguistics, learning model on social media,” World Wide Web, vol. 25, no. 1,
2020, pp. 1765–1771. pp. 281–304, Jan. 2022.
[25] R. Müller, S. Kornblith, and G. E. Hinton, “When does label smooth- [46] A.-S. Uban, B. Chulvi, and P. Rosso, “An emotion and cognitive based
ing help?” in Proc. Adv. Neural Inf. Process. Syst., H. Wallach, analysis of mental health disorders from social media data,” Future
H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett, Gener. Comput. Syst., vol. 124, pp. 480–494, Nov. 2021.
Eds., vol. 32. Red Hook, NY, USA: Curran Associates, 2019, pp. 1–13.
[47] H. Song, J. You, J.-W. Chung, and J. C. Park, “Feature attention network:
[26] S. Muñoz and C. A. Iglesias, “A text classification approach to detect Interpretable depression detection from social media,” in Proc. 32nd
psychological stress combining a lexicon-based feature framework with Pacific Asia Conf. Lang., Inf. Comput. Hong Kong: Association for
distributional representations,” Inf. Process. Manage., vol. 59, no. 5, Computational Linguistics, Dec. 2018, pp. 1–3.
Sep. 2022, Art. no. 103011.
[48] S. Boinepelli, T. Raha, H. Abburi, P. Parikh, N. Chhaya, and V. Varma,
[27] E. Turcan and K. McKeown, “Dreaddit: A Reddit dataset for stress
“Leveraging mental health forums for user-level depression detection on
analysis in social media,” in Proc. 10th Int. Workshop Health Text Mining
social media,” in Proc. 13th Lang. Resour. Eval. Conf. Marseille, France:
Inf. Anal. (LOUHI), 2019, pp. 97–107.
European Language Resources Association, Jun. 2022, pp. 5418–5427.
[28] K. Yang, T. Zhang, and S. Ananiadou, “A mental state knowledge—
Aware and contrastive network for early stress and depression detec- [49] K. Anantharaman, A. S, R. Sivanaiah, S. Madhavan, and
tion on social media,” Inf. Process. Manage., vol. 59, no. 4, 2022, S. M. Rajendram, “SSN_MLRG1@LT-EDI-ACL2022: Multi-class
Art. no. 102961. classification using BERT models for detecting depression signs from
[29] G. I. Winata, O. P. Kampman, and P. Fung, “Attention-based LSTM social media text,” in Proc. 2nd Workshop Lang. Technol. Equality,
for psychological stress detection from spoken language using distant Diversity Inclusion, 2022, pp. 296–300.
supervision,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. [50] F. Nilsson and G. Kovács, “FilipN@LT-EDI-ACL2022-detecting signs
(ICASSP), Apr. 2018, pp. 6204–6208. of depression from social media: Examining the use of summarization
[30] H. Lin et al., “Detecting stress based on social interactions in social net- methods as data augmentation for text classification,” in Proc. 2nd Work-
works,” IEEE Trans. Knowl. Data Eng., vol. 29, no. 9, pp. 1820–1833, shop Lang. Technol. Equality, Diversity Inclusion, 2022, pp. 283–286.
Sep. 2017. [51] H. Zogan, I. Razzak, S. Jameel, and G. Xu, “DepressionNet: Learning
[31] E. Turcan, S. Muresan, and K. McKeown, “Emotion-infused models multi-modalities with user post summarization for depression detection
for explainable psychological stress detection,” in Proc. Conf. North on social media,” in Proc. 44th Int. ACM SIGIR Conf. Res. Develop.
Amer. Chapter Assoc. Comput. Linguistics, Hum. Lang. Technol., 2021, Inf. Retr., Jul. 2021, pp. 133–142.
pp. 2895–2909. [52] S. Ghosh, A. Ekbal, and P. Bhattacharyya, “What does your bio
[32] J. Liu and M. Shi, “A hybrid feature selection and ensemble approach say? Inferring Twitter users’ depression status from multimodal profile
to identify depressed users in online social media,” Frontiers Psychol., information using deep learning,” IEEE Trans. Computat. Social Syst.,
vol. 12, p. 6571, Jan. 2022. vol. 9, no. 5, pp. 1484–1494, Oct. 2022.
[33] T. Nguyen, D. Phung, B. Dao, S. Venkatesh, and M. Berk, “Affective [53] Z. Li, Z. An, W. Cheng, J. Zhou, F. Zheng, and B. Hu, “MHA: A
and content analysis of online depression communities,” IEEE Trans. multimodal hierarchical attention model for depression detection in
Affect. Comput., vol. 5, no. 3, pp. 217–226, Jul. 2014. social media,” Health Inf. Sci. Syst., vol. 11, no. 1, p. 6, Jan. 2023.
[34] S. Tsugawa, Y. Kikuchi, F. Kishino, K. Nakajima, Y. Itoh, and H. Ohsaki, [54] J. C. Cheng and A. L. P. Chen, “Multimodal time-aware attention
“Recognizing depression from Twitter activity,” in Proc. 33rd Annu. networks for depression detection,” J. Intell. Inf. Syst., vol. 59, no. 2,
ACM Conf. Hum. Factors Comput. Syst., Apr. 2015, pp. 3187–3196. pp. 319–339, Oct. 2022.
[35] I. Pirina and Ç. Çöltekin, “Identifying depression on Reddit: The effect [55] G. Shen et al., “Depression detection via harvesting social media: A
of training data,” in Proc. EMNLP Workshop SMM4H, 3rd Social Media multimodal dictionary learning solution,” in Proc. 27th Int. Joint Conf.
Mining Health Appl. Workshop Shared Task, 2018, pp. 9–12. Artif. Intell., Aug. 2017, pp. 3838–3844.
[36] M. A. Wani, M. A. ELAffendi, K. A. Shakil, A. S. Imran, and [56] T. Gui et al., “Cooperative multimodal approach to depression detection
A. A. A. El-Latif, “Depression screening in humans with AI and deep in Twitter,” in Proc. AAAI Conf. Artif. Intell., vol. 33, no. 1, Jul. 2019,
learning techniques,” IEEE Trans. Computat. Social Syst., early access, pp. 110–117.
Sep. 28, 2022, doi: 10.1109/TCSS.2022.3200213.
[57] D. Zhou, J. Yuan, and J. Si, “Health issue identification in social media
[37] J. Kim, J. Lee, E. Park, and J. Han, “A deep learning model for detecting based on multi-task hierarchical neural networks with topic attention,”
mental illness from user content on social media,” Sci. Rep., vol. 10, Artif. Intell. Med., vol. 118, Aug. 2021, Art. no. 102119.
no. 1, pp. 1–6, Jul. 2020.
[58] Y. Wang, Z. Wang, C. Li, Y. Zhang, and H. Wang, “Online social
[38] U. Naseem, A. G. Dunn, J. Kim, and M. Khushi, “Early identification
network individual depression detection using a multitask heterogenous
of depression severity levels on Reddit using ordinal classification,” in
modality fusion approach,” Inf. Sci., vol. 609, pp. 727–749, Sep. 2022.
Proc. ACM Web Conf., Apr. 2022, pp. 2563–2572.
[39] S. Ghosh and T. Anwar, “Depression intensity estimation via social [59] S. M. Mohammad and P. D. Turney, “Crowdsourcing a word—Emotion
media: A deep learning approach,” IEEE Trans. Computat. Social Syst., association lexicon,” Comput. Intell., vol. 29, no. 3, pp. 436–465,
vol. 8, no. 6, pp. 1465–1474, Dec. 2021. Aug. 2013.
[40] H. Kour and M. K. Gupta, “An hybrid deep learning approach [60] J. W. Pennebaker, M. E. Francis, and R. J. Booth, “Linguistic inquiry
for depression prediction from user tweets using feature-rich CNN and word count: LIWC 2001,” Mahway: Lawrence Erlbaum Associates,
and bi-directional LSTM,” Multimedia Tools Appl., vol. 81, no. 17, vol. 71, no. 2001, p. 2001, 2001.
pp. 23649–23685, Jul. 2022. [61] R. L. Boyd, A. Ashokkumar, S. Seraj, and J. W. Pennebaker, The
[41] H. Zogan, I. Razzak, S. Jameel, and G. Xu, “Hierarchical convolutional Development and Psychometric Properties of LIWC-22. Austin, TX,
attention network for depression detection on social media and its impact USA: Univ. Texas Austin, 2022.
during pandemic,” IEEE J. Biomed. Health Informat., early access, [62] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet allocation,”
Feb. 9, 2023, doi: 10.1109/JBHI.2023.3243249. J. Mach. Learn. Res., vol. 3, pp. 993–1022, Mar. 2003.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
[63] D. Angelov, “Top2vec: Distributed representations of topics,” Loukas Ilias received the integrated master’s degree
2020, arXiv:2008.09470. [Online]. Available: https://github.com/ from the School of Electrical and Computer Engi-
ddangelov/Top2Vec neering (SECE), National Technical University of
[64] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training Athens (NTUA), Athens, Greece, in June 2020,
of deep bidirectional transformers for language understanding,” in Proc. where he is currently pursuing the Ph.D. degree with
Conf. North Amer. Chapter Assoc. Comput. Linguistics, Hum. Lang. the Decision Support Systems (DSS) Laboratory,
Technol., vol. 1, Jun. 2019, pp. 4171–4186. SECE. He has completed a Research Internship
[65] S. Ji, T. Zhang, L. Ansari, J. Fu, P. Tiwari, and E. Cambria, with University College London (UCL), London,
“MentalBERT: Publicly available pretrained language models for U.K.
mental healthcare,” in Proc. 13th Lang. Resour. Eval. Conf. Mar- He is a Researcher with the DSS Laboratory,
seille, France: European Language Resources Association, Jun. 2022, NTUA, where he is involved in EU-funded research
pp. 7184–7190. projects. He has published in numerous journals, including IEEE J OURNAL
OF B IOMEDICAL AND H EALTH I NFORMATICS , Expert Systems With Appli-
[66] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” 2016, cations (Elsevier), Applied Soft Computing (Elsevier), Computer Speech and
arXiv:1607.06450. Language (Elsevier), IEEE ACCESS, and Frontiers in Aging Neuroscience. His
[67] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and research has also been accepted for presentation at international conferences,
R. Salakhutdinov, “Dropout: A simple way to prevent neural networks including the IEEE-Engineering in Medicine and Biology Society (EMBS)
from overfitting,” J. Mach. Learn. Res., vol. 15, no. 56, pp. 1929–1958, International Conference on Biomedical and Health Informatics (BHI’22)
2014. and the IEEE International Conference on Acoustics, Speech, and Signal
[68] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, Processing (ICASP) 2023. His research interests include speech processing,
“Rethinking the inception architecture for computer vision,” in Proc. natural language processing, social media analysis, and the detection of
IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, complex brain disorders.
pp. 2818–2826.
[69] T. Wolf et al., “Transformers: State-of-the-art natural language process-
ing,” in Proc. Conf. Empirical Methods Natural Lang. Process., Syst. Spiros Mouzakitis received the Ph.D. degree from
Demonstrations Stroudsburg, PA, USA: Association for Computational the School of Electrical and Computer Engineering,
Linguistics, Oct. 2020, pp. 38–45. National Technical University of Athens, Athens,
[70] A. Paszke et al., “Pytorch: An imperative style, high-performance deep Greece, in 2009.
learning library,” in Proc. Adv. Neural Inf. Process. Syst., H. Wallach, He has more than ten years of industry experience
H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett, in software engineering for banking, e-commerce
Eds., vol. 32. Curran Associates, Inc., 2019. systems, and energy suppliers. He worked as
[71] P. Mishra, M. Del Tredici, H. Yannakoudakis, and E. Shutova, “Author a Project Manager for 16 years in more than
profiling for abuse detection,” in Proc. 27th Int. Conf. Comput. Linguis- 20 projects (H2020, FP7, FP6, and FP5 research
tics. Santa Fe, NM, USA: Association for Computational Linguistics, projects) related to big data, open data, mar-
Aug. 2018, pp. 1088–1098. ket/impact analysis in the context of ICT technolo-
[72] M. P. Naeini, G. Cooper, and M. Hauskrecht, “Obtaining well calibrated gies, web development, and enterprise interoperability. He is a Senior Research
probabilities using Bayesian binning,” in Proc. 29th AAAI Conf. Artif. Analyst with the Decision Support Systems Laboratory, National Technical
Intell., 2015, pp. 1–7. University of Athens. He has published in numerous journals and presented
[73] J. Nixon, M. W. Dusenberry, L. Zhang, G. Jerfel, and D. Tran, “Mea- his research at international conferences. His current research is focused on
suring calibration in deep learning,” in Proc. IEEE/CVF Conf. Comput. decision analysis in the field of decision support systems based on machine
Vis. Pattern Recognit. (CVPR) Workshops, Jun. 2019, pp. 1–4. learning/deep learning, big and linked data analytics, as well as optimization
systems and algorithms.
[74] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, “On calibra-
tion of modern neural networks,” in Proc. 34th Int. Conf. Mach.
Learn. (ICML), vol. 70, D. Precup and Y. W. Teh, Eds., Aug. 2017,
pp. 1321–1330. Dimitris Askounis was the Scientific Director of
[75] L. Ilias and D. Askounis, “Explainable identification of dementia from over 50 European research projects in the above
transcripts using transformer networks,” IEEE J. Biomed. Health Infor- areas (FP7, Horizon2020, and so on). For a number
mat., vol. 26, no. 8, pp. 4153–4164, Aug. 2022. of years, he was an Advisor to the Minister of
Justice and the Special Secretary for Digital Con-
[76] Y. Benjamini and Y. Hochberg, “Controlling the false discovery rate: A
vergence for the introduction of information and
practical and powerful approach to multiple testing,” J. Roy. Stat. Soc.,
communication technologies in public administra-
Ser. B, Methodol., vol. 57, no. 1, pp. 289–300, Jan. 1995.
tion. Since June 2019, he has been the President
[77] J. W. Pennebaker and T. C. Lay, “Language use and personality during of the Information Society SA, Kallithea, Greece.
crises: Analyses of mayor Rudolph Giuliani’s press conferences,” J. Res. He is currently a Professor at the School of Electri-
Personality, vol. 36, no. 3, pp. 271–282, Jun. 2002. cal and Computer Engineering, National Technical
[78] M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic attribution for deep University of Athens (NTUA), Athens, Greece, and the Deputy Director
networks,” in Proc. 34th Int. Conf. Mach. Learn., vol. 70, D. Precup of the Decision Support Systems Laboratory. He has over 25 years of
and Y. W. Teh, Eds., Aug. 2017, pp. 3319–3328. experience in decision support systems, intelligent information systems and
[79] Y. Gal and Z. Ghahramani, “Dropout as a Bayesian approximation: manufacturing, e-business, e-government, open and linked data, big data
Representing model uncertainty in deep learning,” in Proc. 33rd analytics, Artificial Intelligence (AI) algorithms, and the application of modern
Int. Conf. Mach. Learn., New York, NY, USA, vol. 48, Jun. 2016, Information Technology (IT) techniques in the management of companies and
pp. 1050–1059. organizations.