Applying Human-in-the-Loop to construct a dataset for determining content reliability to combat fake news
Applying Human-in-the-Loop to construct a dataset for determining content reliability to combat fake news
Keywords: Annotated corpora are indispensable tools to train computational models in Natural Language Processing.
Natural language processing However, in the case of more complex semantic annotation processes, it is a costly, arduous, and time-
Fake news detection consuming task, resulting in a shortage of resources to train Machine Learning and Deep Learning algorithms.
Assisted annotation
In consideration, this work proposes a methodology, based on the human-in-the-loop paradigm, for semi-
Dataset construction
automatic annotation of complex tasks. This methodology is applied in the construction of a reliability
Human-in-the-Loop Artificial Intelligence
Active learning
dataset of Spanish news so as to combat disinformation and fake news. We obtain a high quality resource
by implementing the proposed methodology for semi-automatic annotation, increasing annotator efficacy
and speed, with fewer examples. The methodology consists of three incremental phases and results in the
construction of the RUN dataset. The annotation quality of the resource was evaluated through time-reduction
(annotation time reduction of almost 64% with respect to the fully manual annotation), annotation quality
(measuring consistency of annotation and inter-annotator agreement), and performance by training a model
with RUN semi-automatic dataset (Accuracy 95% F1 95%), validating the suitability of the proposal.
∗ Corresponding author.
E-mail addresses: [email protected] (A. Bonet-Jover), [email protected] (R. Sepúlveda-Torres), [email protected] (E. Saquete), [email protected]
(P. Martínez-Barco), [email protected] (A. Piad-Morffis), [email protected] (S. Estevez-Velarde).
https://doi.org/10.1016/j.engappai.2023.107152
Received 28 July 2022; Received in revised form 4 July 2023; Accepted 11 September 2023
Available online 20 September 2023
0952-1976/© 2023 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-
nc-nd/4.0/).
A. Bonet-Jover et al. Engineering Applications of Artificial Intelligence 126 (2023) 107152
When a problem is approached from the AI perspective, either with Ma- 2. Background
chine Learning (ML) or Deep Learning (DL) techniques, large and costly
amount of instances of human feedback are required to construct the Our research is based on the application of AI for disinformation
datasets that will be used to train and evaluate the systems that will be detection where the dataset becomes the cornerstone. In this context,
in charge of solving the problem (Stenetorp et al., 2012). With current we face two challenges. Firstly is the lack of high-quality datasets with
pre-trained models (i.e. based on Transformer architecture (Vaswani labeled examples in Spanish on which statistical models for automatic
et al., 2017)), the number of examples required is smaller than in disinformation detection can be trained. Secondly is the time and effort
classical approaches but still very expensive. required to obtain the examples. As stated by Alex et al. (2010), very
Building efficient datasets is a complex task as dataset annotations little work has been done to obtain better annotation methods that
can have different degrees of difficulty. This may imply not only a maximize both the quality and quantity of the annotated data.
time cost but also the need for a high level of expertise in a given This section presents the state of the art related to disinformation
annotation. An efficient dataset would be one that can be created datasets (Section 2.1), followed by the literature regarding the human-
as quickly and inexpensively as possible and also includes the most in-the-loop concept (Section 2.2), and finally the different methodolo-
appropriate annotated examples to assist learning for problem-solving. gies usually adopted in corpus construction in NLP (Section 2.3).
Our challenge is to create tailor-made quality datasets, selected ac-
cording to specific criteria, that increase, or at least maintain, accuracy 2.1. Disinformation corpora
while saving time and effort. This would result in larger and more
efficient datasets that combine both automatic and manual annotation. According to the literature consulted, most of the corpora created
Moreover, the datasets need to be constantly updated at a minimum to address disinformation or fake news detection use a binary classifi-
cost so that the tools derived from them do not become obsolete. cation, by categorizing news as Fake or True (Salem et al., 2019; Silva
Hence, the overall objective of this work is to implement a novel et al., 2020). Others, such as those focused on fact-checking tasks, use
methodology for semi-automatic dataset construction that will allow a fine-grained scale of labels covering several veracity degrees (Wang,
the efficient and effective generation of quality resources. In our re- 2017; Vlachos and Riedel, 2014). However, in all cases, the annotation
search, we will focus on news content reliability to support disinfor- is global for the whole document and the veracity or reliability of the
mation detection, but the methodology could be easily adapted and different parts of the document are not considered. This single global
applied to any semantically complex annotation task. classification of news, whether with binary or multiple values, depends
To address the overall objective, we focus on a paradigm called on external knowledge, such as fact-checking platforms. Few datasets
Human-in-the-Loop. The Human-in-the-Loop (HITL) paradigm is an use a reliability classification and usually this classification is applied
extensive area of research that covers the intersection of computer on the basis of the source’s credibility (Dhoju et al., 2019) and not of
science, cognitive science, and psychology, and of course, it is be- purely textual or linguistic characteristics (Assaf and Saheb, 2021).
ing applied in the AI area. Specifically, HITL machine learning is a To the authors’ knowledge, corpora that address the disinformation
set of strategies for combining human and machine intelligence in task in Spanish (Posadas-Durán et al., 2019) are scarce, since they
applications that use AI (Monarch, 2021). are usually released in English. Therefore, we aim to create a Spanish
As a result, the following contributions to this research area have resource to train in this task. Furthermore, all the disinformation
been provided: datasets found in the literature were created entirely manually, thus
as explained in next section we introduce more efficient ways to create
• The design and implementation of an innovative HITL-based datasets.
methodology for semi-automatic annotation in complex annota-
tion tasks thereby assisting annotators and optimizing resources 2.2. Human-in-the-loop machine learning
and performance, while facilitating the periodic updating of the
language models used by the tools. The use of supervised learning is approximately 90% of today’s
• The creation of an efficiently annotated dataset by applying the machine learning-based applications, i.e. they learn based on examples
proposed methodology, which is a fundamental requirement for created by humans. As stated by Okoro et al. (2018), there is a need for
AI and NLP tasks. In this work, the dataset is applied to disinfor- a hybrid model solution that combines the efforts of both humans and
mation detection and, more specifically, reliable and unreliable machines. Due to the complexity of our semantic annotation proposal
information in news articles in Spanish. and given that researchers spend more time generating data than build-
• The evaluation of the quality and the benefits of applying this ing machine learning models (Monarch, 2021), we focus on the HITL
type of HITL-based methodology for the semi-automatic construc- methodologies to increase the efficacy of our work. HITL is an umbrella
tion of datasets. The performance is determined in terms of a term for defining the new types of interactions between humans and
balance between time-effort consumption, annotation quality of machine learning algorithms (Mosqueira-Rey et al., 2022).
the dataset, and accuracy achievement. HITL-AI are systems that continuously improve because of human
• Making available to the research community the semi-automatic input, addressing the limitations of previous AI solutions and bridging
dataset generated,1 once the validity of the generated dataset has the gap between machines and humans. These systems aim at leverag-
been corroborated. ing the ability of AI to scale the processing to very large amounts of data
This paper is structured as follows: Section 2 presents an overview while relying on human intelligence to perform very complex tasks,
of the most relevant scientific literature concerning disinformation such in the case of natural language understanding (Demartini et al.,
datasets, human-in-the-loop AI, and corpus construction methodolo- 2020). The HITL methodology is being used in several studies to in-
gies; Section 3 details RUN-AS, the annotation guideline created for crease efficiency in data collection, such as in the cases of Fanton et al.
our research; Section 4 introduces the methodology for semi-automatic (2021) and Cañizares-Díaz et al. (2021), since the continuous executive
annotation of datasets; Section 5 presents the specific implementation loop contributes to higher accuracy and stronger robustness of the
of the methodology; Section 6 describes the evaluation framework systems. Fanton et al. (2021) proposed a novel human-in-the-loop data
and discussion; and finally Section 7 presents the conclusions of this collection methodology in which a generative language model is refined
research and future work. iteratively by using its own data from the previous loops to generate
new training samples that experts review and/or post-edit. Cañizares-
Díaz et al. (2021) applies the HITL approach active learning to reduce
1
Available at https://gplsi.dlsi.ua.es/resources/NewsReliabilityAnnotation the human effort required during the annotation of natural language
2
A. Bonet-Jover et al. Engineering Applications of Artificial Intelligence 126 (2023) 107152
corpora composed of entities and semantic relations. The approach better models in an algorithm-centered evaluation, but in IML
assists human annotators by intelligently selecting the most informative systems human factors have to be taken into account, so there
sentences to annotate and then pre-annotating them with a few highly is also a human-centered evaluation, focusing on the utility and
accurate entities and semantic relations. Finally, a survey of existing effectiveness of the application for end-users. AL is considered the
works on human-in-the-loop from a data perspective are presented basis for IML (Mosqueira-Rey et al., 2022).
at Wu et al. (2022), summarizing major approaches in the field along • Machine Teaching (MT): MT describes the idea of a teacher who
with their technical strengths/weaknesses, and classifying them into teaches an ML model to an ML algorithm. In MT, human domain
three main categories: (i) the work of improving model performance experts have control over the learning process by delimiting the
from data processing; (ii) the work of improving model performance knowledge that they intend to transfer to the machine learning
through interventional model training; and (iii) the design of the system model (Ramos et al., 2020; Simard et al., 2017). Even though
independent human in-the-loop. the MT paradigm is quite different in nature from the other
One of the principles of HITL is to assist human tasks with machine paradigms described in this paper (and represents an alternative
learning to increase efficiency. In line with this, our work builds a semi- to them) there are many common factors. Over time, the process
automatic annotated dataset, but HITL is used in many parts of ML has become iterative and incremental. Occasionally, it has been
cycle, from sampling unlabeled data to updating the model. inspired by other approaches, such as active learning. MT has, at
Depending on who is in control of the learning process, we can times, ended up obtaining results that are comparable to other
identify different approaches to HITL-ML (Mosqueira-Rey et al., 2022): techniques, such as curriculum learning.2
• Active Learning (AL): Active learning is a very extended HITL Beyond these well-known approaches, the HITL paradigm encom-
strategy used when obtaining labeled data demands a large passes all those strategies that include two goals that are normally
amount of time or money, since AL aims at selecting examples combined: improving the accuracy of the ML application via human
with high utility for the model (Tomanek et al., 2007) thereby input; and, facilitating the human task with the aid of ML.
increasing the performance of the learning model while reducing In our proposal, both goals are involved in the design of a method-
the amount of annotated data required (Kholghi et al., 2016). ology to create a semi-automatic annotation platform that enables an
In AL methods, labels are collected from humans, fed back to a increase in the amount of annotated data, reaching the target accuracy
supervised learning model, and used to decide which data items more quickly and easily.
humans should label next (Spina et al., 2015). AL is applied in HITL-ML has been successfully applied in a variety of areas such
several ML tasks such as object detection, semantic segmentation, as government (Benedikt et al., 2020), medicine (Budd et al., 2021),
sequence labeling or language generation (Monarch, 2021). In our and energy (Jung and Jazizadeh, 2019). More specifically, as for ap-
work, AL is focused on disinformation detection and it enables plying HITL to dis- and mis-information detection, some works are
us to optimize performance with fewer but better chosen news key. Demartini et al. (2020) presented the challenges and opportuni-
documents to incorporate in our training set. An important aspect ties of combining automatic and manual fact-checking approaches to
of AL is the iterative process, since it allows the retraining of misinformation, developing a human-AI framework. This work is more
human feedback, which in turn enables the system to improve focused on fact-checking and not on reliability. Additionally, Daniel
in terms of accuracy. (2021) proposed a human-AI hybrid disinformation detection system
Two broad active learning sampling strategies are (Monarch, performing as follows: The human user identifies a topic or claim
2021): about which they believe disinformation will be found, and seeks to
learn more about what is being said and (possibly) who is saying
– Uncertainty sampling. This is the set of strategies for identi- it. The machine learning algorithm, taking in the topic/claim and
fying unlabeled items that are near a decision boundary in large amounts of text scraped from the internet, is able to, by using
the current machine learning model. the appropriate approval/disapproval relationships (as decided by the
– Diversity sampling. This is the set of strategies for identify- user), separate the text into two groups: relevant disinformation, and
ing unlabeled items that are underrepresented or unknown everything else. Closing the loop, the human can determine if it satisfies
for the machine learning model in its current state. The their interests or not, and if necessary, revise the search and run the
goal of this sampling is to target new, unusual and under- process again. The paper aimed to determine which of stance, sentiment
represented items for annotation to give the model a more (or something else) techniques is best suited for use in this human-in-
complete picture of the problem space. the-loop disinformation detection. The author indicates that the results
obtained are not conclusive but data suggest that sentiment analy-
• Interactive Machine Learning (IML): IML is an active ma-
sis algorithms outperform the stance detection algorithms. However,
chine learning technique in which models are designed and
sentiment analysis and stance methods should be studied further for
implemented with humans in the loop (Fails and Olsen, 2003;
this purpose. These two papers mentioned above, although they do
Wondimu et al., 2022). When using machine learning in an
address the issue of disinformation using HITL techniques, are different
interactive design setting, feature selection must be automatic
perspectives to the one presented in this paper.
rather than manual and classifier training-time must be relatively
fast. There is a closer interaction between users and learning sys-
2.3. Corpora construction methodologies
tems, with people interactively supplying information in a more
focused, frequent, and incremental way compared to traditional The design, creation and annotation of a corpus is an essential
machine learning (Amershi et al., 2014). Ramos et al. (2020) task in the development of tools and datasets in NLP but, as stated
defines IML as ‘‘the process in which a person (or people) engages by Stenetorp et al. (2012), ‘‘annotation is also one of the most time-
with a learning algorithm, in an interactive loop, to generate consuming and financially costly components of many NLP research
useful artifacts’’. The authors describe these artifacts as data, efforts’’. Nowadays, the number of labeled datasets available for train-
insights about data, or machine-learned models. The difference ing purposes is low and data collection is one of the challenges in
between AL and IML relies more on who has the control of the
learning process and not on the interactivity of the approach.
While in AL the model retains the control and uses the human as 2
Curriculum learning (CL) is a training strategy that trains a machine
an oracle, in IML there is a closer interaction between users and learning model from easier data to harder data, which imitates the meaningful
learning systems, so the control is shared. AL focuses on building learning order in human curricula
3
A. Bonet-Jover et al. Engineering Applications of Artificial Intelligence 126 (2023) 107152
deception research due to the scarce availability of such datasets (Sa- successfully tested in other fields. The novelty of this work is to propose
quete et al., 2020). This scarcity is due to the time and cost that the a methodology, which can be generalized to any complex annotation
annotation task requires because annotating and compiling a corpus task, and which facilitates the creation of these complex, high quality
demands effort, time, consistency, and human expertise. This subject resources.
is at the forefront of NLP research and particularly of disinformation To the best of the authors’ knowledge, none of the works presented
detection research, since ‘‘the development of new resources such as in the literature addresses the annotation process for reliability detec-
annotated corpora can help to increase the performance of automatic tion within the disinformation context in Spanish by means of the HITL
methods aiming at detecting this kind of news’’ (Posadas-Durán et al., paradigm.
2019). Although not the main aim of this work, Section 3 explains the
According to the literature consulted, corpus construction in NLP peculiarities of the annotation scheme proposed for the disinformation
can be approached via several methodologies. Even if there are cases dataset used. The rationale being so that the construction process
in which both the compilation and the annotation tasks are completely can be better understood, given its complexity on account of its high
automated (Abacha et al., 2015) or carried out manually (Evrard semantic and linguistic load.
et al., 2020), most of the corpora released for the disinformation
task follow semi-automatic, but not intelligent, methodologies. In the 3. RUN-AS annotation proposal
semi-automatic approach, data collection is mostly carried out in an
automatic way via social media, fact-checking websites APIs, and web Our work is focused on the annotation of news collected from digital
crawling or web scraping, whereas the annotation task is mostly carried newspapers in Spanish and belonging to different domains, and it is
out manually by experts, such as the corpora introduced by Shahi and based on two well-known journalistic techniques: the Inverted Pyramid
Nandini (2020) and Wang (2017). Even if the manual annotation allows and the 5W1H.
quality examples to be obtained, created and verified by experts, it is Regarding the journalistic structure, well-built news using the in-
an arduous process that leads to small-size resources that require more verted pyramid tends to present five common parts, placed in order
time to achieve the desired goal. of relevance (Zhang and Liu, 2016), which are TITLE, SUBTITLE,
Another type of methodology is crowdsourcing, in which both LEAD, BODY and CONCLUSION. According to Thomson et al. (2008),
compilation and annotation can be automatic or manual, such as those neutrality and the inverted pyramid structure are distinctive features
introduced by Mitra and Gilbert (2015), Färber et al. (2020), Pérez- of hard news.
Rosas and Mihalcea (2015). This practice enables the bulk outsourcing In terms of content, well-built articles present semantic information
of multiple labeling tasks, typically with low overall cost and fast represented through a journalistic technique known as the 5W1H,
completion (Hsueh et al., 2009). It facilitates the creation of larger whose elements ‘‘clearly describe key information of news in an explicit
training datasets, but the quality is often lower than those corpora manner’’ (Zhang et al., 2019). The technique consists of answering six
developed especially by teams of experts working in the same field and key questions: WHO?, WHAT?, WHEN?, WHERE?, WHY?, and HOW?.
cooperating in the same research group. These questions allow the extraction of semantic information related
Besides semi-automatic and crowdsourced corpora, there is increas- to a news item and ‘‘are essential for people to understand the whole
ing interest in applying supervised or semi-supervised learning to build story’’ (Wang et al., 2010). Hamborg et al. (2018) explains that
corpora (Feller et al., 2018). Fairly extensive research grew out of journalists typically answer these questions for describing the main
the Text REtrieval Conference (TREC)3 that resulted in enhancing the event of a news story and they are answered within the first few
fairness and efficiency of the annotation tasks. The TREC’s task was sentences of a news article. These studies point out that the reliability of
based on judging the document relevance, not token-level annotations the information lies in the clearly identifiable existence of these items,
such as that done in the present work, but the approaches presented as well as in the way they are expressed. This is why our annotation
constitute a very important basis for dataset generation methodolo- for measuring reliability is based on these two journalistic practices.
gies (Voorhees, 2018; Vu and Gallinari, 2006). Furthermore, applying A fine-grained annotation guideline called RUN-AS (Reliable and
Human-in-the-loop strategies, and especially Active Learning, to obtain Unreliable News Annotation Scheme)4 has been designed to train our
datasets more efficiently is not new (Olsson, 2009). These approaches dataset specifically created for the reliability detection of news. The
enable the creation of quality resources supervised by human experts, novelty of this annotation scheme lies in the reliability classification
thus obtaining a considerable corpus through automation while keeping based on purely textual, linguistic, and semantic analysis (without
the quality of the human process. In this case, the system makes depending on external knowledge). RUN-AS presents three levels of
decisions in an automatic way but under the supervision of the expert, annotation: structure (Inverted Pyramid), content (5W1H), and Ele-
who corrects, validates or refutes those decisions (Cañizares-Díaz et al., ments of Interest (textual clues about formatting or phraseology that
2021; Rahman et al., 2020). Most of the research in the literature on enable the detection of suspicious information). We propose a complex
facilitating the annotation of entities in datasets, through supervised semantic annotation that is based on a multi-level annotation, two
learning or human in the loop, is applied in the medical domain Kholghi journalistic techniques, and an in-depth linguistic analysis. For this
et al. (2017), Tchoua et al. (2019), Settles et al. (2007). reason, our annotation required expert linguistic annotators as well as
Therefore, considering the task of disinformation and fake news technical experts for building the algorithms and models. An example
detection and taking into account that this task requires evidences to of this annotation is shown in Fig. 1.
justify why a certain decision has been made about the veracity of a To the authors’ knowledge, current datasets focus on determining
news item, this implies a finer-grained annotation that allows for the a global and single veracity value of the news items. However, our
explainability of the model and the obtained veracity classification, proposal enables the annotation of essential content within a news
instead of a unique veracity value of the whole document as the state- item and assigns a reliability classification based on a purely textual,
of-the-art works do. But, at the same time, this finer-grained annotation linguistic and semantic analysis that takes into account several ele-
also makes the work of annotating the datasets difficult and costly, so ments such as vagueness, subjectivity, lack of evidence or emotionally
it is necessary to find a methodology that allows the construction of charged content that influences reader opinions and feelings (Zhang
these datasets in an efficient and effective way. The proposals of HITL et al., 2019). The complete reliability criteria based on accuracy and
will collaborate in this task of improving efficiency and they have been neutrality concepts are fully defined in Bonet-Jover et al. (2023). In
3 4
https://trec.nist.gov/overview.html Available at https://gplsi.dlsi.ua.es/resources/NewsReliabilityAnnotation
4
A. Bonet-Jover et al. Engineering Applications of Artificial Intelligence 126 (2023) 107152
Fig. 1. Example of part of the structure (Inverted Pyramid) and content (5W1H) labels of a news item.
our annotation guideline, the emotional charge is being marked in the 9 sources. Then, news was manually annotated following the RUN-AS
Elements of Interest, especially with the labels KEY_EXPRESSIONS, au- guideline (see Section 3).
thor_stance of the QUOTE, and style of the TITLE. All this information This first phase was an arduous and slow process, since searching
is provided in the reference included. for news items one by one and annotating them from scratch was
For predicting the veracity of a news item, world knowledge is time-consuming.
essential, but what our proposal aims to achieve is not to detect The result of this phase was the first version of the dataset (in Fig. 2
veracity, but rather the features that can be decisive in classifying a green cylinder), obtaining labeled news that were used as input for
news item as reliable or unreliable, thereby providing support to users Phase 2.
and journalists via useful information at first level text-only annotation.
In our annotation procedure, one expert linguistic annotator was in- 4.2. Phase 2: automated compilation and manual annotation
volved to ensure the coordination and harmonization of the annotation
process as well as compliance with the annotation guideline previously The second phase introduced a well-known HITL strategy to in-
described. News was annotated using Brat,5 an intuitive and practical crease the productivity of the annotation process. Specifically, an active
annotation tool. learning approach was implemented in this phase, where the human
annotator interacted with a machine learning model that automatically
4. Human-in-the-loop based methodology for semi-automatic an- selected the most informative documents to annotate. Active learning
was chosen over other existing HITL techniques because it provided
notation
an easy and unobtrusive way to enhance the annotator’s performance,
without requiring additional training of annotators. In fact, annotators
This section presents the design of a HITL-based methodology to
do not even need to know there is a machine learning model selecting
semi-automate the dataset construction task. In order to simplify com-
the documents to annotate. This means that we can leverage annotators
pilation and annotation tasks, the HITL paradigm was used to gradually
with previous experience in the domain even though they may not be
automate the tasks involved in the construction of the dataset. This
technically savvy to participate in more complex HITL scenarios.
minimizes the effort of the human participant in the annotation, and
The process involved in this work is described next. Starting from
creates larger and less costly datasets. This methodology could be
a small batch of annotated news items (from Phase 1), a supervised
easily adapted to whatever complex annotation task optimizing any
model was trained and applied to a larger batch from an unlabeled
annotation procedure as we will discuss in Section 5.4.
news pool (orange cylinder). For each item, an informativeness metric
The methodology proposed here was performed in three phases con-
was computed (see Section 5.1) based on a balance between model
sisting of gradually integrating automation into the annotation process uncertainty and content diversity. All unlabeled news items were sorted
and observing the changes compared to the fully manual annotation. by this score, and a tentative list of suggestions was created, by in-
The news items were annotated in small batches and following different terleaving news items from different sources. Thus, in this list, the
strategies in each phase, but keeping the same number of news items most informative news items appeared first, taking into consideration
for each batch and always following the annotation scheme. content diversity. From this list of suggestions, an expert annotator
Prior to the main procedure, there was a data collection stage to filtered out those that did not follow the language, format, extension,
source a large set of news items from various news and information or other semantic characteristics desired in the corpus. The final list
providers. consisted of the K most informative news items that fit all the desired
Fig. 2 shows a high-level diagram of the three phases of the method- criteria evaluated prior to annotation, and balanced in terms of the
ology based on HITL: Phase 1: manual compilation and annotation of original sources. Finally, this batch of K news items was annotated and
the corpus; Phase 2: automated compilation and manual annotation; added to the training set (step 6 and step 7 in Fig. 2), and the whole
and, Phase 3: automated compilation and semi-automatic annotation. active learning cycle was repeated. In Fig. 2, after steps 3, 4 and 5, the
model selected the most appropriate news items to be annotated from
4.1. Phase 1: Reporting on the manual compilation and annotation the unlabeled news pool.
Specifically, a total of 4 batches was performed in this phase, and a
In the first phase, news was compiled and annotated in an entirely total of 10 news items were selected in each batch. Thus, after Phase 2
manual way, as illustrated in step 1 of Fig. 2. Concerning the com- finishes, 40 novel news items were added to the corpus. As explained,
pilation task, a total of 40 news items in Spanish was collected from these news items were manually annotated by an expert annotator, but
their selection, which was based on an active learning strategy, helped
to guarantee a minimum level of diversity and consistency that would
5
https://brat.nlplab.org/ have been difficult to attain with a purely manual selection.
5
A. Bonet-Jover et al. Engineering Applications of Artificial Intelligence 126 (2023) 107152
Finally, the third phase was an evolution of the second phase, Following the conceptual definition of the methodology, a specific
with the aim of significantly improving annotation times. This phase implementation was performed for the semi-automatic annotation of
performed a human–machine interaction consisting of the human re- news content reliability. The specific implementation in this domain
viewing and improving the automatic pre-annotation of the dataset, of Phase 2 and 3 are fully explained next. In the second phase, 4
provided by an ML-assisted labeling system, in a machine-human- batches were performed and in the third, 9 batches. All batches had
machine loop. This resulted in an improvement of the reliability de- 10 annotated news items. A proposal for the generalization of the
tection classification by increasing the size of the training dataset. methodology is presented at Section 5.4.
Furthermore, the new examples reviewed by the human served to
re-train the ML-assisted labeling system. 5.1. Phase 2 (active learning model)
The novelty here compared with Phase 2 is the pre-annotation
carried out by the system to assist the expert. The purpose being The active learning model used in this phase was an implementation
for the annotator not to label from scratch, but only to revise and based on the proposal by Cañizares-Díaz et al. (2021) for entity and
complete the pre-annotation done by the system. In this stage, the relation annotation. The original model consisted of two different clas-
system only pre-annotated the news items with the second annotation sifiers, one for entity recognition and another for relation extraction.
level (5W1H labels) defined in RUN-AS because it is the more complex However, since our annotation scheme does not contain relations, only
and time-consuming annotation level. the entity classifier was used.
As seen in Fig. 2, human intervention remains important since it is The model is based on a logistic regression classifier trained on
necessary to check that the automatic selection of news and the pre- token-level entity labels. Thus, a preprocessing of the annotated text
annotation proposed by the semi-automatic system meet the criteria was performed that transforms Brat-based annotation, which is defined
of the dataset. In this sense, the loop presented in Phase 2 is extended at the text span level, into a sequence of annotated tokens. The lo-
(steps 8, 9, and 10). This phase used a 5W1H model, previously trained gistic regression model was fed with token-level syntactic, semantic
(step 8) with 5W1H labels examples (pink cylinder), to pre-annotate (extracted with spacy), and contextual features (i.e., the combined
features of a small window of surrounding tokens).
5W1H labels (step 9) in the news selected. The last step in this phase
To compute the informativeness of a news item, the trained model
is 10, where the pre-annotated items according to the 5W1H model
was executed on each sentence of the whole document, and the prob-
were edited by the annotators, and the rest of RUN-AS annotation was
ability distributions of all possible labels in each token were stored.
added. Finally, a new annotated batch was added to the dataset (step
Based on this distribution, a token-level measure of entropy was com-
7) to conclude one loop. The new corrected examples were also used
puted as shown in Eq. (1), where 𝑝(𝑡) is the probability associated with
to re-train the 5W1H pre-annotation model.6 𝑙𝑎𝑏𝑒𝑙
a specific label in token 𝑡.
In this stage, 40 news items were initially annotated in order to keep
∑ (𝑡)
the same number of annotated news as in the previous phases. How- 𝐻(𝑡) = − 𝑝𝑙𝑎𝑏𝑒𝑙 log 𝑝(𝑡)
𝑙𝑎𝑏𝑒𝑙
(1)
ever, after validating that the semi-automatic annotation accelerated 𝑙𝑎𝑏𝑒𝑙
the process (see Section 6.4), we decided to annotate another 50 news The overall entropy of a document 𝐷 was computed as the mean
items (90 in total in this phase), in order to increase the dataset. A total entropy of all its tokens 𝑡 ∈ 𝐷 (see Eq. (2)), which corresponds to
of 9 batches were annotated. a standard interpretation of the annotation process as a stochastic
process with independent decisions. This is a simplification since the
labels of a specific token are often correlated with the labels of nearby
6
This loop is not indicated in the figure for clarity tokens. However, this simplification makes the problem tractable and
6
A. Bonet-Jover et al. Engineering Applications of Artificial Intelligence 126 (2023) 107152
1 ∑
𝐻(𝐷) = 𝐻(𝑡) (2)
‖𝐷‖ 𝑡∈𝐷
Finally, a similarity factor ∼ (𝐷𝑖 , 𝐃) was defined between every new
document 𝐷𝑖 and the set of annotated documents 𝐃. This similarity
was computed as the mean dot-product similarity between document
𝐷𝑖 and all documents already annotated in 𝐃, based on their doc2vec
representation obtained with the Python library gensim (see Eq. (3)).
This similarity factor was used to decrease the informativeness of
potential outliers, e.g., news items in other languages, or documents
that are not news items but nevertheless were included in the unlabeled
set during data collection.
1 ∑
𝑠𝑖𝑚(𝐷𝑖 , 𝐃) = doc2vec(𝐷𝑖 ) ⋅ doc2vec(𝐷𝑗 ) (3)
‖𝐃‖ 𝐷 ∈𝐃
𝑗 Fig. 3. Loss curve using the training and development set during training process.
The final informativeness score of a news item 𝐼(𝐷𝑖 ) was thus
defined as the product of the document-level entropy and the similarity
factor, discounted by a 𝛽 factor (in our experiments 𝛽 = 1 —given The model inputs to train are the questions (5W1H), their context,
no additional information, we chose 𝛽 = 1 as a mid-ground between and their respective answer. The model returned an answer as well as
exploration and exploitation. Further experimentation is necessary to a score that represented the probability of certainty associated to the
fine-tune this parameter) that balances between exploration and ex- answer. Fig. 3 shows the loss curves for training and evaluation where
ploitation (see Eq. (4)). This is the score by which news items were the behavior of the model can be seen during three epochs of training.
sorted before being presented to the expert annotator in Phase 2. According to the graph in Fig. 3 after the first training epoch, the
loss in the training curve decreases from 1.41 to 0.96, and the loss in
𝐼(𝐷𝑖 ) = 𝐻(𝐷𝑖 ) × 𝑠𝑖𝑚(𝐷𝑖 , 𝐃)𝛽 (4)
the evaluation curve increases from 6.19 to 6.49. This behavior remains
Intuitively, the informativeness score can be interpreted as balanc- in the third iteration, which indicates that the model is overfitting.
ing two conflicting factors: diversity (𝐻(⋅)) versus domain consistency So we selected the first iteration to annotate the 5W1H labels. At this
(𝑠𝑖𝑚(⋅)). Using an entropy-based score encourages the model to prefer point, we started the third phase of annotation with the 5W1H model
novel documents for which the uncertainty is higher. This is an indirect fine-tuned.
measure of diversity, but it is better than using explicitly the similarity
measure because it directly leverages the classifier’s learned hypothesis. 5.2.1. Fine-tuning performance of the QA model
Otherwise, documents with a very similar overall content, but that We compared the performance of the QA model with fine-tuning
differ in a subtle way that completely changes the annotation (e.g., a and without fine-tuning. Table 1 shows the results annotating with the
negation) may not be considered. In turn, overall content is a good
QA model without fine-tuning on 5W1H and with QA with fine-tuning
heuristic to detect out-of-domain documents or documents in a different
on 5W1H. The main metrics for the QA task – Exact Match (EM) and
language, which the rules used in the web scrapping phase were unable
F1 score – are included. These metrics are used to measure ML models
to filter out.
performance on well-known datasets such as The Stanford Question
Answering Dataset (SQuAD) (Rajpurkar et al., 2016). Additionally, we
5.2. Phase 3 (Pre-annotation 5W1H)
compute other metrics that allow us to measure similar and incorrect
answers because we consider them important to assessing the quality
In order to perform the third phase definition showed in Section 4.3,
of the pre-annotation task. The definition of each metric is:
a model that annotated the 5W1H labels was required. The 5W1H
labels consist of finding answers to the questions WHAT, WHEN, WHO,
• Exact Match (EM): the number of the exact matches of the pre-
WHERE, WHY, and HOW in the news item to be annotated, as ex-
dicted answer with the manual answers.
plained in Section 3. To accomplish this task, we proposed using a
• Similar: the number of the partial matches of the predicted an-
question answer (QA) model available at Hugging Face repository.7
swer with the manual answers.
This model was built with a fine-tuned distilled version of BETO
• Incorrect: the number of predicted answers that do not match the
model (Canete et al., 2020) on SQuAD-es-v2.0 dataset (Rajpurkar et al.,
manual responses.
2016) to fit in QA task.
• Overall EM: the percentage of exact matches over the number of
The 5W1H examples of the previous two phases were divided into
predicted examples.
three sets (training, development, and test) with a target to adapt
• 𝐹1 : the 𝐹1 score is the harmonic mean of the precision and re-
this model to our dataset, which is known as fine-tuning. The fine-
call (Grandini et al., 2020). Precision is the ratio of the number of
tuning was performed through training and evaluation with the training
and development set. This process was carried out using the Simple overlapped words to the total number of words in the prediction,
Transformers library.8 The initial hyperparameter settings for this fine- and recall is the ratio of the number of overlapped words to
tuning are maximum sequence length of 128, batch size of 8, training the total number of words in the ground truth (Rajpurkar et al.,
rate of 4e-5, and training performed over 3 epochs. This model can be 2016).
replicated at GitHub repository.9
Considering the QA metrics defined previously, our main objective
is maximizing the number of exact matches (EM), reducing incorrect
7
https://huggingface.co/mrm8488/distill-bert-base-spanish-wwm-cased- and similar matches as much as possible, as EM implies that annotators
finetuned-spa-squad2-es do not have to modify the pre-annotation provided by the system. As
8
https://simpletransformers.ai/ shown in the table below, the QA models with fine-tuning obtained
9
https://gplsi.dlsi.ua.es/resources/BETO_QA_SPANISH_5W1H_fine_tuning better results on all metrics presented. The improvement is particularly
7
A. Bonet-Jover et al. Engineering Applications of Artificial Intelligence 126 (2023) 107152
Fig. 4. Representation of 5W1H labels by QA model prediction scores, an index of each label, and the manual classification by an expert annotator of the correct and similar
(blue points) and incorrect (orange points) labels.
Table 1
Comparison between the QA model with and without fine-tuning using BETO transformers model.
Model EM Similar Incorrect Overall EM 𝐹1
QA without fine-tuning 5W1H 30 396 141 0.052 0.191
QA fine-tuning 5W1H after Phase 2 236 178 153 0.416 0.613
QA fine-tuning 5W1H after batch 6 in Phase 3 263 152 152 0.463 0.641
noteworthy in overall EM and F1 after the second fine-tuning. Con- 5.3. Computational prototype
sequently, these results confirm that fine-tuning is beneficial for the
5W1H label pre-annotation task. The computational prototype enables the annotator to select, skip
After batch 6 of Phase 3, with 60 more news items, the 5W1H news (not interesting for the dataset), and pre-annotate news in the
model was retrained. The new model (QA fine-tuning 5W1H after batch same interface (see Fig. 5).10
6) attained the best results in terms of EM, Overall EM, and 𝐹1 ( The user interface is the Brat tool along with the assisted system
Table 1). This finding confirms that with a greater number of examples implemented which allows the annotator to discard, accept or modify
of the 5W1H labels, a model with high precision can be obtained that the annotation proposals (see Fig. 6). The use of this interface enabled
reduces annotation times and finally assists the human annotator in this us to annotate quickly, accurately and easily.
complex task. Despite using this somewhat dated annotation software that is not as
full-featured as more modern alternatives, we found that its simplicity
5.2.2. Tuning the QA model threshold and minimalist design is an advantage when introducing new annota-
Pre-annotation aims to annotate as many 5W1H elements as possi- tors to a semi-automated workflow. Furthermore, its file-based storage
ble with high precision so that the annotator has to discard very few model, and its open annotation syntax allowed us to integrate our semi-
examples as incorrect, because the higher the number of corrections, automated workflow without having to access or modify Brat’s source
the longer the delay in the annotation process. In order to reduce the code. In addition, in this phase both the compilation and the annotation
error rate in the number of pre-annotated examples by the model, we tasks were carried out in the same interface, without having to switch
automatically annotated the 5W1H labels in the 10 news items (batch screens and search external sites. The assisted annotation system was
1 of Phase 3) and classified them manually as correct or incorrect. based on a predictive interface which, as stated by Monarch (2021),
In Fig. 4, a scatter plot is used to represent each 5W1H label by an consists of items that have been pre-annotated by a machine learning
arbitrarily assigned label index (x-axis), and the score assigned by model. This type of interface enables annotators to edit items, and read-
the QA model (y-axis). Finally, an ordinary least squares regression just the model with the errors detected and corrected. Thanks to this
trendline was added to distinguish the correct and similar answers (blue navigability, the system integrated the news recommendation, which
points) from incorrect ones (orange points). not only saved time, but also took into account the annotator’s selection
This process defined a threshold to separate the incorrect answers and, on that basis, retrained the model through active learning.
from correct or similar ones. In this case, the threshold selected was
0.11 because it was the closest score at all times to the ordinary least
squares regression trendline, thereby separating manually classified 10
The original news source of the figure is available at https:
label types. This threshold was configured in the semi-automatic anno- //www.eldiestro.es/2021/04/mintiendo-y-manipulando-asi-pretenden-
tation system with the best QA model obtained to start the annotation marcar-los-miserables-medios-de-comunicacion-de-espana-a-las-personas-
process in Phase 3 (QA fine-tuning 5W1H after Phase 2). que-decidan-no-vacunarse/
8
A. Bonet-Jover et al. Engineering Applications of Artificial Intelligence 126 (2023) 107152
5.4. Generalization of the proposed methodology relationships that one wants to deal with in a complex annotation
scheme, the entropy and informativeness formulas used in AL are not
To sum up, the proposed methodology comprises three phases. In specific to the annotated entities or relationships. Likewise, the pre-
Phase 1, a set of news items are manually compiled and annotated. annotation module can be replaced by an ML-assisted labeling to the
After that, Phase 2 introduced active learning, where the human an- concrete problem. Furthermore, the fact that the corpus is in the Span-
notator interacts with the ML model that automatically selected the ish language is irrelevant to the experimental results since the machine
most informative news item to annotate. While this phase creates a learning models used are language-agnostic and no language-specific
more efficient process by reducing compilation costs, the annotation heuristics are applied. Hence, these results should generalize to other
is still manual. Finally, Phase 3 adds human–machine interaction in languages and annotation schemes albeit with different baseline F1
the annotation process, consisting of the human reviewing and im- scores according to the complexity of the underlying learning problem.
proving the automatic pre-annotation of the dataset, provided by an
ML-assisted labeling system and implying a machine-human-machine
loop. As stated before, the proposed methodology was implemented 6. Evaluation framework
and applied to the reliability annotation within the disinformation
framework, but it is easily applicable to a broad range of complex In order to measure both the benefits of the methodology adopted
annotation problems, following the same phases. Some preliminary and the quality of the resource generated, this section undertakes a
ongoing studies on other annotation schemes are being performed set of experiments. First the features of the dataset constructed are
to confirm this fact. Regardless of the concrete entities and complex presented. Second, two dataset quality measures are provided, both
9
A. Bonet-Jover et al. Engineering Applications of Artificial Intelligence 126 (2023) 107152
Table 2 our annotation guideline (which makes the agreement between anno-
Numerical description of the 5W1H in RUN dataset.
tators more complicated and subjective), the annotation methodology
5W1H Unreliable items Reliable items Total items
improved the procedure, as demonstrated in the next subsections. To
WHAT 687 1600 2296 validate our methodology, we do not need a huge dataset, but a quality
WHEN 117 573 690
WHERE 685 58 747
dataset with rich and well-chosen examples that increase the accuracy.
WHO 326 1525 1856
WHY 142 241 384
HOW 165 358 529 6.2. Annotation process details
TOTAL dataset 2122 4355 6502
Two experts performed the annotation task. The expert annotators
are linguistic researchers specialized in NLP (1 PhD annotator who is
the author of the annotation guidelines and 1 Ph.D. student) and both
in terms of labeling consistency and inter-annotation agreement. Re-
are native Spanish speakers. First annotator has a high level of expertise
garding the methodology, both the efficiency and effectiveness of it are
annotating text with this type of annotation. The second one was an
measured. Finally, the limitations of the proposal are discussed.
experienced annotator but in other types of semantic annotations.
6.1. RUN dataset description Since HITL techniques are techniques involving humans in the
training process, we have to consider in the annotation process as-
After applying the HITL-based methodology, a Reliable and Un- pects related to human–computer interaction (HCI). Due to that, the
reliable News (RUN) Dataset in Spanish (both from Spain and Latin annotation plan comprises the following elements:
America) was obtained. RUN dataset consists of a set of 170 news
• Objectives of the annotation process: The purpose of this annotation
items collected from mainstream digital media that have the traditional
task is to determine the structure-type items of the inverted
journalistic structure.
pyramid and the essential content given by the 5W1H items of
As indicated in Section 3, each 5W1H item is assigned a reliability
the text and the reliability of each of these elements. All the items
value. The reliability criteria adopted for this annotation (Bonet-Jover
to be annotated are clearly defined in the annotation guideline
et al., 2023) is not related to the source credibility but to the accuracy
RUN-AS (Reliable and Unreliable News Annotation Scheme).11
and neutrality of the semantic elements of the news item. Therefore, the
The items to be annotated will be defined in the annotation tool
dataset is balanced according to the Reliable (85) and Unreliable (85)
classification, and it was created via an incremental procedure whereby to be selected when the annotator deems it necessary.
40 news items were included in Phase 1, 40 more news items were • User-friendly annotation interface: The user interface, as indicated
included in Phase 2 and 90 more news items in Phase 3. Due to Phase in Section 5.3, is the Brat tool along with the assisted system
3 being more efficient, we were able to annotate more news items in implemented. This interface enables the annotator to select, skip
less time. The size of the dataset is limited in this initial phase of the news, and pre-annotate news in the same interface. The web
creation, since the aim was to prove the validity of the semi-automatic tool, hosted in a server, automatically saved the work done and
building methodology proposed. Even so, given the characteristics of its showed the labels with colors and symbols without having to
creation, where human-in-the-loop strategies have been used, the size place the cursor above for identification purposes. The interface
should not be a problem to demonstrate the validity of the methodology is minimalist and functional, facilitating the onboarding of new
since the examples are more representative than if this same sample annotators, as it is not necessary for them to deal with complex
had been chosen randomly as it was the case when the annotation is functionalities for project management or credentials. However,
entirely manual and no human-in-the-loop strategy is involved in the at the same time, Brat’s annotation system is powerful enough to
process (Monarch, 2021). deal with a complex schema like ours with several different types
Previous research found that news mixes unreliable and reliable of entities and relations.
information, which hinders the disinformation detection task (Bonet- • Help tutorial: A document with clear instructions on how to use the
Jover et al., 2021). We consider that the different parts and content annotation tool was given to the annotators and it is available at
elements of a news item have specific reliability values that influence https://gplsi.dlsi.ua.es/resources/brat_guideline.
the global reliability value of a news item. To gauge how information • Annotation sessions’ planning: Bearing in mind that the annota-
is distorted in news, we need to analyze each part and component tion process includes human intervention, either annotating from
separately. This requires a balanced dataset of unreliable and reliable scratch as in phase 1, or assisted annotation in phase 3, and taking
news to train our system. Details are presented in Table 2. into account the possible fatigue of the annotators, 30 to 40 min
As can be seen from the figures in the Table 2, 4,355 Reliable sessions were planned in each phase. The rationale of this is trying
5W1Hs are available in the dataset compared to 2,122 Unreliable
to maximize the quality of the annotation and not the quantity of
5W1Hs. Although the news dataset is balanced between reliable and
it.
unreliable, according to the figures of this dataset, the fact that there
• Usability proofs: A post-annotation usability satisfaction ques-
are many more reliable labels indicates that in news intended to spread
tionnaire to collect the opinion of the human annotators was
disinformation, not all labels will be unreliable, both types of informa-
performed. The usability questionnaire was derived from Zhang
tion (reliable and unreliable) would be mixed in order to confuse the
and Adams (2012). Usability testing helps us to identify prob-
reader. As stated by Juez and Mackenzie (2019) ‘‘in most cases, fake
lematic areas and improve the user experience. After analyzing
news is not totally false, but rather a distorted version of something that
really happened or a manipulated account of true facts’’. This does not the results, annotators somewhat agree in the usability of the
mean that in a reliable news item all the labels are also reliable, some tool (Lewis, 1995). All the feedback obtained from these proofs
might not be, probably unintentionally, but precisely because of this, would be considered for future improvements in the annotation
the model will be able to learn to classify into reliable or unreliable process. The questions used in the usability proofs are available
based on those data. at https://gplsi.dlsi.ua.es/resources/usability_annotation_proofs.
The aim of the present work is to increase the speed of creating the
dataset without compromising accuracy. Given the time spent on the
11
annotation task as well as the complexity and the semantic nature of Available at https://gplsi.dlsi.ua.es/resources/NewsReliabilityAnnotation
10
A. Bonet-Jover et al. Engineering Applications of Artificial Intelligence 126 (2023) 107152
Fig. 7. Comparison of the distribution of annotations over 5W1H for batches with and without pre-annotations.
11
A. Bonet-Jover et al. Engineering Applications of Artificial Intelligence 126 (2023) 107152
Table 5
Ratio of EM, Similar and Incorrect 5W1H labels of the best’s performance QA model.
In bold the higher values in EM and the lower values in the incorrect metric.
5W1H labels Ratio
EM Similar Incorrect
WHAT 0.88 0.09 0.04
WHEN 0.70 0.12 0.22
WHERE 0.62 0.01 0.30
WHO 0.84 0.03 0.12
WHY 0.45 0.10 0.45
HOW 0.46 0.08 0.46
12
A. Bonet-Jover et al. Engineering Applications of Artificial Intelligence 126 (2023) 107152
Fig. 9. Internal structure of ML baseline system. Fig. 10. Internal structure of DL baseline system.
Table 6
Overview of the 42 numerical and categorical features used.
Level Features
6.5. Measuring effectiveness of the methodology Title Conclusion
Structure Subtitle Title_Stance
To validate the effectiveness of the methodology proposed for the (7 Categorical) Lead Title_Style
construction of the dataset, the semi-automatically generated dataset is Body
evaluated using two baseline systems to predict a random test set with What Where_Reliability_Reliable
20 news items.12 This test set was obtained before the active learning What_Reliability_Reliable Where_Reliability_Unreliable
What_Reliability_Unreliable Where_Lack_Of_Information_Yes
process, using the initial news pool (randomly selected) to guarantee
What_Main_Event Why
that the test data and initial training data (Phase 1 of methodology) What_Lack_Of_Information_Yes Why_Reliability_Reliable
are from more or less the same distribution (Monarch, 2021). Figs. 9 Who Why_Reliability_Unreliable
and 10 show the internal structure of the ML and DL baseline systems, Who_Reliability_Reliable Why_Lack_Of_Information_Yes
Content Who_Reliability_Unreliable How
respectively.
(28 Numerical) Who_Lack_Of_Information_Yes How_Reliability_Reliable
The baseline systems were trained with the RUN dataset to predict
When How_Reliability_Unreliable
the test set. In all cases, the baselines had as input the news content When_Reliability_Reliable How_Lack_Of_Information_Yes
(TITLE and BODY text), and additionally, in some cases they also used When_Reliability_Unreliable Who_Role_Subject
as input the features obtained from the fine-grained annotation. The When_Lack_Of_Information_Yes Who_Role_Target
output of each baseline was the Reliable/Unreliable classification. Where Who_Role_Both
Fig. 9 represents the ML-based baseline where the news content is Quote Quote_Author_Stance_Agree
encoded as TF-IDF vectors and they are concatenated with the extracted EoI Quote_Author_Stance_Disagree Quote_Author_Stance_Unknown
(7 Numerical) Key_Expression Orthotypography
features. The resulting vectors were used as input to the ML model (in
Figure
this case, Logistic regression) to perform training and prediction. For
this baseline, we chose TF-IDF encoding vectors and Logistic regression
algorithm because, although they are a classic and essential word
representation and classifier, they still performed well in similar tasks {
when numerical and categorical features are used, such as in the fake TITLE_style: Objective,
news detection task (Li, 2021; Jiang et al., 2021; Posadas-Durán et al., TITLE_stance: Agree,
2019; Lahby et al., 2022). TITLE_WHAT_Reliable: 0,
Fig. 10 represents the DL-based baseline. In this case the news TITLE_WHAT_Unreliable: 1,
content was encoded by using a transformer model. The encoded vector TITLE_WHO_Reliable: 0,
(model output) and the annotated features were used as input to the TITLE_WHO_Unreliable: 1,
neural network – in this case, multilayer perceptron (MLP) – to perform TITLE_WHEN_Reliable: 0,
training and prediction. The baseline used the classification archi- TITLE_WHEN_Unreliable: 1,
tecture proposed by Sepúlveda-Torres et al. (2021), which combines # ...
the transformer model with external features. In this case, to design LEAD_WHAT_Reliable: 2,
the baseline system we selected the BETO pre-trained model as the
LEAD_WHAT_Unreliable: 2,
transformer model because it is a Spanish language model that obtains
LEAD_WHO_Reliable: 0,
good performance in multiple NLP tasks (Canete et al., 2020). The
LEAD_WHO_Unreliable: 1,
BETO model is used in fine-tuning mode.
LEAD_WHEN_Reliable: 0,
From the three annotation levels (Structure, Content, and Elements
LEAD_WHEN_Unreliable: 3,
of Interest) of RUN-AS annotation, 42 numerical and categorical fea-
tures were extracted. Table 6 shows the extracted features.
# ...
A simplified example of the numerical and categorical features
}
extracted from the TITLE and LEAD of a news piece is presented next.13
The same feature types were generated from the other parts of the
inverted pyramid structure of the document. Each feature indicates
12
Available at https://gplsi.dlsi.ua.es/resources/NewsReliabilityAnnotation the number of 5W1H components with a specific label and relia-
13
Only some of the features are shown to exemplify the generation of these bility attribute that appear in each part of the news. For example,
features. LEAD_WHAT_Reliable: 2 indicates that the LEAD contains two
13
A. Bonet-Jover et al. Engineering Applications of Artificial Intelligence 126 (2023) 107152
Table 7 used to guide the annotator, the positive reinforcement cycles of biases
Performance of ML and DL models for predicting test set with features and without
may be shorter.
them.
Furthermore, in this phase of the work, a simple AL model based
Model Without features With features
on logistic regression was used, and although it allowed us to obtain
𝐹1 Acc 𝐹1 Acc
positive results, in the future we will consider using other ML or DL ap-
Logistic Regression (TF-IDF encoded) 0.733 0.75 0.949 0.95 proaches and even other HITL strategies. Likewise, the pre-annotation
BETO 0.84 0.85 0.89 0.9
had a considerable error rate for more complex labels such as WHY and
HOW, whose pre-annotation should be studied in depth in order to be
improved.
Despite the dataset’s limited size at present, namely comprising a
WHAT items annotated with a Reliable value. The models were set of 170 news items in Spanish, one of the basic principles in human-
trained to predict the overall document reliability label based on these in-the-loop techniques – and more specifically in active learning (see
numerical and categorical features. Section 2.2) – is precisely to reach the target performance for a machine
Table 7 shows the results of the baselines in terms of the metric learning model faster, since the best examples are automatically ob-
𝐹1 and accuracy to predict the test set. In order to evaluate the tained, and, indeed, a high performance was obtained with this corpus.
contribution provided by the annotation used, two experiments with After analyzing the experimentation results, the size was proved to
each baseline were performed. The first one used only the news content be sufficient for the purpose of this work, which was to determine
(Without features column), and the second one concatenated the ex- the validity of the proposed methodology, in part thanks to its form
tracted features of the RUN dataset (With features column). To replicate of creation with the application of the Human in the loop paradigm.
the results shown in Table 7 the following GitHub repositories can be However, this is currently a preliminary study and although the results
used ML-based baseline14 and DL-based baseline.15 are promising, the dataset would need to be extended to determine how
As can be seen in the table, both baselines when using RUN- it responds in the case of a large-scale dataset. Furthermore, in case the
AS features outperformed the baseline without features. The Logistic results hold with a larger dataset, the next step would be to analyze
Regression model performed better than BETO when using the features. how to apply them to more complex tasks such as veracity detection.
Furthermore, BETO model performed better than Logistic Regression Lastly, regarding the computational prototype, a different annotation
when not using RUN-AS features, showing the power of pre-trained software could be considered in the future, given the fact that BRAT
models when they were not using specific features. This result corrob- is appropriate but limited in some aspects compared to more modern
orates not only the benefits of this fine-grained annotation (RUN-AS alternatives.
scheme) to reliability detection but also the feasibility of the semi-
automatic dataset generated by applying the HITL-based methodology. 7. Conclusions and future work
The dataset construction time was reduced by around 64% without
compromising performance (95% F1 and Acc). To support this state- The main novelty of this proposal is the design and implemen-
ment, a comparison with a previous research (Bonet-Jover et al., 2023), tation of a human-in-the-loop based methodology for semi-automatic
in which the dataset was built entirely manually, is done as a base- annotation of semantically complex datasets. Firstly, the methodology
line for performance since that previous dataset was costlier and less applied Active Learning (AL), a well-known strategy, to determine the
efficient to build than the one proposed here. This is not a direct com- most suitable news to be annotated. Specifically, the diversity sam-
parison since the previous dataset is different in size. It presented only pling strategy was performed. Secondly, a human–machine interaction
80 Reliable and Unreliable news items. However, the same annotation procedure was implemented consisting of the human reviewing and
guideline was used. The results obtained for Logistic Regression and improving the automatic pre-annotation provided by the ML-assisted
BETO were 0.88 and 0.85, respectively. As it can be observed, results labeling system. This was carried out to: (i) enhance the reliability
slightly increased in the current proposal since the dataset is composed detection classification by improving the training dataset; and, (ii) re-
of more examples after applying the methodology that enables a more train the ML-assisted labeling system (a QA system in our case) with
efficient construction of the dataset. the new reviewed examples.
The application of the methodology results in an improvement of
6.6. Discussion and limitations of the proposal the annotation task that is derived from two independent factors: intel-
ligently sorting which news to annotate and providing pre-annotated
A limitation of our proposal is related to using active learning suggestions with a high degree of certainty.
to select the best news examples. This is because the active learn- The methodology is implemented in the disinformation framework,
ing models inherently learn about the annotator’s preferences. As the specifically in the news content’s reliability in Spanish language, pro-
model’s training continues, the model learns the annotator’s biases ducing the RUN dataset. The building of this dataset using the method-
through this interactive process, and this can lead to a cycle of positive ology, although limited in size at its current stage, constitutes a proof
reinforcement whereby the model proposes only those documents that of concept of how the application of this methodology in building
the annotator considers valuable or important. much more bigger datasets could significantly reduce the cost of dataset
However, this may be mitigated in two ways. Firstly, by always creation. Furthermore, this dataset is also a novel contribution since
introducing an element of randomness, and not necessarily choosing state-of-the-art disinformation datasets are annotated with a unique
only the 𝐾 most informative news items that the model proposes, veracity value for the whole news item, whereas in our annotation
but including also random documents. Secondly, by having a variety proposal, essential content of each news item is identified and classified
of annotators and not one annotator alone to train the AL model, as reliable or unreliable. This is more suitable for applying explainable
combining what they are annotating so that the biases are somewhat AI in future work, providing the evidences of disinformation for each
counteracted. news item.
This discussion of bias occurs for models in general, with the addi- The performance of the methodology is evaluated in terms of a
tion that since the model is being trained interactively and its output is balance between time-effort consumption and accuracy achievement.
The annotation procedure time was reduced by 63.61% with respect to
the fully manual annotation, and the inter-annotator agreement was
14
https://gplsi.dlsi.ua.es/resources/Logistic_Regression_RUN_Dataset increased. Regarding the performance of the dataset in the reliable-
15
https://gplsi.dlsi.ua.es/resources/BETO_RUN_AS unreliable classification task, the baseline ML-model trained with the
14
A. Bonet-Jover et al. Engineering Applications of Artificial Intelligence 126 (2023) 107152
15
A. Bonet-Jover et al. Engineering Applications of Artificial Intelligence 126 (2023) 107152
Ireton, C., Posetti, J., 2018. Journalism, Fake News & Disinformation: Handbook for Shahi, G.K., Nandini, D., 2020. FakeCovid–A multilingual cross-domain fact check news
Journalism Education and Training. Unesco Publishing. dataset for COVID-19. arXiv preprint arXiv:2006.11343.
Jiang, T., Li, J.P., Haq, A.U., Saboor, A., Ali, A., 2021. A novel stacking approach for Shu, K., Bhattacharjee, A., Alatawi, F., Nazer, T.H., Ding, K., Karami, M., Liu, H., 2020.
accurate detection of fake news. IEEE Access 9, 22626–22639. Combating disinformation in a social media age. Wiley Interdiscip. Rev.: Data Min.
Juez, L.A., Mackenzie, J.L., 2019. Emotion, lies, and ‘‘bullshit’’ in journalistic discourse. Knowl. Discov. 10 (6), e1385.
Ibérica (38), 17–50. Silva, R.M., Santos, R.L., Almeida, T.A., Pardo, T.A., 2020. Towards automatically
Jung, W., Jazizadeh, F., 2019. Human-in-the-loop HVAC operations: A quantitative filtering fake news in portuguese. Expert Syst. Appl. 146, 113199.
review on occupancy, comfort, and energy-efficiency dimensions. Appl. Energy 239, Simard, P.Y., Amershi, S., Chickering, D.M., Pelton, A.E., Ghorashi, S., Meek, C.,
1471–1508. http://dx.doi.org/10.1016/j.apenergy.2019.01.070. Ramos, G., Suh, J., Verwey, J., Wang, M., et al., 2017. Machine teaching: A new
Kholghi, M., Sitbon, L., Zuccon, G., Nguyen, A., 2016. Active learning: A step towards paradigm for building machine learning systems. arXiv preprint arXiv:1707.06742.
automating medical concept extraction. J. Am. Med. Inf. Assoc. 23 (2), 289–296. Spina, D., Peetz, M.-H., de Rijke, M., 2015. Active learning for entity filtering in
Kholghi, M., Sitbon, L., Zuccon, G., Nguyen, A., 2017. Active learning reduces microblog streams. In: Proceedings of the 38th International ACM SIGIR Conference
annotation time for clinical concept extraction. Int. J. Med. Inf. 106, 25–31. http:// on Research and Development in Information Retrieval. SIGIR ’15, Association for
dx.doi.org/10.1016/j.ijmedinf.2017.08.001, URL: https://www.sciencedirect.com/ Computing Machinery, New York, NY, USA, pp. 975–978. http://dx.doi.org/10.
science/article/pii/S1386505617302009. 1145/2766462.2767839.
Lahby, M., Aqil, S., Yafooz, W.M., Abakarim, Y., 2022. Online fake news detection Stenetorp, P., Pyysalo, S., Topić, G., Ohta, T., Ananiadou, S., Tsujii, J., 2012.
using machine learning techniques: A systematic mapping study. In: Combating brat: A web-based tool for NLP-assisted text annotation. In: Proceedings of the
Fake News with Computational Intelligence Techniques. Springer, pp. 3–37. Demonstrations at the 13th Conference of the European Chapter of the Association
Lewis, J.R., 1995. IBM computer usability satisfaction questionnaires: Psychometric for Computational Linguistics. Association for Computational Linguistics, Avignon,
evaluation and instructions for use. Int. J. Hum.-Comput. Interact. 7 (1), 57–78. France, pp. 102–107, URL: https://www.aclweb.org/anthology/E12-2021.
http://dx.doi.org/10.1080/10447319509526110. Tchoua, R., Ajith, A., Hong, Z., Ward, L., Chard, K., Audus, D., Patel, S., de Pablo, J.,
Li, K., 2021. HAHA at FakeDeS 2021: A fake news detection method based on TF-IDF Foster, I., 2019. Active learning yields better training data for scientific named
and ensemble machine learning. In: IberLEF@ SEPLN. pp. 630–638. entity recognition. In: 2019 15th International Conference on EScience. EScience,
Mitra, T., Gilbert, E., 2015. Credbank: A large-scale social media corpus with associated IEEE, pp. 126–135.
credibility annotations. In: Proceedings of the International AAAI Conference on Thomson, E.A., White, P.R., Kitley, P., 2008. ‘‘Objectivity’’ and ‘‘hard news’’ reporting
Web and Social Media, Vol. 9. pp. 258–267. across cultures: Comparing the news report in English, French, Japanese and
Monarch, R.M., 2021. Human-in-the-Loop Machine Learning: Active Learning and Indonesian journalism. Journalism Stud. 9 (2), 212–228.
Annotation for Human-Centered AI. Simon and Schuster. Tomanek, K., Wermter, J., Hahn, U., 2007. An approach to text corpus construction
Mosqueira-Rey, E., Hernández-Pereira, E., Alonso-Ríos, D., Bobes-Bascarán, J., which cuts annotation costs and maintains reusability of annotated data. In: Pro-
Fernández-Leal, Á., 2022. Human-in-the-loop machine learning: A state of the art. ceedings of the 2007 Joint Conference on Empirical Methods in Natural Language
Artif. Intell. Rev. 1–50. Processing and Computational Natural Language Learning. EMNLP-CoNLL, pp.
Névéol, A., Doğan, R.I., Lu, Z., 2011. Semi-automatic semantic annotation of PubMed 486–495.
queries: A study on quality, efficiency, satisfaction. J. Biomed. Inf. 44 (2), 310–318. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł.,
Okoro, E., Abara, B., Umagba, A., Ajonye, A., Isa, Z., 2018. A hybrid approach to fake Polosukhin, I., 2017. Attention is all you need. Adv. Neural Inf. Process. Syst. 30.
news detection on social media. Nigerian J. Technol. 37 (2), 454–462. Vlachos, A., Riedel, S., 2014. Fact checking: Task definition and dataset construction.
Olsson, F., 2009. A Literature Survey of Active Machine Learning in the Context of In: Proceedings of the ACL 2014 Workshop on Language Technologies and
Natural Language Processing. Technical Report, Swedish Institute of Computer Computational Social Science. pp. 18–22.
Science. Voorhees, E.M., 2018. On building fair and reusable test collections using bandit tech-
Pérez-Rosas, V., Mihalcea, R., 2015. Experiments in open domain deception detection. niques. In: Proceedings of the 27th ACM International Conference on Information
In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language and Knowledge Management. pp. 407–416.
Processing. pp. 1120–1125. Vu, H.-T., Gallinari, P., 2006. A machine learning based approach to evaluating retrieval
Piad-Morffis, A., Gutiérrez, Y., Muñoz, R., 2019. A corpus to support ehealth knowledge systems. In: Proceedings of the Human Language Technology Conference of the
discovery technologies. J. Biomed. Inf. 94, 103172. NAACL, Main Conference. pp. 399–406.
Posadas-Durán, J.-P., Gómez-Adorno, H., Sidorov, G., Escobar, J.J.M., 2019. Detection Wang, W.Y., 2017. ‘‘Liar, liar pants on fire’’: A new benchmark dataset for fake news
of fake news in a new corpus for the Spanish language. J. Intell. Fuzzy Systems detection. arXiv preprint arXiv:1705.00648.
36 (5), 4869–4876. Wang, W., Zhao, D., Zou, L., Wang, D., Zheng, W., 2010. Extracting 5W1H event
Rahman, M.M., Kutlu, M., Elsayed, T., Lease, M., 2020. Efficient test collection semantic elements from Chinese online news. In: International Conference on
construction via active learning. In: Proceedings of the 2020 ACM SIGIR on Web-Age Information Management. Springer, pp. 644–655.
International Conference on Theory of Information Retrieval. pp. 177–184. Wardle, C., et al., 2018. Information Disorder: The Essential Glossary. Shorenstein
Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P., 2016. Squad: 100,000+ questions for Center on Media, Politics, and Public Policy, Harvard Kennedy School, Harvard,
machine comprehension of text. arXiv preprint arXiv:1606.05250. MA.
Ramos, G., Meek, C., Simard, P., Suh, J., Ghorashi, S., 2020. Interactive ma- Wondimu, N.A., Buche, C., Visser, U., 2022. Interactive machine learning: A state of
chine teaching: A human-centered approach to building machine-learned models. the art review. arXiv preprint arXiv:2207.06196.
Human–Comput. Interact. 35 (5–6), 413–451. Wu, X., Xiao, L., Sun, Y., Zhang, J., Ma, T., He, L., 2022. A survey of human-in-the-loop
Salem, F.K.A., Al Feel, R., Elbassuoni, S., Jaber, M., Farah, M., 2019. Fa-kes: A fake for machine learning. Future Gener. Comput. Syst..
news dataset around the syrian war. In: Proceedings of the International AAAI Zhang, T., Adams, J., 2012. Evaluation of a geospatial annotation tool for unmanned
Conference on Web and Social Media, Vol. 13. pp. 573–582. vehicle specialist interface. Intl. J. Hum.–Comput. Interact. 28, 361–372. http:
Saquete, E., Tomás, D., Moreda, P., Martínez-Barco, P., Palomar, M., 2020. Fighting //dx.doi.org/10.1080/10447318.2011.590122.
post-truth using natural language processing: A review and open challenges. Expert Zhang, H., Chen, X., Ma, S., 2019. Dynamic news recommendation with hierarchical
Syst. Appl. 141, 112943. attention network. In: 2019 IEEE International Conference on Data Mining. ICDM,
Sepúlveda-Torres, R., Saquete Boró, E., et al., 2021. GPLSI team at CheckThat! 2021: IEEE, pp. 1456–1461.
Fine-tuning BETO and RoBERTa. CEUR. Zhang, H., Liu, H., 2016. Visualizing structural ‘‘inverted pyramids’’ in English news
Settles, B., Craven, M., Ray, S., 2007. Multiple-instance active learning. Adv. Neural discourse across levels. Text Talk 36 (1), 89–110.
Inf. Process. Syst. 20.
16