0% found this document useful (0 votes)
2 views

Fake News Stance Detection Using Deep Learning Architecture CNN-LSTM

This document presents a study on fake news stance detection using a hybrid deep learning architecture that combines Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) networks. The proposed model utilizes dimensionality reduction techniques, specifically Principal Component Analysis (PCA) and Chi-Square, to enhance the feature set before classification, achieving a significant improvement in accuracy and F1-score. The experimental results indicate that the model outperforms existing methods with an accuracy of 97.8% when using PCA.

Uploaded by

kohlicriss
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Fake News Stance Detection Using Deep Learning Architecture CNN-LSTM

This document presents a study on fake news stance detection using a hybrid deep learning architecture that combines Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) networks. The proposed model utilizes dimensionality reduction techniques, specifically Principal Component Analysis (PCA) and Chi-Square, to enhance the feature set before classification, achieving a significant improvement in accuracy and F1-score. The experimental results indicate that the model outperforms existing methods with an accuracy of 97.8% when using PCA.

Uploaded by

kohlicriss
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Received August 10, 2020, accepted August 24, 2020, date of publication August 26, 2020, date of current

version September 9, 2020.


Digital Object Identifier 10.1109/ACCESS.2020.3019735

Fake News Stance Detection Using Deep


Learning Architecture (CNN-LSTM)
MUHAMMAD UMER 1 , ZAINAB IMTIAZ 1 , SALEEM ULLAH 1 ,
ARIF MEHMOOD 2 , GYU SANG CHOI 3 , AND BYUNG-WON ON4
1 Department of Computer Science, Khwaja Fareed University of Engineering and Information Technology, Rahim Yar Khan 64200, Pakistan
2 Department of Computer Science and Information Technology, The Islamia University of Bahawalpur, Bahawalpur 63100, Pakistan
3 Department of Information and Communication Engineering, Yeungnam University, Gyeongsan 38542, South Korea
4 Department of Statistics and Computer Science, Kunsan National University, Gunsan 54150, South Korea

Corresponding authors: Gyu Sang Choi ([email protected]) and Byung-Won On ([email protected])


This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by
the Ministry of Education (NRF-2019R1A2C1006159) and MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information
Technology Research Center) support program (IITP-2020-2016-0-00313) supervised by the IITP (Institute for Information &
communications Technology Promotion), and in part by the Fareed Computing Research Center, Department of Computer Science,
Khwaja Fareed University of Engineering and Information Technology (KFUEIT), Rahim Yar Khan, Pakistan.

ABSTRACT Society and individuals are negatively influenced both politically and socially by the
widespread increase of fake news either way generated by humans or machines. In the era of social networks,
the quick rotation of news makes it challenging to evaluate its reliability promptly. Therefore, automated
fake news detection tools have become a crucial requirement. To address the aforementioned issue, a hybrid
Neural Network architecture, that combines the capabilities of CNN and LSTM, is used with two different
dimensionality reduction approaches, Principle Component Analysis (PCA) and Chi-Square. This work
proposed to employ the dimensionality reduction techniques to reduce the dimensionality of the feature
vectors before passing them to the classifier. To develop the reasoning, this work acquired a dataset from
the Fake News Challenges (FNC) website which has four types of stances: agree, disagree, discuss, and
unrelated. The nonlinear features are fed to PCA and chi-square which provides more contextual features
for fake news detection. The motivation of this research is to determine the relative stance of a news article
towards its headline. The proposed model improves results by ∼ 4% and ∼ 20% in terms of Accuracy
and F1 − score. The experimental results show that PCA outperforms than Chi-square and state-of-the-art
methods with 97.8% accuracy.

INDEX TERMS Fake news detection, text mining, deep learning, PCA, Chi-square, CNN-LSTM, word
embedding.

I. INTRODUCTION reports illustrate that Russia has created fake accounts and
In the age of technology, a tremendous amount of data is social bots to spread fake news. According to a research
being generated online every day. However, an unprecedented poll, 64% of US citizens reported that fake news has caused
amount of the data flooded on the Internet is fake news, a ‘‘great deal of confusion’’ about the factual content of
which is generated to attract the audience, to influence beliefs reported events [6]. On top of that, large-scale false informa-
and decisions of people [1]–[3], to increase the revenue gen- tion cascade has increasingly harmful consequences in the
erated by clicking [4], and to affect major events such as field of business, marketing, and stock-share. For instance,
political elections [5]. Readers are misguided by deliberately in 2013, 130 billion dollars were wiped out in stock value
spreading false information. Obtaining and spreading infor- after a false news spread on twitter that Barack Obama was
mation through social media platforms has become extremely injured in an explosion [5]. In the US presidential campaign
trouble-free, which makes it difficult and nontrivial to detect of 2016, fake news has been accused of being foremost
based merely on the content of news. For example, some contributing factor of the increasing political polarization and
partisan conflict as well as affecting the outcome [7]–[9].
The associate editor coordinating the review of this manuscript and Thus, it goes without saying that fake news identification
approving it for publication was Keli Xiao . is undeniably a grave challenge for the news industry and

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 8, 2020 156695
M. Umer et al.: Fake News Stance Detection Using Deep Learning Architecture (CNN-LSTM)

journalists and the tools for detection of fake news have is the relevance of each headline-body pair in our case [9],
become dire necessity. [29]–[31].
As manual fact checking is a very tedious task, automat- In this paper, we propose a model that automatically clas-
ically identification of fake news has drawn considerable sifies the news articles with stance labels of either agree, dis-
attention in the Natural Language Processing (NLP) commu- agree, unrelated, or discuss. The classification is done based
nity to help alleviate the burdensome and time-consuming on the level of agreement between the headline and body
human activity of fact checking [10], [11]. Despite that, assigned to headlines. The proposed methodology is based on
the task of evaluating the authenticity of news remains very the observations to find the relevance of articles, which can
complex even for automated systems [12]. Identifying fake be found using keywords within headlines. Some keywords in
news articles by understanding what other news organizations the headlines are useful for identifying key sentences in the
are reporting about the same topic could be a valuable first body of the article. As shown in Table 1, only the first body
step. This step is known as Stance detection. Stance detection is related to the headline rest do not have much relevance to
has always been an important foundation for various tasks, the headline. Keywords such as ‘seven girls’, ‘pregnant’ are
such as analyzing online debates [13]–[15], determining the used to retrieve all the bodies related to these keywords and
authenticity of rumors on twitter [16], [17], or understanding then classify them.
the argumentative structure of persuasive essays [18]. In the proposed model, first the feature set is passed with
In order to encourage the development of automated fake and without preprocessing to the embedding layer for conver-
news detection tools using AI technology and machine learn- sion of features to word-vectors. Another set of experiments
ing, Pomerleau and Rao (2017) organized the first Fake News is carried out by using PCA and Chi-square, to perform
Challenge (FNC-1) [19] to evaluate what a news source is component level analysis and obtain the reduced feature set.
saying about a particular issue. Around 50 teams participated One of the most popular statistical techniques for feature
from both industry and academia in this challenge. The pur- selection is PCA [32]. The discriminative power of the clas-
pose of the FNC-1 challenge is to determine the stance of a sifiers can be enhanced by utilizing PCA. It has many appli-
news article relative to a given headline. There can be four cations in face recognition, text categorization, and image
types of stances of an article. It can either agree or disagree compression [32]. The crux of PCA is to transform the
with the headline, discuss the same topic, or it is unrelated. original variables into a subset of variables by computing
The information on the FNC-1 task, its rules, the dataset, and the highest correlation of original variables [33]. Principal
the evaluation metrics is given on their official website [19]. Component Analysis (PCA) is a widely used technique that
Table 1 shows four example documents elaborating these uses a linear transformation to reduce the dimensions of a
stances. feature set. The resulting dataset is simplified but it retains
the characteristics of the original data set [34]–[36]. The new
TABLE 1. Headline and text bodies with respective stances from FNC dataset might have an equal or lesser number of features
Dataset. than the original dataset. The co-variance matrix is used to
compute the principal components.
After obtaining the features through any of the above men-
tioned methods, the features set is passed to the embedding
layer of the deep model. The embedding layer vectorizes all
the features and then these vector are fed to a 1 dimensional
(1−D) Convolution Neural Network (CNN) layer that further
extracts the useful features by applying 64 filters of 5 different
dimensions. The extracted features are fed to the max-pooling
layer to select the features with the highest importance value
during the computation. Remaining useful features are passed
to the LSTM layer for sequence modeling and to find the
hidden relevance of keywords and bodies.
We compare the effectiveness of the feature reduction
Deep learning models such as recurrent neural net- based methods used with two deep learning models i.e. CNN
works (RNN) and its variants [20]–[22] and convolution neu- and LSTM. The experimental results indicate that the pro-
ral networks (CNN) [23] have been used effectively in many posed model improves the F1-score and accuracy by 20% and
NLP tasks that share similarities to fake news and consist of 4% respectively when used with the reduced feature set, than
calculating semantic similarity between sentences [24], [25] the other techniques discussed in the related work section.
and community based question answering [26], [27]. In [28], The rest of the paper is structured as follows. Section II
Siamese MaLSTM is used to compute the semantic similarity describes state-of-the-art-works related to this work.
of question pairs. A deep neural network converts the text Section III gives a summary of the dataset, preprocessing
sequence into fixed length vector representation which is then steps performed on the dataset. Section IV illustrates the
used to measure the relevance of two textual sequence, which brief explanation of the deep learning model, experiment

156696 VOLUME 8, 2020


M. Umer et al.: Fake News Stance Detection Using Deep Learning Architecture (CNN-LSTM)

details, and machine specifications used for the experiment. uses two bags of words (BOW), term frequency (TF), and
Section V discuss the model performance evaluation met- term frequency-inverse document frequency (TF-IDF) as fea-
rics. Section VI presents the results and discussion and finally tures and proposes a model consisting of MLP with 81.72%
section VII conclude the paper with possible future research accuracy [48]. The fourth-best team extract both semantic
directions. embedding and lexical matching features and pass them
to another gradient boosting trees. Additionally, a two-step
logistic regression classifier [4] and an ensemble models of
II. RELATED WORK five classifiers [49] achieve 9th and 11th places respectively.
Stance Detection is a well-established and well-researched All three challenge winners [46]–[48] in SemEval and FNC
task in NLP. It is defined as determining from the text whether make use of both hand-crafted and neural network-based
the audience is in favor, against or neutral about the target features with classification-based algorithms [50].
[37]. Stance detection has become foundation for many tasks In another research, the focus was on predicting rumor
such as fake news detection [19], claim validation [38], and news using an agreement aware article search. They devel-
argument search [39]. Previous studies in fake news detection oped an agreement-aware search framework designed to pro-
focused on target-specific stance prediction in which the vide users with a holistic view of a question, for which the
stance of a text entity relating to a topic or a named entity ground truth was not confident. They designed a two-step
is determined. In many researches, target-specific stance pre- model consisting of a tree-based model based on handcrafted
diction is performed for tweets (where tweets are the text and features and an RNN plus attention model focusing on only a
target is single stance) [37], [40], [41] and online debates [13], few key sentences [51]. The proposed model in [50] is a sin-
[15], [40]. Such target-specific approaches are based on struc- gle, end-to-end ranking-based algorithm with MLP. TF-IDF
tural features [13], and linguistic and lexical features [15]. is used to extract features to represent both headlines and bod-
Stance prediction in tweets and online debates is different ies of the news articles. The model obtains 86.66% accuracy
from stance detection in a news article in which the stance on FNC-1.
detection of a news article is relative to the headline in NLP. In [12], a deep learning method is used for addressing
The authenticity of claims is predicted with the use of the the stance detection problem from the FNC-1 task. It incor-
stance of articles and the reliability of their sources in [38]. porates bi-directional RNNs together with max-pooling and
In detecting the reliability of fake news, stance features are neural attention mechanisms to build representations from
used, which are also defined as unsupported claims [42]. headlines and from the body of news articles and combine
A researcher used tweets publishing time and stances as these representations with external similarity features. The
the only features for determining the authenticity of tweets use of pre-training and the combination of neural represen-
by using Hidden Markov Models [43]. Another study [44] tations together with external similarity features produces
provides an approach to the claim-relevance discovery prob- 83.8% accuracy. Another work [9] uses deep recurrent model
lem by leveraging various information retrieval and machine to compute the neural embedding, weighted n-gram bag-of-
learning techniques and yielding 91.6% accuracy. words model to compute the statistical features and feature
The first fake news stance detection challenge was initi- engineering heuristics to extract hand crafted external fea-
ated back in 2017. The inspiration behind the FNC-1 stance tures. Finally, all the features are combined, by using deep
detection task was taken from the work proposed in [45], neural layer for the classification of the headline-body news
in which they classify the stance of a single sentence of a pair as agree, disagree, discuss, or unrelated. The obtained
news headline towards a specific claim. The dataset used accuracy is 89.29%. It is proved in [30] that neural network
in the FNC-1 challenge was partially labeled and based on outperforms hand-crafted features. By implementing bilateral
the Emergent dataset [45]. In FNC-1, the stance is detected multi-perspective matching models (BiMPM) and improving
on document level in which the entire news article is clas- the existing Attentive Reader with a full attention mechanism
sified relative to a headline. The top-performing system in between words in body text and headlines makes the model
FNC-1 is developed by Talos Research Intelligence team able to achieve 86.5% accuracy. A Conditional Encoding
called SOLAT in the SWEN [46]. It is based on a 50/50 LSTM model with attention yields a 80.8% score in [31]. In
weighted average ensemble method that combines deep CNN another work [29], a conditioned bidirectional LSTM with
with Google News pre-trained vectors, and gradient-boosted global features is used. It demonstrates that the combination
decision trees. The model achieved 82.02% accuracy. of global features and local word embedding features is better
The 1st runner up ‘Athene’ team, consisting of mem- at predicting the stance of headline-article pairs than each of
bers from the Ubiquitous Knowledge Processing Lab and them individually by obtaining 87.4% accuracy. Rather than
the Adaptive Preparation of Information from Heteroge- using a classification-based method, this research tackles the
neous Sources Research Training Group at Technische news stance detection task by using a ranking-based method.
Universität Darmstadt (TU Darmstadt), uses a multi-layer The ranking-based method compares and maximizes the dif-
perceptron (MLP) as an ensemble of six hidden layers with ference between the true and false stances of a given pair of
hand-crafted features [47] and obtains 81.97% accuracy. headlines and article bodies. This approach results in 86.66%
The 2nd runner up team, UCL Machine Reading (UCLMR), accuracy [50].

VOLUME 8, 2020 156697


M. Umer et al.: Fake News Stance Detection Using Deep Learning Architecture (CNN-LSTM)

Recently, a novel stacked Bi-LSTM layers based approach • Discuss: There is a little bit of match between headline
was introduced containing a model consisting of stacked and article body, taking it as neutral.
Bi-LSTM layers in [52] and novel stacked CNN was intro- • Unrelated: The topic discussed in headline and body are
duced in [53]. The LSTM layer is used for sequence mod- completely different.
eling. Bi-LSTM contains information on both ends of the Dataset has been divided into 49, 972 and 25, 413 instances
sentence which results in much better accuracy. In [54], for training and testing respectively. This distribution of train-
many models were applied and tested on FNC-1. These ing and testing data is made based on the rules mentioned for
models include CNN, LSTM, a combination of CNN and FNC-1 challenge. In training data, the headlines are 1, 648
LSTM, and end-to-end memory networks. They also pro- and the bodies of the articles are 1, 683. The test data contains
pose a novel extension of the general architecture based on around 880 headlines with 904 articles bodies.
a similarity-based matrix. Their works show that the pro-
posed model sMemNN with TF achieves the highest accuracy
B. PRE-PROCESSING
of 88.57%. Whereas CNN+LSTM and LSTM+CNN show
limited results by achieving 48.54% and 65.36% accuracy, Pre-processing is a data mining technique that trans-
respectively. The reason is that the data taken for training forms incomplete and inconsistent raw data into a
is 80% and for testing is 20%. In order to balance the data machine-understandable format. Several tasks for texts
during training, equal instances of each class are randomly pre-processing were performed on FNC-1 dataset. In order
selected for each epoch. Furthermore, there is no mention of to perform these tasks, NLP techniques such as the conver-
pooling layer in CNN architecture which might have caused sion of the texts’ characters to lowercase letters, stopwords
low accuracy. A large-scale language model for stance detec- removal, stemming, and tokenization was applied, with the
tion was proposed which performed transfer learning on a application of algorithms available in Keras’s library.
Roberta deep bidirectional transformer language model [55]. Stopwords are very common words that exist in the text
The model achieved 93.71% accuracy on the Fake News and have very minor importance in terms of features and
Challenge Stage 1 (FNC-I) benchmark dataset. are irrelevant for this work e,g ‘of’, ‘the’, ‘and’, ‘an’, etc.
However, the aforementioned researches that employ By removing the stopwords, we reduce the processing time
machine learning models make use of hand-crafted features. and save space otherwise taken by meaningless words men-
These features do not take the context of the text into account tioned above. In the text, words having similar meanings can
hence, produce limited results. In addition, most of the mod- occur more than once e.g. games and games. If so, reducing
els are unsuccessful in obtaining adequate detection perfor- the words to a common basic form is very effective. This
mance for the agree and disagree classes. To overcome these process is known as stemming and it is performed with the
limitations, we employ CNN and LSTM layers along with open-source implementation of the NLTK’s Porter stemmer
dimensionality reduction techniques including PCA and Chi- algorithm.
square. Overall, the proposed pipeline leads to better results After the execution of the pre-processing steps described
than other described deep learning strategies by resulting in above, the number of terms in the headlines was reduced to
97.8% accuracy. 372. The tokenizer function from Keras’s library was used
to split each headline into a vector of words. Once the pre-
processing is done, we use word embedding (word2vec) to
III. DATASET AND PREPROCESSING
map word/text to a list of vectors. Finally, a dictionary of
A. DATASET
the 5, 000 uni-gram words of headlines and article bodies is
The benchmark dataset of Fake News Challenges was col-
created. The length of all the headlines is set to the maximum
lected from the official website [56]. The FNC dataset con-
length of the headline. The headlines with a length smaller
sists of 75, 385 labeled instances and 2, 587 article bodies,
than maximum length are zero-padded. Next, the features are
which relate to 300 headlines approximately, and for each
fed into the hybrid Neural Network architecture consisting of
claim, there are 5 to 20 news articles. Of these headlines,
CNN [57] and LSTM layers.
7.4% are agreed, 2.0% are disagreed, 17.7% are discussed
and 72.8% are unrelated as shown in Table 2. The claims
related to the articles’ bodies are labeled manually. The IV. PROPOSED METHODOLOGY
details of the labels are as follows: A. PROPOSED MODEL
The utmost contribution of this work is to propose a fea-
• Agree: There is a relation between headline and article
ture reduction techniques along with hybrid deep learning
body.
models, involving two neural network layers, i.e., CNN and
• Disagree: There is no relation between headline and
LSTM. The proposed approach produces higher predictive
article body.
performance when compared to the traditional deep learn-
ing models. To analyze the relationship, four data models
TABLE 2. Dataset statistics.
are developed. In the first model, all the features are used
without preprocessing for classification. In the second model,
the non-reduced features set is used after preprocessing.

156698 VOLUME 8, 2020


M. Umer et al.: Fake News Stance Detection Using Deep Learning Architecture (CNN-LSTM)

FIGURE 1. Proposed model architecture diagram.

Model third and fourth is developed by using dimensional- vector space with special characteristics is created by trans-
ity reduction [58], [59] techniques including PCA and Chi- forming the original vector space. The features are reduced in
square. This work further investigates which of these models new vector space [32].
are most suited for being used in conjunction with hybrid The advantage of reducing features is that the processing
CNN and LSTM model when dealing with text data. speed is reduced which in turn results in higher performance
After the features are selected by any of the four mod- [61]. Feature reduction has a great impact on the text clas-
els discussed above, the selected features are fed to the sification results [62]. Therefore, it is extremely crucial to
CNN-LSTM architecture. The first layer of the model is choose the right selection algorithm to reduce dimensions.
the embedding layer that accepts the input headlines and Information Gain (IG), Mutual Information (1v1I) [62], Gini
article bodies and converts each word into a vector of size Coefficient (GI), Term Frequency-Inverse Document Fre-
100. The number of features is 5000, thus, this layer will quency (TF-IDF) [63], Principal Component Analysis (PCA)
output a matrix of size 5000 ∗ 100. The output matrix will and Chi-Square Statistics (CHI ) are some of the common
contain weights that we get through matrix multiplication, feature reduction algorithms. To improve the scalability of the
to produces a vector for each word. These vectors are passed text classifier, PCA and Chi-square are two-dimensionality
to the CNN layer to extract contextual features. The output reduction approaches that are used in combination with deep
of the CNN layer is fed into LSTM and then passed to a learning models.
fully connected dense layer to produce a single stance as final
output. The proposed model is shown in Fig. 1 is trained and
tested on small batches of size 32. C. PRINCIPAL COMPONENT ANALYSIS (PCA)
Principal Component Analysis (PCA) is a widely used tech-
nique that uses a linear transformation to reduce the dimen-
B. DIMENSIONALITY REDUCTION METHODS sions of a feature set. The resulting dataset is simplified but
There are two ways to perform dimensionality reduction in it retains the characteristics of the original data set [35]. The
text categorization: feature extraction and feature selection. new dataset might have an equal or lesser number of features
In feature selection methods, the most significant and relevant than the original dataset. The covariance matrix is used to
features are retained and the remaining features are discarded compute the principal components. These components are
[60]. On the other hand, in feature extraction methods, a new arranged in decreasing order of importance [64]. Let us

VOLUME 8, 2020 156699


M. Umer et al.: Fake News Stance Detection Using Deep Learning Architecture (CNN-LSTM)

assume that the original matrix comprises ’a’ dimensions and with different sizes, we can get different kinds of features.
’b’ observations and it is required to reduce the dimensional- Multiple filters f of different kernel sizes c is applied on each
ity into a ’t’ dimensional subspace then its transformation can word embedding e and the output is generated as (c × e).
be given by the following equation. In our work, the kernel size is 5, therefore, the filter of size
64 will create 5-word combinations. The input and output
Y = (E Z X ) (1) shape of CNN with several parameters is shown in Table 3
In above equation, Ea×t is the projection matrix which and Table 4.
contains t eigen vectors corresponding to t highest eigen
TABLE 3. Layers structure of proposed model used in this work.
values, and where Xa×b is mean centered data matrix.

D. CHI-SQUARE
Chi-Square Statistics is one of the most effective feature
selection algorithms [65] It is designed for testing relation-
ships between categorical variables. It is used to estimate the
lack of independence between a and b as well as compare
to the chi-square distribution with one degree of freedom
to judge extremeness [62], [66]. Test for independence and TABLE 4. Model parameter structure.
test for goodness of fit are two types of tests for which
Chi-square is used. For feature selection, test for indepen-
dence is implemented and the dependency of target label
is examined on feature(s). Chi-square investigates the cor-
relation of the features. The feature having correlation are
kept and the remaining features are discarded. For each
feature, chi-square is calculated independently towards the
target class and its significance is decided based on a pre-
defined threshold (which is 0.05 commonly). The greater the
value of chi-square, the lesser the significance of the feature. F. ACTIVATION FUNCTION, MAX-POOLING AND DROPOUT
Similarly, the smaller the value of chi-square, the more the
On the output of each CNN neuron, the ReLu activation func-
significance of the feature. Many researchers have proved to
tion is applied. The purpose of using this activation layer is to
improve the results by using chi-square for feature reduction
convert any negative value to zero and to show non-linearity
in text categorization [61], [65].
in the network. The function does not affect the output shape
The formula of chi-square feature selection is shown in
of the CNN layer, thus, it is the same as input shape.
the equation 2, where c is the degree of freedom (threshold
The value of each neuron, after passing through the ReLu
value), O is the observed value, E is the expected value, and
activation function is then fed to a 1-D max-pooling layer.
X 2 is chi-sqaure computed value for feature.
This layer converts the input of each kernel size into one
(Oi − Ei )2 output by selecting the maximum value obtained in each
Xc2 = 6 (2) kernel. This will greatly reduce the size of input features for
Ei
the next layers and will avoid overfitting. The pool size p in
E. INPUT AND CONVOLUTION LAYER our case is 4 thus, the output of this layer will reduce the
In the dataset, text Sequence ’a’ contains ’w’ entries. features by kernel/pool size (p).
A d-dimensional dense vector is used to represent each entry The dropout rate D for the whole network model is 0.2. The
’w’. The feature map of input ’a’ have d × w dimensions. dropout layer is another way to reduce overfitting by dropping
In the first step, we tokenize the headline and body texts the input with values less than the dropout rate. In the FNC-1
using Keras tokenizer. After that, Keras embedding layer dataset, the output of the dropout layer is the same as input
makes use of word2vec word embedding for transforming the passed to it because no value is lower than 0.2.
tokens into word-vectors. For model one and two, the word
vectors obtained from the word embedding layer are fed as G. LSTM
input to the convolution layer. On the contrary, for model The next layer is LSTM with units of 100. We have to
three and four, significant features are extracted from PCA generate a long chain-like sequence structure of our data and
and Chi-square first. Then these features are converted into to keep the knowledge of previous inputs. For this purpose,
word-vectors by embedding layer. Finally, these word vectors LSTM is the most suitable choice as it consists of three
are passed to convolution layer. gates named input gate ik , output gate ok and forget gate
The function of convolution layers is to capture a specific fk . These gates decide which information is important for
semantic or structural feature from the input matrix. Each classification and which information is forgettable based on
word vector is passed to CNN neurons n. By applying filters the dropout value. Previous input, which is necessary for

156700 VOLUME 8, 2020


M. Umer et al.: Fake News Stance Detection Using Deep Learning Architecture (CNN-LSTM)

prediction, is stored in cell memory block CK. Many variants VI. RESULTS AND DISCUSSION
are available for LSTM but the one we used in our model is In the final set of experiments, the proposed ensemble model
as follows. of CNN-LSTM is trained on 49, 972 samples and tested
on 25, 413 headlines and articles. The training is performed
ik = σ (Wi sk + Vi hk−1 + bi ) (3) using a 2 GB Dell PowerEdge T 430 graphical processing unit
fk = σ (Wf sk + Vf hk−1 + bf ) (4) on 2x Intel Xeon 8 Cores 2.4Ghz machine which is equipped
ok = σ (Wo sk + Vo hk−1 + bo ) (5) with 32 GB DDR4 Random Access Memory (RAM). The
training takes 3 hours to run epochs on ’Fake News Challenge
ck = tanh(Wc xk + Vc hk−1 + bc ) (6)
Dataset’ using pre-trained word embedding and to show the
where s is the input sequence (s1 , s2 , s3 , . . . , sN ) to skth vector classification results. On the contrary, the feature reduction
representation. W and V are the weights associated with each techniques take 1.8 hours for the computation.
matrix element. h is the hidden state related to time step k −1, We have compared the outputs of the non-reduced feature
where sk is the input at that time and b is the bias vector. set, PCA, and chi-square used by a CNN-LSTM architecture.
c is the cell memory block which gets updated each time at By analyzing all the results, one can conclude that using PCA
step k − 1. In the output of LSTM layer, all the 100 units are is more effective for severe dimensionality reduction as it
connected to every unit in dense layer. significantly improved the accuracy. The presented model
outperforms all other models by producing an accuracy of
H. DENSE 97.8%. The average precision, recall, and F1-score for all
The last layer of the proposed model is a fully connected classes are 97.4%, 98.2%, and 97.8% respectively as shown
dense layer, which produces a single output as a result. This in Table 5. The detailed statistical results of our proposed
layer is followed by a softmax activation function. Softmax model are shown in Table 5. The statistical significance
activation is used for multi-class classification. We have used ensures that one can easily classify any news as fake or legit-
softmax activation in our dataset because it contains four imate using our proposed model. The train and test, accuracy
classes (agree, disagree, discuss, and unrelated). We used and loss is shown in Figures 2a and 2b whereas the accuracy
Adam as the optimizer for testing purposes. The batch size and F1-score comparison of our proposed model with state-
used in testing is 32 and the number of epochs is set to 50. of-the-art techniques are shown in Figure 3.

V. PERFORMANCE EVALUATION METRICS TABLE 5. CNN-LSTM model classification results.

To compare and evaluate our model, we use accuracy (A),


precision (P), recall (R), and F1-score (F) as evaluation met-
rics. Precision and Recall are computed using equations 7
and 8. Whereas, F1-score is the harmonic mean of precision
and recall as expressed in equation 9.
As we know the LSTM can handle sequential data and
True Positive If the amount of data is quite large it takes a lot of time to
P= (7)
True Positive + False Positive generate sequences and is likely to overfit. Whereas CNN
Precision is calculated as the ratio of correctly classified cannot handle sequences of data, as it does not contain a
positive class and the sum of correctly and falsely classified memory unit. However, we can use principal component
values of the positive class. It tells us about the factualness of analysis (PCA) and chi-square to extract significant features
the model. to feed into the CNN-LSTM model.
TruePositive It is evident from the results that when the features are used
R= (8) without any data cleaning or preprocessing, the accuracy is
TruePositive + FalseNegative
only 78% which is remarkably low. It indicates that the orig-
A recall rate is calculated as the ratio of correctly classified inal dataset contains inconsistent, redundant, and noisy data
positive class and the sum of correctly classified values of in abundance. After performing the preprocessing steps and
the positive class and falsely classified values of the negative eliminating useless data, the accuracy goes up, dramatically,
class. It tells us about the completeness of the model. to be 93.0%. Besides, the application of chi-square further
·precision · recall raises accuracy by selecting relevant features and making it
F1 = 2 ∗ (9) 95%. Finally, We observed that the use of PCA with CNN and
precision + recall
LSTM architecture resulted in the highest accuracy, 97.8%,
F1-score determines the accuracy of the model for each with an insignificant decrease in categorization effectiveness.
class. The F1-score metric is usually used when the dataset There was a sharp increase in precision, recall and F1-score
is imbalanced. As the dataset of FNC-1 is also highly imbal- as well. Results of k-fold to show sample variations are
anced therefore, to calculate the class-wise accuracy, we use presented in Table 7. Moreover, there is a drastic decrease
F1-score as evaluation metrics to show the completeness of in the time required for performing a prediction when using
the proposed model. dimensionality reduction methods. One of the many reasons

VOLUME 8, 2020 156701


M. Umer et al.: Fake News Stance Detection Using Deep Learning Architecture (CNN-LSTM)

FIGURE 2. Training and Testing Accuracy and Loss of Proposed Model.

FIGURE 3. Statistical and Accuracy Comparison of Different Methodologies.

is that PCA is sensitive to low noise. It requires low capacity cultural norms and differences in writing style might result in
and memory. Moreover, it does not require large computation different performance.
[32]. Thus, it has great advantages in terms of time and space Table 6 shows the comparison of the F1-score of all the
complexity. different approaches experimented in [52] with our proposed
However, there is a limitation of using feature reduction model. It is evident from the results that the F-score of
techniques. The features in the dataset should be co-related ‘unrelated’ is the highest among all the stances in every
enough to produce better results. Otherwise, these techniques model. The reason is that the dataset is not balanced and
would not have much effect on the final outcome. Further- many records of class ‘unrelated’ are far more than the other
more, this work is limited to the training of claims and articles classes. Our model has the highest F1-score of discussing and
in English only. If this work is extended to other languages, agree with stances. Upper bound has slightly more F1-score

156702 VOLUME 8, 2020


M. Umer et al.: Fake News Stance Detection Using Deep Learning Architecture (CNN-LSTM)

FIGURE 4. Class-wise F1 score comparison.

TABLE 6. Accuracy and F1-score comparison of different approaches. TABLE 7. CNN-LSTM model k-fold cross-validation with PCA.

of disagree stance than our model. The complete comparison


of the class-wise f-score is shown in Figure 4.

A. COMPARISON WITH DEEP LEARNING MODELS


1) BERT our model as well as the F1-score of agree, disagree, and
BERT stands for Bidirectional Encoder Representations from unrelated classes.
Transformers [67] and the results presented on the FNC-
1 task have used the fine-tuning approach where all param- 2) XLNet
eters are jointly fine-tuned and a simple classification layer XLNet combines a bidirectional context as well as avoids
is added to the pre-trained model. All masked positions are independent predictions [68]. It introduces ‘‘permutation lan-
predicted by BERT independently. This means that it neglects guage modeling’’ in which it predicts tokens in some ran-
dependencies between predicted masked positions during dom order rather than predicting them in sequence. XLNet
training. Due to the reduction of some dependencies BERT uses Transformer XL as its foundation architecture and out-
learns simultaneously, it suffers from a pre-train fine-tune performs BERT on 20 tasks. These tasks include docu-
inconsistency. This model obtains 91.3% accuracy on the ment ranking, including natural language inference, question
FNC-1 task. F1-score achieved by BERT is far lesser than answering, and sentiment analysis. It improves upon BERT

VOLUME 8, 2020 156703


M. Umer et al.: Fake News Stance Detection Using Deep Learning Architecture (CNN-LSTM)

on the FNC-1 task and obtained 92.1% accuracy and 76.0% approaches, (c) different textual features and their fusion shall
F1-score. be analyzed to boost the performance.
CONFLICT OF INTEREST
3) RoBERTa
Authors declare no conflict of interest exists.
An open-source language model named Roberta (Robustly
Optimized BERT Approach) was released in July of 2019 REFERENCES
[69]. In [67], the author constructs the large-scale language [1] T. Mihaylov, G. Georgiev, and P. Nakov, ‘‘Finding opinion manipulation
model using transfer learning on the Roberta-based deep trolls in news community forums,’’ in Proc. 19th Conf. Comput. Natural
Lang. Learn., Beijing, China, Jul. 2015, pp. 310–314. [Online]. Available:
transformer model, consisting of 12-layers of 768-hidden https://www.aclweb.org/anthology/K15-1032
units, each with 12 attention heads, adding up to 125M [2] T. Mihaylov, I. Koychev, G. Georgiev, and P. Nakov, ‘‘Exposing paid
parameters. To perform transfer learning, they train for fifty opinion manipulation trolls,’’ in Proc. Int. Conf. Recent Adv. Natural Lang.
Process., Hissar, Bulgaria, Sep. 2015, pp. 443–450. [Online]. Available:
epochs and follow hyperparameter recommendations by [69] https://www.aclweb.org/anthology/R15-1058
thus it outperforms both BERT and XLnet models. The model [3] T. Mihaylov and P. Nakov, ‘‘Hunting for troll comments in news commu-
achieves 93.71% accuracy which is quite lesser than our nity forums,’’ in Proc. 54th Annu. Meeting Assoc. for Comput. Linguistics,
vol. 2, 2016, pp. 399–405, doi: 10.18653/v1/P16-2065.
model. The better accuracy can be achieved by using our [4] P. Bourgonje, J. Moreno Schneider, and G. Rehm, ‘‘From click-
proposed model with PCA and only one layer of CNN and bait to fake news detection: An approach based on detecting the
the LSTM model. We modified only a limited number of stance of headlines to articles,’’ in Proc. EMNLP Workshop: Natural
Lang. Process. meets Journalism, 2017, pp. 84–89. [Online]. Available:
parameters whereas in Roberta there are 125M parameters https://www.aclweb.org/anthology/W17-4215
that require precise tuning. Moreover, with 12-layers of 768- [5] S. Vosoughi, D. Roy, and S. Aral, ‘‘The spread of true and false news
hidden units in Roberta, the computational costs increase online,’’ Science, vol. 359, no. 6380, pp. 1146–1151, Mar. 2018.
[6] A. M. Michael Barthel and J. Holcomb. (2016). Many Americans Believe
immensely. By comparing F1-scores, it is clear that the per- Fake News is Sowing Confusion. Accessed: Sep. 29, 2019. [Online].
formance of Roberta on individual classes is lesser than our Available: https://www.journalism.org/2016/12/15/many-americans-
model. It will result in insufficient performance on even believe-fake-news-is-sowing-confusion/
[7] A. K. Chaudhry, ‘‘Stance detection for the fake news challenge: identify-
agree and disagree classes. The F1-score of discussing and ing textual relationships with deep neural nets,’’ in Proc. Natural Lang.
unrelated are almost the same. The complete comparison of Process. Deep Learn., 2017, pp. 1–10.
all these deep learning approaches are shown in Table 8. [8] S. Chopra, ‘‘Towards automatic identification of fake news: Headline-
article stance detection with LSTM attention models,’’ Stanford Univ.,
Stanford, CA, USA, Tech. Rep., 2017.
TABLE 8. Accuracy and F1-score comparison with other deep learning [9] G. Bhatt, A. Sharma, S. Sharma, A. Nagpal, B. Raman, and
approaches. A. Mittal, ‘‘Combining neural, statistical and external features
for fake news stance identification,’’ in Proc. Companion Web
Conf. Web Conf. (WWW), Geneva, Switzerland, 2018, p. 1353,
doi: 10.1145/3184558.3191577.
[10] L. Konstantinovskiy, O. Price, M. Babakar, and A. Zubiaga, ‘‘Towards
automated factchecking: Developing an annotation schema and bench-
mark for consistent automated claim detection,’’ 2018, arXiv:1809.08193.
[Online]. Available: http://arxiv.org/abs/1809.08193
[11] D. M. J. Lazer, M. A. Baum, Y. Benkler, A. J. Berinsky, K. M. Greenhill,
VII. CONCLUSION AND FUTURE WORK F. Menczer, M. J. Metzger, B. Nyhan, G. Pennycook, D. Rothschild,
M. Schudson, S. A. Sloman, C. R. Sunstein, E. A. Thorson, D. J. Watts,
This study proposed a fake news stance detection model, and J. L. Zittrain, ‘‘The science of fake news,’’ Science, vol. 359, no. 6380,
based on the headline and the body of the news irrespective pp. 1094–1096, 2018.
of the previous studies which only considered the individ- [12] L. Borges, B. Martins, and P. Calado, ‘‘Combining similarity features
and deep representation learning for stance detection in the context of
ual sentences or phrases. The proposed model incorporates checking fake news,’’ J. Data Inf. Qual., vol. 11, no. 3, pp. 1–26, Jul. 2019,
principal component analysis (PCA) and chi-square with doi: 10.1145/3287763.
CNN and LSTM, in which PCA and chi-square extract the [13] M. A. Walker, P. Anand, R. Abbott, and R. Grant, ‘‘Stance clas-
sification using dialogic properties of persuasion,’’ in Proc. Conf.
quality features which are passed to the CNN-LSTM model. North Amer. Chapter Assoc. Comput. Linguistics: Hum. Lang. Tech-
First, we pass the non-reduced feature set with and without nol., Stroudsburg, PA, USA, 2012, pp. 592–596. [Online]. Available:
preprocessing to the neural network. Then the dimension- http://dl.acm.org/citation.cfm?id=2382029.2382124
[14] D. Sridhar, J. Foulds, B. Huang, L. Getoor, and M. Walker, ‘‘Joint models
ality reduction techniques are applied and the results are of disagreement and stance in online debate,’’ in Proc. 53rd Annu. Meet-
compared. PCA elevates the performance of the classifier ing Assoc. Comput. Linguistics 7th Int. Joint Conf. Natural Lang. Pro-
for fake news detection as it removes the irrelevant, noisy, cess., vol. 1. Beijing, China, Jul. 2015, pp. 116–125. [Online]. Available:
https://www.aclweb.org/anthology/P15-1012
and redundant features from the feature vector. This process [15] S. Somasundaran and J. Wiebe, ‘‘Recognizing stances in ideological on-
produces promising results by scoring up to 97.8% accuracy line debates,’’ in Proc. NAACL HLT Workshop Comput. Approaches Anal.
which is considerably better than the previous studies. It is Gener. Emotion Text, Los Angeles, CA, USA, Jun. 2010, pp. 116–124.
[Online]. Available: https://www.aclweb.org/anthology/W10-0214
pertinent to say that dimensionality reduction approaches can [16] M. Lukasik, P. K. Srijith, D. Vu, K. Bontcheva, A. Zubiaga, and
reduce the number of features while preserving the high per- T. Cohn, ‘‘Hawkes processes for continuous time sequence
formance of classifiers. Our future work entails: (a) validate classification: an application to rumour stance classification in
twitter,’’ in Proc. 54th Annu. Meeting Assoc. Comput. Linguistics,
the performance of our proposed model on larger datasets, vol. 2. Berlin, Germany, Aug. 2016, pp. 393–398. [Online]. Available:
(b) A tree-based learning may perform better than simple https://www.aclweb.org/anthology/P16-2064

156704 VOLUME 8, 2020


M. Umer et al.: Fake News Stance Detection Using Deep Learning Architecture (CNN-LSTM)

[17] L. Derczynski, K. Bontcheva, M. Liakata, R. Procter, G. Wong Sak Hoi, [38] K. Popat, S. Mukherjee, J. Strötgen, and G. Weikum, ‘‘Where the truth
and A. Zubiaga, ‘‘SemEval-2017 Task 8: RumourEval: Determining lies: Explaining the credibility of emerging claims on the Web and social
rumour veracity and support for rumours,’’ in Proc. 11th Int. Workshop media,’’ in Proc. 26th Int. Conf. World Wide Web Companion, Apr. 2017,
Semantic Eval. (SemEval), Vancouver, BC, Canada, Aug. 2017, pp. 69–76. pp. 1003–1012.
[Online]. Available: https://www.aclweb.org/anthology/S17-2006 [39] C. Stab, T. Miller, and I. Gurevych, ‘‘Cross-topic argument mining from
[18] C. Stab and I. Gurevych, ‘‘Parsing argumentation structures in persuasive heterogeneous sources using attention-based neural networks,’’ 2018,
essays,’’ Comput. Linguistics, vol. 43, no. 3, pp. 619–659, Sep. 2017. arXiv:1802.05758. [Online]. Available: http://arxiv.org/abs/1802.05758
[Online]. Available: https://www.aclweb.org/anthology/J17-3005 [40] I. Augenstein, T. Rocktäschel, A. Vlachos, and K. Bontcheva,
[19] D. P. Rao, (2017). Exploring How Artificial Intelligence Technologies ‘‘Stance detection with bidirectional conditional encoding,’’ in
Could be Leveraged to Combat Fake News. Accessed: Oct. 29, 2019. Proc. 2016 Conf. Empirical Methods Natural Lang. Process.,
[Online]. Available: http://www.fakenewschallenge.org/ Austin, TX, USA, Nov. 2016, pp. 876–885. [Online]. Available:
[20] J. Chung, Ç. Gülçehre, K. Cho, and Y. Bengio, ‘‘Empirical evaluation https://www.aclweb.org/anthology/D16-1084
of gated recurrent neural networks on sequence modeling,’’ CoRR, [41] G. Zarrella and A. Marsh, ‘‘MITRE at SemEval-2016 task 6: Transfer
vol. abs/1412.3555, 2014. [Online]. Available: http://arxiv.org/abs/ learning for stance detection,’’ 2016, arXiv:1606.03784. [Online]. Avail-
1412.3555 able: http://arxiv.org/abs/1606.03784
[21] A. Graves and J. Schmidhuber, ‘‘Framewise phoneme classification with [42] O. Enayet and S. R. El-Beltagy, ‘‘NileTMRG at SemEval-2017 task
bidirectional LSTM and other neural network architectures,’’ Neural Netw., 8: Determining rumour and veracity support for rumours on Twit-
vol. 18, pp. 602–610, Jul./Aug. 2005. ter,’’ in Proc. 11th Int. Workshop Semantic Eval. (SemEval). Van-
[22] P. Neculoiu, M. Versteegh, and M. Rotaru, ‘‘Learning text similarity with couver, BC, Canada, Aug. 2017, pp. 470–474. [Online]. Available:
siamese recurrent networks,’’ in Proc. 1st Workshop Represent. Learn., https://www.aclweb.org/anthology/S17-2082
Jan. 2016, pp. 148–157. [43] S. Dungs, A. Aker, N. Fuhr, and K. Bontcheva, ‘‘Can rumour stance
[23] H. He, K. Gimpel, and J. Lin, ‘‘Multi-perspective sentence similarity
alone predict veracity?’’ in Proc. 27th Int. Conf. Comput. Linguistics.
modeling with convolutional neural networks,’’ in Proc. Conf. Empirical
Santa Fe, NM, USA, Aug. 2018, pp. 3360–3370. [Online]. Available:
Methods Natural Lang. Process., 2015, pp. 1576–1586. [Online]. Avail-
https://www.aclweb.org/anthology/C18-1284
able: https://www.aclweb.org/anthology/D15-1181
[44] X. Wang, C. Yu, S. Baumgartner, and F. Korn, ‘‘Relevant document dis-
[24] R. Kiros, Y. Zhu, R. Salakhutdinov, R. S. Zemel, A. Torralba, R. Urtasun,
covery for fact-checking articles,’’ in Proc. Companion Web Conf., 2018,
and S. Fidler, ‘‘Skip-thought vectors,’’ in Proc. Adv. Neural Inf. Process.
pp. 525–533.
Syst., 2015, pp. 3294–3302.
[45] W. Ferreira and A. Vlachos, ‘‘Emergent: a novel data-set for stance classi-
[25] K. S. Tai, R. Socher, and C. D. Manning, ‘‘Improved semantic rep-
fication,’’ in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics:
resentations from tree-structured long short-term memory networks,’’
Hum. Lang. Technol., San Diego, CA, USA, Jun. 2016, pp. 1163–1168.
Proc. 53rd Annu. Meeting Assoc. Comput. Linguistics 7th Int. Joint
[Online]. Available: https://www.aclweb.org/anthology/N16-1138
Conf. Natural Lang. Process., vol. 1, 2015, pp. 1–11. [Online]. Available:
[46] B. Sean, S. Doug, and P. Yuxi. (2017). Talos Targets Disinforma-
http://dx.doi.org/10.3115/v1/p15-1150
tion With Fake News Challenge Victory. Accessed: Oct. 29, 2019.
[26] L. Yang, Q. Ai, D. Spina, R.-C. Chen, L. Pang, W. B. Croft, J. Guo, and
[Online]. Available: http://blog.talosintelligence.com/2017/06/talos-fake-
F. Scholer, ‘‘Beyond factoid qa: Effective methods for non-factoid answer
news-challenge.html
sentence retrieval,’’ in ECIR, 2016.
[47] A. Hanselowski, A. PVS, B. Schiller, and F. Caspelherr. (2017). Team
[27] Y. Yang, W.-T. Yih, and C. Meek, ‘‘WikiQA: A challenge dataset for open-
Athene on the Fake News Challenge. Accessed: Oct. 29, 2019. [Online].
domain question answering,’’ in Proc. Conf. Empirical Methods Natural
Available: https://medium.com/@andre134679/team-athene-on-the-fake-
Lang. Process., Lisbon, Portugal, Sep. 2015, pp. 2013–2018. [Online].
news-challenge-28a5cf5e017b
Available: https://www.aclweb.org/anthology/D15-1237
[28] Z. Imtiaz, M. Umer, M. Ahmad, S. Ullah, G. S. Choi, and A. Mehmood, [48] B. Riedel, I. Augenstein, G. P. Spithourakis, and S. Riedel, ‘‘A
‘‘Duplicate questions pair detection using siamese MaLSTM,’’ IEEE simple but tough-to-beat baseline for the fake news challenge
Access, vol. 8, pp. 21932–21942, 2020. stance detection task,’’ 2017, arXiv:1707.03264. [Online]. Available:
[29] B. Ghanem, P. Rosso, and F. Rangel, ‘‘Stance detection in fake news a http://arxiv.org/abs/1707.03264
combined feature representation,’’ in Proc. 1st Workshop Fact Extraction [49] J. Thorne, M. Chen, G. Myrianthous, J. Pu, X. Wang, and
Verification (FEVER), Brussels, Belgium, Nov. 2018, pp. 66–71. [Online]. A. Vlachos, ‘‘Fake news stance detection using stacked ensemble of
Available: https://www.aclweb.org/anthology/W18-5510 classifiers,’’ in Proc. EMNLP Workshop: Natural Lang. Process. Meets
[30] Q. Zeng, ‘‘Neural stance detectors for fake news challenge,’’ Stanford Journalism Copenhagen, Denmark, Sep. 2017, pp. 80–83. [Online].
Univ., Stanford, CA, USA, Tech. Rep., 2017. Available: https://www.aclweb.org/anthology/W17-4214
[31] S. R. Pfohl, ‘‘Stance detection for the fake news challenge with atten- [50] Q. Zhang, E. Yilmaz, and S. Liang, ‘‘Ranking-based method for news
tion and conditional encoding,’’ Stanford Univ., Stanford, CA, USA, stance detection,’’ in Proc. Companion Web Conf., Geneva, Switzerland,
Tech. Rep., 2017. 2018, p. 41–42, doi: 10.1145/3184558.3186919.
[32] D. A., A. I., and S. S., ‘‘A comparative study on using principle component [51] J. Shang, T. Sun, J. Shen, X. Liu, A. Gruenheid, F. Korn, A. Lelkes,
analysis with different text classifiers,’’ Int. J. Comput. Appl., vol. 180, C. Yu, and J. Han, ‘‘Investigating rumor news using agreement-
no. 31, pp. 1–6, Apr. 2018, doi: 10.5120/ijca2018916800. aware search,’’ 2018, arXiv:1802.07398. [Online]. Available:
[33] S. Karamizadeh, S. Abdullah, A. Manaf, M. Zamani, and A. Hooman, ‘‘An http://arxiv.org/abs/1802.07398
overview of principal component analysis,’’ J. Signal Inf. Process., vol. 4, [52] A. Hanselowski, A. PVS, B. Schiller, F. Caspelherr, D. Chaudhuri,
no. 3B, p. 173, Aug. 2013. C. M. Meyer, and I. Gurevych, ‘‘A retrospective analysis of the fake
[34] M. Ahmad, A. M. Khan, J. A. Brown, S. Protasov, and A. M. Khattak, news challenge stance-detection task,’’ in Proc. 27th Int. Conf. Comput.
‘‘Gait fingerprinting-based user identification on smartphones,’’ Linguistics, Santa Fe, NM, USA, Aug. 2018, pp. 1859–1874. [Online].
in Proc. Int. Joint Conf. Neural Netw. (IJCNN), Jul. 2016, Available: https://www.aclweb.org/anthology/C18-1158
pp. 3060–3067. [53] M. Umer, S. Sadiq, M. Ahmad, S. Ullah, G. S. Choi, and A. Mehmood,
[35] S. Deegalla and H. Bostrom, ‘‘Reducing high-dimensional data by princi- ‘‘A novel stacked CNN for malarial parasite detection in thin blood smear
pal component analysis vs. Random projection for nearest neighbor classi- images,’’ IEEE Access, vol. 8, pp. 93782–93792, 2020.
fication,’’ in Proc. 5th Int. Conf. Mach. Learn. Appl. (ICMLA), Dec. 2006, [54] M. Mohtarami, R. Baly, J. Glass, P. Nakov, L. Marquez, and A. Moschitti,
pp. 245–250. ‘‘Automatic stance detection using End-to-End memory networks,’’ 2018,
[36] M. Ahmad, D. I. U. Haq, Q. Mushtaq, and M. Sohaib, ‘‘A new arXiv:1804.07581. [Online]. Available: http://arxiv.org/abs/1804.07581
statistical approach for band clustering and band selection using [55] C. Dulhanty, J. L. Deglint, I. Ben Daya, and A. Wong, ‘‘Taking a stance
K-means clustering,’’ IACSIT Int. J. Eng. Technol., vol. 3, no. 6, on fake news: Towards automatic disinformation assessment via deep
pp. 606–614, Dec. 2011. bidirectional transformer language models for stance detection,’’ 2019,
[37] S. Mohammad, S. Kiritchenko, P. Sobhani, X. Zhu, and arXiv:1911.11951. [Online]. Available: https://arxiv.org/abs/1911.11951
C. Cherry, ‘‘SemEval-2016 task 6: Detecting stance in tweets,’’ [56] D. Pomerleau and Rao. (2017). Fake News Challenge Dataset. Accessed:
in Proc. 10th Int. Workshop Semantic Eval. (SemEval-2016), Oct. 29, 2019. [Online]. Available: http://www.fakenewschallenge.org/
San Diego, CA, USA, Jun. 2016, pp. 31–41. [Online]. Available: [57] M. Ahmad, ‘‘A fast 3D CNN for hyperspectral image classification,’’ 2020,
https://www.aclweb.org/anthology/S16-1003 arXiv:2004.14152. [Online]. Available: http://arxiv.org/abs/2004.14152

VOLUME 8, 2020 156705


M. Umer et al.: Fake News Stance Detection Using Deep Learning Architecture (CNN-LSTM)

[58] M. Ahmad, D. Ihsan, and D. Ulhaq, ‘‘Linear unmixing and target detec- SALEEM ULLAH was born in Ahmedpur East,
tion of hyperspectral imagery using OSP,’’ in Proc. Int. Conf. Modeling, Pakistan, in 1983. He received the B.Sc. degree in
Simulation Control, Singapore, Jan. 2011. computer science from The Islamia University of
[59] M. Ahmad, M. A. Alqarni, A. M. Khan, R. Hussain, M. Mazzara, and Bahawalpur, Pakistan, in 2003, the M.I.T. degree
S. Distefano, ‘‘Segmented and non-segmented stacked denoising autoen- in computer science from Bahauddin Zakariya
coder for hyperspectral band reduction,’’ Optik, vol. 180, pp. 370–378, University, Multan, in 2005, and the Ph.D. degree
Feb. 2019. from Chongqing University, China, in 2012. From
[60] M. Ahmad, A. M. Khan, and R. Hussain, ‘‘Graph-based spatial–spectral
2006 to 2009, he was a Network/IT Administrator
feature learning for hyperspectral image classification,’’ IET Image Pro-
with different companies. From August 2012 to
cess., vol. 11, no. 12, pp. 1310–1316, 2017.
[61] P. Meesad, P. Boonrawd, and V. Nuipian, ‘‘A chi-square-test for word February 2016, he was an Assistant Professor with
importance differentiation in text classification,’’ in Proc. Int. Conf. Inf. The Islamia University of Bahawalpur. He has been an Associate Dean with
Electron. Eng., 2011, pp. 110–114. the Khwaja Fareed University of Engineering and Information Technology,
[62] Y. Yang and J. O. Pedersen, ‘‘A comparative study on feature selection in Rahim Yar Khan, since February 2016. He has almost 14 years of Industry
text categorization,’’ in Proc. ICML, 1997, pp. 412–420. experience in IT. His current research interests include adhoc networks,
[63] S. M. H. Dadgar, M. S. Araghi, and M. M. Farahani, ‘‘A novel text mining the IoTs, congestion control, data science, and network security.
approach based on TF-IDF and support vector machine for news classifi-
cation,’’ in Proc. IEEE Int. Conf. Eng. Technol. (ICETECH), Mar. 2016,
pp. 112–116.
[64] D. Hou, J. Zhang, Z. Yang, S. Liu, P. Huang, and G. Zhang, ‘‘Distri-
bution water quality anomaly detection from UV optical sensor mon-
itoring data by integrating principal component analysis with chi-
square distribution,’’ Opt. Express, vol. 23, no. 13, p. 17487, Jun. 2015,
doi: 10.1364/OE.23.017487. ARIF MEHMOOD received the Ph.D. degree
[65] Y. Zhai, W. Song, X. Liu, L. Liu, and X. Zhao, ‘‘A chi-square statistics from the Department of Information and Com-
based feature selection method in text classification,’’ in Proc. IEEE 9th munication Engineering, Yeungnam University,
Int. Conf. Softw. Eng. Service Sci. (ICSESS), Nov. 2018, pp. 160–163. South Korea, in November 2017. Since Novem-
[66] X. Xia, D. Lo, W. Qiu, X. Wang, and B. Zhou, ‘‘Automated configuration ber 2017, he has been a Faculty Member with
bug report prediction using text mining,’’ in Proc. IEEE 38th Annu. Com- the Department of Computer Science, KFUEIT,
put. Softw. Appl. Conf., Jul. 2014, pp. 107–116. Pakistan. His recent research interests include
[67] V. Slovikovskaya, ‘‘Transfer learning from transformers to fake news chal- data mining, mainly working on AI and deep
lenge stance detection (FNC-1) task,’’ 2019, arXiv:1910.14353. [Online]. learning-based text mining, and data science man-
Available: http://arxiv.org/abs/1910.14353 agement technologies.
[68] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and
Q. V. Le, ‘‘XLNet: Generalized autoregressive pretraining for lan-
guage understanding,’’ 2019, arXiv:1906.08237. [Online]. Available:
http://arxiv.org/abs/1906.08237
[69] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy,
M. Lewis, L. Zettlemoyer, and V. Stoyanov, ‘‘RoBERTa: A robustly opti-
mized BERT pretraining approach,’’ 2019, arXiv:1907.11692. [Online].
Available: http://arxiv.org/abs/1907.11692 GYU SANG CHOI received the Ph.D. degree
from the Department of Computer Science and
Engineering, Pennsylvania State University, State
College, PA, USA, in 2005. He was a Research
MUHAMMAD UMER received the B.S. and Staff Member with Samsung Advanced Institute
M.S. degrees from the Department of Computer of Technology (SAIT), Samsung Electronics, from
Science, Khwaja Fareed University of Engineer- 2006 to 2009. Since 2009, he has been a Fac-
ing and IT (KFUEIT), Pakistan, 2018 and 2020, ulty Member with the Department of Information
respectively, where he is currently pursuing the and Communication, Yeungnam University, South
Ph.D. degree in computer science. He is also Korea. His research interests include non-volatile
serving as a Research and Teaching Associate at memory and storage systems.
KFUEIT. His recent research interests are related
to data mining, mainly working machine learning
and deep learning based IoT, text mining, and
computer vision tasks. He is a Regular Reviewer for several top tier journals
including but not limited to MTAP (Springer), PR Letter and IMAVIS
(Elsevier), and so on.
BYUNG-WON ON received the Ph.D. degree
from the Department of Computer Science and
ZAINAB IMTIAZ received the B.S. degree from Engineering, Pennsylvania State University, State
the Department of Computer Science, University College, PA, USA, in 2007. He was a Full-Time
of Central Punjab (UCP), Pakistan, in 2015. She Researcher with the Advanced Digital Sciences
is currently pursuing the Master of Computer Center, The University of British Columbia, and
Science degree with the Khwaja Fareed Uni- the Advanced Institutes of Convergence Technol-
versity of Engineering and Information Technol- ogy, for a period of seven years. Since 2014,
ogy (KFUEIT), Pakistan. She was a Microsoft he has been a Faculty Member with the Depart-
Dynamic AX Developer from 2015 to 2017. She is ment of Software Convergence Engineering, Kun-
also a Research Assistant with the Fareed Comput- san National University, South Korea. His recent research interests include
ing and Research Center, KFUEIT, and an Assis- data mining (probability theory and applications), machine learning, and
tant Lecturer with the Computer Science Department. Her recent research artificial intelligence, mainly working on abstractive summarization, creative
interests include data mining, machine learning, and deep learning-based text computing, and multiagent reinforcement learning.
mining.

156706 VOLUME 8, 2020

You might also like